Python for re.match re.sub

问题描述：

使用csv文件。它包含一个源代码列表（简单的ssl链接），地点，网站（< a>不是ssl链接</a>），Direcciones和电子邮件。当某些数据不可用时，它不会显示。像这样：Python for re.match re.sub

httpsgoogledotcom, GooglePlace2, Direcciones, Montain View, Email, [email protected]

尽管如此，网站'一个html标记'链接总是出现两次，后面跟着几个逗号。同样，遵循逗号，有时由Direcciones，有时由源（https）。因此，如果EOF过程没有中断，它可以'替换'几个小时，并创建一个输入文件，其中包含reduce和misplaced信息的gbs。让我们拿起四个条目作为Reutput.csv的例子：

> httpsgoogledotcom, GooglePlace, Website, "<a> href='httpgoogledotcom'></a>",,,,,,,,,,,,,, 
> "<a href='httpgoogledotcom'></a>",,,,,,,,,,,,, 
> ,,Direcciones, Montain View, Email, [email protected] 
> httpsbingdotcom, BingPlace, Direcciones, MicroWorld, Email, [email protected] 
> httpsgoogledotcom, GooglePlace, Website, "<a> href='httpgoogledotcom'></a>",,,,,,,,,,,,,, 
> "<a href='httpgoogledotcom'></a>",,,,,,,,,,,,, 
> httpsbingdotcom, BingPlace, Direcciones, MicroWorld, Email, [email protected]

这样的想法是删除不必要的网站“一个HTML标签”链接和多余的逗号，但尊重新线/ n和不易脱落在循环中。就像这样：

> httpsgoogledotcom, GooglePlace, Website, "<a href='httpgoogledotcom'></a>",Direcciones, Montain View, Email, [email protected] 
> httpsbingdotcom, BingPlace, Direcciones,MicroWorld, Email, [email protected] 
> httpsgoogledotcom, GooglePlace,Website, <a href='httpgoogledotcom'></a>" 
> httpsbingdotcom, BingPlace, Direcciones, MicroWorld, Email, [email protected]

这是代码的最后一个版本：

with open('Reutput.csv') as reuf, open('Put.csv', 'w') as putuf: 
    text = str(reuf.read()) 
    for lines in text: 
     d = re.match('</a>".*D?',text,re.DOTALL) 
     if d is not None: 
      if not 'https' in d: 
       replace = re.sub(d,'</a>",Direc',lines) 
     h = re.match('</a>".*?http',text,re.DOTALL|re.MULTILINE) 
     if h is not None: 
      if not 'Direc' in h: 
       replace = re.sub(h,'</a>"\nhttp',lines) 
     replace = str(replace) 
     putuf.write(replace)

现在，我得到一个Put.csv与永远重复最后一排文件。为什么这个循环？我已经尝试了几种方法来处理这些代码，但不幸的是，我仍然坚持这样做。提前致谢。

答

最后我自己拿了代码。我在这里张贴它希望有人认为它有用。无论如何，谢谢你的帮助和反对票！

import re 
with open('Reutput.csv') as reuf, open('Put.csv', 'w') as putuf: 
    text = str(reuf.read()) 
    d = re.findall('</a>".*?Direc',text,re.DOTALL|re.MULTILINE) 
    if d is not None: 
     for elements in d: 
      elements = str(elements) 
      if not 'https' in elements: 
        s = re.compile('</a>".*?Direc',re.DOTALL) 
        replace = re.sub(s,'</a>",Direc',text) 
    h = re.findall('</a>".*?https',text,re.DOTALL|re.MULTILINE) 
    if h is not None: 
     for elements in h: 
      if not 'Direc' in elements: 
       s = re.compile('</a>".*?https',re.DOTALL) 
       replace = re.sub(s,'</a>"\nhttps',text) 
     replace = str(replace) 
     putuf.write(replace)

答

当没有匹配时，groups将是None。你需要警惕这一点（或重构正则表达式，以便它总是匹配一些东西）。

groups = re.search('</a>".*?Direc',lines,re.DOTALL) 
    if groups is not None: 
     if not 'https' in groups:

通知添加的not None条件和其它支配以下行的后续缩进。

我尝试添加其他： \t更换=行，但没了 – Abueesp

看到示例代码 – tripleee

更新我尝试了，我得到了一个空白的文件，所以你是对的，团体的比赛必须是无。为什么？那么如何解决Reutput.csv呢？预先感谢tripleee – Abueesp

Python for re.match re.sub

相关推荐