Python RegEx获取特定文本
问题描述:
我是RegEx的新手。我使用python浏览网页并挑选出某些文本。我已经能够挑选出我需要的部分附加字符。在下面我试图让这个表达式的例子:“需要这个”Python RegEx获取特定文本
import re
test = '<area alt=Need This <span class=;viewot;>view 1</span>||tin view:<br /> ' \
'<div class=sadfca3 24swcdsa c4566 54dscz>' \
'<span class=asafwef1 41sd fd3532 safwef>' \
'<img class=sfecs 234af wefw47 5awef>' \
'</span> ' \
'<span class=sad536 fwfad23 4s214 fsadfw>' \
'<img class=&we234 fsafsdf 2323 asdfsd>' \
'</span>' \
'<span class=afasui2 34 ewiasd23 4fjlwe;>' \
'<img class=sfawejac2 42jk hewwef32 4uafasd>' \
'</span> ' \
'<span class=gdfjuia w8 aw ijfaw a909>' \
'<img class=asfwejhjdkh f 8sd 8 awiosa;f98a 8a' \
'</span> <div class=afkj waj 98u2oi kjaf09></div>" href="jkhafu.php">'
print("findall")
print(re.findall(r'<area alt=?.*<span class=', str(test), re.I|re.M))
print("finditer")
print(re.finditer(r'<area alt=+.*<span class=', str(test), re.I|re.M))
print("match")
print(re.match(r'<area alt=+.*<span class=', str(test), re.I|re.M))
print("search")
print(re.search(r'<area alt=+.*<span class=', str(test), re.I|re.M))
print("split")
print(re.split(r'<area alt=+.*<span class=', str(test), re.I|re.M))
re.match和re.seach接近我所需要。这里是从上面的例子的结果:
findall
['<area alt=Need This <span class="view">view 1</span>||time to spend in view:<br /> <div class=sadfca3 24swcdsa c4566 54dscz><span class=asafwef1 41sd fd3532 safwef><img class=sfecs 234af wefw47 5awef></span> <span class=sad536 fwfad23 4s214 fsadfw><img class=&we234 fsafsdf 2323 asdfsd></span><span class=afasui2 34 ewiasd23 4fjlwe;><img class=sfawejac2 42jk hewwef32 4uafasd></span> <span class=']
finditer
<callable_iterator object at 0x00493750>
match
<_sre.SRE_Match object; span=(0, 405), match='<area alt=Need This <span class="view">v>
search
<_sre.SRE_Match object; span=(0, 405), match='<area alt=Need This <span class="view">v>
split
['', 'gdfjuia w8 aw ijfaw a909><img class=asfwejhjdkh f 8sd 8 awiosa;f98a 8a</span> <div class=afkj waj 98u2oi kjaf09></div>" href="jkhafu.php">']
如何使用正则表达式使用Python 3.4,只得到“需要这个”从字符串在上面的例子中名为test?
任何帮助将不胜感激!
答
Use a lookbehind and lookahead assertion,
(?<=area alt=).*?(?=\s+<span class=)
代码:
>>> m = re.search(r'(?<=area alt=).*?(?=\s+<span class=)', test).group()
>>> m
'Need This'
+0
工作正常!谢谢! – user908759 2014-09-02 23:46:19
+0
不客气... – 2014-09-02 23:49:44
答
你可以使用这个表达式:
area alt=([\w\s]+)<
的代码是:
import re
p = re.compile(ur'area alt=([\w\s]+)<')
test_str = u"YOUR TEXT HERE"
m = re.match(p, test_str)
print m.group(1)
相关的,如果你曾经做任何事情更复杂的HTML解析:http://stackoverflow.com/a/1732454/406772 – 2014-09-02 23:17:16
你的HTML是不实际有效。将链接分享给网页,或按原样提供相关的网页html。 – alecxe 2014-09-02 23:17:24