使用BeautifulSoup从标题标签中提取数据?
我想通过python中的BeautifulSoup
库获取它的HTML后提取链接的标题。 基本上,整个标题标签使用BeautifulSoup从标题标签中提取数据?
<title>Imaan Z Hazir on Twitter: "Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)"</title>
我想提取的数据是在& QUOT标签,这只是这个Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)
我尝试作为
import urllib
import urllib.request
from bs4 import BeautifulSoup
link = "https://twitter.com/ImaanZHazir/status/778560899061780481"
try:
List=list()
r = urllib.request.Request(link, headers={'User-Agent': 'Chrome/51.0.2704.103'})
h = urllib.request.urlopen(r).read()
data = BeautifulSoup(h,"html.parser")
for i in data.find_all("title"):
List.append(i.text)
print(List[0])
except urllib.error.HTTPError as err:
pass
我也尝试作为
for i in data.find_all("title.""):
for i in data.find_all("title>""):
for i in data.find_all("""):
and
for i in data.find_all("quot"):
但是没有人在工作。
就劈在结肠中的文字:
In [1]: h = """<title>Imaan Z Hazir on Twitter: "Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)"</title>"""
In [2]: from bs4 import BeautifulSoup
In [3]: soup = BeautifulSoup(h, "lxml")
In [4]: print(soup.title.text.split(": ", 1)[1])
"Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)"
其实在看网页,你不需要拆可言,文字是div内的p标记。JS-鸣叫文本容器,TH:
In [8]: import requests
In [9]: from bs4 import BeautifulSoup
In [10]: soup = BeautifulSoup(requests.get("https://twitter.com/ImaanZHazir/status/778560899061780481").content, "lxml")
In [11]: print(soup.select_one("div.js-tweet-text-container p").text)
Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)
In [12]: print(soup.title.text.split(": ", 1)[1])
"Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)"
所以,你可以为同样的结果做任何一种方式。
Caunnungham这个工作!感谢您的通知。'print(soup.select_one(”div.js-tweet-text-container p“)。text)'' – Amar
一旦你解析的HTML:
data = BeautifulSoup(h,"html.parser")
查找标题是这样的:
title = data.find("title").string # this is without <title> tag
现在找到字符串中的两个引号("
)。有很多方法可以做到这一点。我会用正则表达式:
import re
match = re.search(r'".*"', title)
if match:
print match.group(0)
你从来没有搜索"
或任何其他&NAME;
序列,因为BeautifulSoup将它们转换成他们所代表的实际字符。
编辑:
正则表达式不捕捉报价是:
re.search(r'(?<=").*(?=")', title)
下面是使用正则表达式来提取引号内的文本的简单完整的例子:
import urllib
import re
from bs4 import BeautifulSoup
link = "https://twitter.com/ImaanZHazir/status/778560899061780481"
r = urllib.request.urlopen(link)
soup = BeautifulSoup(r, "html.parser")
title = soup.title.string
quote = re.match(r'^.*\"(.*)\"', title)
print(quote.group(1))
这里发生的事情是,在获取页面的源代码并找到title
之后,我们使用正则表达式对标题来提取引号内的文字。
我们告诉正则表达式查找符号在开引号(\"
)前的字符串(^.*
)的开头的任意数,然后捕获它和关闭的引号(第二\"
)之间的文本。
然后我们通过告诉Python打印第一个捕获的组(正则表达式中括号之间的部分)来打印捕获的文本。
这里有更多关于Python与正则表达式匹配 - https://docs.python.org/3/library/re.html#match-objects
我期望BeautifulSoup将'"'转换成''',所以你只需要寻找'''' – zvone
@zvone这是什么? ''''你的意思是这个''标题“'? – Amar