使用BeautifulSoup从标题标签中提取数据？

问题描述：

我想通过python中的BeautifulSoup库获取它的HTML后提取链接的标题。基本上，整个标题标签使用BeautifulSoup从标题标签中提取数据？

<title>Imaan Z Hazir on Twitter: &quot;Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)&quot;</title>

我想提取的数据是在& QUOT标签，这只是这个Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3) 我尝试作为

import urllib 
import urllib.request 

from bs4 import BeautifulSoup 

link = "https://twitter.com/ImaanZHazir/status/778560899061780481" 
try: 
    List=list() 
    r = urllib.request.Request(link, headers={'User-Agent': 'Chrome/51.0.2704.103'}) 
    h = urllib.request.urlopen(r).read() 
    data = BeautifulSoup(h,"html.parser") 
    for i in data.find_all("title"): 
     List.append(i.text) 
     print(List[0]) 
except urllib.error.HTTPError as err: 
    pass

我也尝试作为

for i in data.find_all("title.&quot"): 

for i in data.find_all("title>&quot"): 

for i in data.find_all("&quot"):

and

for i in data.find_all("quot"):

但是没有人在工作。

我期望BeautifulSoup将'"'转换成'''，所以你只需要寻找'''' – zvone

@zvone这是什么？ ''''你的意思是这个''标题“'？ – Amar

答

就劈在结肠中的文字：

In [1]: h = """<title>Imaan Z Hazir on Twitter: &quot;Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)&quot;</title>""" 

In [2]: from bs4 import BeautifulSoup 

In [3]: soup = BeautifulSoup(h, "lxml") 

In [4]: print(soup.title.text.split(": ", 1)[1]) 
"Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)"

其实在看网页，你不需要拆可言，文字是div内的p标记。JS-鸣叫文本容器，TH：

In [8]: import requests 

In [9]: from bs4 import BeautifulSoup 


In [10]: soup = BeautifulSoup(requests.get("https://twitter.com/ImaanZHazir/status/778560899061780481").content, "lxml") 


In [11]: print(soup.select_one("div.js-tweet-text-container p").text) 
Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3) 

In [12]: print(soup.title.text.split(": ", 1)[1]) 
"Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)"

所以，你可以为同样的结果做任何一种方式。

Caunnungham这个工作！感谢您的通知。'print（soup.select_one（”div.js-tweet-text-container p“）。text）'' – Amar

答

一旦你解析的HTML：

data = BeautifulSoup(h,"html.parser")

查找标题是这样的：

title = data.find("title").string # this is without <title> tag

现在找到字符串中的两个引号（"）。有很多方法可以做到这一点。我会用正则表达式：

import re 
match = re.search(r'".*"', title) 
if match: 
    print match.group(0)

你从来没有搜索"或任何其他&NAME;序列，因为BeautifulSoup将它们转换成他们所代表的实际字符。

编辑：

正则表达式不捕捉报价是：

re.search(r'(?<=").*(?=")', title)

答

下面是使用正则表达式来提取引号内的文本的简单完整的例子：

import urllib 
import re 
from bs4 import BeautifulSoup 

link = "https://twitter.com/ImaanZHazir/status/778560899061780481" 

r = urllib.request.urlopen(link) 
soup = BeautifulSoup(r, "html.parser") 
title = soup.title.string 
quote = re.match(r'^.*\"(.*)\"', title) 
print(quote.group(1))

这里发生的事情是，在获取页面的源代码并找到title之后，我们使用正则表达式对标题来提取引号内的文字。

我们告诉正则表达式查找符号在开引号（\"）前的字符串（^.*）的开头的任意数，然后捕获它和关闭的引号（第二\"）之间的文本。

然后我们通过告诉Python打印第一个捕获的组（正则表达式中括号之间的部分）来打印捕获的文本。

这里有更多关于Python与正则表达式匹配 - https://docs.python.org/3/library/re.html#match-objects

使用BeautifulSoup从标题标签中提取数据？

相关推荐