使用beautifulsoup解析HTML页面

问题描述：

我开始使用beautifulsoup解析HTML。
用于例如，对于网站的“http://en.wikipedia.org/wiki/PLCB1”使用beautifulsoup解析HTML页面

import sys 
sys.setrecursionlimit(10000) 

import urllib2, sys 
from BeautifulSoup import BeautifulSoup 

site= "http://en.wikipedia.org/wiki/PLCB1" 
hdr = {'User-Agent': 'Mozilla/5.0'} 
req = urllib2.Request(site,headers=hdr) 
page = urllib2.urlopen(req) 
soup = BeautifulSoup(page) 

table = soup.find('table', {'class':'infobox'}) 
#print table 
rows = table.findAll("th") 
for x in rows: 
    print "x - ", x.string

我得到的输出为无在日那里是URL某些情况下。为什么是这样？

输出：

x - Phospholipase C, beta 1 (phosphoinositide-specific) 
x - Identifiers 
x - None 
x - External IDs 
x - None 
x - None 
x - Molecular function 
x - Cellular component 
x - Biological process 
x - RNA expression pattern 
x - Orthologs 
x - Species 
x - None 
x - None 
x - None 
x - RefSeq (mRNA) 
x - RefSeq (protein) 
x - Location (UCSC) 
x - None

例如，地点后，还有一个个包含“考研搜索”，但显示为无。我想知道它为什么发生。

and
第二：有没有办法在字典中获取th和各自的td，以便它变得容易解析？

答

Element.string只有当文本直接位于元素中时才包含值。不包括嵌套元素。

如果使用BeautifulSoup 4，使用Element.stripped_strings代替：

print ''.join(x.stripped_strings)

对于BeautifulSoup 3，你需要搜索所有文本元素：

print ''.join([unicode(t).strip() for t in x.findAll(text=True)])

如果你想结合<th>和<td>元素合并到一个字典中，您可以遍历所有<th>元素，然后使用.findNextSibling()来查找相应的<td>元素，并将它合并上述.findAll(text=True)招打造自己的字典：

info = {} 
rows = table.findAll("th") 
for headercell in rows: 
    valuecell = headercell.findNextSibling('td') 
    if valuecell is None: 
     continue 
    header = ''.join([unicode(t).strip() for t in headercell.findAll(text=True)]) 
    value = ''.join([unicode(t).strip() for t in valuecell.findAll(text=True)]) 
    info[header] = value

这只适用于bs4。相反，@sam可能会使用较早版本的BeautifulSoup。（不是我-1顺便说一句） – unutbu 2013-02-16 14:48:18

@unutbu：bugger ..更新为包括一个BS3选项 – 2013-02-16 14:48:37

它给TypeError – sam 2013-02-16 14:49:54

答

如果检查HTML，

<th colspan="4" style="text-align:center; background-color: #ddd">Identifiers</th> 
</tr> 
<tr class=""> 
<th style="background-color: #c3fdb8"><a href="/wiki/Human_Genome_Organisation" title="Human Genome Organisation">Symbols</a></th> 
<td colspan="3" class="" style="background-color: #eee"><span class="plainlinks"><a rel="nofollow" class="external text" href="http://www.genenames.org/data/hgnc_data.php?hgnc_id=15917">PLCB1</a>; EIEE12; PI-PLC; PLC-154; PLC-I; PLC154; PLCB1A; PLCB1B</span></td> 
</tr> 
<tr class=""> 
<th style="background-color: #c3fdb8">External IDs</th>

你会看到在Identifiers和External IDs之间有一个<th>标签，没有文字，只有<a>标签：

<th style="background-color: #c3fdb8"><a href="/wiki/Human_Genome_Organisation" title="Human Genome Organisation">Symbols</a></th>

这<th>有没有T分机。所以x.string是None。

当然'x.string'是None，但是你如何解决这个问题？ :-P – 2013-02-16 14:52:13

@MartijnPieters：我来说说这个，但你回答得太快:) – unutbu 2013-02-16 14:53:14

怎么样最后的情况下有

以及标签 – sam 2013-02-16 14:53:53

使用beautifulsoup解析HTML页面

相关问题

相关推荐