BeautifulSoup删除标记内容的特殊字符

问题描述:

我想比较一个字符串与html页面的内容。但是HTML页面中的特殊字符使得这种比较更加困难。所以我想在比较之前从HTML页面中删除所有特殊字符和空格。但所有标签都必须保持不变。 是BeautifulSoup删除标记内容的特殊字符

<div class="abc bcd"> 
     <div class="inner1"> Hai ! this is first inner div;</div> 
     <div class="inner2"> "this is second div... " </div> 
</div> 

这应该转换为

<div class="abc bcd"> 
      <div class="inner1">Haithisisfirstinnerdiv</div> 
      <div class="inner2">thisisseconddiv</div> 
</div> 

这可怎么办呢?

+0

找出如何用BeautifulSoup替换文本。 – Blender 2013-05-12 03:57:47

查找所有叶子标签并更改其字符串。

alphabet = 'abcdefghijklmnopqrtsuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ' 

def replace(soup): 
    for child in soup.children: 
     if child.string: 
      child.string = ''.join([ch for ch in child.string if ch in alphabet]) 
     else: 
      replace(child) 

from bs4 import BeautifulSoup 

orig_string = """ 
<div class="abc bcd"> 
     <div class="inner1"> Hai ! this is first inner div;</div> 
     <div class="inner2"> "this is second div... " </div> 
</div> """ 

soup = BeautifulSoup(orig_string) 
print soup.prettify() # original HTML 
replace(soup) 
print 
print soup.prettify() # new HTML 

输出:

<div class="inner1"> 

转到

<div class="inner1"> 

下面是如何:

<html> 
<body> 
    <div class="abc bcd"> 
    <div class="inner1"> 
    Hai ! this is first inner div; 
    </div> 
    <div class="inner2"> 
    "this is second div... " 
    </div> 
    </div> 
</body> 
</html> 

<html> 
<body> 
    <div class="abc bcd"> 
    <div class="inner1"> 
    Haithisisfirstinnerdiv 
    </div> 
    <div class="inner2"> 
    thisisseconddiv 
    </div> 
    </div> 
</body> 
</html> 
+1

只是一件小事,'输入字符串; string.letters'产生小写字母和大写字母:) – TerryA 2013-05-12 04:34:41

+0

对于Unicode意识,不要枚举所有字母。相反,输入“unicodedata”而不是'ch in alphabet',使用'unicodedata.category(ch)[0] =='L''。 – icktoofay 2013-05-12 05:04:42

+0

另外,你的'child ='在'child = replace(child)'中没有用处。 – icktoofay 2013-05-12 05:05:01

首先,BeautifulSoup调用BeautifulSoup()所以当人们已经修复了一些破碎的HTML得到摆脱空白和特殊字符:

>>> from bs4 import BeautifulSoup 
>>> html = """<div class="abc bcd"> 
    <div class="inner1"> Hai ! this is first inner div;</div> 
    <div class="inner2"> "this is second div... " </div> 
</div>""" 
>>> soup = BeautifulSoup(html) 
>>> for divtag in soup.findAll('div'): 
...  if 'inner' in divtag['class'][0]: 
...   divtag.string = ''.join(i for i in divtag.string if i.isalnum()) 
>>> print soup 
<html><body><div class="abc bcd"> 
<div class="inner1">Haithisisfirstinnerdiv</div> 
<div class="inner2">thisisseconddiv</div> 
</div></body></html>