使用Python中的BeautifulSoup 4从div标签中提取文本

问题描述：

我想从使用BeautifulSoup4和python的div标签中提取文本。下面的HTML代码存储在一个文件（example.html的）使用Python中的BeautifulSoup 4从div标签中提取文本

我的HTML：

<table class="NZX1058422900" cols="20" style="border-collapse: collapse; width: 1496px;" cellspacing="0" cellpadding="0" border="0"> 
<tbody> 
<td class="A10dbmytr2499b"> 
<div class="VWP1058422499" alt="Total Cases: 5 - Level 1, Level 2, or On Hold 2 - Completed" title="Total Cases: 5 - Level 1, Level 2, On Hold 2 - Completed">5/2</div> 
</td> 
</tbody> 
</table> 

I want the output to look like below: 
Total Cases: 
5 - Level 1, Level 2, or On Hold 
2 - Completed

到目前为止，我的代码是：

from bs4 import BeautifulSoup 
openFile = open("C:\\example.html") 
readFile = openFile.read() 
soup = BeautifulSoup(readFile, "lxml")

我曾尝试下面的代码没有任何成功：

soup.find("div", class_="VWP1058422499")

任何人都可以帮助如何提取上述数据？

答

从@ so1989扩大的答案，你也想知道如何与您指定的格式打印，我建议这种做法：

from bs4 import BeautifulSoup 

openFile = open("C:\\example.html") 
readFile = openFile.read() 

soup = BeautifulSoup(readFile, "lxml") 
alt = soup.find("div", {"class":"VWP1058422499"}).get("alt").split() 

for i, char in enumerate(alt): 
    if char == '-': 
     alt[i-2] = alt[i-2] + '\n' 
    if char[0] in ['-', 'C', 'L', 'o']: 
     alt[i] = ' ' + alt[i] 

alt = ''.join(alt) 
print(alt)

谢谢大家的回答！ @ so1989 但是我得到了“AttributeError：'NoneType'对象没有属性'get'”错误： alt = soup.find（“div”，{“class”：“VWP1058422499”}）。get（“alt “）任何想法如何解决这个问题？我无法执行.get方法.. – LinuxUser

@LinuxUser你可以在这里发布网址的网址，你试图刮？ – so1989

@LinuxUser我用你给我们从文件中读取的html文本测试了它，它工作正常，可能是与文件位置或网站url有关的任何错误？ –

答

alt = soup.find("div", {"class":"VWP1058422499"}).get("alt") 
print(alt.text) #or just print(alt)

荣誉给你，我希望你不要介意我决定改进你的答案。 –

使用Python中的BeautifulSoup 4从div标签中提取文本

相关推荐