从python网页抓取结果中删除多余的表格
我的代码生成了我想要移除的额外表格。我想删除除此之外的所有其他表格。从python网页抓取结果中删除多余的表格
我的代码
import csv
from bs4 import BeautifulSoup
import requests
import pandas as pd
import telnetlib as tn
import os
#import sys
cwd = os.getcwd()
print (os.getcwd)
cwd = os.getcwd()
os.chdir('c:\\Users\STaiwo\Desktop\My R code')
page = requests.get("https://www.flyingblue.com/earn-and-spend-
miles/airlines/partner/180/china-eastern.html", verify = False)
print(page.content) ### Collects HTML content of site
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify()) ## Cleans up the content of the site
for table in soup.findAll('tbody'):
print('Table')
list_of_rows = []
for row in table.findAll('tr')[1:]:
list_of_cells = []
for cell in row.findAll('td'):
text = ((cell.text.replace(' ', '')))
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
print(list_of_rows)
结果目前我得到: 表 [[ '头等舱', 'F,U', '150%'],['P '','125%'],['Business Class','J,C,D,I','125%'],['Premium Economy Class','W','110%'],''Economy '','Y,B','100%'],['E,H,M','75%'],['L,N,R,S,V,K','50%'] ,[ 'T','30% '],[' 不符合应计”, 'Z,Q,G', '0%']] 表 [] 表 [] 表 [['英里距离:6,482','总'],['Booking sub-class:125%','8,103'],['8,103']] 表 [['Distance in miles: [''预订小组:125%','精英奖金:75%','12,965'],['8,103','4,862']] 表 [['距离英里数:6,482','Total'],['Booking sub-class:50%','3,241'],['3,241']] 表 [['Distance in miles:6,482','Total'], [ '的预订的子类:50%', '精英奖金:N/A', '3241'],[ '3241', '0']]
我想要的结果: 表 [ ['头等舱','F,U','150%'],['P','125%'],['巴士“'经济舱','Y','B'',''经济舱','J,C,D,I','125%'],['Premium Economy Class','W','110% ['L,N,R,S,V,K','50%'],['T','30%'] ],['不适用于权责发生制','Z,Q,G','0%']]
尝试将[:1]
添加到soup.findAll('tbody')
它将限制结果仅限第一个表。
检查HTML我看到几个表具有相同的id
,即inlineTable
。要选择正确的一个,即使发布者在页面上更改此表的位置,也必须能够以其他方式识别它。我注意到'Classe de cabine'这个标题对于这个表格是独一无二的,它可能会在英文版中作为'Cabin class'提供。让我们使用它。
首先,获取所有与id
表。看看'Classe de cabine'的每张桌子的文字。当您发现吐出行时,除了标题行外。
>>> import requests
>>> page = requests.get('https://www.flyingblue.com/earn-and-spend-miles/airlines/partner/180/china-eastern.html').text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(page, 'lxml')
>>> required_tables = soup.select('#inlineTable')
>>> len(required_tables)
7
>>> for table in required_tables:
... if 'Classe de cabine' in table.text:
... rows = table.findAll('tr')
... for row in rows[1:]:
... row
...
<tr class="table-highlite-light">
<td rowspan="2" width="33%">Première Classe</td>
<td width="33%">F, U</td>
<td width="33%">150 %</td>
</tr>
<tr class="table-highlite-light">
<td>P</td>
<td>125 %</td>
</tr>
<tr class="table-highlite-light">
<td>Classe Affaires</td>
<td>J, C, D, I</td>
<td>125 %</td>
</tr>
<tr class="table-highlite-light">
<td>Premium Economy Classe</td>
<td>W</td>
<td>110 %</td>
</tr>
<tr class="table-highlite-light">
<td rowspan="4">Classe Économique</td>
<td>Y, B</td>
<td>100 %</td>
</tr>
<tr class="table-highlite-light">
<td>E, H, M</td>
<td>75 %</td>
</tr>
<tr class="table-highlite-light">
<td>L, N, R, S, V, K</td>
<td>50 %</td>
</tr>
<tr class="table-highlite-light">
<td>T</td>
<td>30%</td>
</tr>
<tr class="table-highlite-light">
<td>Non éligible pour l’accumulation</td>
<td>Z, Q, G</td>
<td>0 %</td>
</tr>