Python - BeautifulSoup,在标签内获取标签
问题描述:
如何获得有关在标签内获取标签的信息?Python - BeautifulSoup,在标签内获取标签
出了td标签在这里:
<td scope="row"><a href="/Archives/edgar/data/1085621/000108562117000004/actuacorp12312016.htm">actuacorp12312016.htm</a></td>
我想其中的href标记的价值,主要是HTM链接:
<a href="/Archives/edgar/data/1085621/000108562117000004/actuacorp12312016.htm">actuacorp12312016.htm</a>
我有标签这样的:
<tr>
<td scope="row">1</td>
<td scope="row">10-K</td>
<td scope="row"><a href="/Archives/edgar/data/1085621/000108562117000004/actuacorp12312016.htm">actuacorp12312016.htm</a></td>
<td scope="row">10-K</td>
<td scope="row">2724989</td>
</tr>
<tr class="blueRow">
<td scope="row">2</td>
<td scope="row">EXHIBIT 21.1</td>
<td scope="row"><a href="/Archives/edgar/data/1085621/000108562117000004/exhibit211q42016.htm">exhibit211q42016.htm</a></td>
<td scope="row">EX-21.1</td>
<td scope="row">21455</td>
</tr>
<tr>
<td scope="row">3</td>
<td scope="row">EXHIBIT 23.1</td>
<td scope="row"><a href="/Archives/edgar/data/1085621/000108562117000004/exhibit231q42016.htm">exhibit231q42016.htm</a></td>
<td scope="row">EX-23.1</td>
<td scope="row">4354</td>
</tr>
查看所有标签的代码:
base_url = "https://www.sec.gov/Archives/edgar/data/1085621/000108562117000004/" \
"0001085621-17-000004-index.htm"
response = requests.get(base_url)
base_data = response.content
base_soup = BeautifulSoup(base_data, "html.parser")
答
您可以使用find_all
先得到所有td
标签,然后将这些标签中搜索锚:
links = []
for tag in base_soup.find_all('td', {'scope' : 'row'}):
for anchor in tag.find_all('a'):
links.append(anchor['href'])
print(links)
输出:
['/Archives/edgar/data/1085621/000108562117000004/actuacorp12312016.htm',
'/Archives/edgar/data/1085621/000108562117000004/exhibit211q42016.htm',
...
'/Archives/edgar/data/1085621/000108562117000004/acta-20161231_lab.xml',
'/Archives/edgar/data/1085621/000108562117000004/acta-20161231_pre.xml']
你可以写一点点过滤器删除那些非htm链接,如果你想:
filtered_links = list(filter(lambda x: x.endswith('.htm'), links))
要获得第一个链接,这里有一个稍微不同的版本,适合您的用例。
link = None
for tag in base_soup.find_all('td', {'scope' : 'row'}):
children = tag.findChildren()
if len(children) > 0:
try:
link = children[0]['href']
break
except:
continue
print(link)
这打印出'/Archives/edgar/data/1085621/000108562117000004/acta-20161231_pre.xml'
。
这是一个非常好的解决方案,谢谢。无论如何不循环做两次?有什么办法可以减少到只有一个for循环。比如像base_soup.find_all('td',{'scope':'row'{a}})。 – Theo
我只想要第一个htm,'/Archives/edgar/data/1085621/000108562117000004/actuacorp12312016.htm' – Theo
@Theo给我几分钟,会更新。 –