网络爬虫笔记3,信息提取之Beautiul Soup库
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> r.text
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> print(soup.prettify())
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>
<p class="title">
<b>
The demo python introduces several python courses.
</b>
</p>
<p class="course">
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
Advanced Python
</a>
.
</p>
</body>
</html>
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.a.name
'a'
>>> soup.a.parent.name
'p'
>>> soup.a.parent.parent.name
'body'
>>> tag = soup.a
>>> tag.attrs
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
>>> tag.attrs['class']
['py1']
>>> tag.attrs['href']
'http://www.icourse163.org/course/BIT-268001'
>>> type(tag.attrs)
<class 'dict'>
>>> type(tag)
<class 'bs4.element.Tag'>
>>> soup.a
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
>>> soup.a.string
'Basic Python'
>>> soup.p
<p class="title"><b>The demo python introduces several python courses.</b></p>
>>> soup.p.string
'The demo python introduces several python courses.'
>>> type(soup.p.string)
<class 'bs4.element.NavigableString'>
基于bs4库HTML的内容遍历法
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.head
<head><title>This is a python demo page</title></head>
>>> soup.head.contents
[<title>This is a python demo page</title>]
>>> soup.body.contents
['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']
>>> len(soup.body.contents)
5
>>> soup.body.contents[1]
<p class="title"><b>The demo python introduces several python courses.</b></p>
>>>
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.title.parent
<head><title>This is a python demo page</title></head>
>>> soup.html.parent
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>
>>> soup.a.next_sibling
' and '
>>> soup.a.next_sibling.next_sibling
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
>>> soup.a.previous_sibling
'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n'
>>> soup.a.previous_sibling.previous_sibling
>>> soup.a.parent
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
>>>
如何让HTML内容更加友好的显示?
>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> demo
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.prettify()
'<html>\n <head>\n <title>\n This is a python demo page\n </title>\n </head>\n <body>\n <p class="title">\n <b>\n The demo python introduces several python courses.\n </b>\n </p>\n <p class="course">\n Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\n <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">\n Basic Python\n </a>\n and\n <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">\n Advanced Python\n </a>\n .\n </p>\n </body>\n</html>'
>>> print(soup.prettify())
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>
<p class="title">
<b>
The demo python introduces several python courses.
</b>
</p>
<p class="course">
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
Advanced Python
</a>
.
</p>
</body>
</html>
信息标记的三种形式
注释以感叹号开头
三种信息标记形式的比较
信息提取的一般方法
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> for link in soup.find_all('a'):
print(link.get('href'))
http://www.icourse163.org/course/BIT-268001
http://www.icourse163.org/course/BIT-1001870001
>>> soup.find_all('p','course')
[<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
>>> soup.find_all(id='link1')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]
>>> soup.find_all(id='link')
[]
>>> import re
>>> soup.find_all(id=re.compile('link'))
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> soup
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>
>>> soup.find_all(string = "Basic Python")
['Basic Python']
>>> soup.find_all(string = re.compile('python'))
['This is a python demo page', 'The demo python introduces several python courses.']
>>>
import requests from bs4 import BeautifulSoup import bs4 # 获取网页信息的通用框架 def getHtmlText(url): try: r = requests.get(url, timeout=30) r.raise_for_status() r.encoding = r.apparent_encoding return r.text except: return '爬取失败' # 填充列表 def fillUnivList(ulist, html): soup = BeautifulSoup(html, 'lxml') for tr in soup.find('tbody').children: # 检查网页代码可以发现数据都储存在tboyd标签中,这里需要对tbody的儿子节点进行遍历 if isinstance(tr, bs4.element.Tag): # 检测标签类型,如果不是bs4库支持的Tag类型,就过滤掉,这里需要先导入bs4库 tds = tr('td') # 解析出tr标签中的td标签后,将其储存在列表tds中 ulist.append([tds[0].string, tds[1].string, tds[3].string]) # 我们需要的是排名、学校名称和总分 # 格式化后,输出列表数据 def printUnivList(ulist, num): tplt = '{:<10}\t{:<10}\t{:<10}' # 定义输出模板为变量tplt,\t为横向制表符,<为左对齐,10为每列的宽度 print(tplt.format('排名', '学校名称', '总分')) # format()方法做格式化输出 for i in range(num): u = ulist[i] print(tplt.format(u[0], u[1], u[2])) def main(): uinfo = [] url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2017.html' html = getHtmlText(url) fillUnivList(uinfo, html) printUnivList(uinfo, 10) # 选取前10所学校信息 main()
执行结果
C:\Users\Amber\AppData\Local\Programs\Python\Python36\python.exe C:/Users/Amber/PycharmProjects/untitled4/test.py
排名 学校名称 总分
1 清华大学 94.0
2 北京大学 81.2
3 浙江大学 77.8
4 上海交通大学 77.5
5 复旦大学 71.1
6 中国科学技术大学 65.9
7 南京大学 65.3
8 华中科技大学 63.0
9 中山大学 62.7
10 哈尔滨工业大学 61.6
Process finished with exit code 0
# CrawUnivRankingA.py import requests from bs4 import BeautifulSoup import bs4 def getHTMLText(url): try: r = requests.get(url, timeout=30) r.raise_for_status() r.encoding = r.apparent_encoding return r.text except: return "" def fillUnivList(ulist, html): soup = BeautifulSoup(html, "html.parser") for tr in soup.find('tbody').children: if isinstance(tr, bs4.element.Tag): tds = tr('td') ulist.append([tds[0].string, tds[1].string, tds[3].string]) def printUnivList(ulist, num): print("{:^10}\t{:^6}\t{:^10}".format("排名", "学校名称", "总分")) for i in range(num): u = ulist[i] print("{:^10}\t{:^6}\t{:^10}".format(u[0], u[1], u[2])) def main(): uinfo = [] url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html' html = getHTMLText(url) fillUnivList(uinfo, html printUnivList(uinfo, 20) # 20 univs main()
C:\Users\Amber\AppData\Local\Programs\Python\Python36\python.exe C:/Users/Amber/PycharmProjects/untitled4/test.py
排名 学校名称 总分
1 清华大学 94.0
2 北京大学 81.2
3 浙江大学 77.8
4 上海交通大学 77.5
5 复旦大学 71.1
6 中国科学技术大学 65.9
7 南京大学 65.3
8 华中科技大学 63.0
9 中山大学 62.7
10 哈尔滨工业大学 61.6
11 同济大学 60.8
12 东南大学 59.8
13 武汉大学 58.4
14 北京航空航天大学 58.3
15 南开大学 58.2
16 四川大学 57.4
16 西安交通大学 57.4
18 天津大学 56.2
19 华南理工大学 56.1
20 北京师范大学 55.1
Process finished with exit code 0
格式调整后:
# CrawUnivRankingB.py import requests from bs4 import BeautifulSoup import bs4 def getHTMLText(url): try: r = requests.get(url, timeout=30) r.raise_for_status() r.encoding = r.apparent_encoding return r.text except: return "" def fillUnivList(ulist, html): soup = BeautifulSoup(html, "html.parser") for tr in soup.find('tbody').children: if isinstance(tr, bs4.element.Tag): tds = tr('td') ulist.append([tds[0].string, tds[1].string, tds[3].string]) def printUnivList(ulist, num): tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}" print(tplt.format("排名", "学校名称", "总分", chr(12288))) for i in range(num): u = ulist[i] print(tplt.format(u[0], u[1], u[2], chr(12288))) def main(): uinfo = [] url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html' html = getHTMLText(url) fillUnivList(uinfo, html) printUnivList(uinfo, 20) # 20 univs main()
结果:
C:\Users\Amber\AppData\Local\Programs\Python\Python36\python.exe C:/Users/Amber/PycharmProjects/untitled4/test.py
排名 学校名称 总分
1 清华大学 95.9
2 北京大学 82.6
3 浙江大学 80
4 上海交通大学 78.7
5 复旦大学 70.9
6 南京大学 66.1
7 中国科学技术大学 65.5
8 哈尔滨工业大学 63.5
9 华中科技大学 62.9
10 中山大学 62.1
11 东南大学 61.4
12 天津大学 60.8
13 同济大学 59.8
14 北京航空航天大学 59.6
15 四川大学 59.4
16 武汉大学 59.1
17 西安交通大学 58.9
18 南开大学 58.3
19 大连理工大学 56.9
20 山东大学 56.3
Process finished with exit code 0