网络爬虫笔记3,信息提取之Beautiul Soup库

>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> r.text
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> print(soup.prettify())
<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>

</html>

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.a.name
'a'
>>> soup.a.parent.name
'p'
>>> soup.a.parent.parent.name
'body'
>>> tag = soup.a
>>> tag.attrs
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
>>> tag.attrs['class']
['py1']
>>> tag.attrs['href']
'http://www.icourse163.org/course/BIT-268001'
>>> type(tag.attrs)
<class 'dict'>
>>> type(tag)
<class 'bs4.element.Tag'>
>>> soup.a
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
>>> soup.a.string
'Basic Python'
>>> soup.p
<p class="title"><b>The demo python introduces several python courses.</b></p>
>>> soup.p.string
'The demo python introduces several python courses.'
>>> type(soup.p.string)

<class 'bs4.element.NavigableString'>

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

基于bs4库HTML的内容遍历法

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.head
<head><title>This is a python demo page</title></head>
>>> soup.head.contents
[<title>This is a python demo page</title>]
>>> soup.body.contents
['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']
>>> len(soup.body.contents)
5
>>> soup.body.contents[1]
<p class="title"><b>The demo python introduces several python courses.</b></p>
>>>

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.title.parent
<head><title>This is a python demo page</title></head>
>>> soup.html.parent
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>

</body></html>

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

>>> soup.a.next_sibling
' and '
>>> soup.a.next_sibling.next_sibling
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
>>> soup.a.previous_sibling
'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n'
>>> soup.a.previous_sibling.previous_sibling
>>> soup.a.parent
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
>>>

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

如何让HTML内容更加友好的显示?

网络爬虫笔记3,信息提取之Beautiul Soup库

>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> demo

'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'

网络爬虫笔记3,信息提取之Beautiul Soup库

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.prettify()
'<html>\n <head>\n  <title>\n   This is a python demo page\n  </title>\n </head>\n <body>\n  <p class="title">\n   <b>\n    The demo python introduces several python courses.\n   </b>\n  </p>\n  <p class="course">\n   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\n   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">\n    Basic Python\n   </a>\n   and\n   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">\n    Advanced Python\n   </a>\n   .\n  </p>\n </body>\n</html>'
>>> print(soup.prettify())
<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>

</html>

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

信息标记的三种形式

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库注释以感叹号开头

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

三种信息标记形式的比较

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

信息提取的一般方法

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> for link in soup.find_all('a'):
    print(link.get('href'))

    
http://www.icourse163.org/course/BIT-268001

http://www.icourse163.org/course/BIT-1001870001

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

>>> soup.find_all('p','course')
[<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
>>> soup.find_all(id='link1')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]
>>> soup.find_all(id='link')
[]
>>> import re
>>> soup.find_all(id=re.compile('link'))

[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

网络爬虫笔记3,信息提取之Beautiul Soup库

>>> soup
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>
>>> soup.find_all(string = "Basic Python")
['Basic Python']
>>> soup.find_all(string = re.compile('python'))
['This is a python demo page', 'The demo python introduces several python courses.']
>>>


网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

import requests
from bs4 import BeautifulSoup
import bs4


# 获取网页信息的通用框架
def getHtmlText(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return '爬取失败'


# 填充列表
def fillUnivList(ulist, html):
    soup = BeautifulSoup(html, 'lxml')
    for tr in soup.find('tbody').children:
        # 检查网页代码可以发现数据都储存在tboyd标签中,这里需要对tbody的儿子节点进行遍历
        if isinstance(tr, bs4.element.Tag):
            # 检测标签类型,如果不是bs4库支持的Tag类型,就过滤掉,这里需要先导入bs4库
            tds = tr('td')
            # 解析出tr标签中的td标签后,将其储存在列表tds中
            ulist.append([tds[0].string, tds[1].string, tds[3].string])
            # 我们需要的是排名、学校名称和总分


# 格式化后,输出列表数据
def printUnivList(ulist, num):
    tplt = '{:<10}\t{:<10}\t{:<10}'
    # 定义输出模板为变量tplt,\t为横向制表符,<为左对齐,10为每列的宽度
    print(tplt.format('排名', '学校名称', '总分'))
    # format()方法做格式化输出
    for i in range(num):
        u = ulist[i]
        print(tplt.format(u[0], u[1], u[2]))


def main():
    uinfo = []
    url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2017.html'
    html = getHtmlText(url)
    fillUnivList(uinfo, html)
    printUnivList(uinfo, 10)
    # 选取前10所学校信息


main()

执行结果

C:\Users\Amber\AppData\Local\Programs\Python\Python36\python.exe C:/Users/Amber/PycharmProjects/untitled4/test.py
排名            学校名称          总分        
1             清华大学          94.0      
2             北京大学          81.2      
3             浙江大学          77.8      
4             上海交通大学        77.5      
5             复旦大学          71.1      
6             中国科学技术大学      65.9      
7             南京大学          65.3      
8             华中科技大学        63.0      
9             中山大学          62.7      
10            哈尔滨工业大学       61.6      

Process finished with exit code 0

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

网络爬虫笔记3,信息提取之Beautiul Soup库

# CrawUnivRankingA.py
import requests
from bs4 import BeautifulSoup
import bs4


def getHTMLText(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""


def fillUnivList(ulist, html):
    soup = BeautifulSoup(html, "html.parser")
    for tr in soup.find('tbody').children:
        if isinstance(tr, bs4.element.Tag):
            tds = tr('td')
            ulist.append([tds[0].string, tds[1].string, tds[3].string])


def printUnivList(ulist, num):
    print("{:^10}\t{:^6}\t{:^10}".format("排名", "学校名称", "总分"))
    for i in range(num):
        u = ulist[i]
        print("{:^10}\t{:^6}\t{:^10}".format(u[0], u[1], u[2]))


def main():
    uinfo = []
    url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html'
    html = getHTMLText(url)
    fillUnivList(uinfo, html
    printUnivList(uinfo, 20)  # 20 univs

    main()

C:\Users\Amber\AppData\Local\Programs\Python\Python36\python.exe C:/Users/Amber/PycharmProjects/untitled4/test.py
排名            学校名称          总分        
1             清华大学          94.0      
2             北京大学          81.2      
3             浙江大学          77.8      
4             上海交通大学        77.5      
5             复旦大学          71.1      
6             中国科学技术大学      65.9      
7             南京大学          65.3      
8             华中科技大学        63.0      
9             中山大学          62.7      
10            哈尔滨工业大学       61.6      
11            同济大学          60.8      
12            东南大学          59.8      
13            武汉大学          58.4      
14            北京航空航天大学      58.3      
15            南开大学          58.2      
16            四川大学          57.4      
16            西安交通大学        57.4      
18            天津大学          56.2      
19            华南理工大学        56.1      
20            北京师范大学        55.1      

Process finished with exit code 0

格式调整后:

# CrawUnivRankingB.py
import requests
from bs4 import BeautifulSoup
import bs4


def getHTMLText(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""


def fillUnivList(ulist, html):
    soup = BeautifulSoup(html, "html.parser")
    for tr in soup.find('tbody').children:
        if isinstance(tr, bs4.element.Tag):
            tds = tr('td')
            ulist.append([tds[0].string, tds[1].string, tds[3].string])


def printUnivList(ulist, num):
    tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}"
    print(tplt.format("排名", "学校名称", "总分", chr(12288)))
    for i in range(num):
        u = ulist[i]
        print(tplt.format(u[0], u[1], u[2], chr(12288)))


def main():
    uinfo = []
    url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html'
    html = getHTMLText(url)
    fillUnivList(uinfo, html)
    printUnivList(uinfo, 20)  # 20 univs


main()

结果:

C:\Users\Amber\AppData\Local\Programs\Python\Python36\python.exe C:/Users/Amber/PycharmProjects/untitled4/test.py
    排名           学校名称           总分    
    1            清华大学          95.9   
    2            北京大学          82.6   
    3            浙江大学           80    
    4           上海交通大学         78.7   
    5            复旦大学          70.9   
    6            南京大学          66.1   
    7          中国科学技术大学        65.5   
    8          哈尔滨工业大学         63.5   
    9           华中科技大学         62.9   
    10           中山大学          62.1   
    11           东南大学          61.4   
    12           天津大学          60.8   
    13           同济大学          59.8   
    14         北京航空航天大学        59.6   
    15           四川大学          59.4   
    16           武汉大学          59.1   
    17          西安交通大学         58.9   
    18           南开大学          58.3   
    19          大连理工大学         56.9   
    20           山东大学          56.3   

Process finished with exit code 0