python之BeautifulSoup库

解析器

BeautifulSoup（mk，’html.parser')

BeautifulSoup（mk，’lxml')

BeautifulSoup（mk，’xml')

BeautifulSoup（mk，’html5lib')

python之BeautifulSoup库

平行关系是同一父节点

python之BeautifulSoup库

prettify（）

更清晰漂亮的打印

注释<! 注释 >

python之BeautifulSoup库

name:标签名字

attrs:对标签属性值

python之BeautifulSoup库

recursive：是否对子孙全部检索，默认True

python之BeautifulSoup库

string：<>...</>中字符串区域的检索字符串

python之BeautifulSoup库

soup.find_all(['a','b']) 检索a ，b标签

import re

for tag in soup.find_all(re.compile('b')):

print(tag.name)

re正则表达式库

以b开头的所有标签信息

<tag>(..) 等价于<tag>.find_all(..)

soup(..)等价于soup.find_all(..)

python之BeautifulSoup库

format打印输出

python之BeautifulSoup库

Python 爬虫案例

import requests
from bs4 import BeautifulSoup
import bs4
def getHTMLText(url):
    try:
        r=requests.get(url,timeout=30)
        r.raise_for_status()
        r.encoding=r.apparent_encoding
        return r.text
    except:
        return ""
def fillUnivList(ulist,html):
    soup=BeautifulSoup(html,"html.parser")
    for tr in soup.find('tbody').children:
        if isinstance(tr,bs4.element.Tag):
            tds=tr('td')
            ulist.append([tds[0].string,tds[1].string,tds[2].string,tds[3].string])
        
def printUnivlist(ulist,num):
    tplt="{0:^10}\t{1:{4}^15}\t{2:^10}\t{3:^10}"
    print(tplt.format("排名","学校","省市","总分",chr(12288)))
    for i in range(num):
        u=ulist[i]
        print(tplt.format(u[0],u[1],u[2],u[3],chr(12288)))
        
    print("Suc"+str(num))
def main():
    uinfo=[]
    url='http://www.zuihaodaxue.cn/zuihaodaxuepaiming2019.html'
    html=getHTMLText(url)
    fillUnivList(uinfo,html)
    printUnivlist(uinfo,20)

main()

python之BeautifulSoup库

format打印输出

Python 爬虫案例

相关推荐