python之BeautifulSoup库

 

python之BeautifulSoup库

解析器

BeautifulSoup(mk,’html.parser')

BeautifulSoup(mk,’lxml')

BeautifulSoup(mk,’xml')

BeautifulSoup(mk,’html5lib')

python之BeautifulSoup库

python之BeautifulSoup库

python之BeautifulSoup库

python之BeautifulSoup库

python之BeautifulSoup库

python之BeautifulSoup库

平行关系是同一父节点

python之BeautifulSoup库

prettify()

更清晰漂亮的打印

 注释<!       注释        >

python之BeautifulSoup库

name:标签名字

attrs:对标签属性值

python之BeautifulSoup库

recursive:是否对子孙全部检索,默认True

python之BeautifulSoup库

string:<>...</>中字符串区域的检索字符串

python之BeautifulSoup库

python之BeautifulSoup库

soup.find_all(['a','b']) 检索a ,b标签

import re

for tag in soup.find_all(re.compile('b')):

     print(tag.name)

re正则表达式库

以b开头的所有标签信息

<tag>(..) 等价于<tag>.find_all(..)

soup(..)等价于soup.find_all(..)

python之BeautifulSoup库

format打印输出 

 python之BeautifulSoup库

python之BeautifulSoup库 

 Python 爬虫案例

import requests
from bs4 import BeautifulSoup
import bs4
def getHTMLText(url):
    try:
        r=requests.get(url,timeout=30)
        r.raise_for_status()
        r.encoding=r.apparent_encoding
        return r.text
    except:
        return ""
def fillUnivList(ulist,html):
    soup=BeautifulSoup(html,"html.parser")
    for tr in soup.find('tbody').children:
        if isinstance(tr,bs4.element.Tag):
            tds=tr('td')
            ulist.append([tds[0].string,tds[1].string,tds[2].string,tds[3].string])
        
def printUnivlist(ulist,num):
    tplt="{0:^10}\t{1:{4}^15}\t{2:^10}\t{3:^10}"
    print(tplt.format("排名","学校","省市","总分",chr(12288)))
    for i in range(num):
        u=ulist[i]
        print(tplt.format(u[0],u[1],u[2],u[3],chr(12288)))
        
    print("Suc"+str(num))
def main():
    uinfo=[]
    url='http://www.zuihaodaxue.cn/zuihaodaxuepaiming2019.html'
    html=getHTMLText(url)
    fillUnivList(uinfo,html)
    printUnivlist(uinfo,20)

main()