python2.7 用urllib2 etree xpath第三方模块爬取美女图片

    此方法适合初学者,利用的是面向函数的方法.先上截图,由于网站图片众多,只爬取了[校花]这一类,好了废话不多说,接下来正式开始!!!

python2.7 用urllib2 etree xpath第三方模块爬取美女图片

首先导入第三方模块,定义主函数:

         

import os       #创建保存路径
import urllib2
from lxml import etree
if __name__ == "__main__":
    urls = []  #创建空列表,以保存校花分类多页的套图链接
    url = "http://www.mm131.com/xiaohua/"
    urls.append(url)
    pn = 2
    while pn < 7:  #因为校花这类共有六页
        page = 'list_2_' + str(pn) + '.html'
        fullurl = url + page    #组合完整url
        urls.append(fullurl)
        pn += 1
    for link in urls:
        loadPage(link)

仔细查看地址栏地址,可总结出规律

python2.7 用urllib2 etree xpath第三方模块爬取美女图片

接下来对这6页的所有url调用发请求的函数

def loadPage(link):  # 对链接发请求,获取图片(套图的一张图片)链接
    
    headers = {  # 请求报头,模拟浏览器,以免被封IP
        "User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.170 Safari/537.36",
    }
    request = urllib2.Request(link, headers=headers)  # 发请求,获取响应
    html = urllib2.urlopen(request).read()
    content = etree.HTML(html)  # 解析获得的html页面
    # 取出所有图片链接的集合
    link_list = content.xpath('//dl[@class="list-left public-box"]//dd//a[@target="_blank"]/@href')
    for link in link_list:
        morePage(link)

每一个校花的套图利都不止一张图片,接下来要获取多张

def morePage(link):  # 获取套图的多张图片
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.170 Safari/537.36",
    }
    request = urllib2.Request(link, headers=headers)
    html = urllib2.urlopen(request).read()
    u = "http://www.mm131.com/xiaohua/"
    content = etree.HTML(html)  #解析html页面
    # 取出所有图片链接的集合
    link_list = content.xpath('//div[@class="content-page"]//a/@href')

利用xpath获取的不是完整的urls 

python2.7 用urllib2 etree xpath第三方模块爬取美女图片

接下来组合完整urls,

for link in link_list:
    fullurl = u + link
    #print(fullurl)
    loadImg(fullurl)

调用获取图片链接的函数

def loadImg(fullurl):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.170 Safari/537.36",
        "Referer": "http://www.mm131.com/xiaohua/2001.html", # 此地方添加Referer 为了防止重定向
    }
    request = urllib2.Request(fullurl, headers=headers)
    html = urllib2.urlopen(request).read()
    content = etree.HTML(html)
    link_list = content.xpath('//div[@class="content-pic"]//a//img/@src')  # 获取图片链接
    filenames = content.xpath('//div[@class="content"]/h5/text()')         # 获取每张图片的名字
    for link in link_list:  # 取出每个图片的连接
        writeImg(link,filenames)

最后保存图片:

def writeImg(link, filenames):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36",
        "Referer": "http://www.mm131.com/xiaohua/2001.html",
    }
    request = urllib2.Request(link, headers = headers)
    image = urllib2.urlopen(request).read()
    post = link[-4:]
    for filename in filenames:
        name = filename + post
        file_path = "./Girls/"  # 新建路径
        if not os.path.exists(file_path):
            os.mkdir(file_path)
    
        with open(file_path+name, "wb") as f:
            f.write(image)

然后运行py文件,就能获取全部图片了,共一千多张,慢慢欣赏吧