爬取虎扑步行街——秋名山论美美女壁纸图片
最近学了学爬虫,由于平时笔者经常刷虎扑,于是决定实战一下,对虎扑上的美女图片进行爬取,特地来分享一下经验。
这次主要使用了request和BeautifulSoup两个库,使用urllib库对图片进行下载。
首先进入虎扑步行街,搜索关键字“秋名山论美”,得到如下网页:
首先f12打开网页检查,点击network然后刷新网页复制headers。根据观察发现网页的url中&page=1中的1为对应的页码,所以可以构建url列表页进行循环
for i in range(1,17): page_num = '&page=' i = str(i) url = 'https://my.hupu.com/search?q=%E3%80%90%E7%A7%8B%E5%90%8D%E5%B1%B1%E8%AE%BA%E7%BE%8E%E3%80%91'+page_num+i
之后右键每一期的标题,查看在网页源代码的位置。
使用Beautiful库中的selector选择器进行对网页url的提取
response = requests.get(url=url,headers=headers) soup = BeautifulSoup(response.content,'lxml') img_ = soup.select('.mytopic.topiclisttr tbody tr .p_title a') for _url in img_: img_url = _url['href'] url_list.append(img_url)
随后点击进入一期秋名山论美的网页,找到壁纸图片右键检查,找到图片的位置,同样使用selector选择器进行选择。
response = requests.get(url=url,headers=headers) soup = BeautifulSoup(response.content,'lxml') img_ = soup.select(".floor .floor-show .floor_box tbody tr td .quote-content img")创建文件夹,观察发现图片的url中?前面的是图片的url,所以使用split进行分割后提取之后下载和保存图片
if not os.path.exists(path):
os.mkdir(path)
os.chdir(path)
try: if '?' in img['data-original']: img_url = img['data-original'].split('?')[0] else: continue except KeyError: print(i) continue
try: content = urllib.request.urlopen(img_url) except urllib.error.HTTPError: continue content = content.read() with open(name, 'wb') as f: f.write(content) time.sleep(0.2)
最后附上源码进行学习
import requests import urllib.request import urllib.error import os import re import time from bs4 import BeautifulSoup headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/6.2.3964.2 Safari/537.36' } url_list = [] for i in range(1,17): #第17页数字为汉字,共有15期,需要用number_i来命名 page_num = '&page=' i = str(i) url = 'https://my.hupu.com/search?q=%E3%80%90%E7%A7%8B%E5%90%8D%E5%B1%B1%E8%AE%BA%E7%BE%8E%E3%80%91'+page_num+i response = requests.get(url=url,headers=headers) soup = BeautifulSoup(response.content,'lxml') img_ = soup.select('.mytopic.topiclisttr tbody tr .p_title a') for _url in img_: img_url = _url['href'] url_list.append(img_url) # number_i=15 for url in url_list: response = requests.get(url=url,headers=headers) soup = BeautifulSoup(response.content,'lxml') img_ = soup.select(".floor .floor-show .floor_box tbody tr td .quote-content img") # ********************************************************************* title = soup.select(".subhead span")[0].string s = '' s = s.join(title) print(s) try: number = re.findall(r'第(\d+)期',s)[0] except IndexError: continue # number = str(number_i) # number_i-=1 path = 'E:\project\pachong'+'\\'+number # ************************************************************************ if not os.path.exists(path): os.mkdir(path) os.chdir(path) for i, img in enumerate(img_): i = str(i) print(img) try: if '?' in img['data-original']: img_url = img['data-original'].split('?')[0] else: continue except KeyError: print(i) continue print(i, img_url) name = number + '_' + i + '.' + 'jpg' try: content = urllib.request.urlopen(img_url) except urllib.error.HTTPError: continue content = content.read() with open(name, 'wb') as f: f.write(content) time.sleep(0.2) time.sleep(1)