python爬虫(五):实战 【4. 爬亚马逊】

目标:在亚马逊网站搜索商品,爬取前10页的商品(名字和价格)

第一步:访问网站,隐藏爬虫

亚马逊对爬虫限制比较严格,修改headers、cookies、代理ip

获取cookie:f12在console输入document.cookie()

注意:cookies格式为字典,{'a':'1','b':'2','c':'3'}

最好自己手动替换,我用记事本替换=为:就出错了,因为cookies内部也有=

 

import requests

url = 'https://www.amazon.cn/s/field-keywords=spark'

head = {'user-agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}

proxy_id = { "http": "http://61.135.155.82:443"}

cookie = {'session-id':'459-4568418-5692641','ubid-acbcn':'459-5049899-3055220','x-wl-uid':'1AK7YMFc9IzusayDn2fT6Topjz3iAOpR3EeA2UQSqco8fo5PbK2aCpyBA/fdPMfKFqZRHc4IeyuU=','session-token':'OH1wPvfOj6Tylq2nnJcdn5wyxycR/lqyGsGU3+lUtU4mbC0ZD9s8/4Oihd1BlskUQG8zRbLVs9vfWXuiJmnRlDT4x35ircp2uLxOLNYQ4j5pzdFJIqqoZUnhHSJUq2yK80P3LqH8An7faXRCPW9BIqX1wu0WmHlSS9vYAPKA/2SGdV9b//EljYjIVCBjOuR/dKRiYEeGK3li0RJOVz7+vMWg7Rnzbx89QxlbCp0WyquZyVxG6f2mNw=="','session-id-time':'2082787201l'}

r = requests.get(url,headers=head,proxies=proxy_id,cookies=cookie)

r.encoding = r.apparent_encoding

r.text

 

 

第二步:解析页面

通过观察,商品名称都放在h2标签内,商品价值在,取出商品名称

 

# 解析页面,采用 bs4 定位

# 获取商品名称

soup = BeautifulSoup(html, 'html.parser')

name = soup.find_all('h2')

name

python爬虫(五):实战 【4. 爬亚马逊】

 

对于价格,因为每个商品有好几个价格,所以只爬第一个价格(亚马逊自营价格)

通过商品名来定位价格,商品和价格不会对应错

 

通过观察,商品价格:在商品名称的父父父弟弟弟节点的span标签里

name[0].parent.parent.parent.next_sibling.next_sibling.next_sibling('span')[1].string

 

# 获取商品价格

namelist = []

pricelist = []

for i in range(len(name)):

try:

pricelist.append(name[i].parent.parent.parent.next_sibling.next_sibling.next_sibling('span')[1].string)

except:

pricelist.append("null")

 

第三步:输出

# 输出

print("{}\t{}".format("商品名称", "价格"))

for i in range(len(name)):

print("{}\t{}".format(name[i].string, pricelist[i]))

 

# 或者输出为表

import pandas as pd

shangpin = []

for i in range(len(name)):

shangpin.append([namelist[i],pricelist[i]])

table = pd.DataFrame(data = shangpin, columns = ['商品名称','价格'])

table.to_csv('D:/yamasun.csv', index = 0)

python爬虫(五):实战 【4. 爬亚马逊】

 

 

加入分页:

# 获取页面

goods = 'spark' # 商品名

pages = 10 #爬多少页

for n in range(pages):

page = n+1

url = 'https://www.amazon.cn/s/field-keywords=' + goods + '&page=' + str(page)

 

 

 

最终代码:

import requests

from bs4 import BeautifulSoup

 

namelist = [] # 商品名称列表

pricelist = [] # 商品价格列表

shangpin = [] # 商品

# 获取页面

goods = 'spark' # 搜索商品名

pages = 10 #爬多少页

for n in range(pages):

page = n+1

url = 'https://www.amazon.cn/s/field-keywords=' + goods + '&page=' + str(page)

# 隐藏爬虫

head = {'user-agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}

proxy_id = { "http": "http://61.135.155.82:443"}

cookie = {'session-id':'459-4568418-5692641', 'ubid-acbcn':'459-5049899-3055220','x-wl-uid':'1AK7YMFc9IzusayDn2fT6Topjz3iAOpR3EeA2UQSqco8fo5PbK2aCpyBA/fdPMfKFqZRHc4IeyuU=','session-token':'"OH1wPvfOj6Tylq2nnJcdn5wyxycR/lqyGsGU3+lUtU4mbC0ZD9s8/4Oihd1BlskUQG8zRbLVs9vfWXuiJmnRlDT4x35ircp2uLxOLNYQ4j5pzdFJIqqoZUnhHSJUq2yK80P3LqH8An7faXRCPW9BIqX1wu0WmHlSS9vYAPKA/2SGdV9b//EljYjIVCBjOuR/dKRiYEeGK3li0RJOVz7+vMWg7Rnzbx89QxlbCp0WyquZyVxG6f2mNw=="','csm-hit':'tb:0J5M3DH92ZKHNKA0QBAF+b-0J5M3DH92ZKHNKA0QBAF|1544276572483&adb:adblk_no','session-id-time':'2082787201l'}

r = requests.get(url,headers=head,proxies=proxy_id,cookies=cookie)

# 转换编码,apparent_encoding是基于文本推测的编码

r.encoding = r.apparent_encoding

html = r.text

 

# 解析页面 (采用 bs4 定位)

# 获取商品名称

soup = BeautifulSoup(html, 'html.parser')

name = soup.find_all('h2')

# 获取商品价格

for i in range(len(name)):

try:

namelist.append(name[i].string)

pricelist.append(name[i].parent.parent.parent.next_sibling.next_sibling.next_sibling('span')[1].string)

for i in range(len(namelist)):

shangpin.append([namelist[i],pricelist[i]])

except:

pricelist.append("null")

 

# 输出

# print("{}\t{}".format("商品名称", "价格"))

# for i in range(len(name)):

# print("{}\t{}".format(namelist[i], pricelist[i]))

 

# 输出为表

import pandas as pd

table = pd.DataFrame(data = shangpin, columns = ['商品名称','价格'])

table.to_csv('D:/yamasun.csv', index = 0)

 

 

最终结果:

python爬虫(五):实战 【4. 爬亚马逊】

 

 

python爬虫(五):实战 【4. 爬亚马逊】

python爬虫(五):实战 【4. 爬亚马逊】