如何在scrapy中跨多个网站获取单个项目?
问题描述:
我有这种情况:如何在scrapy中跨多个网站获取单个项目?
我想从描述产品(页面A)的特定产品详细信息页面抓取产品详细信息,此页面包含指向此产品的卖方(页面B)的页面的链接,在每个卖家是到另一个页面(C)包含卖家详细信息的链接,下面是一个例子模式:
页答:
- PRODUCT_NAME
- 链接到该产品的销售商(页B)
网页B:卖家
- 列表,每一个都含有:
- SELLER_NAME
- selling_price
- 链接到卖方细节页(页C)
页C:
- seller_address
这是我想爬行后获得JSON:
{
"product_name": "product1",
"sellers": [
{
"seller_name": "seller1",
"seller_price": 100,
"seller_address": "address1",
},
(...)
]
}
我曾尝试:从parse方法传递产品信息,第二在元对象解析方法,这工作很好2级,但我有3,我想要一个单一的项目。
这是可能的scrapy?
编辑:
这里要求的是什么,我试图做的,我知道这是预期不会工作一个缩小的例子,但我无法弄清楚如何使它只返回1组成的对象:
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'examplespider'
allowed_domains = ["example.com"]
start_urls = [
'http://example.com/products/product1'
]
def parse(self, response):
# assume this object was obtained after
# some xpath processing
product_name = 'product1'
link_to_sellers = 'http://example.com/products/product1/sellers'
yield scrapy.Request(link_to_sellers, callback=self.parse_sellers, meta={
'product': {
'product_name': product_name,
'sellers': []
}
})
def parse_sellers(self, response):
product = response.meta['product']
# assume this object was obtained after
# some xpath processing
sellers = [
{
seller_name = 'seller1',
seller_price = 100,
seller_detail_url = 'http://example.com/sellers/seller1',
},
{
seller_name = 'seller2',
seller_price = 100,
seller_detail_url = 'http://example.com/sellers/seller2',
},
{
seller_name = 'seller3',
seller_price = 100,
seller_detail_url = 'http://example.com/sellers/seller3',
}
]
for seller in sellers:
product['sellers'].append(seller)
yield scrapy.Request(seller['seller_detail_url'], callback=self.parse_seller, meta={'seller': seller})
def parse_seller(self, response):
seller = response.meta['seller']
# assume this object was obtained after
# some xpath processing
seller_address = 'seller_address1'
seller['seller_address'] = seller_address
yield seller
答
您需要稍微改变一下逻辑,以便它一次只查询一个卖家地址,一旦您完成查询其他卖家。
def parse_sellers(self, response):
meta = response.meta
# assume this object was obtained after
# some xpath processing
sellers = [
{
seller_name = 'seller1',
seller_price = 100,
seller_detail_url = 'http://example.com/sellers/seller1',
},
{
seller_name = 'seller2',
seller_price = 100,
seller_detail_url = 'http://example.com/sellers/seller2',
},
{
seller_name = 'seller3',
seller_price = 100,
seller_detail_url = 'http://example.com/sellers/seller3',
}
]
current_seller = sellers.pop()
if current_seller:
meta['pending_sellers'] = sellers
meta['current_seller'] = current_seller
yield scrapy.Request(current_seller['seller_detail_url'], callback=self.parse_seller, meta=meta)
else:
yield product
# for seller in sellers:
# product['sellers'].append(seller)
# yield scrapy.Request(seller['seller_detail_url'], callback=self.parse_seller, meta={'seller': seller})
def parse_seller(self, response):
meta = response.meta
current_seller = meta['current_seller']
sellers = meta['pending_sellers']
# assume this object was obtained after
# some xpath processing
seller_address = 'seller_address1'
current_seller['seller_address'] = seller_address
meta['product']['sellers'].append(current_seller)
if sellers:
current_seller = sellers.pop()
meta['pending_sellers'] = sellers
meta['current_seller'] = current_seller
yield scrapy.Request(current_seller['seller_detail_url'], callback=self.parse_seller, meta=meta)
else:
yield meta['product']
但这是直到不是一个很好的方法,原因是卖方可能会出售多个项目。因此,当您再次访问同一卖家的商品时,您的卖家地址请求会被欺骗性过滤器拒绝。您可以通过在请求中添加dont_filter=True
来解决该问题,但这意味着网站会有太多不必要的点击
因此,您需要直接在代码中添加数据库处理,以检查是否已有卖家详细信息,如果是,则使用它们如果不是,那么你需要获取细节。
答
我认为pipeline可以提供帮助。
假设产生seller
是按以下格式(可以通过代码的一些琐碎的修改来实现):
seller = {
'product_name': 'product1',
'seller': {
'seller_name': 'seller1',
'seller_price': 100,
'seller_address': 'address1',
}
}
流水线像下面将收集由他们product_name
和出口卖方指定的文件“items.jl”爬行后(注意,这仅仅是一个想法的草图所以它不能保证工作):
class CollectorPipeline(object):
def __init__(self):
self.collection = {}
def open_spider(self, spider):
self.collection = {}
def close_spider(self, spider):
with open("items.jl", "w") as fp:
for _, product in self.collection.items():
fp.write(json.dumps(product))
fp.write("\n")
def process_item(self, item, spider):
product = self.collection.get(item["product_name"], dict())
product["product_name"] = item["product_name"]
sellers = product.get("sellers", list())
sellers.append(item["seller"])
return item
BTW你需要修改你的settings.py
使管道有效,因为DESCR请参阅scrapy document。
发布您使用的代码。 –
我认为这是一个常见的使用案例,但我会在晚上发布一些示例代码,谢谢 –
代码发布,在此先感谢 –