如何在scrapy中跨多个网站获取单个项目?

问题描述:

我有这种情况:如何在scrapy中跨多个网站获取单个项目?

我想从描述产品(页面A)的特定产品详细信息页面抓取产品详细信息,此页面包含指向此产品的卖方(页面B)的页面的链接,在每个卖家是到另一个页面(C)包含卖家详细信息的链接,下面是一个例子模式:

页答:

  • PRODUCT_NAME
  • 链接到该产品的销售商(页B)

网页B:卖家

  • 列表,每一个都含有:
    • SELLER_NAME
    • selling_price
    • 链接到卖方细节页(页C)

页C:

  • seller_address

这是我想爬行后获得JSON:

{ 
    "product_name": "product1", 
    "sellers": [ 
    { 
     "seller_name": "seller1", 
     "seller_price": 100, 
     "seller_address": "address1", 
    }, 
    (...) 
    ] 
} 

我曾尝试:从parse方法传递产品信息,第二在元对象解析方法,这工作很好2级,但我有3,我想要一个单一的项目。

这是可能的scrapy?

编辑:

这里要求的是什么,我试图做的,我知道这是预期不会工作一个缩小的例子,但我无法弄清楚如何使它只返回1组成的对象:

import scrapy 

class ExampleSpider(scrapy.Spider): 
    name = 'examplespider' 
    allowed_domains = ["example.com"] 

    start_urls = [ 
     'http://example.com/products/product1' 
    ] 

    def parse(self, response): 

     # assume this object was obtained after 
     # some xpath processing 
     product_name = 'product1' 
     link_to_sellers = 'http://example.com/products/product1/sellers' 

     yield scrapy.Request(link_to_sellers, callback=self.parse_sellers, meta={ 
      'product': { 
       'product_name': product_name, 
       'sellers': [] 
      } 
     }) 

    def parse_sellers(self, response): 
     product = response.meta['product'] 

     # assume this object was obtained after 
     # some xpath processing 
     sellers = [ 
      { 
       seller_name = 'seller1', 
       seller_price = 100, 
       seller_detail_url = 'http://example.com/sellers/seller1', 
      }, 
      { 
       seller_name = 'seller2', 
       seller_price = 100, 
       seller_detail_url = 'http://example.com/sellers/seller2', 
      }, 
      { 
       seller_name = 'seller3', 
       seller_price = 100, 
       seller_detail_url = 'http://example.com/sellers/seller3', 
      } 
     ] 

     for seller in sellers: 
      product['sellers'].append(seller) 
      yield scrapy.Request(seller['seller_detail_url'], callback=self.parse_seller, meta={'seller': seller}) 

    def parse_seller(self, response): 
     seller = response.meta['seller'] 

     # assume this object was obtained after 
     # some xpath processing 
     seller_address = 'seller_address1' 

     seller['seller_address'] = seller_address 

     yield seller 
+0

发布您使用的代码。 –

+0

我认为这是一个常见的使用案例,但我会在晚上发布一些示例代码,谢谢 –

+0

代码发布,在此先感谢 –

您需要稍微改变一下逻辑,以便它一次只查询一个卖家地址,一旦您完成查询其他卖家。

def parse_sellers(self, response): 
    meta = response.meta 

    # assume this object was obtained after 
    # some xpath processing 
    sellers = [ 
     { 
      seller_name = 'seller1', 
      seller_price = 100, 
      seller_detail_url = 'http://example.com/sellers/seller1', 
     }, 
     { 
      seller_name = 'seller2', 
      seller_price = 100, 
      seller_detail_url = 'http://example.com/sellers/seller2', 
     }, 
     { 
      seller_name = 'seller3', 
      seller_price = 100, 
      seller_detail_url = 'http://example.com/sellers/seller3', 
     } 
    ] 

    current_seller = sellers.pop() 
    if current_seller: 
     meta['pending_sellers'] = sellers 
     meta['current_seller'] = current_seller 
     yield scrapy.Request(current_seller['seller_detail_url'], callback=self.parse_seller, meta=meta) 
    else: 
     yield product 


    # for seller in sellers: 
    #  product['sellers'].append(seller) 
    #  yield scrapy.Request(seller['seller_detail_url'], callback=self.parse_seller, meta={'seller': seller}) 

def parse_seller(self, response): 
    meta = response.meta 
    current_seller = meta['current_seller'] 
    sellers = meta['pending_sellers'] 
    # assume this object was obtained after 
    # some xpath processing 
    seller_address = 'seller_address1' 

    current_seller['seller_address'] = seller_address 

    meta['product']['sellers'].append(current_seller) 
    if sellers: 
     current_seller = sellers.pop() 
     meta['pending_sellers'] = sellers 
     meta['current_seller'] = current_seller 

     yield scrapy.Request(current_seller['seller_detail_url'], callback=self.parse_seller, meta=meta) 
    else: 
     yield meta['product'] 

但这是直到不是一个很好的方法,原因是卖方可能会出售多个项目。因此,当您再次访问同一卖家的商品时,您的卖家地址请求会被欺骗性过滤器拒绝。您可以通过在请求中添加dont_filter=True来解决该问题,但这意味着网站会有太多不必要的点击

因此,您需要直接在代码中添加数据库处理,以检查是否已有卖家详细信息,如果是,则使用它们如果不是,那么你需要获取细节。

我认为pipeline可以提供帮助。

假设产生seller是按以下格式(可以通过代码的一些琐碎的修改来实现):

seller = { 
    'product_name': 'product1', 
    'seller': { 
     'seller_name': 'seller1', 
     'seller_price': 100, 
     'seller_address': 'address1', 
    } 
} 

流水线像下面将收集由他们product_name和出口卖方指定的文件“items.jl”爬行后(注意,这仅仅是一个想法的草图所以它不能保证工作):

class CollectorPipeline(object): 

    def __init__(self): 
     self.collection = {} 

    def open_spider(self, spider): 
     self.collection = {} 

    def close_spider(self, spider): 
     with open("items.jl", "w") as fp: 
      for _, product in self.collection.items(): 
       fp.write(json.dumps(product)) 
       fp.write("\n") 

    def process_item(self, item, spider): 
     product = self.collection.get(item["product_name"], dict()) 
     product["product_name"] = item["product_name"] 
     sellers = product.get("sellers", list()) 
     sellers.append(item["seller"]) 

     return item 

BTW你需要修改你的settings.py使管道有效,因为DESCR请参阅scrapy document