如何在python中同时抛出多个html页面与beautifulsoup？

问题描述：

我正在Django Web框架中使用Python制作webscraping应用程序。我需要使用beautifulsoup库来取消多个查询。下面是代码的快照，我已经写了：如何在python中同时抛出多个html页面与beautifulsoup？

for url in websites: 
    r = requests.get(url) 
    soup = BeautifulSoup(r.content) 
    links = soup.find_all("a", {"class":"dev-link"})

其实这里网页的抓取顺序走，我想以并行的方式运行。我对Python中的线程没有太多的想法。有人可以告诉我，我怎样才能以平行的方式进行报废？任何帮助，将不胜感激。

多少网页，你想在同一时间刮？ – Exprator

答

您可以使用hadoop（http://hadoop.apache.org/）并行运行您的作业。这是运行并行任务的非常好的工具。

答

试试这个解决方案。

import threading 

def fetch_links(url): 
    r = requests.get(url) 
    soup = BeautifulSoup(r.content) 
    return soup.find_all("a", {"class": "dev-link"}) 

threads = [threading.Thread(target=fetch_links, args=(url,)) 
      for url in websites] 

for t in thread: 
    t.start()

通过requests.get()下载网页内容阻塞操作，和Python线程实际上可以提高性能。

答

如果你想使用多线程的话，

import threading 
import requests 
from bs4 import BeautifulSoup 

class Scrapper(threading.Thread): 
    def __init__(self, threadId, name, url): 
     threading.Thread.__init__(self) 
     self.name = name 
     self.id = threadId 
     self.url = url 

    def run(self): 
     r = requests.get(self.url) 
     soup = BeautifulSoup(r.content, 'html.parser') 
     links = soup.find_all("a") 
     return links 
#list the websites in below list 
websites = [] 
i = 1 
for url in websites: 
    thread = Scrapper(i, "thread"+str(i), url) 
    res = thread.run() 
    # print res

，当涉及到Python和拼抢，这可能是有帮助的

答

，scrapy可能是要走的路。

scrapy是使用twisted mertix库并行所以你不必担心线程和python GIL

如果必须使用beautifulsoap检查this library出

如何在python中同时抛出多个html页面与beautifulsoup？

相关推荐