如何确保BS4请求正在使用列表上的套接字进行？

问题描述：

我有这样一个代理列表，我想与蟒蛇刮使用：如何确保BS4请求正在使用列表上的套接字进行？

proxies_ls = [ '149.56.89.166:3128', 
      '194.44.176.116:8080', 
      '14.203.99.67:8080', 
      '185.87.65.204:63909', 
      '103.206.161.234:63909', 
      '110.78.177.100:65103']

，并以放弃使用BS4一个URL做了一个功能，并请求模块调用crawlSite（URL）。这里的代码：

# Bibliotecas para crawl e regex 
from bs4 import BeautifulSoup 
import requests 
from fake_useragent import UserAgent 
import re 

#Biblioteca para data 
import datetime 
from time import gmtime, strftime 

#Biblioteca para escrita dos logs 
import os 
import errno 

#Biblioteca para delay aleatorio 
import time 
import random 

print('BOT iniciado: '+ datetime.datetime.now().strftime('%d-%m-%Y %H:%M:%S')) 

proxies_ls = [ '149.56.89.166:3128', 
      '194.44.176.116:8080', 
      '14.203.99.67:8080', 
      '185.87.65.204:63909', 
      '103.206.161.234:63909', 
      '110.78.177.100:65103'] 

def crawlSite(url): 
    #Chrome emulation 
    ua=UserAgent() 
    header={'user-agent':ua.chrome} 
    random.shuffle(proxies_ls) 

    #Random delay 
    print('antes do delay: '+ datetime.datetime.now().strftime('%d-%m-%Y %H:%M:%S')) 
    tempoRandom=random.randint(1,5) 
    time.sleep(tempoRandom) 

    try: 
     randProxy=random.choice(proxies_ls) 
     # Getting the webpage, creating a Response object emulated with chrome with a 30sec timeout. 
     response = requests.get(url,proxies = {'https':randProxy},headers=header,timeout=30) 
     print(response) 
     print('Resposta obtida: '+ datetime.datetime.now().strftime('%d-%m-%Y %H:%M:%S')) 

     #Avoid HTTP request errors 
     if response.status_code == 404: 
      raise ConnectionError("HTTP Response [404] - The requested resource could not be found") 
     elif response.status_code == 409:    
      raise ConnectionError("HTTP Response [409] - Possible Cloudflare DNS resolution error") 
     elif response.status_code == 403: 
      raise ConnectionError("HTTP Response [403] - Permission denied error") 
     elif response.status_code == 503: 
      raise ConnectionError("HTTP Response [503] - Service unavailable error") 
     print('RR Status {}'.format(response.status_code)) 
     # Extracting the source code of the page. 
     data = response.text 

    except ConnectionError: 
     try: 
      proxies_ls.remove(randProxy) 
     except ValueError: 
      pass 
     randProxy=random.choice(proxies_ls) 

    return BeautifulSoup(data, 'lxml')

我想要做的是确保只有代理列表中正在使用的连接。随机部分

randProxy=random.choice(proxies_ls)

好的工作，但如果代理是有效还是无效的检查部分不是。主要是因为我仍然收到200个作为“补偿代理”的答复。

如果我减少列表如下：

proxies_ls = ['149.56.89.166:3128']

与不工作，我仍然得到200响应的代理！（我试图使用像https://pt.infobyip.com/proxychecker.php代理检查器，它不工作...）

所以我的问题是（我会列举，所以它更容易）： a）为什么我得到这200响应，而不是4xx响应？ b）我如何强制请求使用代理服务器？

谢谢，

Eunito。

仔细阅读文档：http://docs.python-requests.org/en/master/user/advanced/#proxies。您需要在代理字典中指定协议'requests.get（url，proxies = {'https'：'http：//％s'％randProxy}（...））''。现在你只传递一个IP地址和端口。 – Kalkran

嗨@Kalkran你是对的！但即使使用上面提到的唯一代理（proxies_ls = ['149.56.89.166:3128']）的更正，我仍然得到200 ... – Eunito

可能是因为您正在爬取HTTP站点而不是HTTPS站点？您只是给它一个用于HTTPS站点的代理。 – Kalkran

答

所以基本上，如果我得到你的问题吧，你只是想检查代理是否有效。requests有异常处理程序，你可以做这样的事情：

from requests.exceptions import ProxyError 
try: 
    response = requests.get(url,proxies = {'https':randProxy},headers=header,timeout=30) 
except ProxyError: 
    # message proxy is invalid

如何确保BS4请求正在使用列表上的套接字进行？

相关推荐