在Scrapy中禁用SSL证书验证

问题描述:

我目前正在用Scrapy处理一个问题。每当我使用Scrapy刮取证书的CN值与服务器的域名相匹配的HTTPS站点时,Scrapy效果很好!在另一方面,虽然,每当我试图刮一个网站,该证书的CN值不匹配服务器的域名,我得到如下:在Scrapy中禁用SSL证书验证

Traceback (most recent call last): 
    File "/usr/local/lib/python2.7/dist-packages/twisted/protocols/tls.py", line 415, in dataReceived 
    self._write(bytes) 
    File "/usr/local/lib/python2.7/dist-packages/twisted/protocols/tls.py", line 554, in _write 
    sent = self._tlsConnection.send(toSend) 
    File "/usr/local/lib/python2.7/dist-packages/OpenSSL/SSL.py", line 1270, in send 
    result = _lib.SSL_write(self._ssl, buf, len(buf)) 
    File "/usr/local/lib/python2.7/dist-packages/OpenSSL/SSL.py", line 926, in wrapper 
    callback(Connection._reverse_mapping[ssl], where, return_code) 
--- <exception caught here> --- 
    File "/usr/local/lib/python2.7/dist-packages/twisted/internet/_sslverify.py", line 1055, in infoCallback 
    return wrapped(connection, where, ret) 
    File "/usr/local/lib/python2.7/dist-packages/twisted/internet/_sslverify.py", line 1154, in _identityVerifyingInfoCallback 
    verifyHostname(connection, self._hostnameASCII) 
    File "/usr/local/lib/python2.7/dist-packages/service_identity/pyopenssl.py", line 30, in verify_hostname 
    obligatory_ids=[DNS_ID(hostname)], 
    File "/usr/local/lib/python2.7/dist-packages/service_identity/_common.py", line 235, in __init__ 
    raise ValueError("Invalid DNS-ID.") 
exceptions.ValueError: Invalid DNS-ID. 

我已经通过尽可能多的资料看,我可以和据我所知,Scrapy没有办法禁用SSL证书验证。即使对于Scrapy Request对象(我会以为是哪里此功能会说谎)的文件有没有参考:

http://doc.scrapy.org/en/1.0/topics/request-response.html#scrapy.http.Request https://github.com/scrapy/scrapy/blob/master/scrapy/http/request/init.py

也有其解决问题没有Scrapy设置:

http://doc.scrapy.org/en/1.0/topics/settings.html

根据需要使用Scrapy并根据需要修改源的缺点,有没有人有任何想法可以禁用SSL证书验证?

谢谢!

+2

从文档中查看我可以修改“DOWNLOAD_HANDLERS”或“DOWNLOAD_HANDLERS_BASE”设置以更改scrapy处理https的方式。从那里你可能不得不创建你自己修改的'HttpDownloadHandler',它可以通过你收到的错误。 – Monkpit

+0

/我在桌子上胡思乱想。这当然看起来很有希望。你可以把它写成答案,以便我可以接受,然后添加我用于其他人的代码以供将来参考? – MoarCodePlz

从您链接到the settings的文档看来,您似乎可以修改DOWNLOAD_HANDLERS设置。

从文档:

""" 
    A dict containing the request download handlers enabled by default in 
    Scrapy. You should never modify this setting in your project, modify 
    DOWNLOAD_HANDLERS instead. 
""" 

DOWNLOAD_HANDLERS_BASE = { 
    'file': 'scrapy.core.downloader.handlers.file.FileDownloadHandler', 
    'http': 'scrapy.core.downloader.handlers.http.HttpDownloadHandler', 
    'https': 'scrapy.core.downloader.handlers.http.HttpDownloadHandler', 
    's3': 'scrapy.core.downloader.handlers.s3.S3DownloadHandler', 
} 

然后在你的设置,像这样:

""" 
    Configure your download handlers with something custom to override 
    the default https handler 
""" 
DOWNLOAD_HANDLERS = { 
    'https': 'my.custom.downloader.handler.https.HttpsDownloaderIgnoreCNError', 
} 

所以通过定义https协议自定义处理程序,你应该能够处理错误你得到并允许scrapy继续其业务。

+4

这真是太棒了,并且看起来特别针对我遇到的问题。我将要使用代码来看看我是否能够实现这个目标,并在这里发布我的解决方案!谢谢! – MoarCodePlz

+0

@MoarCodePlz你有没有找到解决方案?发布一些链接有趣吗? – Dawson