爬取代理IP记录

爬取代理IP

因工作需要，爬取了几十万数据加数百万图片，因为需要用到代理IP，作为程序员，当然是先爬为敬了…
可选的有
快代理
 89IP
西祠代理
 站大爷
 蚂蚁代理

西祠、快代理、89ip都没啥难度，西祠不要太频繁爬取，会封IP，不过一天左右就会解封，快代理，89ip完全一个模式，基本爬取代码都不用改，这个比较入门，就不细讲解了，主要讲下蚂蚁代理的爬取。
蚂蚁代理显示页面如下：
爬取代理IP记录
可以看到，端口号做了混淆，不过这也勾起了我的兴趣，观察下可以发现端口号被五颜六色的杂乱线条给覆盖了，真人自然是没问题可以识别，但是爬取并自动识别就要麻烦点了，可以考虑OCR识别，python 有个第三方库 pytesseract，可以来解决这个问题，当然如果调用百度 OCR 接口也没问题，之前用过百度 OCR 接口，挺好用，不过这次想换点别的，就用了 pytesseract，安装过程还算顺利，安装完毕后可以大展手脚了。
先从网页打开蚂蚁代理页面，下载一个端口号图片，使用 pytesseract 开始识别，不出所料，识别失败，很好，这说明站主的混淆方式起作用了，不过这才能勾起兴趣不是，由于图片上数字是黑色，混淆线条是彩色，尝试下将图片像素点大于1的全部转换为255，OK了，
爬取代理IP记录

下面就可以进入到pytesseract 识别环节，
爬取代理IP记录

完美识别~

好了，解决这个问题，可以开始爬取数据了，爬取讲解略略略。。。
（其实下载图片也是遇到了问题，不过最终解决了，有点累，下次有空再补上吧，先上代码。。）

import PIL
import requests
import numpy as np
import pytesseract


from lxml import etree
from selenium import webdriver
from matplotlib import pyplot as plt

options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument("--disable-gpu")

driver = webdriver.Chrome(executable_path=r"D:\chromedriver.exe", options=options)

url = "http://www.mayidaili.com/free/1"

headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
 'Accept-Encoding': 'gzip, deflate',
 'Accept-Language': 'zh-CN,zh;q=0.9',
 'Cache-Control': 'max-age=0',
 'Connection': 'keep-alive',
 'Cookie': 'Hm_lvt_dad083bfc015b67e98395a37701615ca=1554796082; JSESSIONID=C6FF5039054C48DCFDB16B97F1174DA5; proxy_token=xACHriPa; Hm_lpvt_dad083bfc015b67e98395a37701615ca=1554808505',
 'Host': 'www.mayidaili.com',
 'Upgrade-Insecure-Requests': '1',
 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}

if __name__ == "__main__":
    url = "http://www.mayidaili.com/free/{i}"
    session = requests.Session()
    for i in range(2, 11):
        url_ = url.format(i=i)
        print(url_.center(100, "*"))

        response = driver.get(url_)
        content = driver.page_source
        html = etree.HTML(content)
        ips       = list(map(lambda obj: obj.strip(), html.xpath("//tr/td[1]/text()")))
        port_imgs = html.xpath("//tr/td[2]/img[@class='js-proxy-img']/@src")
		# 这个是遇到的一个坑，使用浏览器产生的headers时，status_code是成功的
		# 但是图片是下载不到的，最后用这种方式搞定。
        cookies = driver.get_cookies()
        s = ""
        for cookie in cookies:
            s += "{}={}; ".format(cookie['name'], cookie['value'])
        s = s[:-2]
        headers["Cookie"] = s

        ports = []
        for img in port_imgs:
            img = requests.get(img, headers=headers)
            print(img.status_code)
            if img.status_code == 200:
                with open("test.jpg", 'wb') as file:    
                    for chunk in img.iter_content(128):
                        file.write(chunk)
            img = PIL.Image.open('test.jpg')
            img = np.array(img)
            img_ = np.array(list(map(lambda obj: obj if obj<1 else 255, img.ravel()))).reshape(img.shape)
            img = PIL.Image.fromarray(img_.astype(np.uint8))
            print(img)
            port = pytesseract.image_to_string(img)
            if port:
                plt.gca().set_title(port)
                plt.imshow(img_)
                plt.pause(2)
                plt.clf()
                try:
                    port = int(port)
                except:
                    pass

爬取代理IP记录

爬取代理IP

相关推荐