python 手动给requests模块添加urlretrieve下载文件方法!
requests模块的前代是urllib模块,传入参数headers、cookie、data什么的肯定是requests好使,但是却没有urllib.request.urlretrieve这个方法,
urlretrieve(url, filename=None,reporthook=None, params=None,)
传入url跟文件路径即可下载文件,requests次次都得自己手动编写,我觉得太麻烦了,而且它还有个回调函数,我试着能不能把这个urlretrieve方法移植到requests模块来
要点:1.如何找到自己想要的python模块呢?cmd上打path,找出python的,然后CTRL+F
2.下载文件的实质是 contextlib.closing打开网页--->with open文件--->写入
3.reporthook回调函数的实质就是 把文件一段段写入文件时把3个参数(每次写入bytes的数量、次数、headers得到的总大小size)传出去,让回调函数处理
4.原来现在的模块,都是其他py文件写好方法,然后把其方法传入__init__.py这个文件的
5.r.iter_content()的应用
进入urllib文件夹,在request文件中找到urlretrieve方法,具体如下
def urlretrieve(url, filename=None, reporthook=None, data=None): """ Retrieve a URL into a temporary location on disk. Requires a URL argument. If a filename is passed, it is used as the temporary file location. The reporthook argument should be a callable that accepts a block number, a read size, and the total file size of the URL target. The data argument should be valid URL encoded data. If a filename is passed and the URL points to a local resource, the result is a copy from local file to new file. Returns a tuple containing the path to the newly created data file as well as the resulting HTTPMessage object. """ url_type, path = splittype(url) #分析网页的,忽略 with contextlib.closing(urlopen(url, data)) as fp: #打开网页 headers = fp.info() #头 # Just return the local path and the "headers" for file:// # URLs. No sense in performing a copy unless requested. if url_type == "file" and not filename: return os.path.normpath(path), headers #忽略 # Handle temporary file setup. if filename: tfp = open(filename, 'wb') #打开文件 else: tfp = tempfile.NamedTemporaryFile(delete=False)#忽略 filename = tfp.name _url_tempfiles.append(filename) with tfp: result = filename, headers bs = 1024*8 #每一次写入bytes的大小 size = -1 read = 0 blocknum = 0 #写入bytes的次数,2者相乘就是已经写入的大小 if "content-length" in headers: size = int(headers["Content-Length"]) #size就是文件大小了 if reporthook: reporthook(blocknum, bs, size) #写入前运行一次回调函数 while True: block = fp.read(bs) if not block: break read += len(block) tfp.write(block) #写入 blocknum += 1 if reporthook: reporthook(blocknum, bs, size) #每写入一次就运行一次回调函数 if size >= 0 and read < size: raise ContentTooShortError( "retrieval incomplete: got only %i out of %i bytes" % (read, size), result) return result
然而常规的requests模块下载文件的写法:
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36'} with closing(requests.get(url=target, stream=True, headers=headers)) as r: with open('%d.jpg' % filename, 'ab+') as f: for chunk in r.iter_content(chunk_size=1024): if chunk: f.write(chunk) f.flush()
现在把它给封装起来:
def urlretrieve(url, filename=None,reporthook=None, params=None,): '''传入ID改变url,利用closing跟iter_content下载图片''' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36'} with contextlib.closing(requests.get(url, stream=True, headers=headers,params=params)) as fp:#打开网页 header=fp.headers #得出头 with open(filename, 'wb+') as tfp: #w是覆盖原文件,a是追加写入 #打开文件 bs = 1024 size = -1 blocknum = 0 if "content-length" in header: size = int(header["Content-Length"]) #文件的总大小理论值 if reporthook: reporthook(blocknum, bs, size) #写入前运行一次回调函数 for chunk in fp.iter_content(chunk_size=1024): if chunk: tfp.write(chunk) #写入 tfp.flush() blocknum += 1 if reporthook: reporthook(blocknum, bs, size) #每写入一次就运行一次回调函数
测试:
def Schedule(a, b, c): per = 100.0*a*b/c if per > 100 : per = 100 sys.stdout.write(" " + "%.2f%% 已经下载的大小:%ld 文件大小:%ld" % (per,a*b,c) + '\r') sys.stdout.flush() url='https://images.unsplash.com/photo-1503025768915-494859bd53b2?ixlib=rb-0.3.5&q=85&fm=jpg&crop=entropy&cs=srgb&dl=tommy-344440-unsplash.jpg&s=1382cd0338e13f6460ed68182d35cac9' urlretrieve(url=url,filename='111.jpg',reporthook=Schedule)
OK,成功
现在把这个方法放进requests模块里面,
先在requests文件夹里把刚写的方法放在api.py里的最末
还有把import contextlib 写上
然后在__init__.py这个文件,api import后面加上urlretrieve
OK,可以直接运行
import requests,os,time,sys def Schedule(a, b, c): per = 100.0*a*b/c #a是写入次数,b是每次写入bytes的数值,c是文件总大小 if per > 100 : per = 100 sys.stdout.write(" " + "%.2f%% 已经下载的大小:%ld 文件大小:%ld" % (per,a*b,c) + '\r') sys.stdout.flush() url='https://images.unsplash.com/photo-1503025768915-494859bd53b2?ixlib=rb-0.3.5&q=85&fm=jpg&crop=entropy&cs=srgb&dl=tommy-344440-unsplash.jpg&s=1382cd0338e13f6460ed68182d35cac9' requests.urlretrieve(url=url,filename='111.jpg',reporthook=Schedule)