使用Python迭代地解压文件数
问题描述:
我有2TB的数据,我必须解压文件才能做一些分析。但是,由于硬盘空间问题,我无法一次解压所有文件。我认为首先解压前两千个,然后进行分析并在下一个2000年重复它。我该怎么做?使用Python迭代地解压文件数
import os, glob
import zipfile
root = 'C:\\Users\\X\\*'
directory = 'C:\\Users\\X'
extension = ".zip"
to_save = 'C:\\Users\\X\\to_save'
#x = os.listdir(path)[:2000]
for folder in glob.glob(root):
if folder.endswith(extension): # check for ".zip" extension
try:
print(folder)
os.chdir(to_save)
zipfile.ZipFile(os.path.join(directory, folder)).extractall(os.path.join(directory, os.path.splitext(folder)[0]))
except:
pass
答
约?:
import os
import glob
import zipfile
root = 'C:\\Users\\X\\*'
directory = 'C:\\Users\\X'
extension = ".zip"
to_save = 'C:\\Users\\X\\to_save'
# list comp of all '.zip' folders
folders = [folder for folder in glob.glob(root) if folder.endswith(extension)]
# only executes while there are folders remaining to be processed
while folders:
# only grabs the next 2000 folders if there are at least that many
if len(folders) >= 2000:
temp = folders[:2000]
# otherwise gets all the remaining (i.e. 1152 were left)
else:
temp = folders[:]
# list comp that rebuilds with elements not pulled into 'temp'
folders = [folder for folder in folders if folder not in temp]
# this was all your code, I just swapped 'x' in place of 'folder'
for x in temp:
try:
print(x)
os.chdir(to_save)
zipfile.ZipFile(os.path.join(directory, x)).extractall(os.path.join(directory, os.path.splitext(x)[0]))
except:
pass
这使得.zip文件的的临时列表中,然后删除什么从原来的列表中的那些元素。唯一的缺点是folders
被修改,所以如果你需要在其他地方使用它,最终它将是空的。
+0
非常感谢您的回答我找到了不同的解决方案,将文件路径保存到csv,而不是将它们保存到列表中 – edyvedy13
你真的认为它是重复的吗? – edyvedy13
我需要做的是获得第一个2000年,所以文件在1-2000之间列出;然后2001- 4000 – edyvedy13