使用Python迭代地解压文件数

问题描述:

我有2TB的数据,我必须解压文件才能做一些分析。但是,由于硬盘空间问题,我无法一次解压所有文件。我认为首先解压前两千个,然后进行分析并在下一个2000年重复它。我该怎么做?使用Python迭代地解压文件数

import os, glob 
import zipfile 


root = 'C:\\Users\\X\\*' 
directory = 'C:\\Users\\X' 
extension = ".zip" 
to_save = 'C:\\Users\\X\\to_save' 

#x = os.listdir(path)[:2000] 
for folder in glob.glob(root): 
    if folder.endswith(extension): # check for ".zip" extension 
     try: 
      print(folder) 
      os.chdir(to_save) 
      zipfile.ZipFile(os.path.join(directory, folder)).extractall(os.path.join(directory, os.path.splitext(folder)[0])) 

     except: 
      pass 
+0

你真的认为它是重复的吗? – edyvedy13

+0

我需要做的是获得第一个2000年,所以文件在1-2000之间列出;然后2001- 4000 – edyvedy13

约?:

import os 
import glob 
import zipfile 

root = 'C:\\Users\\X\\*' 
directory = 'C:\\Users\\X' 
extension = ".zip" 
to_save = 'C:\\Users\\X\\to_save' 

# list comp of all '.zip' folders 
folders = [folder for folder in glob.glob(root) if folder.endswith(extension)] 

# only executes while there are folders remaining to be processed 
while folders: 
    # only grabs the next 2000 folders if there are at least that many 
    if len(folders) >= 2000: 
     temp = folders[:2000] 
    # otherwise gets all the remaining (i.e. 1152 were left) 
    else: 
     temp = folders[:] 

    # list comp that rebuilds with elements not pulled into 'temp' 
    folders = [folder for folder in folders if folder not in temp] 

    # this was all your code, I just swapped 'x' in place of 'folder' 
    for x in temp: 
     try: 
      print(x) 
      os.chdir(to_save) 
      zipfile.ZipFile(os.path.join(directory, x)).extractall(os.path.join(directory, os.path.splitext(x)[0])) 
     except: 
      pass 

这使得.zip文件的的临时列表中,然后删除什么从原来的列表中的那些元素。唯一的缺点是folders被修改,所以如果你需要在其他地方使用它,最终它将是空的。

+0

非常感谢您的回答我找到了不同的解决方案,将文件路径保存到csv,而不是将它们保存到列表中 – edyvedy13