通过python分割一个大的csv文件
我有一个csv文件有500万行。 我想将文件拆分成由用户指定的行数。通过python分割一个大的csv文件
已开发了以下代码,但其执行时间太长。任何人都可以帮助我优化代码。
import csv
print "Please delete the previous created files. If any."
filepath = raw_input("Enter the File path: ")
line_count = 0
filenum = 1
try:
in_file = raw_input("Enter Input File name: ")
if in_file[-4:] == ".csv":
split_size = int(raw_input("Enter size: "))
print "Split Size ---", split_size
print in_file, " will split into", split_size, "rows per file named as OutPut-file_*.csv (* = 1,2,3 and so on)"
with open (in_file,'r') as file1:
row_count = 0
reader = csv.reader(file1)
for line in file1:
#print line
with open(filepath + "\\OutPut-file_" +str(filenum) + ".csv", "a") as out_file:
if row_count < split_size:
out_file.write(line)
row_count = row_count +1
else:
filenum = filenum + 1
row_count = 0
line_count = line_count+1
print "Total Files Written --", filenum
else:
print "Please enter the Name of the file correctly."
except IOError as e:
print "Oops..! Please Enter correct file path values", e
except ValueError:
print "Oops..! Please Enter correct values"
我自己也尝试没有 “打开”
Oups!您一再重新打开输出文件的每一行,当它是一个昂贵的操作...你的代码可能会变成:
...
with open (in_file,'r') as file1:
row_count = 0
#reader = csv.reader(file1) # unused here
out_file = open(filepath + "\\OutPut-file_" +str(filenum) + ".csv", "a")
for line in file1:
#print line
if row_count >= split_size:
out_file.close()
filenum = filenum + 1
out_file = open(filepath + "\\OutPut-file_" +str(filenum) + ".csv", "a")
row_count = 0
out_file.write(line)
row_count = row_count +1
line_count = line_count+1
...
理想情况下,你甚至应该在try块之前初始化out_file = None
,确保干净if out_file is not None: out_file.close()
备注:此代码仅在行数中分开(与您的一样)。这意味着如果csv文件可以在引用字段中包含换行符,那么将会给出错误的输出...
哦..在这种情况下..我需要检查新的线。对? – user2597209
@ user2597209:如果你想允许带引号的字段换行,你将不得不解析使用CSV阅读器的输入文件,并用CSV作家写的行,或做手工解析,但它是复杂与许多其他的情况。 –
您绝对可以使用python的多处理模块。
这是我获得的结果,当我有一个csv文件,其中有1,000,000行。
import time
from multiprocessing import Pool
def saving_csv_normally(start):
out_file = open('out_normally/' + str(start/batch_size) + '.csv', 'w')
for i in range(start, start+batch_size):
out_file.write(arr[i])
out_file.close()
def saving_csv_multi(start):
out_file = open('out_multi/' + str(start/batch_size) + '.csv', 'w')
for i in range(start, start+batch_size):
out_file.write(arr[i])
out_file.close()
def saving_csv_multi_async(start):
out_file = open('out_multi_async/' + str(start/batch_size) + '.csv', 'w')
for i in range(start, start+batch_size):
out_file.write(arr[i])
out_file.close()
with open('files/test.csv') as file:
arr = file.readlines()
print "length of file : ", len(arr)
batch_size = 100 #split in number of rows
start = time.time()
for i in range(0, len(arr), batch_size):
saving_csv_normally(i)
print "time taken normally : ", time.time()-start
#multiprocessing
p = Pool()
start = time.time()
p.map(saving_csv_multi, range(0, len(arr), batch_size), chunksize=len(arr)/4) #chunksize you can define as much as you want
print "time taken for multiprocessing : ", time.time()-start
# it does the same thing aynchronically
start = time.time()
for i in p.imap_unordered(saving_csv_multi_async, range(0, len(arr), batch_size), chunksize=len(arr)/4):
continue
print "time taken for multiprocessing async : ", time.time()-start
输出显示每个所用的时间:
length of file : 1000000
time taken normally : 0.733881950378
time taken for multiprocessing : 0.508712053299
time taken for multiprocessing async : 0.471592903137
我已经定义三个单独用作在p.map传递只能有一个参数的功能和作为我存储在三个不同的文件夹的CSV文件这就是为什么我写了三个函数。
来些更传统的单位超过万卢比?;) – liborm
关于与不同的文件指针寻求不同点,并使用所有这些通过并行协同例程/ GEVENT什么? – SRC
我还没有尝试过..你可以请帮助相同。多线程或多任务在这里会有帮助。 – user2597209