比较两个CSV文件并搜索相似的项目
所以我有两个CSV文件,我试图比较并获得类似项目的结果。 第一个文件,hosts.csv如下所示:比较两个CSV文件并搜索相似的项目
Path Filename Size Signature
C:\ a.txt 14kb
D:\ b.txt 99kb 678910
C:\ c.txt 44kb 111213
第二个文件,masterlist.csv如下所示:
Filename Signature
b.txt 678910
x.txt 111213
b.txt 777777
c.txt 999999
正如你所看到的行不匹配和masterlist .csv总是大于hosts.csv文件。我想要搜索的唯一部分是签名部分。我知道这看起来是这样的:
hosts[3] == masterlist[1]
我在找,这将使我类似如下(基本hosts.csv文件与新的结果列)的解决方案:
Path Filename Size Signature RESULTS
C:\ a.txt 14kbNOT FOUND in masterlist
D:\ b.txt 99kb 678910 FOUND in masterlist (row 1)
C:\ c.txt 44kb 111213 FOUND in masterlist (row 2)
我搜索了这些帖子,发现类似于这个here,但我不太了解它,因为我还在学习python。
编辑使用Python 2.6
编辑:虽然我的解决方案正常工作,看看下面的Martijn的回答更高效的解决方案。
你可以找到Python CSV模块here的文档。
你要找什么是这样的:
import csv
f1 = file('hosts.csv', 'r')
f2 = file('masterlist.csv', 'r')
f3 = file('results.csv', 'w')
c1 = csv.reader(f1)
c2 = csv.reader(f2)
c3 = csv.writer(f3)
masterlist = list(c2)
for hosts_row in c1:
row = 1
found = False
for master_row in masterlist:
results_row = hosts_row
if hosts_row[3] == master_row[1]:
results_row.append('FOUND in master list (row ' + str(row) + ')')
found = True
break
row = row + 1
if not found:
results_row.append('NOT FOUND in master list')
c3.writerow(results_row)
f1.close()
f2.close()
f3.close()
Python的CSV和收藏模块,具体OrderedDict,是真正有用的在这里。你想使用OrderedDict来保存键的顺序等。你不需要,但它很有用!
import csv
from collections import OrderedDict
signature_row_map = OrderedDict()
with open('hosts.csv') as file_object:
for line in csv.DictReader(file_object, delimiter='\t'):
signature_row_map[line['Signature']] = {'line': line, 'found_at': None}
with open('masterlist.csv') as file_object:
for i, line in enumerate(csv.DictReader(file_object, delimiter='\t'), 1):
if line['Signature'] in signature_row_map:
signature_row_map[line['Signature']]['found_at'] = i
with open('newhosts.csv', 'w') as file_object:
fieldnames = ['Path', 'Filename', 'Size', 'Signature', 'RESULTS']
writer = csv.DictWriter(file_object, fieldnames, delimiter='\t')
writer.writer.writerow(fieldnames)
for signature_info in signature_row_map.itervalues():
result = '{0} FOUND in masterlist {1}'
# explicit check for sentinel
if signature_info['found_at'] is not None:
result = result.format('', '(row %s)' % signature_info['found_at'])
else:
result = result.format('NOT', '')
payload = signature_info['line']
payload['RESULTS'] = result
writer.writerow(payload)
下面是使用测试CSV文件的输出:
Path Filename Size Signature RESULTS
C:\ a.txt 14kbNOT FOUND in masterlist
D:\ b.txt 99kb 678910 FOUND in masterlist (row 1)
C:\ c.txt 44kb 111213 FOUND in masterlist (row 2)
请原谅的错位,他们是制表符分隔:)
我得到一个ImportError:无法导入名称OrderedDict。我正在使用Python 2.6和Python 3的可移植版本。OrderedDict仅特定于2.7? – serk 2011-03-11 05:17:15
是的。您可以将OrderedDict更改为dict()并且它可以正常工作。 – 2011-03-11 05:34:55
您可以将2.7 OrderedDict恢复到2.6。该模块可以在这里找到:http://hg.python.org/cpython/file/291bc0097cc1/Lib/collections/__init__.py – 2011-03-11 05:38:25
的csv
模块就派上用场了在解析的CSV文件。但为了好玩,我只是将输入分割为空格来获取数据。
只解析数据,为masterlist.csv中的数据构建一个dict
,签名为键,行号为值。现在,对于hosts.csv的每一行,我们可以查询dict
,并确定masterlist.csv中是否存在相应的条目,如果是,那么在哪一行。
#! /usr/bin/env python
def read_data(filename):
input_source=open(filename,'r')
input_source.readline()
return [line.split() for line in input_source]
if __name__=='__main__':
hosts=read_data('hosts.csv')
masterlist=read_data('masterlist.csv')
master=dict()
for index,data in enumerate(masterlist):
master[data[-1]]=index+1
for row in hosts:
try:
found="FOUND in masterlist (row %s)"%master[row[-1]]
except KeyError:
found="NOT FOUND in masterlist"
line=row+[found]
print "%s %s %s %s %s"%tuple(line)
由srgerg的答案是非常低效的,因为它运行在二次时间;这里是一个线性时间溶液代替,使用Python 2.6兼容的语法:
import csv
with open('masterlist.csv', 'rb') as master:
master_indices = dict((r[1], i) for i, r in enumerate(csv.reader(master)))
with open('hosts.csv', 'rb') as hosts:
with open('results.csv', 'wb') as results:
reader = csv.reader(hosts)
writer = csv.writer(results)
writer.writerow(next(reader, []) + ['RESULTS'])
for row in reader:
index = master_indices.get(row[3])
if index is not None:
message = 'FOUND in master list (row {})'.format(index)
else:
message = 'NOT FOUND in master list'
writer.writerow(row + [message])
这将产生一个字典,从masterlist.csv
映射签名行号第一。字典中的查找需要一定的时间,使得第hosts.csv
行上的第二个循环与masterlist.csv
中的行数无关。更不用说代码更简单了。
这很不错。使用csv.DictReader可能会更清晰,因为您可以用'master_row ['signature']'替换'master_row [1]'。 – chmullig 2011-03-11 04:50:38
这将在每个结果后生成一个空行。 – serk 2011-03-11 05:17:36
空行问题依赖于系统。如果你在每一个结果后都得到一个空行,用'f3 = file('results.csv','wb')替换'f3 = file('results.csv','w')'行' – srgerg 2011-03-11 05:36:04