转换文本文件到csv Python的

问题描述:

我有不同的字典组成的文本文件,它看起来像这样:转换文本文件到csv Python的

{"destination.fqdn": "194-65-57-128.ctt.pt", "feed.provider": "MyFeed", "source.abuse_contact": "[email protected]", "raw": "bWFsd2FyZSwyMTAuMjguNTYuMSxodHRwOi8vd3d3LmN0dC5wdCAsMTk0LTY1LTU3LTEyOC5jdHQucHQsY29pc2FzQGZvby5jb20sMTk0LjIzOS4xNjcuNSx3d3cudmVyeWJhZC5jb20gLHZlcnkudmVyeWJhZC5jb20sLCwsMjAxMC0wMi0xOFQwMDowMDowMCswMDowMA0K", "feed.name": "FileCollector", "destination.geolocation.latitude": 32.2109, "destination.geolocation.cc": "CN", "source.geolocation.longitude": 12.069, "event_description.text": "ctt", "source.ip": "194.239.167.5", "source.geolocation.city": "Frederikssund", "destination.geolocation.city": "Zhenjiang", "destination.url": "http://www.ctt.pt", "classification.taxonomy": "malicious code", "source.url": "http://www.verybad.com", "source.fqdn": "very.verybad.com", "feed.url": "file://localhost/opt/intelmq/teste_ip_url_fqdn.csv", "feed.accuracy": 100.0, "time.observation": "2017-07-18T13:15:48+00:00", "destination.geolocation.longitude": 119.4551, "source.geolocation.latitude": 55.8396, "classification.type": "malware", "destination.ip": "210.28.56.1", "time.source": "2010-02-18T00:00:00+00:00", "source.geolocation.cc": "DK"} 
{"destination.url": "http://www2.ctt.pt", "classification.taxonomy": "malicious code", "source.url": "http://www.telecom.pt", "feed.provider": "MyFeed", "time.observation": "2017-07-18T13:15:48+00:00", "destination.fqdn": "ctt-pt.mail.protection.outlook.com", "source.abuse_contact": "[email protected]", "source.geolocation.cc": "TN", "feed.url": "file://localhost/opt/intelmq/teste_ip_url_fqdn.csv", "raw": "YyZjLDI1MS4xNTQuNjUuOSxodHRwOi8vd3d3Mi5jdHQucHQsY3R0LXB0Lm1haWwucHJvdGVjdGlvbi5vdXRsb29rLmNvbSxjb2lzYXM3QGZvby5jb20sMTk3LjEzLjEwNS44LHd3dy50ZWxlY29tLnB0LCwsLCwyMDEwLTAyLTE4VDAwOjAwOjAwKzAwOjAwDQo=", "feed.name": "FileCollector", "classification.type": "c&c", "source.geolocation.latitude": 34.0, "source.geolocation.longitude": 9.0, "destination.ip": "251.154.65.9", "event_description.text": "ctt", "source.ip": "197.13.105.8", "time.source": "2010-02-18T00:00:00+00:00", "feed.accuracy": 100.0} 

每一行是一本字典,有的字典有比别人更多的键,我会喜欢将文本文件转换为csv文件。

我有以下代码:

import json 
import csv 
import ast 

def json_to_csv(txt_file, csv_file): 

    lista = [] 
    with open(txt_file, 'rb') as fin: 
     lines = fin.readlines() 
     for line in lines: 
      dict_line = ast.literal_eval(line) 
      lista.append(line) 
    list_json = json.dumps(lista) 

    read_json = json.loads(list_json) 

    header =["feed.accuracy","feed.url","source.geolocation.longitude","event_description.text","raw","destination.geolocation.city","source.ip","classification.taxonomy", 
       "time.observation","destination.geolocation.latitude","destination.ip","source.asn","feed.name","source.geolocation.latitude","time.source","feed.provider", 
       "destination.geolocation.longitude","destination.geolocation.cc","destination.asn","source.abuse_contact","source.geolocation.cc","classification.type"] 
    with open(csv_file, 'wb+') as f: 
     dict_writer = csv.DictWriter(f, header) 
     dict_writer.writeheader() 
     dict_writer.writerows(read_json) 

首先,我读的文本文件,然后我其内容转换成JSON,然后我尝试将转换后的数据写入到csv文件,但其返回以下错误:

Traceback (most recent call last): 
    File "<pyshell#38>", line 1, in <module> 
    json_to_csv('ctt.txt','ctt.csv') 
    File "C:/Users/Marisa/Documents/json_to_csv.py", line 26, in json_to_csv 
    dict_writer.writerows(read_json) 
    File "C:\Python27\lib\csv.py", line 157, in writerows 
    rows.append(self._dict_to_list(rowdict)) 
    File "C:\Python27\lib\csv.py", line 148, in _dict_to_list 
    + ", ".join([repr(x) for x in wrong_fields])) 
ValueError: dict contains fields not in fieldnames: u'{', u'"', u'f', u'e', u'e', u'd', u'.', u'a', u'c', u'c', u'u', u'r', u'a', u'c', u'y', u'"', u':', u' ', u'1', u'0', u'0', u'.', u'0', u',', u' ', u'"', u'c', u'l', u'a', u's', u's', u'i', u'f', u'i', u'c', u'a', u't', u'i', u'o', u'n', u'.', u't', u'a', u'x',... 
+0

你这是什么认为'csv'文件应该看起来像你完成后? –

+0

为什么要将列表转储到JSON字符串,然后立即读回?这不是一个没有操作?此外,根据输入文件的来源,最好使用JSON库而不是(不是在AST之后)读取文件的每一行。 –

+0

如果你将它加载到熊猫中,然后将该熊猫框架导出到csv,你不必手动修复任何东西,但是你最终将得到一个稀疏矩阵类型csv – Nullman

你让它变得比它需要的复杂一点,并且你在上面你自己的示例数据中缺少了一些字段。我们可以摆脱ast依赖和后&来回JSON处理,所缺的字段添加了,下面将与您所提供的样本数据的工作:

import json 
import csv 

def json_to_csv(txt_file, csv_file): 

    lista = [] 
    with open(txt_file, 'r') as in_file: 
     lines = in_file.readlines() 
     for line in lines: 
      try: 
       dict_line = json.loads(line) 
       lista.append(dict_line) 
      except Exception as err: 
       print(err) 

    header = [ 
     "feed.accuracy", "feed.url", "source.geolocation.longitude", 
     "event_description.text", "raw", "destination.geolocation.city", 
     "source.ip", "classification.taxonomy", "time.observation", 
     "destination.geolocation.latitude", "destination.ip", "source.asn", 
     "feed.name", "source.geolocation.latitude", "time.source", 
     "feed.provider", "destination.geolocation.longitude", 
     "destination.geolocation.cc", "destination.asn", 
     "source.abuse_contact", "source.geolocation.cc", "classification.type", 
     'destination.fqdn', 'source.fqdn', 'source.geolocation.city', 
     'source.url', 'destination.url' 
    ] 
    with open(csv_file, 'w+') as out_file: 
     dict_writer = csv.DictWriter(out_file, header) 
     dict_writer.writeheader() 
     dict_writer.writerows(lista) 

需要注意的是,如果你的真实数据有更多未包含在示例中的字段,您还需要添加这些字段。

还要注意,如果您输入的数据是正确的JSON数组一样:

[{"destination.fqdn": "194-65-57-128.ctt.pt", "feed.provider": "MyFeed", "source.abuse_contact": "[email protected]", "raw": "bWFsd2FyZSwyMTAuMjguNTYuMSxodHRwOi8vd3d3LmN0dC5wdCAsMTk0LTY1LTU3LTEyOC5jdHQucHQsY29pc2FzQGZvby5jb20sMTk0LjIzOS4xNjcuNSx3d3cudmVyeWJhZC5jb20gLHZlcnkudmVyeWJhZC5jb20sLCwsMjAxMC0wMi0xOFQwMDowMDowMCswMDowMA0K", "feed.name": "FileCollector", "destination.geolocation.latitude": 32.2109, "destination.geolocation.cc": "CN", "source.geolocation.longitude": 12.069, "event_description.text": "ctt", "source.ip": "194.239.167.5", "source.geolocation.city": "Frederikssund", "destination.geolocation.city": "Zhenjiang", "destination.url": "http://www.ctt.pt", "classification.taxonomy": "malicious code", "source.url": "http://www.verybad.com", "source.fqdn": "very.verybad.com", "feed.url": "file://localhost/opt/intelmq/teste_ip_url_fqdn.csv", "feed.accuracy": 100.0, "time.observation": "2017-07-18T13:15:48+00:00", "destination.geolocation.longitude": 119.4551, "source.geolocation.latitude": 55.8396, "classification.type": "malware", "destination.ip": "210.28.56.1", "time.source": "2010-02-18T00:00:00+00:00", "source.geolocation.cc": "DK"}, 
{"destination.url": "http://www2.ctt.pt", "classification.taxonomy": "malicious code", "source.url": "http://www.telecom.pt", "feed.provider": "MyFeed", "time.observation": "2017-07-18T13:15:48+00:00", "destination.fqdn": "ctt-pt.mail.protection.outlook.com", "source.abuse_contact": "[email protected]", "source.geolocation.cc": "TN", "feed.url": "file://localhost/opt/intelmq/teste_ip_url_fqdn.csv", "raw": "YyZjLDI1MS4xNTQuNjUuOSxodHRwOi8vd3d3Mi5jdHQucHQsY3R0LXB0Lm1haWwucHJvdGVjdGlvbi5vdXRsb29rLmNvbSxjb2lzYXM3QGZvby5jb20sMTk3LjEzLjEwNS44LHd3dy50ZWxlY29tLnB0LCwsLCwyMDEwLTAyLTE4VDAwOjAwOjAwKzAwOjAwDQo=", "feed.name": "FileCollector", "classification.type": "c&c", "source.geolocation.latitude": 34.0, "source.geolocation.longitude": 9.0, "destination.ip": "251.154.65.9", "event_description.text": "ctt", "source.ip": "197.13.105.8", "time.source": "2010-02-18T00:00:00+00:00", "feed.accuracy": 100.0}] 

的解决方案简化了相当多的与整个初始with open块刚刚成为:

with open(txt_file, 'r') as in_file: 
    lista = json.load(in_file) 
+0

谢谢@Feneric!是的,如果文本文件是适当的JSON数组,我不会有太多问题。 我试过了你的代码,但是它给了我以下错误: 'UnicodeEncodeError:'ascii'编解码器无法在位置12编码字符''\ xf3'' 你知道什么可能导致这种情况吗? – mf370

+0

可能您的完整数据集中包含非ASCII字符。您需要确定数据的正确字符集并进行适当设置,或者(如果只有少数非ASCII字符)寻找导致问题的字符集并将其删除。 我添加了正确的JSON数组注释,因为这是一个非常简单的编辑,在每行的末尾放置一个逗号,用一个关闭的篮子替换最后一个,并在开头插入一个开放的括号。有时最简单的方法就是按摩数据。 – Feneric