如何加快使用RegEx解析Apache日志以扩展Pandas数据框?

问题描述:

我正在写一个脚本,将大的(400mb)apache日志文件解析到熊猫表中。如何加快使用RegEx解析Apache日志以扩展Pandas数据框?

我的旧笔记本电脑在大约2分钟内用脚本解析apache日志文件。 现在我想知道它不能更快​​?

Apache的日志文件的结构是这样的: 叶 - - [时间戳]“GET ......法” HTTP状态代码字节“地址”,“用户代理” 例如:

93.185.11.11 - - [13/Aug/2016:05:34:12 +0200] "GET /v1/con?from=…" 200 575 "http://google.com" "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0" 

我的代码使用正则表达式findall。我也测试了匹配和搜索方法。但他们似乎更慢。

reg_dic = { 
    "ip" : r'\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b', 
    "timestamp" : r'\[\d+\/\w+\/\d+\:\d+\:\d+\:\d+\s\+\d+\]', 
    "method" : r'"(.*?)"', 
    "httpstatus" : r'\s\d{1,3}\s', 
    "bytes_" : r'\s\d+\s\"', 
    "adress" : r'\d\s\"(.*?)"', 
    "useragent" : r'\"\s\"(.*?)"' 
} 

    for name, reg in reg_dic.items() : 
     item_list = [] 
     with open (file) as f_obj: 
      for line in f_obj : 
       item = re.findall(reg , line) 
       item = item[0] 
       if name == "bytes_" : 
        item = item.replace("\"", "") 
       item = item.strip() 
       item_list.append(item) 
     df[ name ] = item_list 
     del item_list 
+2

参见[这条巨蟒演示(https://ideone.com/LLW3Uf)和[正则表达式演示(https://开头regex101的.com/R/UOtsAL/1)。如果你的日志行总是相同的格式,这应该是快速和安全的。 –

您可以使用extractexpand放慢参数true,以便将返回基于提取数据的数据帧。希望它可以帮助

例DF

df = pd.DataFrame({"log":['93.185.11.11 - - [13/Aug/2016:05:34:12 
+0200] "GET /v1/con?from=…" 200 575 "http://google.com" "Mozilla/5.0 
(Windows NT 6.2; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0"', 

'93.185.11.11 - - [13/Aug/2016:05:34:12 +0200] "GET /v1/con?from=…" 
200 575 "http://google.com" "Mozilla/5.0 (Windows NT 6.2; WOW64; 
rv:54.0) Gecko/20100101 Firefox/54.0"', 

'93.185.11.11 - - [13/Aug/2016:05:34:12 +0200] "GET /v1/con?from=…" 
200 575 "http://google.com" "Mozilla/5.0 (Windows NT 6.2; WOW64; 
rv:54.0) Gecko/20100101 Firefox/54.0"']}) 

这是基于@Wiktor Stribiżew的正则表达式改善

ws = '^(?P<ip>[\d.]+)(?:\s+\S+){2}\s+\[(?P<timestamp>[\w:/\s+]+)\]\s+"(?P<method>[^"]+)"\s+(?P<httpstatus>\d+)\s+(?P<bytes>\d+)\s+(?P<adress>"[^"]+")\s+(?P<useragent>"[^"]+")$' 

new = df['log'].str.extract(ws,expand=True) 

输出:

 
      ip     timestamp    method httpstatus \ 
0 93.185.11.11 13/Aug/2016:05:34:12 +0200 GET /v1/con?from=…  200 
1 93.185.11.11 13/Aug/2016:05:34:12 +0200 GET /v1/con?from=…  200 
2 93.185.11.11 13/Aug/2016:05:34:12 +0200 GET /v1/con?from=…  200 

    bytes    adress \ 
0 575 "http://google.com" 
1 575 "http://google.com" 
2 575 "http://google.com" 

              useragent 
0 "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) ... 
1 "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) ... 
2 "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) ... 

我不认为我们需要太多很多RegEx的这个简单的任务:

fn = r'D:\temp\.data\46620093.log' 
cols = ['ip','l','userid','timestamp','tz','request','status','bytes','referer','useragent'] 

df = pd.read_csv(fn, delim_whitespace=True, names=cols).drop('l', 1) 

这给了我们:

In [179]: df 
Out[179]: 
      ip userid    timestamp  tz    request \ 
0 93.185.11.11  - [13/Aug/2016:05:34:12 +0200] GET /v1/con?from=… 
1 93.185.11.11  - [13/Aug/2016:05:34:12 +0200] GET /v1/con?from=… 
2 93.185.11.11  - [13/Aug/2016:05:34:12 +0200] GET /v1/con?from=… 

    status bytes   referer \ 
0  200 575 http://google.com 
1  200 575 http://google.com 
2  200 575 http://google.com 

              useragent 
0 Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) G... 
1 Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) G... 
2 Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) G... 

现在我们只需要连接timestamptz成一列,摆脱[]

df['timestamp'] = df['timestamp'].str.replace(r'\[(\d+/\w+/\d+):(\d+:\d+:\d+)', r'\1 \2') \ 
        + ' ' + df.pop('tz').str.strip(r'[\]]') 

结果:

In [181]: df 
Out[181]: 
      ip userid     timestamp    request \ 
0 93.185.11.11  - 13/Aug/2016 05:34:12 +0200 GET /v1/con?from=… 
1 93.185.11.11  - 13/Aug/2016 05:34:12 +0200 GET /v1/con?from=… 
2 93.185.11.11  - 13/Aug/2016 05:34:12 +0200 GET /v1/con?from=… 

    status bytes   referer \ 
0  200 575 http://google.com 
1  200 575 http://google.com 
2  200 575 http://google.com 

              useragent 
0 Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) G... 
1 Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) G... 
2 Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) G... 

注意:我们可以eas随手转换成datetime D类datetime(UTC时间没有时区):

In [182]: pd.to_datetime(df['timestamp']) 
Out[182]: 
0 2016-08-13 03:34:12 
1 2016-08-13 03:34:12 
2 2016-08-13 03:34:12 
Name: timestamp, dtype: datetime64[ns]