如何加快使用RegEx解析Apache日志以扩展Pandas数据框?
问题描述:
我正在写一个脚本,将大的(400mb)apache日志文件解析到熊猫表中。如何加快使用RegEx解析Apache日志以扩展Pandas数据框?
我的旧笔记本电脑在大约2分钟内用脚本解析apache日志文件。 现在我想知道它不能更快?
Apache的日志文件的结构是这样的: 叶 - - [时间戳]“GET ......法” HTTP状态代码字节“地址”,“用户代理” 例如:
93.185.11.11 - - [13/Aug/2016:05:34:12 +0200] "GET /v1/con?from=…" 200 575 "http://google.com" "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0"
我的代码使用正则表达式findall。我也测试了匹配和搜索方法。但他们似乎更慢。
reg_dic = {
"ip" : r'\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b',
"timestamp" : r'\[\d+\/\w+\/\d+\:\d+\:\d+\:\d+\s\+\d+\]',
"method" : r'"(.*?)"',
"httpstatus" : r'\s\d{1,3}\s',
"bytes_" : r'\s\d+\s\"',
"adress" : r'\d\s\"(.*?)"',
"useragent" : r'\"\s\"(.*?)"'
}
for name, reg in reg_dic.items() :
item_list = []
with open (file) as f_obj:
for line in f_obj :
item = re.findall(reg , line)
item = item[0]
if name == "bytes_" :
item = item.replace("\"", "")
item = item.strip()
item_list.append(item)
df[ name ] = item_list
del item_list
答
您可以使用extract与expand
放慢参数true,以便将返回基于提取数据的数据帧。希望它可以帮助
例DF
df = pd.DataFrame({"log":['93.185.11.11 - - [13/Aug/2016:05:34:12
+0200] "GET /v1/con?from=…" 200 575 "http://google.com" "Mozilla/5.0
(Windows NT 6.2; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0"',
'93.185.11.11 - - [13/Aug/2016:05:34:12 +0200] "GET /v1/con?from=…"
200 575 "http://google.com" "Mozilla/5.0 (Windows NT 6.2; WOW64;
rv:54.0) Gecko/20100101 Firefox/54.0"',
'93.185.11.11 - - [13/Aug/2016:05:34:12 +0200] "GET /v1/con?from=…"
200 575 "http://google.com" "Mozilla/5.0 (Windows NT 6.2; WOW64;
rv:54.0) Gecko/20100101 Firefox/54.0"']})
这是基于@Wiktor Stribiżew的正则表达式改善
ws = '^(?P<ip>[\d.]+)(?:\s+\S+){2}\s+\[(?P<timestamp>[\w:/\s+]+)\]\s+"(?P<method>[^"]+)"\s+(?P<httpstatus>\d+)\s+(?P<bytes>\d+)\s+(?P<adress>"[^"]+")\s+(?P<useragent>"[^"]+")$'
new = df['log'].str.extract(ws,expand=True)
输出:
ip timestamp method httpstatus \ 0 93.185.11.11 13/Aug/2016:05:34:12 +0200 GET /v1/con?from=… 200 1 93.185.11.11 13/Aug/2016:05:34:12 +0200 GET /v1/con?from=… 200 2 93.185.11.11 13/Aug/2016:05:34:12 +0200 GET /v1/con?from=… 200 bytes adress \ 0 575 "http://google.com" 1 575 "http://google.com" 2 575 "http://google.com" useragent 0 "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) ... 1 "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) ... 2 "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) ...
答
我不认为我们需要太多很多RegEx的这个简单的任务:
fn = r'D:\temp\.data\46620093.log'
cols = ['ip','l','userid','timestamp','tz','request','status','bytes','referer','useragent']
df = pd.read_csv(fn, delim_whitespace=True, names=cols).drop('l', 1)
这给了我们:
In [179]: df
Out[179]:
ip userid timestamp tz request \
0 93.185.11.11 - [13/Aug/2016:05:34:12 +0200] GET /v1/con?from=…
1 93.185.11.11 - [13/Aug/2016:05:34:12 +0200] GET /v1/con?from=…
2 93.185.11.11 - [13/Aug/2016:05:34:12 +0200] GET /v1/con?from=…
status bytes referer \
0 200 575 http://google.com
1 200 575 http://google.com
2 200 575 http://google.com
useragent
0 Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) G...
1 Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) G...
2 Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) G...
现在我们只需要连接timestamp
和tz
成一列,摆脱[
和]
:
df['timestamp'] = df['timestamp'].str.replace(r'\[(\d+/\w+/\d+):(\d+:\d+:\d+)', r'\1 \2') \
+ ' ' + df.pop('tz').str.strip(r'[\]]')
结果:
In [181]: df
Out[181]:
ip userid timestamp request \
0 93.185.11.11 - 13/Aug/2016 05:34:12 +0200 GET /v1/con?from=…
1 93.185.11.11 - 13/Aug/2016 05:34:12 +0200 GET /v1/con?from=…
2 93.185.11.11 - 13/Aug/2016 05:34:12 +0200 GET /v1/con?from=…
status bytes referer \
0 200 575 http://google.com
1 200 575 http://google.com
2 200 575 http://google.com
useragent
0 Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) G...
1 Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) G...
2 Mozilla/5.0 (Windows NT 6.2; WOW64; rv:54.0) G...
注意:我们可以eas随手转换成datetime
D类datetime
(UTC时间没有时区):
In [182]: pd.to_datetime(df['timestamp'])
Out[182]:
0 2016-08-13 03:34:12
1 2016-08-13 03:34:12
2 2016-08-13 03:34:12
Name: timestamp, dtype: datetime64[ns]
参见[这条巨蟒演示(https://ideone.com/LLW3Uf)和[正则表达式演示(https://开头regex101的.com/R/UOtsAL/1)。如果你的日志行总是相同的格式,这应该是快速和安全的。 –