正则表达式的表达 - 删除匹配多个选项
下面我有一些样本数据线....正则表达式的表达 - 删除匹配多个选项
100.200.300.40 - - [02/Feb/2012:12:18:35 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 132189
100.200.300.40 - - [02/Feb/2012:12:18:35 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 106866
100.200.300.40 - - [02/Feb/2012:12:18:35 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 103461
100.200.300.40 - - [02/Feb/2012:12:19:10 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 106866
100.200.300.40 - - [02/Feb/2012:12:19:10 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 106866
100.200.300.40 - - [02/Feb/2012:12:19:10 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 106866
300.230.100.10 - - [02/Feb/2012:12:20:55 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 28017662
200.100.600.30 - - [02/Feb/2012:12:27:22 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 464787
200.100.600.30 - - [02/Feb/2012:12:27:22 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 463747
200.100.600.30 - - [02/Feb/2012:12:27:22 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 434485
200.100.600.30 - - [02/Feb/2012:12:27:54 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 330269
664.387.880.60 - - [02/Feb/2012:12:32:03 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 266372
664.387.880.60 - - [02/Feb/2012:12:32:34 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 176348
我试图使用正则表达式表达找到符合以下条件的行....
- 重复的IP
- 重复时间
- 重名
然后我想删除这些行,所以只剩下一个,最后有一个文件大小稍微不同,否则我可以删除重复的行。
我已经设法检测IP http://regexr.com?2vv5c但是就我所知,任何人都可以帮忙吗?
UPDATE 在回应评论留下,如下的原始数据....
100.200.300.40 - - [02/Feb/2012:12:18:35 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 132189
100.200.300.40 - - [02/Feb/2012:12:18:35 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 106866
100.200.300.40 - - [02/Feb/2012:12:18:35 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 103461
100.200.300.40 - - [02/Feb/2012:12:19:10 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 106866
100.200.300.40 - - [02/Feb/2012:12:19:10 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 106866
100.200.300.40 - - [02/Feb/2012:12:19:10 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 106866
300.230.100.10 - - [02/Feb/2012:12:20:55 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 28017662
200.100.600.30 - - [02/Feb/2012:12:27:22 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 464787
200.100.600.30 - - [02/Feb/2012:12:27:22 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 463747
200.100.600.30 - - [02/Feb/2012:12:27:22 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 434485
200.100.600.30 - - [02/Feb/2012:12:27:54 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 330269
664.387.880.60 - - [02/Feb/2012:12:32:03 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 266372
664.387.880.60 - - [02/Feb/2012:12:32:34 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 176348
下应保持....
100.200.300.40 - - [02/Feb/2012:12:18:35 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 103461
100.200.300.40 - - [02/Feb/2012:12:19:10 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 106866
300.230.100.10 - - [02/Feb/2012:12:20:55 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 28017662
200.100.600.30 - - [02/Feb/2012:12:27:22 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 434485
200.100.600.30 - - [02/Feb/2012:12:27:54 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 330269
664.387.880.60 - - [02/Feb/2012:12:32:03 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 266372
664.387.880.60 - - [02/Feb/2012:12:32:34 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 176348
所以知道在你的样本数据中你可以做以下几行:
String subString = line.substring(0, line.lastIndexOf(" "));
其中line
是从您的示例数据中读取的单行。这里subString
是String
,在删除每行数据末尾的filesize
之后。现在你可以很容易地比较所有这样的subString
并删除重复。
貌似可以用sort -u
得到所希望的结果:
$ sort -k1,6 -u < test.txt
100.200.300.40 - - [02/Feb/2012:12:18:35 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 132189
100.200.300.40 - - [02/Feb/2012:12:19:10 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 106866
200.100.600.30 - - [02/Feb/2012:12:27:22 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 464787
200.100.600.30 - - [02/Feb/2012:12:27:54 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 330269
300.230.100.10 - - [02/Feb/2012:12:20:55 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 28017662
664.387.880.60 - - [02/Feb/2012:12:32:03 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 266372
664.387.880.60 - - [02/Feb/2012:12:32:34 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 176348
的想法是定义字段1-6为重点,并要求排序后返回一个列表的uniq。它然后每个键只返回一个uniq条目(ip + time + file)。根据密钥定义对uniq的列表进行排序,而不是整个行。
这里是一个Perl的一个班轮,没有工作:
perl -ane '($id)=$_=~/^(.*) HTTP/;print unless exists $seen{$id};$seen{$id}=1;' logfile.log
它认为从一开始^
字符串直到HTTP
作为哈希%seen
的关键。
如果密钥不存在于散列中,则打印当前行。
输出:
100.200.300.40 - - [02/Feb/2012:12:18:35 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 132189
100.200.300.40 - - [02/Feb/2012:12:19:10 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 106866
300.230.100.10 - - [02/Feb/2012:12:20:55 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 28017662
200.100.600.30 - - [02/Feb/2012:12:27:22 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 464787
200.100.600.30 - - [02/Feb/2012:12:27:54 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 330269
664.387.880.60 - - [02/Feb/2012:12:32:03 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 266372
664.387.880.60 - - [02/Feb/2012:12:32:34 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 176348
也许这是一个愚蠢的问题,但是对于这个Perl版本,我如何首先在数据中加载txt文件? – fightstarr20 2012-02-09 13:50:15
@ fightstarr20:它只是一个参数,我把它命名为'logfile。日志“在我的答案。你必须这样做:'perl -ane'所有的perl代码'file_to_be_treated' – Toto 2012-02-09 13:53:37
一些例子什么应该匹配,什么不应该匹配总是有益的。 – aioobe 2012-02-09 09:57:22
你真的需要文件大小吗?如果没有,你可以很容易地删除它,然后使用'sort'和'uniq'来过滤掉重复项。 – beerbajay 2012-02-09 10:21:31