包括顺序编号匹配文本

问题描述:

我有一个目前看起来是这样的文件,例如:包括顺序编号匹配文本

>ENSOFAS001369_p |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001369,probes-probe:,probes-source:Anoplocnemis_curvipes_contig204 
>ENSOFAS001369_p |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001369,probes-probe:,probes-source:Anoplocnemis_curvipes_contig204 
>ENSOFAS001264_p |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001264,probes-probe:,probes-source:Anoplocnemis_curvipes_contig6359 
>ENSOFAS001264_p |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001264,probes-probe:,probes-source:Anoplocnemis_curvipes_contig6359 
>ENSOFAS001369_p |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001369,probes-probe:,probes-source:Alydus_pilosus_contig594 
>ENSOFAS004569_p |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS004569,probes-probe:,probes-source:Alydus_pilosus_contig30341 

着眼于>|符号之间文本的部分,我想基础上添加顺序编号匹配ENSOFAS数字ID。也就是说,我想利用这一点,让这样的:

>ENSOFAS001369_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001369,probes-probe:,probes-source:Anoplocnemis_curvipes_contig204 
>ENSOFAS001369_p2 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001369,probes-probe:,probes-source:Anoplocnemis_curvipes_contig204 
>ENSOFAS001264_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001264,probes-probe:,probes-source:Anoplocnemis_curvipes_contig6359 
>ENSOFAS001264_p2 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001264,probes-probe:,probes-source:Anoplocnemis_curvipes_contig6359 
>ENSOFAS001369_p3 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001369,probes-probe:,probes-source:Alydus_pilosus_contig594 
>ENSOFAS004569_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS004569,probes-probe:,probes-source:Alydus_pilosus_contig30341 

我能在的TextWrangler用grep搜索(> ENSOFAS(\ d +)_ P(+)\ r)的,但我知道的文本编辑器在_p后面添加数字方面做不到我需要的。我认为macOS linux版本的搜索部分可能是grep -E ">ENSOFAS[0-9]\{6\}_p\s|",但不知道如何获得_p|之前的空白区域之间的编号。匹配的ENSOFAS号码在文本文件中不会聚集在一起,但如果需要,我可以采用某种排序方式。

AWK方法:

awk '{ $1=$1""++a[$1] }1' file 

输出:

>ENSOFAS001369_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001369,probes-probe:,probes-source:Anoplocnemis_curvipes_contig204 
>ENSOFAS001369_p2 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001369,probes-probe:,probes-source:Anoplocnemis_curvipes_contig204 
>ENSOFAS001264_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001264,probes-probe:,probes-source:Anoplocnemis_curvipes_contig6359 
>ENSOFAS001264_p2 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001264,probes-probe:,probes-source:Anoplocnemis_curvipes_contig6359 
>ENSOFAS001369_p3 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001369,probes-probe:,probes-source:Alydus_pilosus_contig594 
>ENSOFAS004569_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS004569,probes-probe:,probes-source:Alydus_pilosus_contig30341 

AWKsub()功能的另一种方法:

awk '{ sub(/$/,++a[$1],$1) }1' file 
+0

简洁的awk方法非常完美! –

如果awk是在安装程序的一个选项:

$ awk '{cnt[$1]++; $1=$1""cnt[$1]; print}' file 
>ENSOFAS001369_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001369,probes-probe:,probes-source:Anoplocnemis_curvipes_contig204 
>ENSOFAS001369_p2 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001369,probes-probe:,probes-source:Anoplocnemis_curvipes_contig204 
>ENSOFAS001264_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001264,probes-probe:,probes-source:Anoplocnemis_curvipes_contig6359 
>ENSOFAS001264_p2 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001264,probes-probe:,probes-source:Anoplocnemis_curvipes_contig6359 
>ENSOFAS001369_p3 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001369,probes-probe:,probes-source:Alydus_pilosus_contig594 
>ENSOFAS004569_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS004569,probes-probe:,probes-source:Alydus_pilosus_contig30341 

说明:$1将包含第一场(每行),例如>ENSOFAS001369_p。我们使用关联数组cnt来计算来自$1的每个唯一标记的出现次数,并修改字段$1(之前的输出)以包括处理的记录/行的当前计数。

awk脚本可以缩短,但在这种形式下它可能更具可读性和可理解性。