使用正则表达式提取特定字符串
问题描述:
我有一个数据名称包含字符串和POS标记。我想通过过滤特定的POS标签来提取特定的字符串。使用正则表达式提取特定字符串
举一个简单的例子,我想提取“NN-NN-NN”和“VB-JJ-NN”基础字符串。
df <- data.frame(word = c("abrasion process management",
"slurries comprise abrasive",
"slurry compositions comprise ",
"keep high polishing",
"improved superabrasive grit",
"using ceriacoated silica",
"and grinding",
"for cmp",
"and grinding for"),
pos_tag = c("NN-NN-NN", "NNS-NN-NN", "NN-NNS-NN", "VB-JJ-NN",
"VBN-JJ-NN", "VBG-JJ-NN", "CC-VBG", "IN-NN", "CC-VBG-IN"))
> df
word pos_tag
1 abrasion process management NN-NN-NN
2 slurries comprise abrasive NNS-NN-NN
3 slurry compositions comprise NN-NNS-NN
4 keep high polishing VB-JJ-NN
5 improved superabrasive grit VBN-JJ-NN
6 using ceriacoated silica VBG-JJ-NN
7 and grinding CC-VBG
8 for cmp IN-NN
9 and grinding for CC-VBG-IN
我试过用正则表达式来定义我的模式。 但我认为这不是定义模式的有效方法。 还有其他更有效的方法吗?
pos <- c("NN-NN-NN", "NNS-NN-NN", "NN-NNS-NN", "VB.-JJ-NN", "VB-JJ-NN")
pos2 <- paste0('^', pos , "\\w*$", collapse = '|')
sort_string <- df[grep(pos2, df$pos_tag),] %>%
unique()
这里是我想要得到
word pos_tag
1 abrasion process management NN-NN-NN
2 slurries comprise abrasive NNS-NN-NN
3 slurry compositions comprise NN-NNS-NN
4 keep high polishing VB-JJ-NN
5 improved superabrasive grit VBN-JJ-NN
6 using ceriacoated silica VBG-JJ-NN
答
你不需要为正则表达式。一种可能性是使用amatch
- 函数从stringdist
-package:
vec <- c("NN-NN-NN", "VB-JJ-NN")
library(stringdist)
df[!!amatch(df$pos_tag, vec, maxDist = 1, nomatch = 0),]
这给:
word pos_tag
1 abrasion process management NN-NN-NN
2 slurries comprise abrasive NNS-NN-NN
3 slurry compositions comprise NN-NNS-NN
4 keep high polishing VB-JJ-NN
5 improved superabrasive grit VBN-JJ-NN
6 using ceriacoated silica VBG-JJ-NN
这个作用:
-
amatch(df$pos_tag, vec, maxDist = 1, nomatch = 0)
看起来df$pos_tag
值是否匹配vec
中的值与指定的差值容差。 - 在这种情况下,我使用的1个字符的最大允许editdistance:
maxDist = 1
- 由双否定,
!!
创建指示是否pos_tag
(几乎)在VEC的值之一匹配的逻辑矢量。另一种方法是:# method 1: df[rowSums(sapply(vec, function(x) agrepl(x, df$pos_tag, max.distance = 1))) > 0,] # method 2: df[unlist(lapply(vec, function(x) agrep(x, df$pos_tag, max.distance = 1))),]
都将给你相同的:
df[amatch(df$pos_tag, vec, maxDist = 1, nomatch = 0) > 0,]
您也可以与agrep
/agrepl
结合sapply
/lapply
和rowSums
/unlist
做到这一点的基础R结果。
在预期你有'NNS-NN-NN'模式不明确 – akrun
这个问题不是很清楚。让我看看我是否理解:你想从单词中取出“i”元素并将其与pos_tag中的“i”元素相匹配,将文件/控制台写入从1到“i”的行,其中“我“代表循环索引控制。你也想打印行号。这是你想要的吗? – Heto