使用正则表达式提取特定字符串

问题描述：

我有一个数据名称包含字符串和POS标记。我想通过过滤特定的POS标签来提取特定的字符串。使用正则表达式提取特定字符串

举一个简单的例子，我想提取“NN-NN-NN”和“VB-JJ-NN”基础字符串。

df <- data.frame(word = c("abrasion process management", 
          "slurries comprise abrasive", 
          "slurry compositions comprise ", 
          "keep high polishing", 
          "improved superabrasive grit", 
          "using ceriacoated silica", 
          "and grinding", 
          "for cmp", 
          "and grinding for"), 
       pos_tag = c("NN-NN-NN", "NNS-NN-NN", "NN-NNS-NN", "VB-JJ-NN", 
          "VBN-JJ-NN", "VBG-JJ-NN", "CC-VBG", "IN-NN", "CC-VBG-IN")) 

> df 
       word    pos_tag 
1 abrasion process management NN-NN-NN 
2 slurries comprise abrasive NNS-NN-NN 
3 slurry compositions comprise NN-NNS-NN 
4   keep high polishing VB-JJ-NN 
5 improved superabrasive grit VBN-JJ-NN 
6  using ceriacoated silica VBG-JJ-NN 
7     and grinding CC-VBG 
8      for cmp IN-NN 
9    and grinding for CC-VBG-IN

我试过用正则表达式来定义我的模式。但我认为这不是定义模式的有效方法。还有其他更有效的方法吗？

pos <- c("NN-NN-NN", "NNS-NN-NN", "NN-NNS-NN", "VB.-JJ-NN", "VB-JJ-NN") 
pos2 <- paste0('^', pos , "\\w*$", collapse = '|') 
sort_string <- df[grep(pos2, df$pos_tag),] %>% 
       unique()

这里是我想要得到

   word    pos_tag 
1 abrasion process management NN-NN-NN 
2 slurries comprise abrasive NNS-NN-NN 
3 slurry compositions comprise NN-NNS-NN 
4   keep high polishing VB-JJ-NN 
5 improved superabrasive grit VBN-JJ-NN 
6  using ceriacoated silica VBG-JJ-NN

在预期你有'NNS-NN-NN'模式不明确 – akrun

这个问题不是很清楚。让我看看我是否理解：你想从单词中取出“i”元素并将其与pos_tag中的“i”元素相匹配，将文件/控制台写入从1到“i”的行，其中“我“代表循环索引控制。你也想打印行号。这是你想要的吗？ – Heto

答

你不需要为正则表达式。一种可能性是使用amatch - 函数从stringdist -package：

vec <- c("NN-NN-NN", "VB-JJ-NN") 

library(stringdist) 
df[!!amatch(df$pos_tag, vec, maxDist = 1, nomatch = 0),]

这给：

      word pos_tag 
1 abrasion process management NN-NN-NN 
2 slurries comprise abrasive NNS-NN-NN 
3 slurry compositions comprise NN-NNS-NN 
4   keep high polishing VB-JJ-NN 
5 improved superabrasive grit VBN-JJ-NN 
6  using ceriacoated silica VBG-JJ-NN

这个作用：

amatch(df$pos_tag, vec, maxDist = 1, nomatch = 0)看起来df$pos_tag值是否匹配vec中的值与指定的差值容差。
在这种情况下，我使用的1个字符的最大允许editdistance：maxDist = 1
由双否定，!!创建指示是否pos_tag（几乎）在VEC的值之一匹配的逻辑矢量。另一种方法是：
```
# method 1: 
df[rowSums(sapply(vec, function(x) agrepl(x, df$pos_tag, max.distance = 1))) > 0,] 

# method 2: 
df[unlist(lapply(vec, function(x) agrep(x, df$pos_tag, max.distance = 1))),] 
```
都将给你相同的：df[amatch(df$pos_tag, vec, maxDist = 1, nomatch = 0) > 0,]

您也可以与agrep/agrepl结合sapply/lapply和rowSums/unlist做到这一点的基础R结果。

哇〜这真的是一种有效的方式。我在我的过程中使用这种方式。但它显示了另一个问题。因为在我的数据名中也有诸如“IN-NN-NN”，“VB-RB-IN”，“NN-IN-NN”pos标签。（对不起，我没有在我的示例df中显示所有pos标签）。所以当我使用amatch（）时，它也提取了其他我不想要的模式。 – Eva

@Eva它取决于你有多少'错误'匹配，但有可能是过滤掉％c（“IN-NN-NN”，“VB-RB- IN”， “NN-IN-NN”）' – Jaap

使用正则表达式提取特定字符串

相关推荐