双正则表达式匹配列[R

问题描述:

这是一个问题,我昨天问了遵循: Partial string match two columns R双正则表达式匹配列[R

提供给这个答案是伟大的;然而,我发现许多物种并没有被直接提及,也就是说乌龟从来没有被直接描述在数据产品中,但是“异国情调”是可以接受的匹配。

dats<-data.frame(ID=c(1:4),species=c("dog","cat","rabbit","tortoise"), 
      species.descriptor=c("all animal dog","all animal cat","rabbit exotic","tortoise exotic"), 
      product=c(1,2,3,4),product.authorise=c("all animal dog cat rabbit","cat horse pig", 
      "dog cat","exotic")) 
dats 
    ID species species.descriptor product   product.authorise 
    1  dog  all animal dog  1 all animal dog cat rabbit 
    2  cat  all animal cat  2    cat horse pig 
    3 rabbit  rabbit exotic  3     dog cat 
    4 tortoise tortoise exotic  4     exotic 

我想出了那个作品基础上结合$ species.descriptor和$ product.authorise在一起,然后指定行作为“TRUE”如果一个特定的REG EXP出现在两个或更多次的解决方案像这样的字段:

library(stringr) 
dats$bound<-paste(dats$product.authorise, dats$species.descriptor) 

species_descriptor<-c("all animal","dog","cat","rabbit","exotic","horse","pig","tortoise") 
species_descriptor<-setNames(nm=species_descriptor) 
result<-ifelse(sapply(species_descriptor, str_count, string=dats$bound)>=2,"TRUE","FALSE") 
result<-as.data.frame(result) 

result$AuthorisedCount<-apply(result[,1:ncol(result)],MARGIN=1,function(x){sum(x=="TRUE",na.rm=T)}) 
result$SpeciesAuthorised<-ifelse(result$AuthorisedCount>=1,"TRUE","FALSE") 

dats<-cbind(dats, result$SpeciesAuthorised) 
names(dats)[7]<-"SpeciesAuthorised" 
dats$bound<-NULL 

dats 
    ID species species.descriptor product   product.authorise SpeciesAuthorised 
    1  dog  all animal dog  1 all animal dog cat rabbit    TRUE 
    2  cat  all animal cat  2    cat horse pig    TRUE 
    3 rabbit  rabbit exotic  3     dog cat    FALSE 
    4 tortoise tortoise exotic  4     exotic    TRUE 

这很好,在大得多的数据集工作很快;但是,我意识到可能有更优雅的做事方式。我想知道有没有人有任何建议?

使用sapply函数调用和bound变量产生相同的结果:

bound<-paste(dats$product.authorise, dats$species.descriptor) 
dats$SpeciesAuthorised <- as.logical(rowSums(sapply(species_descriptor, str_count, string=bound)>=2)) 
# ID species species.descriptor product   product.authorise SpeciesAuthorised 
# 1 1  dog  all animal dog  1 all animal dog cat rabbit    TRUE 
# 2 2  cat  all animal cat  2    cat horse pig    TRUE 
# 3 3 rabbit  rabbit exotic  3     dog cat    FALSE 
# 4 4 tortoise tortoise exotic  4     exotic    TRUE 

扩展你提到的将这项工作职位?

dats$SpeciesAuthorised <- with(dats, 
           str_detect(species.descriptor, species) & 
            (str_detect(product.authorise, species) | str_detect(species.descriptor,product.authorise)) 
) 

我只是在函数中添加了一个OR运算符来检测species.descriptor中的product.authorise中的模式。

dats 
    ID species species.descriptor product   product.authorise SpeciesAuthorised 
1 1  dog  all animal dog  1 all animal dog cat rabbit    TRUE 
2 2  cat  all animal cat  2    cat horse pig    TRUE 
3 3 rabbit  rabbit exotic  3     dog cat    FALSE 
4 4 tortoise tortoise exotic  4     exotic    TRUE 

您可以使用功能any减少代码:

bound <- paste(dats$product.authorise, dats$species.descriptor) 
result <- ifelse(sapply(species_descriptor, str_count, string=bound)>=2, TRUE, FALSE) 

dats$SpeciesAuthorised <- apply(result, 1, any) 

而无需设置的结果,"TRUE""FALSE"字符,使用逻辑值。

另外,如果你想使代码更干净和可读性,你可以定义自己的职能:

isSpeciesAuthorised = function(data, species_descriptor) { 
    bound <- paste(data$product.authorise, data$species.descriptor) 
    result <- ifelse(sapply(species_descriptor, str_count, string=bound)>=2, TRUE, FALSE) 

    return(apply(result, 1, any)) 
} 

,然后用它们:

dats$SpeciesAuthorised <- isSpeciesAuthorised(data=dats, species_descriptor)