双正则表达式匹配列[R
问题描述:
这是一个问题,我昨天问了遵循: Partial string match two columns R双正则表达式匹配列[R
提供给这个答案是伟大的;然而,我发现许多物种并没有被直接提及,也就是说乌龟从来没有被直接描述在数据产品中,但是“异国情调”是可以接受的匹配。
dats<-data.frame(ID=c(1:4),species=c("dog","cat","rabbit","tortoise"),
species.descriptor=c("all animal dog","all animal cat","rabbit exotic","tortoise exotic"),
product=c(1,2,3,4),product.authorise=c("all animal dog cat rabbit","cat horse pig",
"dog cat","exotic"))
dats
ID species species.descriptor product product.authorise
1 dog all animal dog 1 all animal dog cat rabbit
2 cat all animal cat 2 cat horse pig
3 rabbit rabbit exotic 3 dog cat
4 tortoise tortoise exotic 4 exotic
我想出了那个作品基础上结合$ species.descriptor和$ product.authorise在一起,然后指定行作为“TRUE”如果一个特定的REG EXP出现在两个或更多次的解决方案像这样的字段:
library(stringr)
dats$bound<-paste(dats$product.authorise, dats$species.descriptor)
species_descriptor<-c("all animal","dog","cat","rabbit","exotic","horse","pig","tortoise")
species_descriptor<-setNames(nm=species_descriptor)
result<-ifelse(sapply(species_descriptor, str_count, string=dats$bound)>=2,"TRUE","FALSE")
result<-as.data.frame(result)
result$AuthorisedCount<-apply(result[,1:ncol(result)],MARGIN=1,function(x){sum(x=="TRUE",na.rm=T)})
result$SpeciesAuthorised<-ifelse(result$AuthorisedCount>=1,"TRUE","FALSE")
dats<-cbind(dats, result$SpeciesAuthorised)
names(dats)[7]<-"SpeciesAuthorised"
dats$bound<-NULL
dats
ID species species.descriptor product product.authorise SpeciesAuthorised
1 dog all animal dog 1 all animal dog cat rabbit TRUE
2 cat all animal cat 2 cat horse pig TRUE
3 rabbit rabbit exotic 3 dog cat FALSE
4 tortoise tortoise exotic 4 exotic TRUE
这很好,在大得多的数据集工作很快;但是,我意识到可能有更优雅的做事方式。我想知道有没有人有任何建议?
答
使用sapply
函数调用和bound
变量产生相同的结果:
bound<-paste(dats$product.authorise, dats$species.descriptor)
dats$SpeciesAuthorised <- as.logical(rowSums(sapply(species_descriptor, str_count, string=bound)>=2))
# ID species species.descriptor product product.authorise SpeciesAuthorised
# 1 1 dog all animal dog 1 all animal dog cat rabbit TRUE
# 2 2 cat all animal cat 2 cat horse pig TRUE
# 3 3 rabbit rabbit exotic 3 dog cat FALSE
# 4 4 tortoise tortoise exotic 4 exotic TRUE
答
扩展你提到的将这项工作职位?
dats$SpeciesAuthorised <- with(dats,
str_detect(species.descriptor, species) &
(str_detect(product.authorise, species) | str_detect(species.descriptor,product.authorise))
)
我只是在函数中添加了一个OR运算符来检测species.descriptor中的product.authorise中的模式。
dats
ID species species.descriptor product product.authorise SpeciesAuthorised
1 1 dog all animal dog 1 all animal dog cat rabbit TRUE
2 2 cat all animal cat 2 cat horse pig TRUE
3 3 rabbit rabbit exotic 3 dog cat FALSE
4 4 tortoise tortoise exotic 4 exotic TRUE
答
您可以使用功能any
减少代码:
bound <- paste(dats$product.authorise, dats$species.descriptor)
result <- ifelse(sapply(species_descriptor, str_count, string=bound)>=2, TRUE, FALSE)
dats$SpeciesAuthorised <- apply(result, 1, any)
而无需设置的结果,"TRUE"
或"FALSE"
字符,使用逻辑值。
另外,如果你想使代码更干净和可读性,你可以定义自己的职能:
isSpeciesAuthorised = function(data, species_descriptor) {
bound <- paste(data$product.authorise, data$species.descriptor)
result <- ifelse(sapply(species_descriptor, str_count, string=bound)>=2, TRUE, FALSE)
return(apply(result, 1, any))
}
,然后用它们:
dats$SpeciesAuthorised <- isSpeciesAuthorised(data=dats, species_descriptor)