如何检查字符串是否包含R中的特定单词
问题描述:
我有来自kaggle.com的包含每集的标题的辛普森数据。我想检查每个标题中字符名称的使用次数。我可以在标题中找到确切的单词,但是当我寻找荷马时,我的代码错过了诸如Homers这样的单词。有没有办法做到这一点?如何检查字符串是否包含R中的特定单词
数据例子,我的代码:
text <- 'title
Homer\'s Night Out
Krusty Gets Busted
Bart Gets an "F"
Two Cars in Every Garage and Three Eyes on Every Fish
Dead Putting Society
Bart the Daredevil
Bart Gets Hit by a Car
Homer vs. Lisa and the 8th Commandment
Oh Brother, Where Art Thou?
Old Money
Lisa\'s Substitute
Blood Feud
Mr. Lisa Goes to Washington
Bart the Murderer
Like Father, Like Clown
Saturdays of Thunder
Burns Verkaufen der Kraftwerk
Radio Bart
Bart the Lover
Separate Vocations
Colonel Homer'
simpsons <- read.csv(text = text, stringsAsFactors = FALSE)
library(stringr)
titlewords <- paste(simpsons$title, collapse = " ")
words <- c('Homer')
titlewords <- gsub("[[:punct:]]", "", titlewords)
HomerCount <- str_count(titlewords, paste(words, collapse=" "))
HomerCount
答
在一个替代的评论很好的建议,你也可以使用tidytext
包
library(tidytext)
library(dplyr)
text <- 'title
Homer\'s Night Out
Krusty Gets Busted
Bart Gets an "F"
Two Cars in Every Garage and Three Eyes on Every Fish
Dead Putting Society
Bart the Daredevil
Bart Gets Hit by a Car
Homer vs. Lisa and the 8th Commandment
Oh Brother, Where Art Thou?
Old Money
Lisa\'s Substitute
Blood Feud
Mr. Lisa Goes to Washington
Bart the Murderer
Like Father, Like Clown
Saturdays of Thunder
Burns Verkaufen der Kraftwerk
Radio Bart
Bart the Lover
Separate Vocations
Colonel Homer'
simpsons <- read.csv(text = text, stringsAsFactors = FALSE)
# Number of homers
simpsons %>%
unnest_tokens(word, title) %>%
summarize(count = sum(grepl("homer", word)))
# Lines location of homers
simpsons %>%
unnest_tokens(word, title) %>%
mutate(lines = rownames(.)) %>%
filter(grepl("homer", word))
[选择行,其中一列可能的复制有一个字符串像'hsa ..'(部分字符串匹配)](http://stackoverflow.com/questions/13043928/selecting-rows-where-a-column-has-a-string-like-hsa-partial-字符串匹配) –
你不只是想'sum(grepl('Homer',辛普森$ title))'? – rawr
并为每个字符串计数'sapply(gregexpr(“Homer”,simpsons $ title),function(x)sum(x> 0))''。 –