提取之前和,使用R编程
基于在全文关键字线后我想提取从使用R.提取之前和,使用R编程
我想之前和之后的线或段落含有字,以提取PDF的文章“癌症”相关的信息癌症在文本文件。
abstracts <- lapply(mytxtfiles, function(i) {
j <- paste0(scan(i, what = character()), collapse = " ")
regmatches(j, gregexpr("(?m)(^[^\\r\\n]*\\R+){4}[cancer][^\\r\\n]*\\R+(^[^\\r\\n]*\\R+){4}", j, perl=TRUE))})
上述正则表达式不工作
这里有一个办法:
library(textreadr)
library(tidyverse)
loc <- function(var, regex, n = 1, ignore.case = TRUE){
locs <- grep(regex, var, ignore.case = ignore.case)
out <- sort(unique(c(locs - 1, locs, locs + 1)))
out <- out[out > 0]
out[out <= length(var)]
}
doc <- 'https://www.in.kpmg.com/pdf/Indian%20Pharma%20Outlook.pdf' %>%
read_pdf() %>%
slice(loc(text, 'cancer'))
doc
## page_id element_id text
## 1 24 28 Ranjit Shahani applauds the National Pharmaceuticals Policy's proposal of public/private
## 2 24 29 partnerships (PPPs) to tackle life-threatening diseases such as cancer and HIV/AIDS, but
## 3 24 30 stresses that, in order for them to work, they should be voluntary, and the government
## 4 25 8 the availability of medicines to treat life-threatening diseases. It notes, for example, that
## 5 25 9 while an average estimate of the value of drugs to treat the country's cancer patients is
## 6 25 10 $1.11 billion, the market is in fact worth only $33.5 million. “The big gap indicates the
## 7 25 12 because of the high cost of these medicines,” says the Policy, which also calls for tax and
## 8 25 13 excise exemptions for anti-cancer drugs.
## 9 25 14 Another area for which PPPs are proposed is for drugs to treat HIV/AIDS, India's biggest health
## 10 32 19 Variegate Trading, a UB subsidiary. The firm's major products are in the anti-infective,
## 11 32 20 anti-inflammatory, cancer, diabetes and allergy market segments and, for the year ended
## 12 32 21 December 31, 2005, it reported net sales (excluding excise duty) up 9.9 percent to $181.1
谢谢你给我不同的方法。我们可以做到这一点多个pdf存储在特定的位置。同样使用这个,我能够提取含有癌症字符的行,而不是前后的行。我如何提取包含单词'cancer –
'的行的前后行是的,您可以为多个dfs执行操作。请参阅'read_dir'函数。我已经在前后显示了上述线条,所以我不知道你前后的线条是什么意思。例如,第29行有癌症这个词。我也包括第28和30行。 –
我们可以把句子分开吗?我正在考虑把一行作为一句完整的句子。 –
'[癌症]'='cancer'!第一个是角色类,后者是字面类。 – Jan
如果使用'\ R',则必须使用'perl = TRUE'。 –
用'。*'和'[cancer] [^ \\ r \\ n] *'替换所有'[^ \ r \ n] *''。见['(?m)(^。* \ R +){4}。* cancer。*(\ R +。*){4}'](https://regex101.com/r/Hbr9ep/1)。如果没有足够的行,请用'{0,4}'替换'{4}'。 –