提取之前和，使用R编程

问题描述：

基于在全文关键字线后我想提取从使用R.提取之前和，使用R编程

我想之前和之后的线或段落含有字，以提取PDF的文章“癌症”相关的信息癌症在文本文件。

abstracts <- lapply(mytxtfiles, function(i) { 
j <- paste0(scan(i, what = character()), collapse = " ") 
regmatches(j, gregexpr("(?m)(^[^\\r\\n]*\\R+){4}[cancer][^\\r\\n]*\\R+(^[^\\r\\n]*\\R+){4}", j, perl=TRUE))})

上述正则表达式不工作

'[癌症]'='cancer'！第一个是角色类，后者是字面类。 – Jan

如果使用'\ R'，则必须使用'perl = TRUE'。 –

用'。*'和'[cancer] [^ \\ r \\ n] *'替换所有'[^ \ r \ n] *''。见['（？m）（^。* \ R +）{4}。* cancer。*（\ R +。*）{4}']（https://regex101.com/r/Hbr9ep/1）。如果没有足够的行，请用'{0,4}'替换'{4}'。 –

答

这里有一个办法：

library(textreadr) 
library(tidyverse) 

loc <- function(var, regex, n = 1, ignore.case = TRUE){ 
    locs <- grep(regex, var, ignore.case = ignore.case) 
    out <- sort(unique(c(locs - 1, locs, locs + 1))) 
    out <- out[out > 0] 
    out[out <= length(var)] 
} 

doc <- 'https://www.in.kpmg.com/pdf/Indian%20Pharma%20Outlook.pdf' %>% 
    read_pdf() %>% 
    slice(loc(text, 'cancer')) 

doc 

## page_id element_id                             text 
## 1  24   28        Ranjit Shahani applauds the National Pharmaceuticals Policy's proposal of public/private 
## 2  24   29        partnerships (PPPs) to tackle life-threatening diseases such as cancer and HIV/AIDS, but 
## 3  24   30        stresses that, in order for them to work, they should be voluntary, and the government 
## 4  25   8       the availability of medicines to treat life-threatening diseases. It notes, for example, that 
## 5  25   9        while an average estimate of the value of drugs to treat the country's cancer patients is 
## 6  25   10        $1.11 billion, the market is in fact worth only $33.5 million. “The big gap indicates the 
## 7  25   12       because of the high cost of these medicines,” says the Policy, which also calls for tax and 
## 8  25   13                    excise exemptions for anti-cancer drugs. 
## 9  25   14      Another area for which PPPs are proposed is for drugs to treat HIV/AIDS, India's biggest health 
## 10  32   19        Variegate Trading, a UB subsidiary. The firm's major products are in the anti-infective, 
## 11  32   20        anti-inflammatory, cancer, diabetes and allergy market segments and, for the year ended 
## 12  32   21        December 31, 2005, it reported net sales (excluding excise duty) up 9.9 percent to $181.1

谢谢你给我不同的方法。我们可以做到这一点多个pdf存储在特定的位置。同样使用这个，我能够提取含有癌症字符的行，而不是前后的行。我如何提取包含单词'cancer –

'的行的前后行是的，您可以为多个dfs执行操作。请参阅'read_dir'函数。我已经在前后显示了上述线条，所以我不知道你前后的线条是什么意思。例如，第29行有癌症这个词。我也包括第28和30行。 –

我们可以把句子分开吗？我正在考虑把一行作为一句完整的句子。 –

提取之前和，使用R编程

相关推荐