阅读CSV既成对和不成引号
问题描述:
我有我想读入R.它具有类似于下面的数据从MS SQL Server生成一个CSV文件:阅读CSV既成对和不成引号
# reproduce file
possibilities <- c('this is good','"this has, a comma"','here is a " quotation','')
newstrings <- expand.grid(possibilities,possibilities,possibilities,stringsAsFactors = F)
xwrite <- apply(newstrings,1,paste,collapse = ",")
xwrite <- c('v1,v2,v3',xwrite)
writeLines(xwrite,con = 'test.csv')
我通常会打开这个与Excel和它神奇地读取和写入一个更清洁的R格式,但这是超过了行限制。如果我无法弄清楚,我将不得不返回并以另一种格式输出它。我尝试了很多我读过的变体。
# a few things I've tried
(rl <- readLines('test.csv'))
read.csv('test.csv',header = T,quote = "",stringsAsFactors = F)
read.csv('test.csv',header = F,quote = "",stringsAsFactors = F,skip = 1)
read.csv('test.csv',header = T,stringsAsFactors = F)
read.csv('test.csv',header = F,stringsAsFactors = F,skip = 1)
read.table('test.csv',header = F)
read.table('test.csv',header = F,quote = "\"")
read.table('test.csv',header = T,sep = ",")
scan('test.csv',what = 'character')
scan('test.csv',what = 'character',sep = ",")
scan('test.csv',what = 'character',sep = ",",quote = "")
scan('test.csv',what = 'character',sep = ",",quote = "\"")
unlist(strsplit(rl,split = ','))
这似乎对我有数据的工作,但我不放心重用它,因为它不第六行这说明可能在另一个文件中可能发生的数据。
# works if only comma OR unpaired quotation but not both
rl[grep('^[^\"]*\"[^\"]*$',rl)] <- sub('^([^\"]*)(\")([^\"]*)$','\\1\\3',rl[grep('^[^\"]*\"[^\"]*$',rl)])
writeLines(rl,'testfixed.csv')
read.csv('testfixed.csv')
我发现了一个similar problem,但我的引号的问题是数据独来独往,没有一个统一的格式问题。
是否有可能从此获得正确的data.frame?
答
我不认为有直接的方法来做到这一点。在这里,我基本上用逗号分隔strsplit
。但首先,我将,\"
或\",
这样的特殊分隔符处理。
lines <- readLines('test.csv')
## separate teh quotaion case
lines_spe <- strsplit(lines,',\"|\",')
nn <- sapply(lines_spe,length)==1
## the normal case
lines[nn] <- strsplit(lines[nn],',',perl=TRUE)
## aggregate the results
lines[!nn] <- lines_spe[!nn]
## bind to create a data.frame
dat <-
setNames(as.data.frame(do.call(rbind,lines[-1]),stringsAsFactors =F),
lines[[1]])
## treat the special case of strsplit('some text without second part,',',')
dat[dat$v1==dat$v2,"v2"] <- ""
dat
# v1 v2
# 1 this is good this is fine
# 2 this has no commas this has, a comma"
# 3 this has no quotations this has a " quotation
# 4 this field has something
# 5 now the other side does
# 6 "this has, a comma this has a " quotation
# 7 and a final line that should be fine
结果是除了不具有第二部分,其中strsplit
未能得到第二空文本的情况下,近良好:在您的数据,出现这种情况有:“这一领域有什么东西,”。这里举一个例子来解释这个问题:
strsplit('aaa,',',')
[[1]]
[1] "aaa"
> strsplit(',aaa',',')
[[1]]
[1] "" "aaa"
答
这是更接近,可能会做。如果逗号旁边有一个单引号,那么它会失败,因为我假设那些实际需要引用的字符串的开始或结尾。
rl <- readLines('test.csv')
rl <- gsub('([^,])(\")([^,])','\\1\\3',rl,perl = T)
writeLines(rl,'testfixed.csv')
read.csv('testfixed.csv')