如何读取包含多个数据集的CSV文件?

时间:2021-03-01 15:56:59

Did some research on this and only found information on reading in multiple CSV files.

对此进行了一些研究,并且只找到了有关在多个CSV文件中阅读的信息。

I'm trying to create a widget where I can read in a CSV file with data sets and print as many graphs as there are data sets.

我正在尝试创建一个小部件,我可以在CSV文件中读取数据集并打印与数据集一样多的图形。

But I was trying to brainstorm a means of reading in a CSV with multiple data sets inputted vertically. However, I won't know the length of each data set and I won't know how many data sets would be present.

但是我试图用垂直输入的多个数据集在CSV中集思广益。但是,我不知道每个数据集的长度,我不知道会有多少数据集。

Any ideas or concepts to consider would be appreciated.

任何想要考虑的想法或概念将不胜感激。

2 个解决方案

#1


2  

# Create sample data

unlink("so-data.csv") # remove it if it exists

set.seed(1492) # reproducible

# make 3 data frames of different lengths
frames <- lapply(c(3, 10, 5), function(n) {
  data.frame(X = runif(n), Y1 = runif(n), Y2= runif(n))
}) 

# write them to single file preserving the header
suppressWarnings(
  invisible(
    lapply(frames, write.table, file="so-data.csv", sep=",", quote=FALSE, 
           append=TRUE, row.names=FALSE)
  )
)

That file looks like:

该文件看起来像:

"X","Y1","Y2"
0.277646409813315,0.110495456494391,0.852662623859942
0.21606229362078,0.0521760624833405,0.510357670951635
0.184417578391731,0.00824321852996945,0.390395383816212
"X","Y1","Y2"
0.769067857181653,0.916519832098857,0.971386880846694
0.6415081594605,0.63678711745888,0.148033464793116
0.638599780155346,0.381162445060909,0.989824152784422
0.194932354846969,0.132614633999765,0.845784503268078
0.522090089507401,0.599085820373148,0.218151196138933
0.521618122234941,0.0903550288639963,0.983936473494396
0.792095972690731,0.932019826257601,0.703315682942048
0.12338977586478,0.584303047973663,0.421113619813696
0.343668724410236,0.561827397439629,0.111441049026325
0.660837838426232,0.345943035557866,0.0270762923173606
"X","Y1","Y2"
0.309987690066919,0.441982284653932,0.133840701542795
0.747786369873211,0.240106994053349,0.62044994905591
0.789473889162764,0.853503877297044,0.150850139558315
0.165826949058101,0.119402598123997,0.318282842403278
0.39083837531507,0.109747459646314,0.876092307968065

Now you can do:

现在你可以这样做:

# read in the data as lines

l <- readLines("so-data.csv")

# figure out where the individual data sets are

starts <- which(grepl("X", l))
ends <- c((starts[2:length(starts)]-1), length(l))

# read them in

new_frames <- mapply(function(start, end) {
  read.csv(text=paste0(l[start:end], collapse="\n"), header=TRUE)
}, starts, ends, SIMPLIFY=FALSE)

str(new_frames)
## List of 3
##  $ :'data.frame':    3 obs. of  3 variables:
##   ..$ X : num [1:3] 0.278 0.216 0.184
##   ..$ Y1: num [1:3] 0.1105 0.05218 0.00824
##   ..$ Y2: num [1:3] 0.853 0.51 0.39
##  $ :'data.frame':    10 obs. of  3 variables:
##   ..$ X : num [1:10] 0.769 0.642 0.639 0.195 0.522 ...
##   ..$ Y1: num [1:10] 0.917 0.637 0.381 0.133 0.599 ...
##   ..$ Y2: num [1:10] 0.971 0.148 0.99 0.846 0.218 ...
##  $ :'data.frame':    5 obs. of  3 variables:
##   ..$ X : num [1:5] 0.31 0.748 0.789 0.166 0.391
##   ..$ Y1: num [1:5] 0.442 0.24 0.854 0.119 0.11
##   ..$ Y2: num [1:5] 0.134 0.62 0.151 0.318 0.876

#2


2  

As @Oriol Mirosa mentioned in the comments, this is one way you can do it. You can first read the whole file:

正如@Oriol Mirosa在评论中提到的,这是你可以做到的一种方式。您可以先读取整个文件:

df = read.csv("path", header = TRUE)

Assuming below is how the whole csv file is structured:

假设以下是整个csv文件的结构:

df = data.frame(X=c(1:10, "X", 1:20, "X", 1:30),
                Y=c(1:10, "Y", 1:20, "Y", 1:30),
                Z=c(1:10, "Z", 1:20, "Z", 1:30))

df$newset = ifelse(df$X == "X", 1, 0)
df$newset = as.factor(cumsum(df$newset))

dfs = split(df, df$newset)
dfs[-1] = lapply(dfs[-1], function(x) x[-1,-ncol(x)])
dfs[[1]] = dfs[[1]][,-ncol(dfs[[1]])]

I created a binary variable newset indicating whether a row is a "header". Then, used cumsum to populate each "dataset" with a unique number. I then split() on newset to create a list of datasets with each element containing one. Finally, I removed the first row of each dataset and made them the column names as desired. This should work no matter the length of each dataset.

我创建了一个二进制变量newset,指示行是否为“标题”。然后,使用cumsum以唯一编号填充每个“数据集”。然后我在newset上split()创建一个数据集列表,每个元素包含一个。最后,我删除了每个数据集的第一行,并根据需要将它们作为列名。无论每个数据集的长度如何,这都应该有效。

Result:

# $`0`
#     X  Y  Z
# 1   1  1  1
# 2   2  2  2
# 3   3  3  3
# 4   4  4  4
# 5   5  5  5
# 6   6  6  6
# 7   7  7  7
# 8   8  8  8
# 9   9  9  9
# 10 10 10 10
# 
# $`1`
#     X  Y  Z
# 12  1  1  1
# 13  2  2  2
# 14  3  3  3
# 15  4  4  4
# 16  5  5  5
# 17  6  6  6
# 18  7  7  7
# 19  8  8  8
# 20  9  9  9
# 21 10 10 10
# 22 11 11 11
# 23 12 12 12
# 24 13 13 13
# 25 14 14 14
# 26 15 15 15
# 27 16 16 16
# 28 17 17 17
# 29 18 18 18
# 30 19 19 19
# 31 20 20 20
# 
# $`2`
#     X  Y  Z
# 33  1  1  1
# 34  2  2  2
# 35  3  3  3
# 36  4  4  4
# 37  5  5  5
# 38  6  6  6
# 39  7  7  7
# 40  8  8  8
# 41  9  9  9
# 42 10 10 10
# 43 11 11 11
# 44 12 12 12
# 45 13 13 13
# 46 14 14 14
# 47 15 15 15
# 48 16 16 16
# 49 17 17 17
# 50 18 18 18
# 51 19 19 19
# 52 20 20 20
# 53 21 21 21
# 54 22 22 22
# 55 23 23 23
# 56 24 24 24
# 57 25 25 25
# 58 26 26 26
# 59 27 27 27
# 60 28 28 28
# 61 29 29 29
# 62 30 30 30

#1


2  

# Create sample data

unlink("so-data.csv") # remove it if it exists

set.seed(1492) # reproducible

# make 3 data frames of different lengths
frames <- lapply(c(3, 10, 5), function(n) {
  data.frame(X = runif(n), Y1 = runif(n), Y2= runif(n))
}) 

# write them to single file preserving the header
suppressWarnings(
  invisible(
    lapply(frames, write.table, file="so-data.csv", sep=",", quote=FALSE, 
           append=TRUE, row.names=FALSE)
  )
)

That file looks like:

该文件看起来像:

"X","Y1","Y2"
0.277646409813315,0.110495456494391,0.852662623859942
0.21606229362078,0.0521760624833405,0.510357670951635
0.184417578391731,0.00824321852996945,0.390395383816212
"X","Y1","Y2"
0.769067857181653,0.916519832098857,0.971386880846694
0.6415081594605,0.63678711745888,0.148033464793116
0.638599780155346,0.381162445060909,0.989824152784422
0.194932354846969,0.132614633999765,0.845784503268078
0.522090089507401,0.599085820373148,0.218151196138933
0.521618122234941,0.0903550288639963,0.983936473494396
0.792095972690731,0.932019826257601,0.703315682942048
0.12338977586478,0.584303047973663,0.421113619813696
0.343668724410236,0.561827397439629,0.111441049026325
0.660837838426232,0.345943035557866,0.0270762923173606
"X","Y1","Y2"
0.309987690066919,0.441982284653932,0.133840701542795
0.747786369873211,0.240106994053349,0.62044994905591
0.789473889162764,0.853503877297044,0.150850139558315
0.165826949058101,0.119402598123997,0.318282842403278
0.39083837531507,0.109747459646314,0.876092307968065

Now you can do:

现在你可以这样做:

# read in the data as lines

l <- readLines("so-data.csv")

# figure out where the individual data sets are

starts <- which(grepl("X", l))
ends <- c((starts[2:length(starts)]-1), length(l))

# read them in

new_frames <- mapply(function(start, end) {
  read.csv(text=paste0(l[start:end], collapse="\n"), header=TRUE)
}, starts, ends, SIMPLIFY=FALSE)

str(new_frames)
## List of 3
##  $ :'data.frame':    3 obs. of  3 variables:
##   ..$ X : num [1:3] 0.278 0.216 0.184
##   ..$ Y1: num [1:3] 0.1105 0.05218 0.00824
##   ..$ Y2: num [1:3] 0.853 0.51 0.39
##  $ :'data.frame':    10 obs. of  3 variables:
##   ..$ X : num [1:10] 0.769 0.642 0.639 0.195 0.522 ...
##   ..$ Y1: num [1:10] 0.917 0.637 0.381 0.133 0.599 ...
##   ..$ Y2: num [1:10] 0.971 0.148 0.99 0.846 0.218 ...
##  $ :'data.frame':    5 obs. of  3 variables:
##   ..$ X : num [1:5] 0.31 0.748 0.789 0.166 0.391
##   ..$ Y1: num [1:5] 0.442 0.24 0.854 0.119 0.11
##   ..$ Y2: num [1:5] 0.134 0.62 0.151 0.318 0.876

#2


2  

As @Oriol Mirosa mentioned in the comments, this is one way you can do it. You can first read the whole file:

正如@Oriol Mirosa在评论中提到的,这是你可以做到的一种方式。您可以先读取整个文件:

df = read.csv("path", header = TRUE)

Assuming below is how the whole csv file is structured:

假设以下是整个csv文件的结构:

df = data.frame(X=c(1:10, "X", 1:20, "X", 1:30),
                Y=c(1:10, "Y", 1:20, "Y", 1:30),
                Z=c(1:10, "Z", 1:20, "Z", 1:30))

df$newset = ifelse(df$X == "X", 1, 0)
df$newset = as.factor(cumsum(df$newset))

dfs = split(df, df$newset)
dfs[-1] = lapply(dfs[-1], function(x) x[-1,-ncol(x)])
dfs[[1]] = dfs[[1]][,-ncol(dfs[[1]])]

I created a binary variable newset indicating whether a row is a "header". Then, used cumsum to populate each "dataset" with a unique number. I then split() on newset to create a list of datasets with each element containing one. Finally, I removed the first row of each dataset and made them the column names as desired. This should work no matter the length of each dataset.

我创建了一个二进制变量newset,指示行是否为“标题”。然后,使用cumsum以唯一编号填充每个“数据集”。然后我在newset上split()创建一个数据集列表,每个元素包含一个。最后,我删除了每个数据集的第一行,并根据需要将它们作为列名。无论每个数据集的长度如何,这都应该有效。

Result:

# $`0`
#     X  Y  Z
# 1   1  1  1
# 2   2  2  2
# 3   3  3  3
# 4   4  4  4
# 5   5  5  5
# 6   6  6  6
# 7   7  7  7
# 8   8  8  8
# 9   9  9  9
# 10 10 10 10
# 
# $`1`
#     X  Y  Z
# 12  1  1  1
# 13  2  2  2
# 14  3  3  3
# 15  4  4  4
# 16  5  5  5
# 17  6  6  6
# 18  7  7  7
# 19  8  8  8
# 20  9  9  9
# 21 10 10 10
# 22 11 11 11
# 23 12 12 12
# 24 13 13 13
# 25 14 14 14
# 26 15 15 15
# 27 16 16 16
# 28 17 17 17
# 29 18 18 18
# 30 19 19 19
# 31 20 20 20
# 
# $`2`
#     X  Y  Z
# 33  1  1  1
# 34  2  2  2
# 35  3  3  3
# 36  4  4  4
# 37  5  5  5
# 38  6  6  6
# 39  7  7  7
# 40  8  8  8
# 41  9  9  9
# 42 10 10 10
# 43 11 11 11
# 44 12 12 12
# 45 13 13 13
# 46 14 14 14
# 47 15 15 15
# 48 16 16 16
# 49 17 17 17
# 50 18 18 18
# 51 19 19 19
# 52 20 20 20
# 53 21 21 21
# 54 22 22 22
# 55 23 23 23
# 56 24 24 24
# 57 25 25 25
# 58 26 26 26
# 59 27 27 27
# 60 28 28 28
# 61 29 29 29
# 62 30 30 30