从凌乱的字符列表到R中的矩阵

时间:2021-03-03 15:41:23

I would really appreciate your help. I have large vector that contains 2000 strings of character of different length, which I retrieved from Web of Science. My dataset can be downloaded here.

我将衷心感谢您的帮助。我有一个大的向量,包含2000个不同长度的字符串,我从Web of Science中检索到。我的数据集可以在这里下载。

Data structure and Outcome.

Each row of this vector has a different "length" but the same pattern. The characters within the "[]" determine the number of rows and the characters outside determine the columns. I will make an example with these three rows:

该向量的每一行具有不同的“长度”但具有相同的模式。 “[]”中的字符确定行数,外部字符确定列。我将用这三行做一个例子:

[Sorce, A.; Greco, A.; Magistri, L.] Univ Genoa, Polytech Sch, Thermochem Power Grp TPG DIME, I-16145 Genoa, Italy; [Costamagna, P.] Univ Genoa, Polytech Sch, Thermochem Power Grp TPG DICCA, I-16145 Genoa, Italy
[Allema, Bas; Hemerik, Lia; Rossing, Walter A. H.] Wageningen Univ, NL-6700 AP Wageningen, Netherlands; [Allema, Bas; van Lenteren, Joop C.] Wageningen Univ, Entomol Lab, NL-6700 AP Wageningen, Netherlands; [van der Werf, Wopke] Wageningen Univ, Ctr Crop Syst Anal, Crop & Weed Ecol Grp, NL-6700 AP Wageningen, Netherlands
[Abdissa, Ketema; Tadesse, Mulualem; Bezabih, Mesele; Bekele, Alemayehu; Abebe, Gemeda] Jimma Univ, Dept Med Lab Sci & Pathol, Jimma, Ethiopia; [Apers, Ludwig] Inst Trop Med, Dept Clin Sci, B-2000 Antwerp, Belgium; [Rigouts, Leen] Inst Trop Med, Dept Microbiol, Mycobacteriol Unit, B-2000 Antwerp, Belgium

The first row has 2 groups in "[]" both with 5 columns each; the second row has 2 groups, one with 3 columns and the second with 4; the third row has 3 groups, with 4, 4 and 5 columns each.

第一行在“[]”中有两组,每组有5列;第二行有2组,一组有3列,第二行有4组;第三行有3组,每组有4列,4列和5列。

The outcome will be a matrix like this:

结果将是这样的矩阵:

ID  Author  Info01  Info02  Info03  Info04  Info05
1   Sorce, A    Univ Genoa   Polytech Sch    Thermochem Power Grp TPG DIME   I-16145 Genoa   Italy
1   Greco, A.   Univ Genoa   Polytech Sch    Thermochem Power Grp TPG DIME   I-16145 Genoa   Italy
1   Magistri, L.    Univ Genoa   Polytech Sch    Thermochem Power Grp TPG DIME   I-16145 Genoa   Italy
1   Costamagna, P.  Univ Genoa   Polytech Sch   Thermochem Power Grp TPG DICCA   I-16145 Genoa   Italy
2   Allema  Wageningen Univ  NL-6700 AP Wageningen   Netherlands    N/A N/A
2   Bas; Hemerik    Wageningen Univ  NL-6700 AP Wageningen   Netherlands    N/A N/A
2   Lia; Rossing    Wageningen Univ  NL-6700 AP Wageningen   Netherlands    N/A N/A
2   Walter A. H.    Wageningen Univ  NL-6700 AP Wageningen   Netherlands    N/A N/A
2   Allema, Bas Wageningen Univ  Entomol Lab     NL-6700 AP Wageningen   Netherlands    N/A
2   van Lenteren, Joop C.   Wageningen Univ  Entomol Lab     NL-6700 AP Wageningen   Netherlands    N/A
2   van der Werf, Wopke Wageningen Univ  Ctr Crop Syst Anal  Crop & Weed Ecol Grp    NL-6700 AP Wageningen   Netherlands
3   Abdissa, Ketema  Jimma Univ  Dept Med Lab Sci & Pathol   Jimma   Ethiopia   N/A
3   Tadesse, Mulualem    Jimma Univ  Dept Med Lab Sci & Pathol   Jimma   Ethiopia   N/A
3   Bezabih, Mesele  Jimma Univ  Dept Med Lab Sci & Pathol   Jimma   Ethiopia   N/A
3   Bekele, Alemayehu    Jimma Univ  Dept Med Lab Sci & Pathol   Jimma   Ethiopia   N/A
3   Abebe, Gemeda    Jimma Univ  Dept Med Lab Sci & Pathol   Jimma   Ethiopia   N/A
3   Apers, Ludwig    Inst Trop Med   Dept Clin Sci   B-2000 Antwerp  Belgium    N/A
3   Rigouts, Leen    Inst Trop Med   Dept Microbiol  Mycobacteriol Unit  B-2000 Antwerp  Belgium

My Approach

Separate the strings and convert the vector into a list using this command:

使用以下命令分隔字符串并将向量转换为列表:

CL1 <- str_split(CL, "\\[|\\]", n= Inf)

This generates a list of vectors with characters like this:

这会生成一个包含以下字符的向量列表:

[[1999]]
[1] ""                                                                                               
[2] "Zhuo, Hongying; Li, Qingzhong; Li, Wenzuo; Cheng, Jianbo"                                       
[3] " Yantai Univ, Sch Chem & Chem Engn, Lab Theoret & Computat Chem, Yantai 264005, Peoples R China"

[[2000]]
[1] ""                                                                                                        
[2] "Zuo, Li; Meng, Qing-Hong; Chung, Peter Chee-Keung"                                                       
[3] " Guiyang Med Coll, Dept Immunol, Guiyang 550004, Guizhou Provinc, Peoples R China; "                     
[4] "Yuan, Kai-Tao"                                                                                           
[5] " Sun Yat Sen Univ, Affiliated Hosp 1, Dept Surg, Guangzhou 510080, Guangdong, Peoples R China; "         
[6] "Yu, Li"                                                                                                  
[7] " Guangzhou First Municipal Peoples Hosp, Dept Paediat, Guangzhou 510180, Guangdong, Peoples R China; "   
[8] "Yang, Ding-Hua"                                                                                          
[9] " Southern Med Univ, Nan Fang Hosp, Dept Hepatobiliary Surg, Guangzhou 510515, Guangdong, Peoples R China"

As you can see the first element of each vector in the list is blank. Each "even" element of the vectors contains the "groups" and each "odd" element contains the columns of that group.

如您所见,列表中每个向量的第一个元素是空白的。向量的每个“偶数”元素包含“组”,每个“奇数”元素包含该组的列。

The next step is to separate the groups to assemble a matrix for this I'm using this two commands.

下一步是将组分开以组装矩阵,我正在使用这两个命令。

CL2 <- lapply(CL1,function(x)x[2])

AF1 <- lapply(CL1,function(x)x[3])

Since in some cases I have more that 50 groups in the same row, basically I have to repeat this process in a loop, but I don't know how, now I'm doing it manually. Another problem is that I don't know how to create an ID and how to merge the lists into a matrix.

因为在某些情况下我在同一行中有超过50个组,基本上我必须在循环中重复这个过程,但我不知道如何,现在我手动完成它。另一个问题是我不知道如何创建ID以及如何将列表合并到矩阵中。

Any ideas or suggestions will be welcome.

任何想法或建议都将受到欢迎。

2 个解决方案

#1


The following should do what you want to achieve:

以下应该做你想要实现的目标:

A <- read.csv("AU.csv", stringsAsFactors = FALSE)

## One vector with all of the data in square brackets
A1 <- regmatches(A[[2]], gregexpr("\\[.*?\\]", A[[2]]))
LA1 <- lengths(A1)

A1 <- gsub("\\[|\\]", "", unlist(A1))

## One vector with all of the other data
A2 <- regmatches(A[[2]], gregexpr("\\[.*?\\]", A[[2]]), invert = TRUE)
LA2 <- lengths(A2) - 1

A2 <- unlist(lapply(A2, function(x) gsub("^\\s+|\\s+$|;\\s+$", "", x[-1])))

## Checking for mistakes....
all.equal(LA1, LA2)
# [1] TRUE
all.equal(sum(LA1), length(A1))
# [1] TRUE

Now that we have the vectors, we can use cSplit from my "splitstackshape" package to get the output you want:

现在我们有了向量,我们可以使用我的“splitstackshape”包中的cSplit来获得你想要的输出:

library(splitstackshape)
library(magrittr)

## Make a data.table of the two vectors and the ID column
DT <- data.table(ID = rep(A[[1]], LA1), A1, A2)

## Here's the splitting....
final <- DT %>% 
  cSplit("A1", ";", "long") %>%  ## The first column is split and made long
  cSplit("A2", ",")              ## The second column is split and made wide

Here's the result:

这是结果:

final
#          ID                      A1                                  A2_01                            A2_02
#     1:    1         Aalten, Pauline                Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
#     2:    1 Ramakers, Inez H. G. B.                Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
#     3:    1         Rozendaal, Nico                Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
#     4:    1     Verhey, Frans R. J.                Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
#     5:    1     Biessels, Geert Jan                   Univ Med Ctr Utrecht                      Dept Neurol
#    ---                                                                                                     
# 13949: 2000         Meng, Qing-Hong                       Guiyang Med Coll                     Dept Immunol
# 13950: 2000 Chung, Peter Chee-Keung                       Guiyang Med Coll                     Dept Immunol
# 13951: 2000           Yuan, Kai-Tao                       Sun Yat Sen Univ                Affiliated Hosp 1
# 13952: 2000                  Yu, Li Guangzhou First Municipal Peoples Hosp                     Dept Paediat
# 13953: 2000          Yang, Ding-Hua                      Southern Med Univ                    Nan Fang Hosp
#                          A2_03                 A2_04           A2_05           A2_06 A2_07 A2_08 A2_09 A2_10
#     1:   Alzheimer Ctr Limburg NL-6200 MD Maastricht     Netherlands              NA    NA    NA    NA    NA
#     2:   Alzheimer Ctr Limburg NL-6200 MD Maastricht     Netherlands              NA    NA    NA    NA    NA
#     3:   Alzheimer Ctr Limburg NL-6200 MD Maastricht     Netherlands              NA    NA    NA    NA    NA
#     4:   Alzheimer Ctr Limburg NL-6200 MD Maastricht     Netherlands              NA    NA    NA    NA    NA
#     5:                 Utrecht           Netherlands              NA              NA    NA    NA    NA    NA
#    ---                                                                                                      
# 13949:          Guiyang 550004       Guizhou Provinc Peoples R China              NA    NA    NA    NA    NA
# 13950:          Guiyang 550004       Guizhou Provinc Peoples R China              NA    NA    NA    NA    NA
# 13951:               Dept Surg      Guangzhou 510080       Guangdong Peoples R China    NA    NA    NA    NA
# 13952:        Guangzhou 510180             Guangdong Peoples R China              NA    NA    NA    NA    NA
# 13953: Dept Hepatobiliary Surg      Guangzhou 510515       Guangdong Peoples R China    NA    NA    NA    NA

#2


You can do some various manipulations with regular expressions, and use plyr and foreach functions to process everything. Here is an example of the first row

您可以使用正则表达式进行各种操作,并使用plyr和foreach函数来处理所有内容。这是第一行的示例

library(foreach)
library(plyr)
str1 = '[Sorce, A.; Greco, A.; Magistri, L.] Univ Genoa, Polytech Sch, Thermochem Power Grp TPG DIME, I-16145 Genoa, Italy; [Costamagna, P.] Univ Genoa, Polytech Sch, Thermochem Power Grp TPG DICCA, I-16145 Genoa, Italy'

##split the string into different parts
s1 = strsplit(str1,'; \\[')
s1. = llply(s1,strsplit,split = ']')[[1]]

##get list of authors
auths = llply(s1.,function(x) gsub('^ ','',strsplit(gsub('\\[','',x[1]),';')[[1]]))
##get all other attributes
other.stuff = llply(s1.,function(x) gsub('^ ','',strsplit(x[2],',')[[1]]))

results = foreach(auth = auths, other = other.stuff, .combine = 'rbind') %do%
 expand.grid(auth,other[1],other[2],other[3],other[4],other[5])

The output's column names need to be changed, and you need to iterate this for each line, but that should be easy.

输出的列名称需要更改,您需要为每行重复此操作,但这应该很容易。

#1


The following should do what you want to achieve:

以下应该做你想要实现的目标:

A <- read.csv("AU.csv", stringsAsFactors = FALSE)

## One vector with all of the data in square brackets
A1 <- regmatches(A[[2]], gregexpr("\\[.*?\\]", A[[2]]))
LA1 <- lengths(A1)

A1 <- gsub("\\[|\\]", "", unlist(A1))

## One vector with all of the other data
A2 <- regmatches(A[[2]], gregexpr("\\[.*?\\]", A[[2]]), invert = TRUE)
LA2 <- lengths(A2) - 1

A2 <- unlist(lapply(A2, function(x) gsub("^\\s+|\\s+$|;\\s+$", "", x[-1])))

## Checking for mistakes....
all.equal(LA1, LA2)
# [1] TRUE
all.equal(sum(LA1), length(A1))
# [1] TRUE

Now that we have the vectors, we can use cSplit from my "splitstackshape" package to get the output you want:

现在我们有了向量,我们可以使用我的“splitstackshape”包中的cSplit来获得你想要的输出:

library(splitstackshape)
library(magrittr)

## Make a data.table of the two vectors and the ID column
DT <- data.table(ID = rep(A[[1]], LA1), A1, A2)

## Here's the splitting....
final <- DT %>% 
  cSplit("A1", ";", "long") %>%  ## The first column is split and made long
  cSplit("A2", ",")              ## The second column is split and made wide

Here's the result:

这是结果:

final
#          ID                      A1                                  A2_01                            A2_02
#     1:    1         Aalten, Pauline                Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
#     2:    1 Ramakers, Inez H. G. B.                Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
#     3:    1         Rozendaal, Nico                Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
#     4:    1     Verhey, Frans R. J.                Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
#     5:    1     Biessels, Geert Jan                   Univ Med Ctr Utrecht                      Dept Neurol
#    ---                                                                                                     
# 13949: 2000         Meng, Qing-Hong                       Guiyang Med Coll                     Dept Immunol
# 13950: 2000 Chung, Peter Chee-Keung                       Guiyang Med Coll                     Dept Immunol
# 13951: 2000           Yuan, Kai-Tao                       Sun Yat Sen Univ                Affiliated Hosp 1
# 13952: 2000                  Yu, Li Guangzhou First Municipal Peoples Hosp                     Dept Paediat
# 13953: 2000          Yang, Ding-Hua                      Southern Med Univ                    Nan Fang Hosp
#                          A2_03                 A2_04           A2_05           A2_06 A2_07 A2_08 A2_09 A2_10
#     1:   Alzheimer Ctr Limburg NL-6200 MD Maastricht     Netherlands              NA    NA    NA    NA    NA
#     2:   Alzheimer Ctr Limburg NL-6200 MD Maastricht     Netherlands              NA    NA    NA    NA    NA
#     3:   Alzheimer Ctr Limburg NL-6200 MD Maastricht     Netherlands              NA    NA    NA    NA    NA
#     4:   Alzheimer Ctr Limburg NL-6200 MD Maastricht     Netherlands              NA    NA    NA    NA    NA
#     5:                 Utrecht           Netherlands              NA              NA    NA    NA    NA    NA
#    ---                                                                                                      
# 13949:          Guiyang 550004       Guizhou Provinc Peoples R China              NA    NA    NA    NA    NA
# 13950:          Guiyang 550004       Guizhou Provinc Peoples R China              NA    NA    NA    NA    NA
# 13951:               Dept Surg      Guangzhou 510080       Guangdong Peoples R China    NA    NA    NA    NA
# 13952:        Guangzhou 510180             Guangdong Peoples R China              NA    NA    NA    NA    NA
# 13953: Dept Hepatobiliary Surg      Guangzhou 510515       Guangdong Peoples R China    NA    NA    NA    NA

#2


You can do some various manipulations with regular expressions, and use plyr and foreach functions to process everything. Here is an example of the first row

您可以使用正则表达式进行各种操作,并使用plyr和foreach函数来处理所有内容。这是第一行的示例

library(foreach)
library(plyr)
str1 = '[Sorce, A.; Greco, A.; Magistri, L.] Univ Genoa, Polytech Sch, Thermochem Power Grp TPG DIME, I-16145 Genoa, Italy; [Costamagna, P.] Univ Genoa, Polytech Sch, Thermochem Power Grp TPG DICCA, I-16145 Genoa, Italy'

##split the string into different parts
s1 = strsplit(str1,'; \\[')
s1. = llply(s1,strsplit,split = ']')[[1]]

##get list of authors
auths = llply(s1.,function(x) gsub('^ ','',strsplit(gsub('\\[','',x[1]),';')[[1]]))
##get all other attributes
other.stuff = llply(s1.,function(x) gsub('^ ','',strsplit(x[2],',')[[1]]))

results = foreach(auth = auths, other = other.stuff, .combine = 'rbind') %do%
 expand.grid(auth,other[1],other[2],other[3],other[4],other[5])

The output's column names need to be changed, and you need to iterate this for each line, but that should be easy.

输出的列名称需要更改,您需要为每行重复此操作,但这应该很容易。