I would really appreciate your help. I have large vector that contains 2000 strings of character of different length, which I retrieved from Web of Science. My dataset can be downloaded here.
我将衷心感谢您的帮助。我有一个大的向量,包含2000个不同长度的字符串,我从Web of Science中检索到。我的数据集可以在这里下载。
Data structure and Outcome.
Each row of this vector has a different "length" but the same pattern. The characters within the "[]" determine the number of rows and the characters outside determine the columns. I will make an example with these three rows:
该向量的每一行具有不同的“长度”但具有相同的模式。 “[]”中的字符确定行数,外部字符确定列。我将用这三行做一个例子:
[Sorce, A.; Greco, A.; Magistri, L.] Univ Genoa, Polytech Sch, Thermochem Power Grp TPG DIME, I-16145 Genoa, Italy; [Costamagna, P.] Univ Genoa, Polytech Sch, Thermochem Power Grp TPG DICCA, I-16145 Genoa, Italy
[Allema, Bas; Hemerik, Lia; Rossing, Walter A. H.] Wageningen Univ, NL-6700 AP Wageningen, Netherlands; [Allema, Bas; van Lenteren, Joop C.] Wageningen Univ, Entomol Lab, NL-6700 AP Wageningen, Netherlands; [van der Werf, Wopke] Wageningen Univ, Ctr Crop Syst Anal, Crop & Weed Ecol Grp, NL-6700 AP Wageningen, Netherlands
[Abdissa, Ketema; Tadesse, Mulualem; Bezabih, Mesele; Bekele, Alemayehu; Abebe, Gemeda] Jimma Univ, Dept Med Lab Sci & Pathol, Jimma, Ethiopia; [Apers, Ludwig] Inst Trop Med, Dept Clin Sci, B-2000 Antwerp, Belgium; [Rigouts, Leen] Inst Trop Med, Dept Microbiol, Mycobacteriol Unit, B-2000 Antwerp, Belgium
The first row has 2 groups in "[]" both with 5 columns each; the second row has 2 groups, one with 3 columns and the second with 4; the third row has 3 groups, with 4, 4 and 5 columns each.
第一行在“[]”中有两组,每组有5列;第二行有2组,一组有3列,第二行有4组;第三行有3组,每组有4列,4列和5列。
The outcome will be a matrix like this:
结果将是这样的矩阵:
ID Author Info01 Info02 Info03 Info04 Info05
1 Sorce, A Univ Genoa Polytech Sch Thermochem Power Grp TPG DIME I-16145 Genoa Italy
1 Greco, A. Univ Genoa Polytech Sch Thermochem Power Grp TPG DIME I-16145 Genoa Italy
1 Magistri, L. Univ Genoa Polytech Sch Thermochem Power Grp TPG DIME I-16145 Genoa Italy
1 Costamagna, P. Univ Genoa Polytech Sch Thermochem Power Grp TPG DICCA I-16145 Genoa Italy
2 Allema Wageningen Univ NL-6700 AP Wageningen Netherlands N/A N/A
2 Bas; Hemerik Wageningen Univ NL-6700 AP Wageningen Netherlands N/A N/A
2 Lia; Rossing Wageningen Univ NL-6700 AP Wageningen Netherlands N/A N/A
2 Walter A. H. Wageningen Univ NL-6700 AP Wageningen Netherlands N/A N/A
2 Allema, Bas Wageningen Univ Entomol Lab NL-6700 AP Wageningen Netherlands N/A
2 van Lenteren, Joop C. Wageningen Univ Entomol Lab NL-6700 AP Wageningen Netherlands N/A
2 van der Werf, Wopke Wageningen Univ Ctr Crop Syst Anal Crop & Weed Ecol Grp NL-6700 AP Wageningen Netherlands
3 Abdissa, Ketema Jimma Univ Dept Med Lab Sci & Pathol Jimma Ethiopia N/A
3 Tadesse, Mulualem Jimma Univ Dept Med Lab Sci & Pathol Jimma Ethiopia N/A
3 Bezabih, Mesele Jimma Univ Dept Med Lab Sci & Pathol Jimma Ethiopia N/A
3 Bekele, Alemayehu Jimma Univ Dept Med Lab Sci & Pathol Jimma Ethiopia N/A
3 Abebe, Gemeda Jimma Univ Dept Med Lab Sci & Pathol Jimma Ethiopia N/A
3 Apers, Ludwig Inst Trop Med Dept Clin Sci B-2000 Antwerp Belgium N/A
3 Rigouts, Leen Inst Trop Med Dept Microbiol Mycobacteriol Unit B-2000 Antwerp Belgium
My Approach
Separate the strings and convert the vector into a list using this command:
使用以下命令分隔字符串并将向量转换为列表:
CL1 <- str_split(CL, "\\[|\\]", n= Inf)
This generates a list of vectors with characters like this:
这会生成一个包含以下字符的向量列表:
[[1999]]
[1] ""
[2] "Zhuo, Hongying; Li, Qingzhong; Li, Wenzuo; Cheng, Jianbo"
[3] " Yantai Univ, Sch Chem & Chem Engn, Lab Theoret & Computat Chem, Yantai 264005, Peoples R China"
[[2000]]
[1] ""
[2] "Zuo, Li; Meng, Qing-Hong; Chung, Peter Chee-Keung"
[3] " Guiyang Med Coll, Dept Immunol, Guiyang 550004, Guizhou Provinc, Peoples R China; "
[4] "Yuan, Kai-Tao"
[5] " Sun Yat Sen Univ, Affiliated Hosp 1, Dept Surg, Guangzhou 510080, Guangdong, Peoples R China; "
[6] "Yu, Li"
[7] " Guangzhou First Municipal Peoples Hosp, Dept Paediat, Guangzhou 510180, Guangdong, Peoples R China; "
[8] "Yang, Ding-Hua"
[9] " Southern Med Univ, Nan Fang Hosp, Dept Hepatobiliary Surg, Guangzhou 510515, Guangdong, Peoples R China"
As you can see the first element of each vector in the list is blank. Each "even" element of the vectors contains the "groups" and each "odd" element contains the columns of that group.
如您所见,列表中每个向量的第一个元素是空白的。向量的每个“偶数”元素包含“组”,每个“奇数”元素包含该组的列。
The next step is to separate the groups to assemble a matrix for this I'm using this two commands.
下一步是将组分开以组装矩阵,我正在使用这两个命令。
CL2 <- lapply(CL1,function(x)x[2])
AF1 <- lapply(CL1,function(x)x[3])
Since in some cases I have more that 50 groups in the same row, basically I have to repeat this process in a loop, but I don't know how, now I'm doing it manually. Another problem is that I don't know how to create an ID and how to merge the lists into a matrix.
因为在某些情况下我在同一行中有超过50个组,基本上我必须在循环中重复这个过程,但我不知道如何,现在我手动完成它。另一个问题是我不知道如何创建ID以及如何将列表合并到矩阵中。
Any ideas or suggestions will be welcome.
任何想法或建议都将受到欢迎。
2 个解决方案
#1
The following should do what you want to achieve:
以下应该做你想要实现的目标:
A <- read.csv("AU.csv", stringsAsFactors = FALSE)
## One vector with all of the data in square brackets
A1 <- regmatches(A[[2]], gregexpr("\\[.*?\\]", A[[2]]))
LA1 <- lengths(A1)
A1 <- gsub("\\[|\\]", "", unlist(A1))
## One vector with all of the other data
A2 <- regmatches(A[[2]], gregexpr("\\[.*?\\]", A[[2]]), invert = TRUE)
LA2 <- lengths(A2) - 1
A2 <- unlist(lapply(A2, function(x) gsub("^\\s+|\\s+$|;\\s+$", "", x[-1])))
## Checking for mistakes....
all.equal(LA1, LA2)
# [1] TRUE
all.equal(sum(LA1), length(A1))
# [1] TRUE
Now that we have the vectors, we can use cSplit
from my "splitstackshape" package to get the output you want:
现在我们有了向量,我们可以使用我的“splitstackshape”包中的cSplit来获得你想要的输出:
library(splitstackshape)
library(magrittr)
## Make a data.table of the two vectors and the ID column
DT <- data.table(ID = rep(A[[1]], LA1), A1, A2)
## Here's the splitting....
final <- DT %>%
cSplit("A1", ";", "long") %>% ## The first column is split and made long
cSplit("A2", ",") ## The second column is split and made wide
Here's the result:
这是结果:
final
# ID A1 A2_01 A2_02
# 1: 1 Aalten, Pauline Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
# 2: 1 Ramakers, Inez H. G. B. Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
# 3: 1 Rozendaal, Nico Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
# 4: 1 Verhey, Frans R. J. Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
# 5: 1 Biessels, Geert Jan Univ Med Ctr Utrecht Dept Neurol
# ---
# 13949: 2000 Meng, Qing-Hong Guiyang Med Coll Dept Immunol
# 13950: 2000 Chung, Peter Chee-Keung Guiyang Med Coll Dept Immunol
# 13951: 2000 Yuan, Kai-Tao Sun Yat Sen Univ Affiliated Hosp 1
# 13952: 2000 Yu, Li Guangzhou First Municipal Peoples Hosp Dept Paediat
# 13953: 2000 Yang, Ding-Hua Southern Med Univ Nan Fang Hosp
# A2_03 A2_04 A2_05 A2_06 A2_07 A2_08 A2_09 A2_10
# 1: Alzheimer Ctr Limburg NL-6200 MD Maastricht Netherlands NA NA NA NA NA
# 2: Alzheimer Ctr Limburg NL-6200 MD Maastricht Netherlands NA NA NA NA NA
# 3: Alzheimer Ctr Limburg NL-6200 MD Maastricht Netherlands NA NA NA NA NA
# 4: Alzheimer Ctr Limburg NL-6200 MD Maastricht Netherlands NA NA NA NA NA
# 5: Utrecht Netherlands NA NA NA NA NA NA
# ---
# 13949: Guiyang 550004 Guizhou Provinc Peoples R China NA NA NA NA NA
# 13950: Guiyang 550004 Guizhou Provinc Peoples R China NA NA NA NA NA
# 13951: Dept Surg Guangzhou 510080 Guangdong Peoples R China NA NA NA NA
# 13952: Guangzhou 510180 Guangdong Peoples R China NA NA NA NA NA
# 13953: Dept Hepatobiliary Surg Guangzhou 510515 Guangdong Peoples R China NA NA NA NA
#2
You can do some various manipulations with regular expressions, and use plyr
and foreach
functions to process everything. Here is an example of the first row
您可以使用正则表达式进行各种操作,并使用plyr和foreach函数来处理所有内容。这是第一行的示例
library(foreach)
library(plyr)
str1 = '[Sorce, A.; Greco, A.; Magistri, L.] Univ Genoa, Polytech Sch, Thermochem Power Grp TPG DIME, I-16145 Genoa, Italy; [Costamagna, P.] Univ Genoa, Polytech Sch, Thermochem Power Grp TPG DICCA, I-16145 Genoa, Italy'
##split the string into different parts
s1 = strsplit(str1,'; \\[')
s1. = llply(s1,strsplit,split = ']')[[1]]
##get list of authors
auths = llply(s1.,function(x) gsub('^ ','',strsplit(gsub('\\[','',x[1]),';')[[1]]))
##get all other attributes
other.stuff = llply(s1.,function(x) gsub('^ ','',strsplit(x[2],',')[[1]]))
results = foreach(auth = auths, other = other.stuff, .combine = 'rbind') %do%
expand.grid(auth,other[1],other[2],other[3],other[4],other[5])
The output's column names need to be changed, and you need to iterate this for each line, but that should be easy.
输出的列名称需要更改,您需要为每行重复此操作,但这应该很容易。
#1
The following should do what you want to achieve:
以下应该做你想要实现的目标:
A <- read.csv("AU.csv", stringsAsFactors = FALSE)
## One vector with all of the data in square brackets
A1 <- regmatches(A[[2]], gregexpr("\\[.*?\\]", A[[2]]))
LA1 <- lengths(A1)
A1 <- gsub("\\[|\\]", "", unlist(A1))
## One vector with all of the other data
A2 <- regmatches(A[[2]], gregexpr("\\[.*?\\]", A[[2]]), invert = TRUE)
LA2 <- lengths(A2) - 1
A2 <- unlist(lapply(A2, function(x) gsub("^\\s+|\\s+$|;\\s+$", "", x[-1])))
## Checking for mistakes....
all.equal(LA1, LA2)
# [1] TRUE
all.equal(sum(LA1), length(A1))
# [1] TRUE
Now that we have the vectors, we can use cSplit
from my "splitstackshape" package to get the output you want:
现在我们有了向量,我们可以使用我的“splitstackshape”包中的cSplit来获得你想要的输出:
library(splitstackshape)
library(magrittr)
## Make a data.table of the two vectors and the ID column
DT <- data.table(ID = rep(A[[1]], LA1), A1, A2)
## Here's the splitting....
final <- DT %>%
cSplit("A1", ";", "long") %>% ## The first column is split and made long
cSplit("A2", ",") ## The second column is split and made wide
Here's the result:
这是结果:
final
# ID A1 A2_01 A2_02
# 1: 1 Aalten, Pauline Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
# 2: 1 Ramakers, Inez H. G. B. Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
# 3: 1 Rozendaal, Nico Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
# 4: 1 Verhey, Frans R. J. Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
# 5: 1 Biessels, Geert Jan Univ Med Ctr Utrecht Dept Neurol
# ---
# 13949: 2000 Meng, Qing-Hong Guiyang Med Coll Dept Immunol
# 13950: 2000 Chung, Peter Chee-Keung Guiyang Med Coll Dept Immunol
# 13951: 2000 Yuan, Kai-Tao Sun Yat Sen Univ Affiliated Hosp 1
# 13952: 2000 Yu, Li Guangzhou First Municipal Peoples Hosp Dept Paediat
# 13953: 2000 Yang, Ding-Hua Southern Med Univ Nan Fang Hosp
# A2_03 A2_04 A2_05 A2_06 A2_07 A2_08 A2_09 A2_10
# 1: Alzheimer Ctr Limburg NL-6200 MD Maastricht Netherlands NA NA NA NA NA
# 2: Alzheimer Ctr Limburg NL-6200 MD Maastricht Netherlands NA NA NA NA NA
# 3: Alzheimer Ctr Limburg NL-6200 MD Maastricht Netherlands NA NA NA NA NA
# 4: Alzheimer Ctr Limburg NL-6200 MD Maastricht Netherlands NA NA NA NA NA
# 5: Utrecht Netherlands NA NA NA NA NA NA
# ---
# 13949: Guiyang 550004 Guizhou Provinc Peoples R China NA NA NA NA NA
# 13950: Guiyang 550004 Guizhou Provinc Peoples R China NA NA NA NA NA
# 13951: Dept Surg Guangzhou 510080 Guangdong Peoples R China NA NA NA NA
# 13952: Guangzhou 510180 Guangdong Peoples R China NA NA NA NA NA
# 13953: Dept Hepatobiliary Surg Guangzhou 510515 Guangdong Peoples R China NA NA NA NA
#2
You can do some various manipulations with regular expressions, and use plyr
and foreach
functions to process everything. Here is an example of the first row
您可以使用正则表达式进行各种操作,并使用plyr和foreach函数来处理所有内容。这是第一行的示例
library(foreach)
library(plyr)
str1 = '[Sorce, A.; Greco, A.; Magistri, L.] Univ Genoa, Polytech Sch, Thermochem Power Grp TPG DIME, I-16145 Genoa, Italy; [Costamagna, P.] Univ Genoa, Polytech Sch, Thermochem Power Grp TPG DICCA, I-16145 Genoa, Italy'
##split the string into different parts
s1 = strsplit(str1,'; \\[')
s1. = llply(s1,strsplit,split = ']')[[1]]
##get list of authors
auths = llply(s1.,function(x) gsub('^ ','',strsplit(gsub('\\[','',x[1]),';')[[1]]))
##get all other attributes
other.stuff = llply(s1.,function(x) gsub('^ ','',strsplit(x[2],',')[[1]]))
results = foreach(auth = auths, other = other.stuff, .combine = 'rbind') %do%
expand.grid(auth,other[1],other[2],other[3],other[4],other[5])
The output's column names need to be changed, and you need to iterate this for each line, but that should be easy.
输出的列名称需要更改,您需要为每行重复此操作,但这应该很容易。