I want to take a csv export of my bibtex literature database and analyse the correlation between keywords and Journals. I start off with a csv file containing one row per piece of literature, each one with a Journal name, and a keyword list, which is a slash deliminated list. I want to end up with either a matrix of Journal by Keyword and counts.
我想要导出我的bibtex文献数据库的csv导出,并分析关键字和日志之间的相关性。我从一个csv文件开始,该文件包含每行文档,每个文档都有一个日志名,以及一个关键字列表,这是一个斜杠删除列表。我想以一个由关键字和计数组成的日志矩阵结束。
Currently I've written this code, but there must be a better way, anyone have any ideas ?
目前我已经写了这段代码,但是一定有更好的方法,有人知道吗?
sortframe<-function(df,...){df[do.call(order,list(...)),]}
library(ggplot2)
library(plyr)
bib<-read.csv("/home/paul/workspace/Test_R_statet/data/bib.csv") # read csv file
So, here's the structure of my data, I've taken twenty rows that (seem) to be representative of the three thousand I've got in total.
这是我的数据结构,我取了20行,看起来代表了我总共得到的3000行。
dput(bib)
structure(list(BibliographyType = c(7L, 7L, 7L, 7L, 7L, 7L, 7L,
7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L,
7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L), Author = structure(c(19L, 21L,
22L, 23L, 24L, 25L, 20L, 28L, 26L, 27L, 1L, 2L, 2L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 15L, 16L, 17L,
18L), .Label = c("Constantinos, Apostolou; Dotsikas, Yannis; Kousoulos, Constantinos & Loukas, Yannis L.",
"Constantinos, Apostolou; Kousoulos, Constantinos; Dotsikas, Yannis; Soumelas, Georgios Stefanos; Kolocouri, Filomila; Ziaka, Afroditi & Loukas, Yannis L.",
"Constantinos, Kousoulos; Tsatsou, Georgia; Apostolou, Constantinos; Dotsikas, Yannis & Loukas, Yannis",
"Corine, Ekhart; Gebretensae, Abadi; Rosing, Hilde; Rodenhuis, Sjoerd; Beijnen, Jos H. & Huitema, Alwin D. R.",
"Costa, Ferreira Sergio Luis; Bruns, Roy Edward; da Silva, Erik Galvpo Paranhos; dos Santos, Walter Nei Lopes; Quintella, Cristina Maria; David, Jorge Mauricio; de Andrade, Jailson Bittencourt; Breitkreitz, Marcia Cristina; Jardim, Isabel Cristina Sales Fontes & Neto, Benicio Barros",
"Costa, Queiroz Regina Helena; Bertucci, Carlo; Malfarb, Wilson Roberto; Dreossi, S�nia Aparecida Carvalho; Chaves, Andrqa Rodrigues; Valqrio, Daniel Augusto Rodrigues & Queiroz, Maria EugWnia Costa",
"Cui, Shuangjin; Fang, Feng; Han, Liu & Ming, Ma", "D., Blessborn; Neamin, G.; Bergqvist, Y. & Lindegsrdh, N.",
"D., Fraier; Frigerio, E.; Brianceschi, G. & James, C. A.", "D., Grotto; Santa Maria, L. D.; Boeira, S.; Valentini, J.; Charpo, M. F.; Moro, A. M.; Nascimento, P. C.; Pomblum, V. J. & Garcia, S. C.",
"D., Hawker Charles; Garr, Susan B.; Hamilton, Leslie T.; Penrose, John R.; Ashwood, Edward R. & Weiss, Ronald L.",
"D., Hawker Charles; Roberts, William L.; Garr, Susan B.; Hamilton, Leslie T.; Penrose, John R.; Ashwood, Edward R. & Weiss, Ronald L.",
"D., Heath Dennis; Pruitt, Milagros A.; Brenner, Dean E.; Begum, Aynun N.; Frautschy, Sally A. & Rock, Cheryl L.",
"D., Jovanovic & Vukovic, S.", "D., McCullough B.", "D., McCullough B. & Vinod, H. D.",
"D., McCullough B. & Wilson, B.", "D., Mendes Gustavo; Hamamoto, Daniele; Ilha, Jaime; Pereira, Alberto dos Santos & De Nucci, Gilberto",
"do, Borges Ney Carter; Mendes, Gustavo D.; Barrientos-Astigarraga, Rafael E.; Galvinas, Paulo; Oliveira, Celso H. & De Nucci, Gilberto",
"hui, Liu Chang; Huang, Xiao tao; Zhang, Rong; Yang, Lei; Huang, Tian lai; Wang, Ning sheng & Mi, Sui qing",
"jing, Chen Zhang; Zhang, Jing; Yu, Ji cheng; Cao, Guo ying; Wu, Xiao jie & Shi, Yao guo",
"jun, Dao Yi; Jiao, Zheng & Zhong, Ming kang", "lan, Feng Shi; Hu, Fang di; Zhao, Jian xiong; Liu, Xi & Li, Y.",
"ming, Huang Jian; Wang, Guo quan; Jin, Yu; Shen, Teng & Weng, Weiyu",
"nhaug, Halvorsen Trine Gr; Pedersen-Bjergaard, Stig & Rasmussen, Knut E.",
"qing, Liu Hua; Su, Meng xiang; Di, Bin; Hang, Tai jun; Hu, Ying; Tian, Xiao qin; Zhang, Yin di & Shen, Jian ping",
"qing, Liu Yun; Chen, Qi yuan; Chen, Ben Mei; Liu, Shao gang; Deng, Fu liang & Zhou, Ping",
"ying, Lee Chun & Lee, Yung-jin"), class = "factor"), Title = structure(c(29L,
23L, 24L, 9L, 10L, 15L, 8L, 18L, 11L, 21L, 20L, 3L, 3L, 3L, 12L,
25L, 26L, 19L, 16L, 2L, 14L, 22L, 6L, 7L, 27L, 13L, 5L, 4L, 28L,
17L, 1L), .Label = c("Anastrozole quantification in human plasma by high-performance liquid chromatography coupled to photospray tandem mass spectrometry applied to pharmacokinetic studies",
"A new approach to evaluate stability of amodiaquine and its metabolite in blood and plasma",
"An improved and fully validated LC-MS/MS method for the simultaneous quantification of simvastatin and simvastatin acid in human plasma",
"Assessing the reliability of statistical software: Part I",
"Assessing the reliability of statistical software: Part II",
"Automated Transport and Sorting System in a Large Reference Laboratory: Part 1. Evaluation of Needs and Alternatives and Development of a Plan",
"Automated Transport and Sorting System in a Large Reference Laboratory: Part 2. Implementation of the System and Performance Measures over Three Years",
"Determination of CQP propionic acid in rat plasma and study of pharmacokinetics of CQP propionic acid in rats by liquid chromatography",
"Determination of eleutheroside E and eleutheroside B in rat plasma and tissue by high-performance liquid chromatography using solid-phase extraction and photodiode array detection",
"Determination of palmatine in canine plasma by liquid chromatography-tandem mass spectrometry with solid-phase extraction",
"Development and validation of a liquid chromatography-tandem mass spectrometry method for the determination of xanthinol in human plasma and its application in a bioequivalence study of xanthinol nicotinate tablets",
"Development of a high-throughput method for the determination of itraconazole and its hydroxy metabolite in human plasma, employing automated liquidG��liquid extraction based on 96-well format plates and LC/MS/MS",
"Generation of quasi-stationary magnetic fields in turbulent plasmas",
"LC-MS-MS determination of nemorubicin (methoxymorpholinyldoxorubicin, PNU-152243A) and its 13-OH metabolite (PNU-155051A) in human plasma",
"Liquid-phase microextraction and capillary electrophoresis of citalopram, an antidepressant drug",
"New method for high-performance liquid chromatographic determination of amantadine and its analogues in rat plasma",
"On the accuracy of statistical procedures in Microsoft Excel 97",
"PKfit - A Pharmacokinetic Data Analaysis Tool in R", "Quantification of carbamazepine, carbamazepine-10,11-epoxide, phenytoin and phenobarbital in plasma samples by stir bar-sorptive extraction and liquid chromatography",
"Quantitative determination of donepezil in human plasma by liquid chromatography/tandem mass spectrometry employing an automated liquid-liquid extraction based on 96-well format plates: Application to a bioequivalence study",
"Quantitative determination of erythromycylamine in human plasma by liquid chromatography-mass spectrometry and its application in a bioequivalence study of dirithromycin",
"Rapid quantification of malondialdehyde in plasma by high performance liquid chromatography-visible detection",
"Selective method for the determination of cefdinir in human plasma using liquid chromatography electrospray ionization tandam mass spectrometry",
"Simultaneous determination of aciclovir, ganciclovir, and penciclovir in human plasma by high-performance liquid chromatography with fluorescence detection",
"Simultaneous quantification of cyclophosphamide and its active metabolite 4-hydroxycyclophosphamide in human plasma by high-performance liquid chromatography coupled with electrospray ionization tandem mass spectrometry (LC-MS/MS)",
"Statistical designs and response surface techniques for the optimization of chromatographic systems",
"Tetrahydrocurcumin in plasma and urine: Quantitation by high performance liquid chromatography",
"The numerical reliability of econometric software", "Verapamil quantification in human plasma by liquid chromatography coupled to tandem mass spectrometry: An application for bioequivalence study"
), class = "factor"), Journal = structure(c(7L, 7L, 7L, 5L, 7L,
6L, 7L, 1L, 7L, 7L, 7L, 9L, 9L, 9L, 2L, 7L, 6L, 9L, 9L, 9L, 9L,
9L, 3L, 3L, 7L, 10L, 11L, 11L, 8L, 4L, 7L), .Label = c("", "Analytical and Bioanalytical Chemistry",
"Clinical Chemistry", "Computational Statistics and Data Analysis",
"European Journal of Pharmaceutics and Biopharmaceutics", "Journal of Chromatography A",
"Journal of Chromatography B", "Journal of Economic Literature",
"Journal of Pharmaceutical and Biomedical Analysis", "Physica B+C",
"The American Statistician"), class = "factor"), Custom3 = structure(c(8L,
9L, 11L, 17L, 25L, 19L, 24L, 27L, 12L, 22L, 2L, 5L, 5L, 6L, 3L,
1L, 20L, 23L, 13L, 14L, 4L, 7L, 21L, 16L, 15L, 26L, 28L, 28L,
28L, 10L, 18L), .Label = c("4-Hydroxycyclophosphamide/Accuracy/Active metabolite/Assay/Chromatography/Cyclophosphamide/Determination/Electrospray/Electrospray ionization/Electrospray ionization tandem mass spectrometry/High performance liquid chromatography/High-performance liquid chromatography/Human/Human plasma/Internal standard/LC-MS/MS/Liquid chromatography/Liquid chromatography tandem mass spectrometry/Mass spectrometry/Metabolite/Pharmacokinetic/Pharmacokinetics/Plasma/Precipitation/Precision/Protein precipitation/Quantification/Sample preparation/Tandem mass spectrometry",
"96-Well/96-Well format/Analytical/Automated liquid-liquid extraction/bioequivalence/Bioequivalence study/Determination/Donepezil/Electrospray/Electrospray ionization/Extraction/Freezing/High throughput/High-throughput/Human/Human plasma/Human-plasma/LC-MS/MS/Liquid chromatography/tandem mass spectrometry/Liquid-liquid extraction/Loratadine/Mass spectrometry/Plasma/Plasma samples/Quantitative/Sample preparation/Tablet/Validation",
"96-Well/96-Well format/Assay/bioequivalence/Bioequivalence study/Determination/Electrospray/Electrospray ionization/Extraction/Freezing/High throughput/High-throughput/Human/Human plasma/Human-plasma/Interface/Internal standard/LC/MS/MS/LLE/Mass spectrometry/Metabolite/Monitoring/MRM/Parallel sample processing/Plasma/Plasma sample/Plasma samples/Precision/Quality control/Quantification/Simultaneous quantification/Tablet",
"96-Well/96-well plates/Accuracy/Analysis/Determination/Doxorubicin/Doxorubicin derivative/Extraction/Human/Human plasma/Human-plasma/In vivo/Interface/Interference/Internal standard/Ionspray/LC-MS-MS/LC-MS-MS determination/Liquid chromatography tandem mass spectrometry/Liquid chromatography-tandem mass spectrometry/Mass spectrometry/Metabolite/Methoxymorpholinyldoxorubicin/Monitoring/Multiple reaction monitoring/Nemorubicin/Patients/Plasma/Plasma samples/Precision/Quantitative/Quantitative determination/Residue/Solid phase extraction/SPE",
"96-Well/Analysis/Analytical/APCI/Atmospheric pressure chemical ionization/bioequivalence/Bioequivalence study/Determination/Electrospray/ESI/Extraction/Fully automated/High throughput/High-throughput/Human/Human plasma/Human-plasma/Improved/Internal standard/LC-MS/MS/LC-MS/MS method/Linearity/Liquid chromatography/tandem mass spectrometry/Liquid-liquid extraction/LLE/Lovastatin/Mass spectrometry/Plasma/Plasma sample/Plasma samples/Polarity switch/Precipitation/Precision/Protein precipitation/Quantification/Sample preparation/Simultaneous determination/Simultaneous quantification/Simvastatin/Simvastatin acid/Specificity/Tablet/Two-step extraction",
"96-Well/Atmospheric pressure chemical ionization/bioequivalence/Electrospray/High throughput/High-throughput/Human plasma/LC-MS/MS/Liquid-liquid extraction/Plasma/Polarity switch/Protein precipitation/Sample preparation/Simvastatin/Two-step extraction",
"Accuracy/Alkaline hydrolysis/Analytical/Assay/Bias/Deproteinization/Derivatization/Determination/Extraction/HPLC-VIS/Human plasma/Malondialdehyde/MDA/n-Butanol extraction/Phosphate/Plasma/Quantification/Reproducibility/Stability",
"Accuracy/Analysis/Analytical/bioequivalence/Bioequivalence study/Chromatography/Determination/Electrospray/Electrospray ionization/ESI/Extraction/Formulation/Human/Human plasma/Human-plasma/Imprecision/Internal standard/LC-MS/MS/Liquid chromatography/Liquid-liquid extraction/Mass spectrometry/Metoprolol/Monitoring/MRM/Plasma/Plasma samples/Quantification/Tablet/Tandem mass spectrometry/Verapamil",
"Accuracy/Cefdinir/Chromatography/Determination/Electrospray/Electrospray ionization/Healthy volunteer/Human/Human plasma/Human-plasma/LC/LC-MS/MS/Liquid chromatography/Mass spectrometry/Method validation/Monitoring/MS/MS/Pharmacokinetic/Pharmacokinetic profile/Plasma/Precipitation/Protein precipitation/Quantification/Three/Triple quadrupole/Validation/Water/Waters",
"Accuracy/statistics/reliability/testing", "Aciclovir/Assay/Bias/Chromatography/Determination/Ganciclovir/High performance liquid chromatography/High-performance liquid chromatography/HPLC/HPLC method/Human/Human plasma/Liquid chromatography/Penciclovir/Pharmacokinetic/Pharmacokinetic study/Plasma/Precipitation/Protein precipitation/Three",
"Acyclovir/bioequivalence/Electrospray/Extraction/Human/Human plasma/Liquid chromatography-tandem mass spectrometry/Liquid chromatography/tandem mass spectrometry/Mass spectrometry/Plasma/Precipitation/Protein precipitation/Quantification/Validation/Xanthinol nicotinate",
"Amantadine/Anthraquinone-2-sulfonyl chloride/Derivatization/Determination/HPLC/Memantine/Pharmacokinetic/Pharmacokinetic studies/Pharmacokinetic study/Plasma/Quantification/Rat/Rat plasma/Rimantadine/Three/UV/UV detection",
"Amodiaquine/Analysis/Antimalarial/Bias/Blood/Chloroquine/Desethylamodiaquine/Liquid chromatography/Metabolite/Plasma/Simultaneous analysis/solid-phase extraction/Stability/Whole blood",
"Analysis/Analytical/Blood/Chromatography/Curcumin/High performance liquid chromatography/HPLC/Internal standard/Liquid chromatography/Metabolite/Metabolites/Methods/Plasma/Quantification/Quantitation/Tetrahydrocurcumin/Urine/UV/UV detection/UV-detection",
"Analysis/Automation/Linear/Methods/Three", "Analysis/Blood/Chromatography/Determination/Eleutherococcus injection/Eleutheroside B/Eleutheroside E/Extraction/High performance liquid chromatography/High-performance liquid chromatography/HPLC/HPLC method/Liquid chromatography/Model/Pharmacokinetic/Pharmacokinetic studies/Pharmacokinetic study/Pharmacokinetics/Plasma/Rat/Rat plasma/Rats/Sample preparation/Solid phase extraction/solid-phase extraction/Tissue distribution",
"Analytical/Anastrazole/Anastrozole/Chromatography/Extraction/Healthy volunteer/High performance liquid chromatography/High-performance liquid chromatography/HPLC-MS-MS/Human/Human plasma/Human-plasma/Internal standard/Liquid chromatography/Liquid-liquid extraction/Mass spectrometry/Pharmacokinetic/Pharmacokinetic studies/Pharmacokinetic study/Pharmacokinetics/Photospray/Plasma/Quantification/Tandem mass spectrometry",
"Antidepressant/Antidepressant drug/Basic drugs/Capillary electrophoresis/CE/Citalopram/Detection/Drugs/Extraction/Hollow fibre/HPLC/HPLC method/Human/Human plasma/Human-plasma/liquid phase microextraction/Liquid-phase microextraction/LPME/Metabolite/Methods/Microextraction/N-Desmethylcitalopram/Phosphate/Plasma/Plasma sample/Plasma samples/Proteins/Quantification",
"Applications/Box-Behnken design/Box-Benhken design/Central composite design/Chromatographic methods/Determination/DOE/Doehlert matrix/Extraction/Methodology/Methods/Model/Multivariate techniques/Optimization/paper/Review/Sample preparation/Validation",
"Automation/Improved/Methods", "bioequivalence/Bioequivalence study/Determination/Dirithromycin/Electrospray/Electrospray ionization/Erythromycylamine/Extraction/Human/Human plasma/LC-MS/Plasma/Precision/Quantification/Residue",
"Carbamazepine/Carbamazepine-10,11-epoxide/Extraction/High-performance liquid chromatography/Liquid chromatography/Optimization/Phenobarbital/Phenytoin/Plasma/Quantification/Stir bar-sorptive extraction/Therapeutic drug monitoring/Three",
"Chromatography/CQP propionic acid/Determination/Extraction/Liquid chromatography/Pharmacokinetic/Pharmacokinetic studies/Pharmacokinetic study/Pharmacokinetics/Plasma/Rat/Rat plasma/Solid phase extraction/solid-phase extraction/UV",
"Determination/Dog/Electrospray/Electrospray ionization/Extraction/High-performance liquid chromatography-tandem mass spectrometry/HPLC-MS-MS/Internal standard/Jatrorrhizine/LC-MS-MS/Liquid chromatography tandem mass spectrometry/Liquid chromatography-tandem mass spectrometry/Mass spectrometry/Oasis/Palmatine/Pharmacokinetic/Pharmacokinetic studies/Pharmacokinetic study/Pharmacokinetics/Plasma/Plasma samples/Quantification/Solid phase extraction/solid-phase extraction/SPE/Water/Waters",
"Interaction/Plasma/Turbulence/Turbulent plasmas", "Pharmacokinetic/R",
"Software/statistics/reliability/testing"), class = "factor")), .Names = c("BibliographyType",
"Author", "Title", "Journal", "Custom3"), row.names = c(1L, 2L,
3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 270L, 271L, 272L, 273L, 274L,
275L, 276L, 277L, 278L, 279L, 280L, 281L, 282L, 283L, 284L, 285L,
286L, 287L, 288L, 289L, 290L), class = "data.frame")
So that's my example data. Here I manually loop over the data and build a narrow result dataframe, that I could turn into my desired result using melt/reshape.
这就是我的例子数据。在这里,我手动循环数据,并构建一个窄的结果dataframe,我可以使用熔体/整形将其转换为我想要的结果。
rm("res","l","keyw","resu")
res<-data.frame(Journal=NA) # Create result dataframe
res[keywordlist]<-NA # Create keyword columns
l<-1
resu<-data.frame(Journal=NA,Keyword=NA, Count=0)
for(n in 1:nrow(bib)){ # Loop over entries,
message(n)
keyw<-strsplit(as.character(bib$Custom3[n]),"/")[[1]]
if(length(keyw)>0){ # If there was a keyword....
for(i in 1:length(keyw)){ # for each keyword, add a line with Journal, Keyword, 1
message(paste("i is ",i,sep=""))
message(paste("l is ",l,sep=""))
message(paste("journal ",bib$Journal[n],sep=""))
resu[l,]<-c(as.character(bib$Journal[n]),keyw[i],1)
message(paste("Keyword is ",keyw[i],sep=""))
resu$Count[l]<-1
l<-l+1
}}
}
#Now use ddply to summarise
keywordtable<-ddply(resu, c("Journal","Keyword"),function(df) {
result<-data.frame(Journal=df$Journal[1], Keyword=df$Keyword[1], Count=sum(as.numeric(df$Count)))
return(result)
})
Now I can take the highest scores and plot a 'heatmap' style graph.
现在我可以拿最高的分数并绘制一个“热度图”样式的图表。
trimkeyword<-subset(keywordtable,!(Journal == "") & Count > 5, drop=TRUE)
trimkeyword$Journal<-droplevels(trimkeyword$Journal)
trimkeyword$Keyword<-droplevels(trimkeyword$Keyword)
qplot(data=trimkeyword, x=Journal, y=Count)
p <- ggplot(trimkeyword, aes(Journal, Keyword)) + geom_tile(aes(fill = Count),colour = "white") +
scale_fill_gradient(low = "white", high = "steelblue")
base_size <- 6
p<-p + theme_grey(base_size = base_size) + labs(x = "", y = "") +
scale_x_discrete(expand = c(0, 0)) +
scale_y_discrete(expand = c(0, 0)) +
opts(legend.position = "none", axis.ticks = theme_blank(),
axis.text.x = theme_text(size = base_size *0.8,
angle = 330, hjust = 0, colour = "grey50"))
print(p)
Anyone else want to suggest anything ?
还有谁想提出什么建议吗?
Other things that come to mind are;
我想到的其他事情是;
- Are there any good ways (in R) to weed out sets of keywords that have the same meaning (ie cat, cats, feline, pussy could all be replaced with cat)
- 有没有什么好的方法(在R中)清除那些具有相同含义的关键字(如猫、猫、猫、猫等)
- Is there a way to build the table without looping
- 是否有一种方法可以不使用循环来构建表
EDIT: I've replaced the dummy data with something that's more representative.
编辑:我已经用更有代表性的东西代替了虚拟数据。
3 个解决方案
#1
2
here is a much shorter piece of code to get to keywordtable from the data frame bib
这里有一个更短的代码,可以从数据框bib中获取keywordtable。
# create list of keywords by journal
res = dlply(bib, .(Journal), summarize,
keyw = strsplit(as.character(Custom3), "/"));
# convert into dataframe
res = melt(unlist(res));
res$journal = rownames(res);
names(res)[1] = 'keyword';
rownames(res) = NULL;
res$journal = with(res, gsub('.keyw', "", journal));
res$journal = with(res, gsub('[[:digit:]]', "", journal));
res$keyword = tolower(res$keyword);
keywordtable = ddply(res, .(journal, keyword), summarize,
count = length(keyword));
An alternate visualization would be to create a word cloud of keywords using the snippets package. Here is the code to do that:
另一种可视化方法是使用snippets包创建一个关键字云。这里有这样做的代码:
library(snippets);
keywords = table(res$keyword);
cloud(keywords, col = col.br(keywords, fit=TRUE))
#2
2
Here's a slightly refactored version of your manipulations (I don't know much about ggplot2).
这里是您的操作的一个稍微重构版本(我对ggplot2了解不多)。
bib <- bib[bib$Journal != "",]
resu <- NULL
for(i in 1:nrow(bib)) {
resu <- rbind(resu,
data.frame( Journal=as.character(bib$Journal[i]),
Keyword=strsplit(as.character(bib$Custom3[i]),"/")[[1]],
Count=1, stringsAsFactors=FALSE ))
}
keywordtable <- aggregate(resu[,"Count",FALSE],
by=resu[,c("Journal","Keyword")], sum)
trimkeyword <- subset(keywordtable, Count > 0, drop=TRUE)
trimkeyword <- trimkeyword[order(trimkeyword$Journal,trimkeyword$Keyword),]
trimkeyword$Journal <- factor(trimkeyword$Journal)
trimkeyword$Keyword <- factor(trimkeyword$Keyword)
#3
0
Take a look at the text mining package - tm
看看文本挖掘包——tm
http://cran.r-project.org/web/packages/tm/index.html
http://cran.r-project.org/web/packages/tm/index.html
#1
2
here is a much shorter piece of code to get to keywordtable from the data frame bib
这里有一个更短的代码,可以从数据框bib中获取keywordtable。
# create list of keywords by journal
res = dlply(bib, .(Journal), summarize,
keyw = strsplit(as.character(Custom3), "/"));
# convert into dataframe
res = melt(unlist(res));
res$journal = rownames(res);
names(res)[1] = 'keyword';
rownames(res) = NULL;
res$journal = with(res, gsub('.keyw', "", journal));
res$journal = with(res, gsub('[[:digit:]]', "", journal));
res$keyword = tolower(res$keyword);
keywordtable = ddply(res, .(journal, keyword), summarize,
count = length(keyword));
An alternate visualization would be to create a word cloud of keywords using the snippets package. Here is the code to do that:
另一种可视化方法是使用snippets包创建一个关键字云。这里有这样做的代码:
library(snippets);
keywords = table(res$keyword);
cloud(keywords, col = col.br(keywords, fit=TRUE))
#2
2
Here's a slightly refactored version of your manipulations (I don't know much about ggplot2).
这里是您的操作的一个稍微重构版本(我对ggplot2了解不多)。
bib <- bib[bib$Journal != "",]
resu <- NULL
for(i in 1:nrow(bib)) {
resu <- rbind(resu,
data.frame( Journal=as.character(bib$Journal[i]),
Keyword=strsplit(as.character(bib$Custom3[i]),"/")[[1]],
Count=1, stringsAsFactors=FALSE ))
}
keywordtable <- aggregate(resu[,"Count",FALSE],
by=resu[,c("Journal","Keyword")], sum)
trimkeyword <- subset(keywordtable, Count > 0, drop=TRUE)
trimkeyword <- trimkeyword[order(trimkeyword$Journal,trimkeyword$Keyword),]
trimkeyword$Journal <- factor(trimkeyword$Journal)
trimkeyword$Keyword <- factor(trimkeyword$Keyword)
#3
0
Take a look at the text mining package - tm
看看文本挖掘包——tm
http://cran.r-project.org/web/packages/tm/index.html
http://cran.r-project.org/web/packages/tm/index.html