在R中有选择地读取txt文件

时间:2022-12-03 19:25:14

I'm looking for an easy fix to read a txt file that looks like this when opened in excel:

我正在寻找一个简单的解决方案来读取在excel中打开时看起来像这样的txt文件:

IDmaster    By_uspto    App_date    Grant_date  Applicant   Cited   
2   1   19671106    19700707    Motorola Inc    1052446 
2   1   19740909    19751028    Gen Motors Corp 1062884 
2   1   19800331    19820817    Amp Incorporated    1082369 
2   1   19910515    19940719    Dell Usa L.P.   389546  
2   1   19940210    19950912    Schueman Transfer    Inc.   1164239
2   1   19940217    19950912    Spacelabs Medical    Inc.   1164336

EDIT: Opening the txt file in notepad looks like this (with commas). The last two rows exhibit the problem.

编辑:在记事本中打开txt文件看起来像这样(用逗号)。最后两行显示问题。

IDmaster,By_uspto,App_date,Grant_date,Applicant,Cited
2,1,19671106,19700707,Motorola Inc,1052446
2,1,19740909,19751028,Gen Motors Corp,1062884
2,1,19800331,19820817,Amp Incorporated,1082369
2,1,19910515,19940719,Dell Usa L.P.,389546
2,1,19940210,19950912,Schueman Transfer, Inc.,1164239
2,1,19940217,19950912,Spacelabs Medical, Inc.,1164336

The problem is that some of the Applicant names contain commas so that they are read as if they belong in a different column, which they actually don't.

问题是某些申请人名称包含逗号,因此它们被读取就好像它们属于不同的列,而实际上并不是这样。

Is there a simple way to a) "teach" R to keep string variables together, regardless of commas in between b) read in the first 4 columns, and then add an extra column for everything behind the last comma?

是否有一种简单的方法来a)“教”R将字符串变量保持在一起,无论中间是否有逗号b)在前4列中读取,然后为最后一个逗号后面的所有内容添加一个额外的列?

Given the length of the data I can't open it entirely in excel which would be otherwise a simple alternative.

考虑到数据的长度,我无法在excel中完全打开它,否则这将是一个简单的替代方案。

2 个解决方案

#1


2  

If your example is written in a "Test.csv" file, try with:

如果您的示例是用“Test.csv”文件编写的,请尝试使用:

read.csv(text=gsub(', ', ' ', paste0(readLines("Test.csv"),collapse="\n")),
         quote="'",
         stringsAsFactors=FALSE)

It returns:

#   IDmaster By_uspto App_date Grant_date              Applicant   Cited
# 1        2        1 19671106   19700707           Motorola Inc 1052446
# 2        2        1 19740909   19751028        Gen Motors Corp 1062884
# 3        2        1 19800331   19820817       Amp Incorporated 1082369
# 4        2        1 19910515   19940719          Dell Usa L.P.  389546
# 5        2        1 19940210   19950912 Schueman Transfer Inc. 1164239
# 6        2        1 19940217   19950912 Spacelabs Medical Inc. 1164336

#2


1  

This provides a very silly workaround but it does the trick for me (because I don't really care about the Applicant names atm. However, I'm hoping for a better solution.

这提供了一个非常愚蠢的解决方法,但它对我有用(因为我并不真正关心申请人名称atm。但是,我希望有更好的解决方案。

Step 1: Open the .txt file in notepad, and add five column names V1, V2, V3, V4, V5 (to be sure to capture names with multiple commas).

步骤1:在记事本中打开.txt文件,并添加五个列名称V1,V2,V3,V4,V5(以确保使用多个逗号捕获名称)。

bc <- read.table("data.txt", header = T, na.strings = T, fill = T, sep = ",", stringsAsFactors = F)

library(data.table)

sapply(bc, class)
unique(bc$V5) # only NA so can be deleted
setDT(bc)
bc <- bc[,1:10, with = F]
bc$Cited <- as.numeric(bc$Cited)
  bc$Cited[is.na(bc$Cited)] <- 0
  bc$V1 <- as.numeric(bc$V1)
  bc$V2 <- as.numeric(bc$V2)
  bc$V3 <- as.numeric(bc$V3)
  bc$V4 <- as.numeric(bc$V4)

  bc$V1[is.na(bc$V1)] <- 0
  bc$V2[is.na(bc$V2)] <- 0
  bc$V3[is.na(bc$V3)] <- 0
  bc$V4[is.na(bc$V4)] <- 0

head(bc, 10)
bc$Cited <- with(bc, Cited + V1 + V2 + V3 + V4)

It's a silly patch but it does the trick in this particular context

这是一个愚蠢的补丁,但它在这个特定的环境中发挥作用

#1


2  

If your example is written in a "Test.csv" file, try with:

如果您的示例是用“Test.csv”文件编写的,请尝试使用:

read.csv(text=gsub(', ', ' ', paste0(readLines("Test.csv"),collapse="\n")),
         quote="'",
         stringsAsFactors=FALSE)

It returns:

#   IDmaster By_uspto App_date Grant_date              Applicant   Cited
# 1        2        1 19671106   19700707           Motorola Inc 1052446
# 2        2        1 19740909   19751028        Gen Motors Corp 1062884
# 3        2        1 19800331   19820817       Amp Incorporated 1082369
# 4        2        1 19910515   19940719          Dell Usa L.P.  389546
# 5        2        1 19940210   19950912 Schueman Transfer Inc. 1164239
# 6        2        1 19940217   19950912 Spacelabs Medical Inc. 1164336

#2


1  

This provides a very silly workaround but it does the trick for me (because I don't really care about the Applicant names atm. However, I'm hoping for a better solution.

这提供了一个非常愚蠢的解决方法,但它对我有用(因为我并不真正关心申请人名称atm。但是,我希望有更好的解决方案。

Step 1: Open the .txt file in notepad, and add five column names V1, V2, V3, V4, V5 (to be sure to capture names with multiple commas).

步骤1:在记事本中打开.txt文件,并添加五个列名称V1,V2,V3,V4,V5(以确保使用多个逗号捕获名称)。

bc <- read.table("data.txt", header = T, na.strings = T, fill = T, sep = ",", stringsAsFactors = F)

library(data.table)

sapply(bc, class)
unique(bc$V5) # only NA so can be deleted
setDT(bc)
bc <- bc[,1:10, with = F]
bc$Cited <- as.numeric(bc$Cited)
  bc$Cited[is.na(bc$Cited)] <- 0
  bc$V1 <- as.numeric(bc$V1)
  bc$V2 <- as.numeric(bc$V2)
  bc$V3 <- as.numeric(bc$V3)
  bc$V4 <- as.numeric(bc$V4)

  bc$V1[is.na(bc$V1)] <- 0
  bc$V2[is.na(bc$V2)] <- 0
  bc$V3[is.na(bc$V3)] <- 0
  bc$V4[is.na(bc$V4)] <- 0

head(bc, 10)
bc$Cited <- with(bc, Cited + V1 + V2 + V3 + V4)

It's a silly patch but it does the trick in this particular context

这是一个愚蠢的补丁,但它在这个特定的环境中发挥作用