我应该使用哪个函数将非结构化文本文件读入R?

时间:2022-09-10 22:40:05

This is my first ever question here and I'm new to R, trying to figure out my first step in how to do data processing, please keep it easy : )

这是我在这里的第一个问题,我是R的新手,试图找出我如何进行数据处理的第一步,请保持简单:)

I'm wondering what would be the best function and a useful data structure in R to load unstructured text data for further processing. For example, let's say I have a book stored as a text file, with no new line characters in it.

我想知道在R中加载非结构化文本数据以进行进一步处理的最佳功能和有用的数据结构是什么。例如,假设我将一本书存储为文本文件,其中没有新的行字符。

Is it a good idea to use read.delim() and store the data in a list? Or is a character vector better, and how would I define it?

使用read.delim()并将数据存储在列表中是一个好主意吗?或者是一个更好的角色向量,我将如何定义它?

Thank you in advance.

先谢谢你。

PN

P.S. If I use "." as my delimeter, it would treat things like "Mr." as a separate sentence. While this is just an example and I'm not concerned about this flaw, just for educational purposes, I'd still be curious how you'd go around this problem.

附:如果我用“。”作为我的界限,它会对待像“先生”这样的事情作为一个单独的句子。虽然这只是一个例子,我并不关心这个缺陷,仅仅是出于教育目的,我仍然很好奇你是如何解决这个问题的。

1 个解决方案

#1


7  

read.delim reads in data in table format (with rows and columns, as in Excel). It is not very useful for reading a string of text.

read.delim以表格格式读取数据(包含行和列,如在Excel中)。它对于读取一串文本并不是很有用。

To read text from a text file into R you can use readLines(). readLines() creates a character vector with as many elements as lines of text. A line, for this kind of software, is any string of text that ends with a newline. (Read about newline on Wikipedia.) When you write text, you enter your system specific newline character(s) by pressing Return. In effect, a line of text is not defined by the width of your software window, but can run over many visual rows. In effect, a line of text is what in a book would be a a paragraph. So readLines() splits your text at the paragraphs:

要将文本文件中的文本读入R,可以使用readLines()。 readLines()创建一个字符向量,其元素与文本行一样多。对于这种软件,一条线是以换行符结尾的任何文本字符串。 (阅读*上的新行。)当您编写文本时,按Return键输入系统特定的换行符。实际上,文本行不是由软件窗口的宽度定义的,而是可以在许多可视行上运行。实际上,一行文本就是书中的一个段落。所以readLines()在段落中分割你的文字:

> readLines("/path/to/tom_sawyer.txt")
[1] "\"TOM!\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
[2] "No answer."                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
[3] "\"TOM!\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
[4] "No answer."                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
[5] "\"What's gone with that boy,  I wonder? You TOM!\""                                                                                                                                                                                                                                                                                                                                                                                                                             
[6] "No answer."                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
[7] "The old lady pulled her spectacles down and looked over them about the room; then she put them up and looked out under them. She seldom or never looked through them for so small a thing as a boy; they were her state pair, the pride of her heart, and were built for \"style,\" not service—she could have seen through a pair of stove-lids just as well. She looked perplexed for a moment, and then said, not fiercely, but still loud enough for the furniture to hear:"
[8] "\"Well, I lay if I get hold of you I'll—\"

Note that you can scroll long text to the left here in *. That seventh line is longer than this column is wide.

请注意,您可以在*中向左滚动长文本。第七行比这一列宽。

As you can see, readLines() read that long seventh paragraph as one line. And, as you can also see, readLines() added a backslash in front of each quotation mark. Since R holds the individual lines in quotation marks, it needs to distinguish these from those that are part of the original text. Therefore, it "escapes" the original quotation marks. Read about escaping on Wikipedia.

如您所见,readLines()将长第七段读作一行。而且,正如您所看到的,readLines()在每个引号前添加了反斜杠。由于R将各行保持在引号中,因此需要将它们与原始文本中的那些区分开来。因此,它“逃脱”原始引号。阅读关于*上的转义。

readLines() may output a warning that an "incomplete final line" was found in your file. This only means that there was no newline after the last line. You can suppress this warning with readLines(..., warn = FALSE), but you don't have to, it is not an error, and supressing the warning will do nothing but supress the warning message.

readLines()可能会输出一个警告,指出在文件中找到了“不完整的最后一行”。这只意味着在最后一行之后没有换行符。您可以使用readLines(...,warn = FALSE)来抑制此警告,但您不必,这不是错误,并且压制警告将不会执行任何操作,只会抑制警告消息。

If you don't want to just output your text to the R console but process it further, create an object that holds the output of readLines():

如果您不想仅将文本输出到R控制台但进一步处理它,请创建一个包含readLines()输出的对象:

mytext <- readLines("textfile.txt")

Besides readLines(), you can also use scan(), readBin() and other functions to read text from files. Look at the manual by entering ?scan etc. Look at ?connections to learn about many different methods to read files into R.

除了readLines()之外,您还可以使用scan(),readBin()和其他函数从文件中读取文本。通过输入?scan等查看手册。查看连接以了解将文件读入R的许多不同方法。

I would strongly advise you to write your text in a .txt-file in a text editor like Vim, Notepad, TextWrangler etc., and not compose it in a word processor like MS Word. Word files contain more than the text you see on screen or printed, and those will be read by R. You can try and see what you get, but for good results you should either save your file as a .txt-file from Word or compose it in a text editor.

我强烈建议您在文本编辑器中将文本写入.txt文件,如Vim,Notepad,TextWrangler等,而不是像MS Word这样的文字处理器。 Word文件包含的内容多于您在屏幕上显示或打印的文本,并且这些文本将由R读取。您可以尝试查看所获得的内容,但为了获得良好的效果,您应该将文件保存为Word中的.txt文件或在文本编辑器中撰写。

You can also copy-paste your text from a text file open in any other software to R or compose your text in the R console:

您还可以将文本从任何其他软件中打开的文本文件复制粘贴到R或在R控制台中撰写文本:

myothertext <- c("What did you do?
+ I wrote some text.
+ Ah, interesting.")
> myothertext
[1] "What did you do?\nI wrote some text.\nAh, interesting."

Note how entering Return does not cause R to execute the command before I closed the string with "). R just replies with +, telling me that I can continue to edit. I did not type in those plusses. Try it. Note also that now the newlines are part of your string of text. (I'm on a Mac, so my newline is \n.)

注意在我用“)关闭字符串之前输入Return不会导致R执行命令.R只回复+,告诉我我可以继续编辑。我没有输入那些加号。试试看。还要注意现在新行是你的文本串的一部分。(我在Mac上,所以我的新行是\ n。)

If you input your text manually, I would load the whole text as one string into a vector:

如果您手动输入文本,我会将整个文本作为一个字符串加载到矢量中:

x <- c("The text of your book.")

You could load different chapters into different elements of this vector:

您可以将不同的章节加载到此向量的不同元素中:

y <- c("Chapter 1", "Chapter 2")

For better reference, you can name the elements:

为了更好地参考,您可以命名元素:

z <- c(ch1 = "This is the text of the first chapter. It is not long! Why was the author so lazy?", ch2 = "This is the text of the second chapter. It is even shorter.")

Now you can split the elements of any of these vectors:

现在您可以拆分任何这些向量的元素:

sentences <- strsplit(z, "[.!?] *")

Enter ?strsplit to read the manual for this function and learn about the attributes it takes. The second attribute takes a regular expression. In this case I told strsplit to split the elements of the vector at any of the three punctuation marks followed by an optional space (if you don't define a space here, the resulting "sentences" will be preceded by a space).

输入?strsplit以阅读此功能的手册并了解其所需的属性。第二个属性采用正则表达式。在这种情况下,我告诉strsplit将矢量的元素分成三个标点符号中的任意一个,然后是一个可选空格(如果你没有在这里定义一个空格,那么结果“句子”前面会有一个空格)。

sentences now contains:

句子现在包含:

> sentences
$ch1
[1] "This is the text of the first chapter" "It is not long"                       
[3] "Why was the author so lazy"           

$ch2
[1] "This is the text of the second chapter" "It is even shorter"

You can access the individual sentences by indexing:

您可以通过索引来访问单个句子:

> sentences$ch1[2]
[3] "It is not long"

R will be unable to know that it should not split after "Mr.". You must define exceptions in your regular expression. Explaining this is beyond the scope of this question.

R将无法知道它不应该在“先生”之后拆分。您必须在正则表达式中定义例外。解释这个问题超出了这个问题的范围。

How you would tell R how to recognize subjects or objects, I have no idea.

如何告诉R如何识别主题或对象,我不知道。

#1


7  

read.delim reads in data in table format (with rows and columns, as in Excel). It is not very useful for reading a string of text.

read.delim以表格格式读取数据(包含行和列,如在Excel中)。它对于读取一串文本并不是很有用。

To read text from a text file into R you can use readLines(). readLines() creates a character vector with as many elements as lines of text. A line, for this kind of software, is any string of text that ends with a newline. (Read about newline on Wikipedia.) When you write text, you enter your system specific newline character(s) by pressing Return. In effect, a line of text is not defined by the width of your software window, but can run over many visual rows. In effect, a line of text is what in a book would be a a paragraph. So readLines() splits your text at the paragraphs:

要将文本文件中的文本读入R,可以使用readLines()。 readLines()创建一个字符向量,其元素与文本行一样多。对于这种软件,一条线是以换行符结尾的任何文本字符串。 (阅读*上的新行。)当您编写文本时,按Return键输入系统特定的换行符。实际上,文本行不是由软件窗口的宽度定义的,而是可以在许多可视行上运行。实际上,一行文本就是书中的一个段落。所以readLines()在段落中分割你的文字:

> readLines("/path/to/tom_sawyer.txt")
[1] "\"TOM!\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
[2] "No answer."                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
[3] "\"TOM!\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
[4] "No answer."                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
[5] "\"What's gone with that boy,  I wonder? You TOM!\""                                                                                                                                                                                                                                                                                                                                                                                                                             
[6] "No answer."                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
[7] "The old lady pulled her spectacles down and looked over them about the room; then she put them up and looked out under them. She seldom or never looked through them for so small a thing as a boy; they were her state pair, the pride of her heart, and were built for \"style,\" not service—she could have seen through a pair of stove-lids just as well. She looked perplexed for a moment, and then said, not fiercely, but still loud enough for the furniture to hear:"
[8] "\"Well, I lay if I get hold of you I'll—\"

Note that you can scroll long text to the left here in *. That seventh line is longer than this column is wide.

请注意,您可以在*中向左滚动长文本。第七行比这一列宽。

As you can see, readLines() read that long seventh paragraph as one line. And, as you can also see, readLines() added a backslash in front of each quotation mark. Since R holds the individual lines in quotation marks, it needs to distinguish these from those that are part of the original text. Therefore, it "escapes" the original quotation marks. Read about escaping on Wikipedia.

如您所见,readLines()将长第七段读作一行。而且,正如您所看到的,readLines()在每个引号前添加了反斜杠。由于R将各行保持在引号中,因此需要将它们与原始文本中的那些区分开来。因此,它“逃脱”原始引号。阅读关于*上的转义。

readLines() may output a warning that an "incomplete final line" was found in your file. This only means that there was no newline after the last line. You can suppress this warning with readLines(..., warn = FALSE), but you don't have to, it is not an error, and supressing the warning will do nothing but supress the warning message.

readLines()可能会输出一个警告,指出在文件中找到了“不完整的最后一行”。这只意味着在最后一行之后没有换行符。您可以使用readLines(...,warn = FALSE)来抑制此警告,但您不必,这不是错误,并且压制警告将不会执行任何操作,只会抑制警告消息。

If you don't want to just output your text to the R console but process it further, create an object that holds the output of readLines():

如果您不想仅将文本输出到R控制台但进一步处理它,请创建一个包含readLines()输出的对象:

mytext <- readLines("textfile.txt")

Besides readLines(), you can also use scan(), readBin() and other functions to read text from files. Look at the manual by entering ?scan etc. Look at ?connections to learn about many different methods to read files into R.

除了readLines()之外,您还可以使用scan(),readBin()和其他函数从文件中读取文本。通过输入?scan等查看手册。查看连接以了解将文件读入R的许多不同方法。

I would strongly advise you to write your text in a .txt-file in a text editor like Vim, Notepad, TextWrangler etc., and not compose it in a word processor like MS Word. Word files contain more than the text you see on screen or printed, and those will be read by R. You can try and see what you get, but for good results you should either save your file as a .txt-file from Word or compose it in a text editor.

我强烈建议您在文本编辑器中将文本写入.txt文件,如Vim,Notepad,TextWrangler等,而不是像MS Word这样的文字处理器。 Word文件包含的内容多于您在屏幕上显示或打印的文本,并且这些文本将由R读取。您可以尝试查看所获得的内容,但为了获得良好的效果,您应该将文件保存为Word中的.txt文件或在文本编辑器中撰写。

You can also copy-paste your text from a text file open in any other software to R or compose your text in the R console:

您还可以将文本从任何其他软件中打开的文本文件复制粘贴到R或在R控制台中撰写文本:

myothertext <- c("What did you do?
+ I wrote some text.
+ Ah, interesting.")
> myothertext
[1] "What did you do?\nI wrote some text.\nAh, interesting."

Note how entering Return does not cause R to execute the command before I closed the string with "). R just replies with +, telling me that I can continue to edit. I did not type in those plusses. Try it. Note also that now the newlines are part of your string of text. (I'm on a Mac, so my newline is \n.)

注意在我用“)关闭字符串之前输入Return不会导致R执行命令.R只回复+,告诉我我可以继续编辑。我没有输入那些加号。试试看。还要注意现在新行是你的文本串的一部分。(我在Mac上,所以我的新行是\ n。)

If you input your text manually, I would load the whole text as one string into a vector:

如果您手动输入文本,我会将整个文本作为一个字符串加载到矢量中:

x <- c("The text of your book.")

You could load different chapters into different elements of this vector:

您可以将不同的章节加载到此向量的不同元素中:

y <- c("Chapter 1", "Chapter 2")

For better reference, you can name the elements:

为了更好地参考,您可以命名元素:

z <- c(ch1 = "This is the text of the first chapter. It is not long! Why was the author so lazy?", ch2 = "This is the text of the second chapter. It is even shorter.")

Now you can split the elements of any of these vectors:

现在您可以拆分任何这些向量的元素:

sentences <- strsplit(z, "[.!?] *")

Enter ?strsplit to read the manual for this function and learn about the attributes it takes. The second attribute takes a regular expression. In this case I told strsplit to split the elements of the vector at any of the three punctuation marks followed by an optional space (if you don't define a space here, the resulting "sentences" will be preceded by a space).

输入?strsplit以阅读此功能的手册并了解其所需的属性。第二个属性采用正则表达式。在这种情况下,我告诉strsplit将矢量的元素分成三个标点符号中的任意一个,然后是一个可选空格(如果你没有在这里定义一个空格,那么结果“句子”前面会有一个空格)。

sentences now contains:

句子现在包含:

> sentences
$ch1
[1] "This is the text of the first chapter" "It is not long"                       
[3] "Why was the author so lazy"           

$ch2
[1] "This is the text of the second chapter" "It is even shorter"

You can access the individual sentences by indexing:

您可以通过索引来访问单个句子:

> sentences$ch1[2]
[3] "It is not long"

R will be unable to know that it should not split after "Mr.". You must define exceptions in your regular expression. Explaining this is beyond the scope of this question.

R将无法知道它不应该在“先生”之后拆分。您必须在正则表达式中定义例外。解释这个问题超出了这个问题的范围。

How you would tell R how to recognize subjects or objects, I have no idea.

如何告诉R如何识别主题或对象,我不知道。