用R和cSplit()将文本拆分为单词

时间:2022-11-16 21:36:57

I'm trying to split a series of sentences into separate words, that is to tokenize the text.

我试着把一系列的句子分成几个不同的单词,这是为了让文本更加清晰。

I have found an R package splitstackshape that is able to do what I want, well almost... it truncates the output to the first and last 5 rows.

我找到了一个R包splitstackshape,可以做我想做的事情,差不多……它将输出截断到第一行和最后5行。

Anyway, this is what I need to do:

总之,这就是我需要做的:

id text
1 Lorem ipsum dolor sit amet
2 consectetur adipiscing elit
3 Donec euismod enim quis 
4 nunc fringilla sodales
5 Etiam tempor ligula vitae 
6 pellentesque dictum
7 Quisque non justo scelerisque 
8 est facilisis congue quis vel
9 Phasellus ex lorem
10 eleifend at magna vel
11 egestas eleifend massa

Output:

输出:

id word
1 Lorem
1 ipsum
1 dolor
1 sit
1 amet
2 consectetur
2 adipiscing
...

That is, I need words in separate rows, but with alongside the ID of the sentence it belongs to.

也就是说,我需要把单词放在不同的行中,但是在它所属的句子的ID旁边。

I was trying cSplit(data, "text", " ", "long"), but it truncates..

我尝试了cSplit(数据、“文本”、“长”),但是它截断了。


Update. FYI, here is how to do the reverse

更新。这里是如何做相反的事情

1 个解决方案

#1


3  

The cSplit function returns a data.table.

cSplit函数返回一个data.table。

What you are describing is the default print behavior for data.tables. To see this in action, try the following:

您所描述的是data.tables的默认打印行为。要了解这一点,请尝试以下方法:

library(data.table)
as.data.table(airquality)
print(as.data.table(airquality))

print(as.data.table(airquality), nrows = Inf)

Thus, to get the full table displayed, you can try:

因此,要显示整个表,可以尝试:

library(splitstackshape)
print(cSplit(data, "text", " ", "long"), nrows = Inf)

#1


3  

The cSplit function returns a data.table.

cSplit函数返回一个data.table。

What you are describing is the default print behavior for data.tables. To see this in action, try the following:

您所描述的是data.tables的默认打印行为。要了解这一点,请尝试以下方法:

library(data.table)
as.data.table(airquality)
print(as.data.table(airquality))

print(as.data.table(airquality), nrows = Inf)

Thus, to get the full table displayed, you can try:

因此,要显示整个表,可以尝试:

library(splitstackshape)
print(cSplit(data, "text", " ", "long"), nrows = Inf)