连接数据帧的行

时间:2022-04-29 17:03:20

I would like to take a data frame with characters and numbers, and concatenate all of the elements of the each row into a single string, which would be stored as a single element in a vector. As an example, I make a data frame of letters and numbers, and then I would like to concatenate the first row via the paste function, and hopefully return the value "A1"

我想取一个包含字符和数字的数据框,并将每一行的所有元素连接到一个字符串中,该字符串将作为向量中的单个元素存储。例如,我创建了一个字母和数字的数据框架,然后我想通过粘贴函数连接第一行,并希望返回值“A1”

df <- data.frame(letters = LETTERS[1:5], numbers = 1:5)
df

##   letters numbers
## 1       A       1
## 2       B       2
## 3       C       3
## 4       D       4
## 5       E       5

paste(df[1,], sep =".")
## [1] "1" "1"

So paste is converting each element of the row into an integer that corresponds to the 'index of the corresponding level' as if it were a factor, and it keeps it a vector of length two. (I know/believe that factors that are coerced to be characters behave in this way, but as R is not storing df[1,] as a factor at all (tested by is.factor(), I can't verify that it is actually an index for a level)

因此,粘贴将行中的每个元素转换成一个整数,它对应于相应级别的“索引”,就好像它是一个因子,并且它保持它为长度为2的向量。(我知道/相信被强制成为字符的因素是这样的,但是由于R根本不存储df[1]作为一个因子(由is.factor()测试,我无法验证它实际上是一个级别的索引)

is.factor(df[1,])
## [1] FALSE
is.vector(df[1,])
## [1] FALSE

So if it is not a vector then it makes sense that it is behaving oddly, but I can't coerce it into a vector

如果它不是一个矢量,那么它的行为很奇怪,但我不能强迫它成为一个矢量

> is.vector(as.vector(df[1,]))
[1] FALSE

Using as.character did not seem to help in my attempts

使用。在我的努力中,性格似乎没有帮助。

Can anyone explain this behavior?

有人能解释这种行为吗?

4 个解决方案

#1


45  

While others have focused on why your code isn't working and how to improve it, I'm going to try and focus more on getting the result you want. From your description, it seems you can readily achieve what you want using paste:

当其他人关注于为什么您的代码不能工作以及如何改进它时,我将尝试并更多地关注于获得您想要的结果。从你的描述,似乎你可以很容易地实现你想要的粘贴:

df <- data.frame(letters = LETTERS[1:5], numbers = 1:5, stringsAsFactors=FALSE)
paste(df$letters, df$numbers, sep=""))

## [1] "A1" "B2" "C3" "D4" "E5"

You can change df$letters to character using df$letters <- as.character(df$letters) if you don't want to use the stringsAsFactors argument.

如果您不想使用stringsAsFactors参数,您可以使用df$letter <- as.character(df$letters)将df$letter更改为字符。

But let's assume that's not what you want. Let's assume you have hundreds of columns and you want to paste them all together. We can do that with your minimal example too:

但我们假设这不是你想要的。假设你有数百列,你想把它们都粘在一起。我们也可以用你最简单的例子:

df_args <- c(df, sep="")
do.call(paste, df_args)

## [1] "A1" "B2" "C3" "D4" "E5"

EDIT: Alternative method and explanation:

I realised the problem you're having is a combination of the fact that you're using a factor and that you're using the sep argument instead of collapse (as @adibender picked up). The difference is that sep gives the separator between two separate vectors and collapse gives separators within a vector. When you use df[1,], you supply a single vector to paste and hence you must use the collapse argument. Using your idea of getting every row and concatenating them, the following line of code will do exactly what you want:

我意识到你遇到的问题是,你使用了一个因子,你使用了sep而不是崩溃(@adibender提到过)。不同的是,sep在两个不同的向量之间给出了分隔符,并且在一个向量中给出了分隔符。当您使用df[1,]时,您提供一个要粘贴的向量,因此必须使用折叠参数。使用获取每一行并将它们连接起来的想法,下面的代码行将执行您想要的操作:

apply(df, 1, paste, collapse="")

Ok, now for the explanations:

现在来解释一下

Why won't as.list work?

为什么不。列表的工作吗?

as.list converts an object to a list. So it does work. It will convert your dataframe to a list and subsequently ignore the sep="" argument. c combines objects together. Technically, a dataframe is just a list where every column is an element and all elements have to have the same length. So when I combine it with sep="", it just becomes a regular list with the columns of the dataframe as elements.

作为。列表将对象转换为列表。所以它的工作。它将把您的dataframe转换为一个列表,然后忽略sep=""参数。c对象结合在一起。从技术上讲,dataframe只是一个列表,其中每个列都是元素,所有元素都必须具有相同的长度。所以当我将它与sep=""结合时,它就变成了一个常规的列表,其中的元素是dataframe的列。

Why use do.call?

为什么使用do.call ?

do.call allows you to call a function using a named list as its arguments. You can't just throw the list straight into paste, because it doesn't like dataframes. It's designed for concatenating vectors. So remember that dfargs is a list containing a vector of letters, a vector of numbers and sep which is a length 1 vector containing only "". When I use do.call, the resulting paste function is essentially paste(letters, numbers, sep).
But what if my original dataframe had columns "letters", "numbers", "squigs", "blargs" after which I added the separator like I did before? Then the paste function through do.call would look like:

做的。调用允许使用指定的列表作为参数调用函数。不能直接将列表粘贴到粘贴中,因为它不喜欢dataframes。它是为连接向量而设计的。记住,dfargs是一个包含字母矢量的列表,一个数字和sep的矢量,它是一个长度为1的向量。当我使用。调用时,生成的粘贴函数实质上是粘贴(字母、数字、sep)。但是,如果我原来的dataframe有列“字母”、“数字”、“花式”、“blargs”,然后像以前那样添加分隔符,会怎么样呢?然后粘贴函数通过do。看起来像:

paste(letters, numbers, squigs, blargs, sep)

So you see it works for any number of columns.

你可以看到它适用于任意数量的列。

#2


3  

This is indeed a little weird, but this is also what is supposed to happen. When you create the data.frame as you did, column letters is stored as factor. Naturally factors have no ordering, therefore when as.numeric() is applied to a factor it returns the ordering of of the factor. For example:

这确实有点奇怪,但这也是应该发生的。当您创建data.frame时,将列字母存储为factor。自然,因子没有排序,因此当把asn .numeric()应用到一个因子时,它返回因子的排序。例如:

> df[, 1]
[1] A B C D E
Levels: A B C D E
> as.numeric(df[, 1])
[1] 1 2 3 4 5

A is the first level of the factor df[, 1] therefore A gets converted to the value 1, when as.numeric is applied. This is what happens when you call paste(df[1, ]). Since columns 1 and 2 are of different class, paste first transforms both elements of row 1 to numeric then to characters.

A是因子df的第一级[,1],因此A在as时被转换为值1。应用数值。当您调用paste(df[1,])时,就会发生这种情况。由于列1和2属于不同的类,所以paste首先将第1行中的两个元素转换为数字,然后转换为字符。

When you want to concatenate both columns, you first need to transform the first row to character:

当你想连接两列时,首先需要将第一行转换为字符:

df[, 1] <- as.character(df[, 1])
paste(df[1,], collapse = "")

As @sebastian-c pointed out, you can also use stringsAsFactors = FALSE in the creation of the data.frame, then you can omit the as.character() step.

正如@sebastian-c指出的,您还可以使用stringsAsFactors = FALSE来创建数据。frame,然后您可以省略as.character()步骤。

#3


3  

For those using library(tidyverse), you can simply use the unite function.

对于使用library(tidyverse)的用户,可以使用unite函数。

 new.df<-df%>%
 unite(together, letters, numbers, sep="")

This will give you a new column called "together" with A1, B2, etc

这会给你一个新的列,叫做“together”,包括A1, B2等等

#4


0  

if you want to start with

如果你想开始

df <- data.frame(letters = LETTERS[1:5], numbers = 1:5, stringsAsFactors=TRUE)

.. then there is no general rule about how df$letters will be interpreted by any given function. It's a factor for modelling functions, character for some and integer for some others. Even the same function such as paste may interpret it differently, depending on how you use it:

. .那么,对于任何给定的函数如何解释df$字母,就没有一般的规则了。它是函数建模的一个因素,有些是字符,有些是整数。甚至连粘贴这样的功能也可能有不同的解释,这取决于你如何使用它:

paste(df[1,], collapse="") # "11"
apply(df, 1, paste, collapse="") # "A1" "B2" "C3" "D4" "E5"

No logic in it except that it will probably make sense once you know the internals of every function.

它没有任何逻辑,除非一旦你知道了每个函数的内部结构。

The factors seem to be converted to integers when an argument is converted to vector (as you know, data frames are lists of vectors of equal length, so the first row of a data frame is also a list, and when it is forced to be a vector, something like this happens:)

似乎因素转换为整数时一个参数转化为向量(如你所知,数据帧列表的向量相等的长度,所以数据帧的第一行也是一个列表,它不得不是一个向量,这样的事情发生了:)

df[1,]
#    letters numbers
# 1       A       1
unlist(df[1,])
# letters numbers 
#  1       1 

I don't know how apply achieves what it does (i.e., factors are represented by character values) -- if you're interested, look at its source code. It may be useful to know, though, that you can trust (in this specific sense) apply (in this specific occasion). More generally, it is useful to store every piece of data in a sensible format, that includes storing strings as strings, i.e., using stringsAsFactors=FALSE.

我不知道应用是如何实现它的功能的。如果您感兴趣,请查看它的源代码。但是,知道您可以信任(在这个特定的意义上)应用(在这个特定的场合)可能是有用的。更一般地说,以一种合理的格式存储每一段数据是有用的,这种格式包括将字符串存储为字符串。,使用stringsAsFactors = FALSE。

Btw, every introductory R book should have this idea in a subtitle. For example, my plan for retirement is to write "A (not so) gentle introduction to the zen of data fishery with R, the stringsAsFactors=FALSE way".

顺便说一句,每一本入门书都应该有一个副标题。例如,我的退休计划是写“A(不那么)温和地介绍数据渔业的zen和R, stringsAsFactors=FALSE way”。

#1


45  

While others have focused on why your code isn't working and how to improve it, I'm going to try and focus more on getting the result you want. From your description, it seems you can readily achieve what you want using paste:

当其他人关注于为什么您的代码不能工作以及如何改进它时,我将尝试并更多地关注于获得您想要的结果。从你的描述,似乎你可以很容易地实现你想要的粘贴:

df <- data.frame(letters = LETTERS[1:5], numbers = 1:5, stringsAsFactors=FALSE)
paste(df$letters, df$numbers, sep=""))

## [1] "A1" "B2" "C3" "D4" "E5"

You can change df$letters to character using df$letters <- as.character(df$letters) if you don't want to use the stringsAsFactors argument.

如果您不想使用stringsAsFactors参数,您可以使用df$letter <- as.character(df$letters)将df$letter更改为字符。

But let's assume that's not what you want. Let's assume you have hundreds of columns and you want to paste them all together. We can do that with your minimal example too:

但我们假设这不是你想要的。假设你有数百列,你想把它们都粘在一起。我们也可以用你最简单的例子:

df_args <- c(df, sep="")
do.call(paste, df_args)

## [1] "A1" "B2" "C3" "D4" "E5"

EDIT: Alternative method and explanation:

I realised the problem you're having is a combination of the fact that you're using a factor and that you're using the sep argument instead of collapse (as @adibender picked up). The difference is that sep gives the separator between two separate vectors and collapse gives separators within a vector. When you use df[1,], you supply a single vector to paste and hence you must use the collapse argument. Using your idea of getting every row and concatenating them, the following line of code will do exactly what you want:

我意识到你遇到的问题是,你使用了一个因子,你使用了sep而不是崩溃(@adibender提到过)。不同的是,sep在两个不同的向量之间给出了分隔符,并且在一个向量中给出了分隔符。当您使用df[1,]时,您提供一个要粘贴的向量,因此必须使用折叠参数。使用获取每一行并将它们连接起来的想法,下面的代码行将执行您想要的操作:

apply(df, 1, paste, collapse="")

Ok, now for the explanations:

现在来解释一下

Why won't as.list work?

为什么不。列表的工作吗?

as.list converts an object to a list. So it does work. It will convert your dataframe to a list and subsequently ignore the sep="" argument. c combines objects together. Technically, a dataframe is just a list where every column is an element and all elements have to have the same length. So when I combine it with sep="", it just becomes a regular list with the columns of the dataframe as elements.

作为。列表将对象转换为列表。所以它的工作。它将把您的dataframe转换为一个列表,然后忽略sep=""参数。c对象结合在一起。从技术上讲,dataframe只是一个列表,其中每个列都是元素,所有元素都必须具有相同的长度。所以当我将它与sep=""结合时,它就变成了一个常规的列表,其中的元素是dataframe的列。

Why use do.call?

为什么使用do.call ?

do.call allows you to call a function using a named list as its arguments. You can't just throw the list straight into paste, because it doesn't like dataframes. It's designed for concatenating vectors. So remember that dfargs is a list containing a vector of letters, a vector of numbers and sep which is a length 1 vector containing only "". When I use do.call, the resulting paste function is essentially paste(letters, numbers, sep).
But what if my original dataframe had columns "letters", "numbers", "squigs", "blargs" after which I added the separator like I did before? Then the paste function through do.call would look like:

做的。调用允许使用指定的列表作为参数调用函数。不能直接将列表粘贴到粘贴中,因为它不喜欢dataframes。它是为连接向量而设计的。记住,dfargs是一个包含字母矢量的列表,一个数字和sep的矢量,它是一个长度为1的向量。当我使用。调用时,生成的粘贴函数实质上是粘贴(字母、数字、sep)。但是,如果我原来的dataframe有列“字母”、“数字”、“花式”、“blargs”,然后像以前那样添加分隔符,会怎么样呢?然后粘贴函数通过do。看起来像:

paste(letters, numbers, squigs, blargs, sep)

So you see it works for any number of columns.

你可以看到它适用于任意数量的列。

#2


3  

This is indeed a little weird, but this is also what is supposed to happen. When you create the data.frame as you did, column letters is stored as factor. Naturally factors have no ordering, therefore when as.numeric() is applied to a factor it returns the ordering of of the factor. For example:

这确实有点奇怪,但这也是应该发生的。当您创建data.frame时,将列字母存储为factor。自然,因子没有排序,因此当把asn .numeric()应用到一个因子时,它返回因子的排序。例如:

> df[, 1]
[1] A B C D E
Levels: A B C D E
> as.numeric(df[, 1])
[1] 1 2 3 4 5

A is the first level of the factor df[, 1] therefore A gets converted to the value 1, when as.numeric is applied. This is what happens when you call paste(df[1, ]). Since columns 1 and 2 are of different class, paste first transforms both elements of row 1 to numeric then to characters.

A是因子df的第一级[,1],因此A在as时被转换为值1。应用数值。当您调用paste(df[1,])时,就会发生这种情况。由于列1和2属于不同的类,所以paste首先将第1行中的两个元素转换为数字,然后转换为字符。

When you want to concatenate both columns, you first need to transform the first row to character:

当你想连接两列时,首先需要将第一行转换为字符:

df[, 1] <- as.character(df[, 1])
paste(df[1,], collapse = "")

As @sebastian-c pointed out, you can also use stringsAsFactors = FALSE in the creation of the data.frame, then you can omit the as.character() step.

正如@sebastian-c指出的,您还可以使用stringsAsFactors = FALSE来创建数据。frame,然后您可以省略as.character()步骤。

#3


3  

For those using library(tidyverse), you can simply use the unite function.

对于使用library(tidyverse)的用户,可以使用unite函数。

 new.df<-df%>%
 unite(together, letters, numbers, sep="")

This will give you a new column called "together" with A1, B2, etc

这会给你一个新的列,叫做“together”,包括A1, B2等等

#4


0  

if you want to start with

如果你想开始

df <- data.frame(letters = LETTERS[1:5], numbers = 1:5, stringsAsFactors=TRUE)

.. then there is no general rule about how df$letters will be interpreted by any given function. It's a factor for modelling functions, character for some and integer for some others. Even the same function such as paste may interpret it differently, depending on how you use it:

. .那么,对于任何给定的函数如何解释df$字母,就没有一般的规则了。它是函数建模的一个因素,有些是字符,有些是整数。甚至连粘贴这样的功能也可能有不同的解释,这取决于你如何使用它:

paste(df[1,], collapse="") # "11"
apply(df, 1, paste, collapse="") # "A1" "B2" "C3" "D4" "E5"

No logic in it except that it will probably make sense once you know the internals of every function.

它没有任何逻辑,除非一旦你知道了每个函数的内部结构。

The factors seem to be converted to integers when an argument is converted to vector (as you know, data frames are lists of vectors of equal length, so the first row of a data frame is also a list, and when it is forced to be a vector, something like this happens:)

似乎因素转换为整数时一个参数转化为向量(如你所知,数据帧列表的向量相等的长度,所以数据帧的第一行也是一个列表,它不得不是一个向量,这样的事情发生了:)

df[1,]
#    letters numbers
# 1       A       1
unlist(df[1,])
# letters numbers 
#  1       1 

I don't know how apply achieves what it does (i.e., factors are represented by character values) -- if you're interested, look at its source code. It may be useful to know, though, that you can trust (in this specific sense) apply (in this specific occasion). More generally, it is useful to store every piece of data in a sensible format, that includes storing strings as strings, i.e., using stringsAsFactors=FALSE.

我不知道应用是如何实现它的功能的。如果您感兴趣,请查看它的源代码。但是,知道您可以信任(在这个特定的意义上)应用(在这个特定的场合)可能是有用的。更一般地说,以一种合理的格式存储每一段数据是有用的,这种格式包括将字符串存储为字符串。,使用stringsAsFactors = FALSE。

Btw, every introductory R book should have this idea in a subtitle. For example, my plan for retirement is to write "A (not so) gentle introduction to the zen of data fishery with R, the stringsAsFactors=FALSE way".

顺便说一句,每一本入门书都应该有一个副标题。例如,我的退休计划是写“A(不那么)温和地介绍数据渔业的zen和R, stringsAsFactors=FALSE way”。