The problem:
I often need to select a set of variables from a data.frame in R. My research is in the social and behavioural sciences, and it is quite common to have a data.frame with several hundreds of variables (e.g., there'll be item level information for a range of survey questions, demographic items, performance measures, etc., etc.).
我经常需要选择一组变量的data.frame r .我的研究是在社会和行为科学,是很常见的data.frame与几个数以百计的变量(例如,会有一系列的项目级别信息调查问题,人口项目,性能措施,等等,等等)。
As part of analyses, I'll often want to select a subset of variables. For example, I might want to get:
作为分析的一部分,我通常希望选择变量的子集。例如,我可能想要得到:
- descriptive statistics for a set of variables
- 一组变量的描述性统计
- correlation matrix on a set of variables
- 一组变量上的相关矩阵
- factor analysis on a set of variables
- 一组变量的因子分析。
- predictors in a linear model
- 线性模型中的预测器
Now, I know that there are many ways to write the code to select a subset of variables. Quick-r has a nice overview of common ways of extracting variable subsets from a data.frame.
现在,我知道有很多方法可以编写代码来选择变量的子集。Quick-r很好地概述了从data.frame提取变量子集的常见方法。
e.g.,
例如,
myvars <- c("v1", "v2", "v3")
newdata <- mydata[myvars]
However, I'm interested in the efficiency of this process, particularly where you might need to extract 20 or so variables from a data.frame. The naming convention of variables is often not intuitive, especially where you've inherited a dataset from someone else, so you might be left wondering, was the variable Gender
, gender
, sex
, GENDER
, gender1
, etc. Multiply this by 20 variables that need to be extracted, and the task of memorising variable names becomes more complicated than it needs to be.
但是,我对这个过程的效率很感兴趣,尤其是当您可能需要从data.frame中提取大约20个变量时。变量的命名约定通常不是直观的,特别是在你继承了数据集从别人,所以你可能会好奇,变量是性别、性别、性别、性别、gender1,等。乘以20需要提取的变量,和记忆的任务变量名称比它需要变得更加复杂。
Concrete example
To make the following discussion concrete, I'll use the bfi
data.frame in the psych
package.
为了使下面的讨论更加具体,我将在psych包中使用bfi data.frame。
library(psych)
data(bfi)
df <- bfi
head(df, 1)
A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1 O2 O3 O4
61617 2 4 3 4 4 2 3 3 4 4 3 3 3 4 4 3 4 2 2 3 3 6 3 4
O5 gender education age
61617 3 1 NA 16
- How can I efficiently select an arbitrary set of variables, which for concreteness, I'll choose
A1, A2, A3, A5, C2, C3, C5, E2, E3, gender, education, age
? - 如何有效地选择一组变量,具体来说,我将选择A1, A2, A3, A5, C2, C3, C5, E2, E3,性别,教育,年龄?
My current strategy
I currently have a range of strategies that I use. Of course sometimes I can exploit things like the numeric position of the variables or the naming convention and use either grep
to select or paste
to construct. But sometimes I need a more general solution. I've used the following over the while:
我目前有一系列的策略可以使用。当然,有时我可以利用变量的数值位置或命名约定,并使用grep来选择或粘贴来构造。但有时我需要一个更一般的解决方案。我一直在使用以下内容:
1. names(df)
In the early days, I used to call names(df)
, copy the quoted variable names and then edit until I have what I want.
在早期,我常常调用names(df),复制引用的变量名称,然后进行编辑,直到得到我想要的结果。
2. Use a database
Sometimes I'll have a separate data.frame that stores each variable as a row, and has columns for variable names, variable labels, and it has a column which indicates whether the variable should be retained for a particular analysis. I can then filter on that include
variable and extract a vector of variable names. I find this particularly useful when I'm developing a psychological test and for various iterations I want to include or exclude certain items.
有时我会有一个单独的data.frame,它将每个变量存储为一行,并为变量名和变量标签设置列,它还有一个列,指示是否应该为特定的分析保留该变量。然后我可以对它进行过滤,包括变量并提取变量名的向量。当我开发一个心理测试时,我发现这一点特别有用,对于不同的迭代,我想包含或排除某些项目。
3. dput(names(df))
As Hadley Wickham once pointed out to me dput
is a good option; e.g., dput(names(df))
is better than names(df)
in that it outputs a list that is already in the c("var1", "var2", ...)
format:
正如哈德利·韦翰曾经向我指出的那样,dput是个不错的选择;例如,dput(名称(df))优于名称(df),因为它输出已经在c中的列表(“var1”、“var2”、…)格式:
dput(names(df))
c("A1", "A2", "A3", "A4", "A5", "C1", "C2", "C3", "C4", "C5",
"E1", "E2", "E3", "E4", "E5", "N1", "N2", "N3", "N4", "N5", "O1",
"O2", "O3", "O4", "O5", "gender", "education", "age")
This can then be copied into the script and edited.
然后可以将其复制到脚本中并进行编辑。
But can it be more efficient
I guess dput
is a pretty good variable selection strategy. The efficiency of the process largely depends on how proficient you are in copying the text into your script and then editing the list of names down to those desired.
dput是一个很好的变量选择策略。这个过程的效率在很大程度上取决于你如何熟练地将文本复制到脚本中,然后编辑列表到所需的名称。
However, I still remember the efficiency of GUI based systems of variable selection. For example, in SPSS when you interact with a dialogue box you can point and click with the mouse the variables you want from the dataset. You can shift-click to select a range of variables, you can hold shift and press the down key to select one or more variables, and so on. And then you can press Paste
and the command with extracted variable names is pasted into your script editor.
但是,我仍然记得基于GUI的变量选择系统的效率。例如,在SPSS中,当您与一个对话框交互时,您可以指向并单击鼠标,从数据集中您想要的变量。你可以切换点击来选择一系列变量,你可以按住shift键并按下键来选择一个或多个变量,以此类推。然后,您可以按Paste,并将提取的变量名称命令粘贴到脚本编辑器中。
So, finally the core question
- Is there a simple no frills GUI device that permits the selection of variables from a data.frame (e.g., something like
guiselect(df)
opens a gui window for variable selection), and returns a vector of variable names selectedc("var1", "var2", ...)
? - 框架(例如,guiselect(df)打开一个GUI窗口进行变量选择),并返回一个选择了c(“var1”、“var2”、…)的变量名向量。
- Is
dput
the best general option for selecting a set of variable names in R? Or is there a better way? - dput是在R中选择一组变量名的最佳通用选项吗?或者有更好的方法吗?
Update (April 2017): I have posted my own understanding of a good strategy below.
更新(2017年4月):我在下面发布了我对一个好策略的理解。
5 个解决方案
#1
22
I'm personally a fan of the myvars <- c(...)
and then using mydf[,myvars]
from there on in.
我个人是myvars <- c(…)的粉丝,然后从那里开始使用mydf[,myvars]。
However this still requires you to enter the initial variable names (even though just once), and as far as I read your question, it is this initial 'picking variable names' that is what you're asking about.
然而,这仍然需要您输入初始变量名(即使只输入一次),而且就我所读到的问题而言,您要问的是这个初始的“选择变量名”。
Re a simple no-frills GUI device -- I've recently been introduced to the menu
function, which is exactly a simple no-frills GUI device for selecting one object out of a list of choices. Try menu(names(df),graphics=TRUE)
to see what I mean (returns the column number). It even gives a nice text interface if for some reason your system can't do the graphics (try with graphics=FALSE
to see what I mean).
这是一个简单的无修饰GUI设备——我最近被介绍到菜单函数中,菜单函数就是一个简单的无修饰GUI设备,用于从选项列表中选择一个对象。尝试菜单(名称(df),图形=TRUE)来查看我的意思(返回列号)。它甚至提供了一个很好的文本界面,如果由于某种原因您的系统不能做图形(尝试使用graphics=FALSE来理解我的意思)。
However this is of limited use to you, as you can only select one column name. To select multiple, you can use select.list
(mentioned in ?menu
as the alternative to make multiple selections):
但是,这对您的用处不大,因为您只能选择一个列名。要选择多个,您可以使用select。列表(在?菜单中提到的可选选项):
# example with iris data (I don't have 'psych' package):
vars <- select.list(names(iris),multiple=TRUE,
title='select your variable names',
graphics=TRUE)
This also takes a graphics=TRUE
option (single click on all the items you want to select). It returns the names of the variables.
这也需要一个图形=TRUE选项(单击所有要选择的项)。它返回变量的名称。
#2
10
You could use select.list()
, like this:
您可以使用select.list(),如下所示:
DF <- data.frame(replicate(26,list(rnorm(5))))
names(DF) <- LETTERS
subDF <- DF[select.list(names(DF), multiple=TRUE)]
#3
4
I use the following strategy to make variable selection in R efficient.
我使用以下策略在R中进行变量选择。
Use metadata to store variable names
I have data frames with one row per variable for certain sets of variables. For example, I might have a 100 item personality test. The meta data includes the variable name in R along with all the scoring information (e.g., should the item be reversed and so on). I can then extract variable names for the items and the scale names from this meta data.
我有一个数据框架,每个变量对应一组变量。例如,我可能有一个100项性格测试。元数据包括变量名在R中以及所有的计分信息(例如,如果项目被反转等)。然后我可以从这个元数据中提取项目的变量名和比例名。
Store variable sets in a named list
In every project, I have a list called v
that stores named sets of variables. Then in any analysis that requires a set of variables, I can just refer to the named list. This also makes code more reliable, because if the variable names change so do all your contingent analyses. It is also good for creating consistency in how variables are ordered.
在每个项目中,我都有一个名为v的列表,用来存储变量集合。然后,在任何需要一组变量的分析中,我只能引用命名列表。这也使代码更加可靠,因为如果变量名发生变化,那么所有的偶然分析也会发生变化。它还有助于创建变量排序方式的一致性。
Here's a simple example:
这是一个简单的例子:
v <- list()
v$neo_items <- meta.neo$id
v$ds14_items <- meta.ds14$id
v$core_items <- c(v$neo_items, v$ds14_items)
v$typed_scales <- c("na", "si")
v$typed_all <- c("typed_continuous_sum", "na", "si")
v$neo_facets <- sort(unique(meta.neo$facet))
v$neo_factors <- c("agreeableness", "conscientiousness",
"extraversion", "neuroticism", "openness")
v$outcomes_scales <- c("healthbehavior", "socialsupport",
"physical_symptoms", "psychological_symptoms")
A few points can be seen from the above example:
从上面的例子可以看出几点:
- Often the variable lists will be generated from meta data that I have stored separately. So for example, I have the variable names for the 240 itms of the neo personality test stored in
meta.neo$id
- 通常,变量列表将由我单独存储的元数据生成。例如,我有240个新人格测试的itms的变量名存储在元。neo$id中
- In some cases, variable names can be derived from meta data. For example, one of the columns in my meta-data for a personality test indicates which scale the item belongs to, and the variable names are derived from that column by taking the
unique
value of that column. - 在某些情况下,变量名可以从元数据中派生出来。例如,我的性格测试的元数据中的一列指出了项目所属的比例,变量名是通过该列的唯一值派生出来的。
- In some cases, variable sets are the combination of smaller sets. So for example, you might have one set for predictors, one set for outcomes, and one set that combines predictors and outcomes. The division into predictors and outcomes might be useful for some regression models, and the combined set might be useful for a correlation matrix or a factor analysis.
- 在某些情况下,变量集是小集合的组合。例如,你可能有一组预测因子,一组用于结果,一组集合了预测因子和结果。将预测因子和结果划分为若干回归模型可能有用,组合集可能对相关矩阵或因子分析有用。
- For more ad hoc lists of variables, I still use
dput(names(df)
wheredf
is my data.frame to generate the vector of character names that is then stored in a variable list. - 对于更特殊的变量列表,我仍然使用dput(names(df),其中df是我的数据。frame生成字符名称的向量,然后存储在变量列表中。
- These variable lists are generally placed after you load your data, but before you munge it. That way, they can be used for data preparation, and they are certainly available when you start running analyses (e.g., predictive models, correlations, descriptive statistics, etc.).
- 这些变量列表通常放置在加载数据之后,但在修改数据之前。这样,它们可以用于数据准备,当您开始运行分析(例如,预测模型、相关性、描述性统计等)时,它们当然是可用的。
- The beauty of variable lists is that you can readily use auto-copmlete in RStudio. So you don't need to remember variable names or even the names of the variable lists. You just type
v$
and press tab orv$
and some part of the list name. - 变量列表的美妙之处在于,您可以在RStudio中轻松地使用自动copmlete。所以你不需要记住变量名甚至变量列表的名字。您只需输入v$并按tab或v$并按下列表名的某些部分。
Using variables lists
Using variable lists is fairly straight forward, but some functions in R specify variable names differently.
使用变量列表是相当直接的,但是R中的一些函数以不同的方式指定变量名。
The simple and standard scenario involves supplying the list of variable names to the data.frame subset. For example,
简单而标准的场景包括向data.frame子集提供变量名列表。例如,
cor(data[,v$mylist])
cor(data[,v$predictors], data[,v$outcomes])
It is a little bit trickier for functions that require formulas. You may need to write a function. For example:
对于需要公式的函数来说有点麻烦。您可能需要编写一个函数。例如:
v <- list()
v$predictors <- c("cyl", "disp")
f <- as.formula(paste("mpg ~", paste(v$predictors, collapse = " + ")))
lm(f, mtcars)
You can also use variable lists in functions like sapply
and lapply
(and presumably the tidyverse equivalents). For example,
您还可以在函数中使用变量列表,如sapply和lapply(以及可能的tidyverse等价物)。例如,
Create a descriptive statistics table with:
创建描述性统计表,其中:
sapply(mydata[, v$outcomes], function(X) c(mean = mean(X), sd = sd(X)))
dput
is still useful
For ad hoc variables or even when you are just writing the code to create a variable list, dput
is still very useful.
对于特别的变量,甚至当您编写创建变量列表的代码时,dput仍然非常有用。
The standard code is dput(names(df))
where df
is your data.frame. So for example:
标准代码是dput(name (df)), df是您的data.frame。举个例子:
dput(names(mtcars))
Produces
生产
c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am",
"gear", "carb")
You can then edit this string to extract the variables you need. This has the additional benefit that it reduces typing errors in your code. And this is a really important point. You don't want to spend lots of time trying to debug code that was merely a result of a typo. Furthermore, Rs error message when mistyping a variable name is awful. It just says "undefined columns selected". It doesn't tell you which variable names were wrong.
然后,您可以编辑该字符串以提取所需的变量。这有一个额外的好处,它可以减少代码中的输入错误。这一点非常重要。您不希望花费大量时间调试仅仅是由于输入错误造成的代码。此外,当错误输入变量名时,Rs错误消息非常糟糕。它只是说“选择的未定义列”。它没有告诉你哪个变量名是错误的。
If you have a large number of variables, you can also use a range of string search functions to extract a subset of the variable names:
如果您有大量的变量,您还可以使用一系列字符串搜索函数来提取变量名的子集:
For example
例如
> library(psych)
> dput(names(bfi)) #all items
c("A1", "A2", "A3", "A4", "A5", "C1", "C2", "C3", "C4", "C5",
"E1", "E2", "E3", "E4", "E5", "N1", "N2", "N3", "N4", "N5", "O1",
"O2", "O3", "O4", "O5", "gender", "education", "age")
> dput(grep("^..$", names(bfi), value = TRUE)) # two letter variable names
c("A1", "A2", "A3", "A4", "A5", "C1", "C2", "C3", "C4", "C5",
"E1", "E2", "E3", "E4", "E5", "N1", "N2", "N3", "N4", "N5", "O1",
"O2", "O3", "O4", "O5")
> dput(grep("^E.$", names(bfi), value = TRUE)) # E items
c("E1", "E2", "E3", "E4", "E5")
> dput(grep(".5$", names(bfi), value = TRUE)) # 5th items
c("A5", "C5", "E5", "N5", "O5")
Clean existing variable names and use a naming convention
When I get a data file from someone else, the variable names often lack conventions or use conventions that make working with the variables less useful in R. A few rules that I use:
当我从别人那里得到一个数据文件时,变量名通常缺乏约定或使用约定,这使得在r中处理变量变得不那么有用。
- make all variables lower case (having to think about lower and upper case variables is just annoying)
- 使所有变量都小写(必须考虑小写和大写的变量是很烦人的)
- make variable names intrinsically meaningful (some other software uses variable labels to store meaningful data; R doesn't really use labels)
- 使变量名具有本质意义(其他一些软件使用变量标签来存储有意义的数据;R没有使用标签)
- Keep variables to an appropriate length (i.e., not too long). Up to 10 characters is fine. More than 20 gets annoying.
- 将变量保持在适当的长度(例如。,而不是太长)。最多10个字符是可以的。超过20个会让人讨厌。
All these steps generally make variable selection easier because there are fewer inconsistencies to remember.
所有这些步骤通常使变量选择更容易,因为需要记住的不一致性更少。
Use tab completion for individual variable names
For individual variables, I generally use auto-completion from the data frame. E.g., df$
and press tab.
对于单个变量,我通常使用数据框架中的自动补全。例如:df$和press tab。
I try to use a coding style that allows me to use auto-completion as much as possible. I don't like functions that require me to know the variable name without using auto-completion. For example, when subsetting a data.frame, I prefer
我尝试使用一种编码风格,让我尽可能地使用自动完成。我不喜欢不使用自动补全就需要知道变量名的函数。例如,在设置data.frame时,我更喜欢
df[ df$sample == "control", ]
to
来
subset(df, sample == "control")
because I can autocomplete the variable name "sample" in the top example, but not in the second.
因为我可以在上面的例子中自动完成变量名“sample”,但在第二个例子中不能。
#4
3
If you want a method that ignores the case of variables and perhaps picks out variables on the basis of their 'stems' then use the appropriate regex pattern and ignore.case-=TRUE and value=TRUE with grep:
如果您想要一个方法来忽略变量的情况,或者根据变量的“茎”选择变量,那么使用适当的regex模式并忽略它们。case-=TRUE and value=TRUE with grep:
dfrm <- data.frame(var1=1, var2=2, var3=3, THIS=4, Dont=5, NOTthis=6, WANTthis=7)
unlist(sapply( c("Want", "these", "var"),
function(x) grep(paste("^", x,sep=""), names(dfrm), ignore.case=TRUE, value=TRUE) ))
#----------------
Want var1 var2 var3 # Names of the vector
"WANTthis" "var1" "var2" "var3" # Values matched
> dfrm[desired]
WANTthis var1 var2 var3
1 7 1 2 3
#5
0
Do you mean select
?
你的意思是选择吗?
sub_df = subset(df, select=c("v1","v2","v3"))
#1
22
I'm personally a fan of the myvars <- c(...)
and then using mydf[,myvars]
from there on in.
我个人是myvars <- c(…)的粉丝,然后从那里开始使用mydf[,myvars]。
However this still requires you to enter the initial variable names (even though just once), and as far as I read your question, it is this initial 'picking variable names' that is what you're asking about.
然而,这仍然需要您输入初始变量名(即使只输入一次),而且就我所读到的问题而言,您要问的是这个初始的“选择变量名”。
Re a simple no-frills GUI device -- I've recently been introduced to the menu
function, which is exactly a simple no-frills GUI device for selecting one object out of a list of choices. Try menu(names(df),graphics=TRUE)
to see what I mean (returns the column number). It even gives a nice text interface if for some reason your system can't do the graphics (try with graphics=FALSE
to see what I mean).
这是一个简单的无修饰GUI设备——我最近被介绍到菜单函数中,菜单函数就是一个简单的无修饰GUI设备,用于从选项列表中选择一个对象。尝试菜单(名称(df),图形=TRUE)来查看我的意思(返回列号)。它甚至提供了一个很好的文本界面,如果由于某种原因您的系统不能做图形(尝试使用graphics=FALSE来理解我的意思)。
However this is of limited use to you, as you can only select one column name. To select multiple, you can use select.list
(mentioned in ?menu
as the alternative to make multiple selections):
但是,这对您的用处不大,因为您只能选择一个列名。要选择多个,您可以使用select。列表(在?菜单中提到的可选选项):
# example with iris data (I don't have 'psych' package):
vars <- select.list(names(iris),multiple=TRUE,
title='select your variable names',
graphics=TRUE)
This also takes a graphics=TRUE
option (single click on all the items you want to select). It returns the names of the variables.
这也需要一个图形=TRUE选项(单击所有要选择的项)。它返回变量的名称。
#2
10
You could use select.list()
, like this:
您可以使用select.list(),如下所示:
DF <- data.frame(replicate(26,list(rnorm(5))))
names(DF) <- LETTERS
subDF <- DF[select.list(names(DF), multiple=TRUE)]
#3
4
I use the following strategy to make variable selection in R efficient.
我使用以下策略在R中进行变量选择。
Use metadata to store variable names
I have data frames with one row per variable for certain sets of variables. For example, I might have a 100 item personality test. The meta data includes the variable name in R along with all the scoring information (e.g., should the item be reversed and so on). I can then extract variable names for the items and the scale names from this meta data.
我有一个数据框架,每个变量对应一组变量。例如,我可能有一个100项性格测试。元数据包括变量名在R中以及所有的计分信息(例如,如果项目被反转等)。然后我可以从这个元数据中提取项目的变量名和比例名。
Store variable sets in a named list
In every project, I have a list called v
that stores named sets of variables. Then in any analysis that requires a set of variables, I can just refer to the named list. This also makes code more reliable, because if the variable names change so do all your contingent analyses. It is also good for creating consistency in how variables are ordered.
在每个项目中,我都有一个名为v的列表,用来存储变量集合。然后,在任何需要一组变量的分析中,我只能引用命名列表。这也使代码更加可靠,因为如果变量名发生变化,那么所有的偶然分析也会发生变化。它还有助于创建变量排序方式的一致性。
Here's a simple example:
这是一个简单的例子:
v <- list()
v$neo_items <- meta.neo$id
v$ds14_items <- meta.ds14$id
v$core_items <- c(v$neo_items, v$ds14_items)
v$typed_scales <- c("na", "si")
v$typed_all <- c("typed_continuous_sum", "na", "si")
v$neo_facets <- sort(unique(meta.neo$facet))
v$neo_factors <- c("agreeableness", "conscientiousness",
"extraversion", "neuroticism", "openness")
v$outcomes_scales <- c("healthbehavior", "socialsupport",
"physical_symptoms", "psychological_symptoms")
A few points can be seen from the above example:
从上面的例子可以看出几点:
- Often the variable lists will be generated from meta data that I have stored separately. So for example, I have the variable names for the 240 itms of the neo personality test stored in
meta.neo$id
- 通常,变量列表将由我单独存储的元数据生成。例如,我有240个新人格测试的itms的变量名存储在元。neo$id中
- In some cases, variable names can be derived from meta data. For example, one of the columns in my meta-data for a personality test indicates which scale the item belongs to, and the variable names are derived from that column by taking the
unique
value of that column. - 在某些情况下,变量名可以从元数据中派生出来。例如,我的性格测试的元数据中的一列指出了项目所属的比例,变量名是通过该列的唯一值派生出来的。
- In some cases, variable sets are the combination of smaller sets. So for example, you might have one set for predictors, one set for outcomes, and one set that combines predictors and outcomes. The division into predictors and outcomes might be useful for some regression models, and the combined set might be useful for a correlation matrix or a factor analysis.
- 在某些情况下,变量集是小集合的组合。例如,你可能有一组预测因子,一组用于结果,一组集合了预测因子和结果。将预测因子和结果划分为若干回归模型可能有用,组合集可能对相关矩阵或因子分析有用。
- For more ad hoc lists of variables, I still use
dput(names(df)
wheredf
is my data.frame to generate the vector of character names that is then stored in a variable list. - 对于更特殊的变量列表,我仍然使用dput(names(df),其中df是我的数据。frame生成字符名称的向量,然后存储在变量列表中。
- These variable lists are generally placed after you load your data, but before you munge it. That way, they can be used for data preparation, and they are certainly available when you start running analyses (e.g., predictive models, correlations, descriptive statistics, etc.).
- 这些变量列表通常放置在加载数据之后,但在修改数据之前。这样,它们可以用于数据准备,当您开始运行分析(例如,预测模型、相关性、描述性统计等)时,它们当然是可用的。
- The beauty of variable lists is that you can readily use auto-copmlete in RStudio. So you don't need to remember variable names or even the names of the variable lists. You just type
v$
and press tab orv$
and some part of the list name. - 变量列表的美妙之处在于,您可以在RStudio中轻松地使用自动copmlete。所以你不需要记住变量名甚至变量列表的名字。您只需输入v$并按tab或v$并按下列表名的某些部分。
Using variables lists
Using variable lists is fairly straight forward, but some functions in R specify variable names differently.
使用变量列表是相当直接的,但是R中的一些函数以不同的方式指定变量名。
The simple and standard scenario involves supplying the list of variable names to the data.frame subset. For example,
简单而标准的场景包括向data.frame子集提供变量名列表。例如,
cor(data[,v$mylist])
cor(data[,v$predictors], data[,v$outcomes])
It is a little bit trickier for functions that require formulas. You may need to write a function. For example:
对于需要公式的函数来说有点麻烦。您可能需要编写一个函数。例如:
v <- list()
v$predictors <- c("cyl", "disp")
f <- as.formula(paste("mpg ~", paste(v$predictors, collapse = " + ")))
lm(f, mtcars)
You can also use variable lists in functions like sapply
and lapply
(and presumably the tidyverse equivalents). For example,
您还可以在函数中使用变量列表,如sapply和lapply(以及可能的tidyverse等价物)。例如,
Create a descriptive statistics table with:
创建描述性统计表,其中:
sapply(mydata[, v$outcomes], function(X) c(mean = mean(X), sd = sd(X)))
dput
is still useful
For ad hoc variables or even when you are just writing the code to create a variable list, dput
is still very useful.
对于特别的变量,甚至当您编写创建变量列表的代码时,dput仍然非常有用。
The standard code is dput(names(df))
where df
is your data.frame. So for example:
标准代码是dput(name (df)), df是您的data.frame。举个例子:
dput(names(mtcars))
Produces
生产
c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am",
"gear", "carb")
You can then edit this string to extract the variables you need. This has the additional benefit that it reduces typing errors in your code. And this is a really important point. You don't want to spend lots of time trying to debug code that was merely a result of a typo. Furthermore, Rs error message when mistyping a variable name is awful. It just says "undefined columns selected". It doesn't tell you which variable names were wrong.
然后,您可以编辑该字符串以提取所需的变量。这有一个额外的好处,它可以减少代码中的输入错误。这一点非常重要。您不希望花费大量时间调试仅仅是由于输入错误造成的代码。此外,当错误输入变量名时,Rs错误消息非常糟糕。它只是说“选择的未定义列”。它没有告诉你哪个变量名是错误的。
If you have a large number of variables, you can also use a range of string search functions to extract a subset of the variable names:
如果您有大量的变量,您还可以使用一系列字符串搜索函数来提取变量名的子集:
For example
例如
> library(psych)
> dput(names(bfi)) #all items
c("A1", "A2", "A3", "A4", "A5", "C1", "C2", "C3", "C4", "C5",
"E1", "E2", "E3", "E4", "E5", "N1", "N2", "N3", "N4", "N5", "O1",
"O2", "O3", "O4", "O5", "gender", "education", "age")
> dput(grep("^..$", names(bfi), value = TRUE)) # two letter variable names
c("A1", "A2", "A3", "A4", "A5", "C1", "C2", "C3", "C4", "C5",
"E1", "E2", "E3", "E4", "E5", "N1", "N2", "N3", "N4", "N5", "O1",
"O2", "O3", "O4", "O5")
> dput(grep("^E.$", names(bfi), value = TRUE)) # E items
c("E1", "E2", "E3", "E4", "E5")
> dput(grep(".5$", names(bfi), value = TRUE)) # 5th items
c("A5", "C5", "E5", "N5", "O5")
Clean existing variable names and use a naming convention
When I get a data file from someone else, the variable names often lack conventions or use conventions that make working with the variables less useful in R. A few rules that I use:
当我从别人那里得到一个数据文件时,变量名通常缺乏约定或使用约定,这使得在r中处理变量变得不那么有用。
- make all variables lower case (having to think about lower and upper case variables is just annoying)
- 使所有变量都小写(必须考虑小写和大写的变量是很烦人的)
- make variable names intrinsically meaningful (some other software uses variable labels to store meaningful data; R doesn't really use labels)
- 使变量名具有本质意义(其他一些软件使用变量标签来存储有意义的数据;R没有使用标签)
- Keep variables to an appropriate length (i.e., not too long). Up to 10 characters is fine. More than 20 gets annoying.
- 将变量保持在适当的长度(例如。,而不是太长)。最多10个字符是可以的。超过20个会让人讨厌。
All these steps generally make variable selection easier because there are fewer inconsistencies to remember.
所有这些步骤通常使变量选择更容易,因为需要记住的不一致性更少。
Use tab completion for individual variable names
For individual variables, I generally use auto-completion from the data frame. E.g., df$
and press tab.
对于单个变量,我通常使用数据框架中的自动补全。例如:df$和press tab。
I try to use a coding style that allows me to use auto-completion as much as possible. I don't like functions that require me to know the variable name without using auto-completion. For example, when subsetting a data.frame, I prefer
我尝试使用一种编码风格,让我尽可能地使用自动完成。我不喜欢不使用自动补全就需要知道变量名的函数。例如,在设置data.frame时,我更喜欢
df[ df$sample == "control", ]
to
来
subset(df, sample == "control")
because I can autocomplete the variable name "sample" in the top example, but not in the second.
因为我可以在上面的例子中自动完成变量名“sample”,但在第二个例子中不能。
#4
3
If you want a method that ignores the case of variables and perhaps picks out variables on the basis of their 'stems' then use the appropriate regex pattern and ignore.case-=TRUE and value=TRUE with grep:
如果您想要一个方法来忽略变量的情况,或者根据变量的“茎”选择变量,那么使用适当的regex模式并忽略它们。case-=TRUE and value=TRUE with grep:
dfrm <- data.frame(var1=1, var2=2, var3=3, THIS=4, Dont=5, NOTthis=6, WANTthis=7)
unlist(sapply( c("Want", "these", "var"),
function(x) grep(paste("^", x,sep=""), names(dfrm), ignore.case=TRUE, value=TRUE) ))
#----------------
Want var1 var2 var3 # Names of the vector
"WANTthis" "var1" "var2" "var3" # Values matched
> dfrm[desired]
WANTthis var1 var2 var3
1 7 1 2 3
#5
0
Do you mean select
?
你的意思是选择吗?
sub_df = subset(df, select=c("v1","v2","v3"))