从因子字符串变量中提取惟一的字符串

时间:2022-01-26 07:10:38

I have a variable which contains the actor names.

我有一个包含演员名字的变量。

(actor=structure(c(4L, 1L, 6L, 2L, 5L, 3L), .Label = c("Christian Bale, Tom Hardy, Anne Hathaway, Gary Oldman", 
"Jamie Foxx, Christoph Waltz, Leonardo DiCaprio, Kerry Washington", 
"Jennifer Lawrence, Josh Hutcherson, Liam Hemsworth, Stanley Tucci", 
"Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen Page, Ken Watanabe", 
"Leonardo DiCaprio, Mark Ruffalo, Ben Kingsley, Max von Sydow", 
"Robert Downey Jr., Chris Evans, Scarlett Johansson, Jeremy Renner"
), class = "factor"))
# [1] Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen Page, Ken Watanabe
# [2] Christian Bale, Tom Hardy, Anne Hathaway, Gary Oldman            
# [3] Robert Downey Jr., Chris Evans, Scarlett Johansson, Jeremy Renner
# [4] Jamie Foxx, Christoph Waltz, Leonardo DiCaprio, Kerry Washington 
# [5] Leonardo DiCaprio, Mark Ruffalo, Ben Kingsley, Max von Sydow     
# [6] Jennifer Lawrence, Josh Hutcherson, Liam Hemsworth, Stanley Tucci
# 6 Levels: Christian Bale, Tom Hardy, Anne Hathaway, Gary Oldman ...

I want to extract all the complete actor names from it (name + surname) and make them columns in an output matrix.

我想从它中提取所有完整的参与者名称(名称+姓),并将它们作为输出矩阵的列。

1 个解决方案

#1


2  

If you wanted to extract the unique names of actors, you can get the indicated actors with the as.character function, split it on the commas with strsplit, combine together all vectors in the resulting list with unlist, and grab the unique names with unique:

如果想提取参与者的唯一名称,可以使用as获取指定的参与者。字符函数,将其与strsplit在逗号上分割,将结果列表中的所有向量与unlist合并,以unique获取唯一的名称:

(all.actors <- unique(unlist(strsplit(as.character(actor), ", "))))
#  [1] "Leonardo DiCaprio"    "Joseph Gordon-Levitt" "Ellen Page"           "Ken Watanabe"        
#  [5] "Christian Bale"       "Tom Hardy"            "Anne Hathaway"        "Gary Oldman"         
#  [9] "Robert Downey Jr."    "Chris Evans"          "Scarlett Johansson"   "Jeremy Renner"       
# [13] "Jamie Foxx"           "Christoph Waltz"      "Kerry Washington"     "Mark Ruffalo"        
# [17] "Ben Kingsley"         "Max von Sydow"        "Jennifer Lawrence"    "Josh Hutcherson"     
# [21] "Liam Hemsworth"       "Stanley Tucci"    

By using as.character(actor), this code uses only the actors that show up in the the factor actor, even if that factor has many more levels that are unused. If you use levels(actor) instead, you will get all the actors in the factor's levels, regardless of whether they are used in actors. You can use whichever you prefer when defining all.actors.

通过使用as.character(actor),该代码只使用在factor actor中出现的角色,即使该因素有许多未使用的级别。如果您使用的是级别(actor),那么您将获得因子级别中的所有actor,而不管它们是否用于actor。在定义all.actors时,您可以使用任何您喜欢的元素。

If you wanted a matrix indicating the inclusion of each actor in each element of actor, you could then do

如果您想要一个矩阵来指示每个参与者在每个参与者元素中包含的内容,那么您可以这样做

mat <- sapply(strsplit(as.character(actor), ", "), function(x) all.actors %in% x)
row.names(mat) <- all.actors
mat
#                       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]
# Leonardo DiCaprio     TRUE FALSE FALSE  TRUE  TRUE FALSE
# Joseph Gordon-Levitt  TRUE FALSE FALSE FALSE FALSE FALSE
# Ellen Page            TRUE FALSE FALSE FALSE FALSE FALSE
# Ken Watanabe          TRUE FALSE FALSE FALSE FALSE FALSE
# Christian Bale       FALSE  TRUE FALSE FALSE FALSE FALSE
# Tom Hardy            FALSE  TRUE FALSE FALSE FALSE FALSE
# Anne Hathaway        FALSE  TRUE FALSE FALSE FALSE FALSE
# Gary Oldman          FALSE  TRUE FALSE FALSE FALSE FALSE
# Robert Downey Jr.    FALSE FALSE  TRUE FALSE FALSE FALSE
# Chris Evans          FALSE FALSE  TRUE FALSE FALSE FALSE
# Scarlett Johansson   FALSE FALSE  TRUE FALSE FALSE FALSE
# Jeremy Renner        FALSE FALSE  TRUE FALSE FALSE FALSE
# Jamie Foxx           FALSE FALSE FALSE  TRUE FALSE FALSE
# Christoph Waltz      FALSE FALSE FALSE  TRUE FALSE FALSE
# Kerry Washington     FALSE FALSE FALSE  TRUE FALSE FALSE
# Mark Ruffalo         FALSE FALSE FALSE FALSE  TRUE FALSE
# Ben Kingsley         FALSE FALSE FALSE FALSE  TRUE FALSE
# Max von Sydow        FALSE FALSE FALSE FALSE  TRUE FALSE
# Jennifer Lawrence    FALSE FALSE FALSE FALSE FALSE  TRUE
# Josh Hutcherson      FALSE FALSE FALSE FALSE FALSE  TRUE
# Liam Hemsworth       FALSE FALSE FALSE FALSE FALSE  TRUE
# Stanley Tucci        FALSE FALSE FALSE FALSE FALSE  TRUE

#1


2  

If you wanted to extract the unique names of actors, you can get the indicated actors with the as.character function, split it on the commas with strsplit, combine together all vectors in the resulting list with unlist, and grab the unique names with unique:

如果想提取参与者的唯一名称,可以使用as获取指定的参与者。字符函数,将其与strsplit在逗号上分割,将结果列表中的所有向量与unlist合并,以unique获取唯一的名称:

(all.actors <- unique(unlist(strsplit(as.character(actor), ", "))))
#  [1] "Leonardo DiCaprio"    "Joseph Gordon-Levitt" "Ellen Page"           "Ken Watanabe"        
#  [5] "Christian Bale"       "Tom Hardy"            "Anne Hathaway"        "Gary Oldman"         
#  [9] "Robert Downey Jr."    "Chris Evans"          "Scarlett Johansson"   "Jeremy Renner"       
# [13] "Jamie Foxx"           "Christoph Waltz"      "Kerry Washington"     "Mark Ruffalo"        
# [17] "Ben Kingsley"         "Max von Sydow"        "Jennifer Lawrence"    "Josh Hutcherson"     
# [21] "Liam Hemsworth"       "Stanley Tucci"    

By using as.character(actor), this code uses only the actors that show up in the the factor actor, even if that factor has many more levels that are unused. If you use levels(actor) instead, you will get all the actors in the factor's levels, regardless of whether they are used in actors. You can use whichever you prefer when defining all.actors.

通过使用as.character(actor),该代码只使用在factor actor中出现的角色,即使该因素有许多未使用的级别。如果您使用的是级别(actor),那么您将获得因子级别中的所有actor,而不管它们是否用于actor。在定义all.actors时,您可以使用任何您喜欢的元素。

If you wanted a matrix indicating the inclusion of each actor in each element of actor, you could then do

如果您想要一个矩阵来指示每个参与者在每个参与者元素中包含的内容,那么您可以这样做

mat <- sapply(strsplit(as.character(actor), ", "), function(x) all.actors %in% x)
row.names(mat) <- all.actors
mat
#                       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]
# Leonardo DiCaprio     TRUE FALSE FALSE  TRUE  TRUE FALSE
# Joseph Gordon-Levitt  TRUE FALSE FALSE FALSE FALSE FALSE
# Ellen Page            TRUE FALSE FALSE FALSE FALSE FALSE
# Ken Watanabe          TRUE FALSE FALSE FALSE FALSE FALSE
# Christian Bale       FALSE  TRUE FALSE FALSE FALSE FALSE
# Tom Hardy            FALSE  TRUE FALSE FALSE FALSE FALSE
# Anne Hathaway        FALSE  TRUE FALSE FALSE FALSE FALSE
# Gary Oldman          FALSE  TRUE FALSE FALSE FALSE FALSE
# Robert Downey Jr.    FALSE FALSE  TRUE FALSE FALSE FALSE
# Chris Evans          FALSE FALSE  TRUE FALSE FALSE FALSE
# Scarlett Johansson   FALSE FALSE  TRUE FALSE FALSE FALSE
# Jeremy Renner        FALSE FALSE  TRUE FALSE FALSE FALSE
# Jamie Foxx           FALSE FALSE FALSE  TRUE FALSE FALSE
# Christoph Waltz      FALSE FALSE FALSE  TRUE FALSE FALSE
# Kerry Washington     FALSE FALSE FALSE  TRUE FALSE FALSE
# Mark Ruffalo         FALSE FALSE FALSE FALSE  TRUE FALSE
# Ben Kingsley         FALSE FALSE FALSE FALSE  TRUE FALSE
# Max von Sydow        FALSE FALSE FALSE FALSE  TRUE FALSE
# Jennifer Lawrence    FALSE FALSE FALSE FALSE FALSE  TRUE
# Josh Hutcherson      FALSE FALSE FALSE FALSE FALSE  TRUE
# Liam Hemsworth       FALSE FALSE FALSE FALSE FALSE  TRUE
# Stanley Tucci        FALSE FALSE FALSE FALSE FALSE  TRUE