获取具有特殊字符和大写字母的列

时间:2021-02-20 04:26:43

I have a data frame and I'm trying to loop through the data frame to identify those columns which contain a special character or which are all capital letters.

我有一个数据框,我试图在数据框中循环,以识别那些包含特殊字符或所有大写字母的列。

I have tried a few things but nothing where I'm apple to catch the column names within the loop.

我已经尝试了一些东西,但是我是apple,我没有尝试去捕捉循环中的列名。

data = data.frame(one=c(1,3,5,1,3,5,1,3,5,1,3,5), two=c(1,3,5,1,3,5,1,3,5,1,3,5), 
                thr=c("A","B","D","E","F","G","H","I","J","H","I","J"),
                fou=c("A","B","D","A","B","D","A","B","D","A","B","D"),
                fiv=c(1,3,5,1,3,5,1,3,5,1,3,5), 
                six=c("A","B","D","E","F","G","H","I","J","H","I","J"),
                sev=c("A","B","D","A","B","D","A","B","D","A","B","D"),
                eig=c("A","B","D","A","B","D","A","B","D","A","B","D"),
                nin=c(1.24,3.52,5.33,1.44,3.11,5.33,1.55,3.66,5.33,1.32,3.54,5.77),
                ten=c(1:12),
                ele=rep(1,12),
                twe=c(1,2,1,2,1,2,1,2,1,2,1,2), 
                thir=c("THiS","THAT34","T(&*(", "!!!","@$#","$Q%J","who","THIS","this","this","this","this"),
                stringsAsFactors = FALSE)
data

colls <- c()

    spec=c("$","%","&")
    for( col in names(data) ) {
      if( length(strings[stringr::str_detect(data[,col], spec)]) >= 1 ){
          print("HORRAY")
colls <- c(collls, col) 
      }
      else print ("NOOOOOOOOOO")
    }


    for( col in names(data) ) {
      if( any(data[,col]) %in% spec ){
        print("HORRAY") 
colls <- c(collls, col)
      }
      else print ("NOOOOOOOOOO")
    }

Can anyone shed light on a good way to tackle this problem.

有人能找出解决这个问题的好办法吗?

EDIT:

编辑:

The end goal is to have a vector with a name of column names which meet that criteria. Sorry for my poor SO question, but hopefully this will help with what I'm trying to do

最终的目标是拥有一个具有符合该条件的列名的向量。很抱歉我的问题如此糟糕,但希望这能帮助我做我想做的

2 个解决方案

#1


2  

I would use grep() to search for the pattern you are interested in. See here.

我将使用grep()搜索您感兴趣的模式。在这里看到的。

[:upper:] Matches any upper case letters.

[:upper:]匹配任何大写字母。

Combining it with anchors (^,$) and match one or more times (+) gives ^[[:upper:]]+$ and should only match entries completely in capitals.

结合它与锚(^ $)和匹配一次或多次(+)给^[[:上:]]+美元,应该只在首都完全匹配条目。

The following would match the special characters in your toy data set (but is not guaranteed to match all special characters in your real data set i.e form feeds, carriage returns)

下面将匹配玩具数据集中的特殊字符(但不能保证匹配真实数据集中i中的所有特殊字符)。e表格提要,回车)

[:punct:] #Matches punctuation - ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~.

[:punct:] #匹配标点- !" # $ & '() * +, -。/:;< = > ?@[\]^ _ { | } ~。

Note that rather than use [:punct:] you could define your special characters manually.

注意,您可以手动定义特殊字符,而不是使用[:punct:]。

We can try the resultant code on the first row of your data set:

我们可以在您的数据集中的第一行尝试合成代码:

#Using grepl() rather than grep() so that we return a list of logical values.
grepl(x= data[1,], pattern = "^[[:upper:]]+$|[[:punct:]]")
[1] FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE

This gives us our expected response except for column nine which has the value 1.24. Here the decimal point is being recognised as punctuation and is being flagged as a match. We can add a "negative lookahead assertion" - (?!\\.) - to remove any periods from consideration, before they are even tested for being punctuation characters. Note we use \ to escape the period.

这给出了我们期望的响应,除了第9列,它的值是1。24。在这里,小数点被识别为标点符号,并被标记为匹配。我们可以添加一个“消极的前瞻断言”-(?!\\)-删除任何一个周期,在他们甚至被测试为标点字符之前。注意我们使用\来转义周期。

grepl(x= data[1,], perl = TRUE, pattern = "(?!\\.)(^[[:upper:]]+$|[[:punct:]])")
[1] FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE

This returns a better response - it now no longer matches decimal places. NOTE: This might not be what you want as this pattern also won't match any fullstops in character fields. You would need to refine the pattern further.

这将返回更好的响应——它现在不再匹配小数部分。注意:这可能不是您想要的,因为这个模式也不会匹配字符字段中的任何fullstop。您需要进一步细化模式。

Rather than use a 'for loop' to reiterate this code across every row in your dataframe I would use vectorization instead which is 'more R like'.

与其使用“for循环”来在dataframe中的每一行重复这段代码,不如使用矢量化,这更像是“R”。

To do this we must convert our script into a function which we will call with apply()

为此,我们必须将脚本转换为一个函数,我们将使用apply()调用这个函数

myFunction <- function(x){
      matches <- grepl(x= x, perl = TRUE, pattern = "(?!\\.)(^[[:upper:]]+$|[[:punct:]])")
  #Given a set of logical vectors 'matches', is at least one of the values true? using any()
  return(any(matches))
}

apply(X = data, 1, myFunction)

The 1 above instructs apply() to reiterate across rows rather than columns.

上面的1指示apply()跨行而不是跨列重复。

[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

In your example data set all rows have an entry containing a special character or a string of all capital letters. This is unsurprising as many columns in your example data set are a list of single capital letters.

在示例数据集中,所有行都有一个包含特殊字符或所有大写字母的字符串的条目。这并不奇怪,因为您的示例数据集中的许多列都是单个大写字母的列表。

If you are just interested in which values in column thirteen fit the stated criteria you can use:

如果您只是对第13列中的哪些值符合您可以使用的指定标准感兴趣:

matches <- grepl(x= data$thir, perl = TRUE, pattern = "(?!\\.)(^[[:upper:]]+$|[[:punct:]])")
matches
 [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE

To subset your dataframe on matching rows:

将您的数据aframe子集划分为匹配的行:

data[matches,]
  one two thr fou fiv six sev eig  nin ten ele twe  thir
3   5   5   D   D   5   D   D   D 5.33   3   1   1 T(&*(
4   1   1   E   A   1   E   A   A 1.44   4   1   2   !!!
5   3   3   F   B   3   F   B   B 3.11   5   1   1   @$#
6   5   5   G   D   5   G   D   D 5.33   6   1   2  $Q%J
8   3   3   I   B   3   I   B   B 3.66   8   1   2  THIS

To subset your dataframe on non-matching rows:

将数据aframe子集划分为不匹配的行:

data[!matches,]
   one two thr fou fiv six sev eig  nin ten ele twe   thir
1    1   1   A   A   1   A   A   A 1.24   1   1   1   THiS
2    3   3   B   B   3   B   B   B 3.52   2   1   2 THAT34
7    1   1   H   A   1   H   A   A 1.55   7   1   1    who
9    5   5   J   D   5   J   D   D 5.33   9   1   1   this
10   1   1   H   A   1   H   A   A 1.32  10   1   2   this
11   3   3   I   B   3   I   B   B 3.54  11   1   1   this
12   5   5   J   D   5   J   D   D 5.77  12   1   2   this

Note that the regular expression used doesn't match THAT34 as it isn't composed wholly of capitalised letters, having the number 34 at the end.

请注意,使用的正则表达式与此不匹配,因为它并不是完全由大写字母组成的,在末尾是34。

EDIT:

编辑:

To get a list of column names identifying columns that fulfill the criteria in your edit use myFunction described above with:

要获得一个列名称的列表,在您的编辑使用myFunction中描述的列中符合标准的列如下:

colnames(data)[apply(X = data, 2, myFunction)]
"thr"  "fou"  "six"  "sev"  "eig"  "thir"

The number in apply() changes from 1 to 2 to reiterate across columns rather than rows. We pass the output from apply(), a list of logical matches (TRUE or FALSE), to colnames(data) - this returns the matching column names via subsetting.

apply()中的数字从1变为2,以跨列而不是跨行重复。我们将逻辑匹配列表apply()的输出(TRUE或FALSE)传递给colnames(数据)——这通过子设置返回匹配的列名。

#2


1  

I would collapse the data into strings (one string per row)

我将数据折叠成字符串(每行一个字符串)

strings = apply(data, 1, paste, collapse = "")
contains_only_caps = strings == toupper(strings)
strings[contains_only_caps]
# [1] "33BB3BBB3.52 212THAT34" "55DD5DDD5.33 311T(&*("  "11EA1EAA1.44 412!!!"   "33FB3FBB3.11 511@$#"   
# [5] "55GD5GDD5.33 612$Q%J"   "33IB3IBB3.66 812THIS"  


# escaping special characters
spec=c("\\$","%","\\&")
contains_spec = stringr::str_detect(strings, pattern = paste(spec, collapse = "|"))

strings[contains_spec]
# [1] "55DD5DDD5.33 311T(&*(" "33FB3FBB3.11 511@$#"   "55GD5GDD5.33 612$Q%J" 

You could also use which on contains_spec or contains_only_caps to get the corresponding row numbers for the original data frame. I think that using strings rather than row-wise data frame elements will by much faster - as long as you want to search the whole strings, not certain columns for certain conditions.

您还可以使用contains_spec或contains_only_caps上的which来获取原始数据帧的相应行号。我认为使用字符串而不是行式数据框架元素将会更快——只要你想搜索整个字符串,而不是特定的列。

#1


2  

I would use grep() to search for the pattern you are interested in. See here.

我将使用grep()搜索您感兴趣的模式。在这里看到的。

[:upper:] Matches any upper case letters.

[:upper:]匹配任何大写字母。

Combining it with anchors (^,$) and match one or more times (+) gives ^[[:upper:]]+$ and should only match entries completely in capitals.

结合它与锚(^ $)和匹配一次或多次(+)给^[[:上:]]+美元,应该只在首都完全匹配条目。

The following would match the special characters in your toy data set (but is not guaranteed to match all special characters in your real data set i.e form feeds, carriage returns)

下面将匹配玩具数据集中的特殊字符(但不能保证匹配真实数据集中i中的所有特殊字符)。e表格提要,回车)

[:punct:] #Matches punctuation - ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~.

[:punct:] #匹配标点- !" # $ & '() * +, -。/:;< = > ?@[\]^ _ { | } ~。

Note that rather than use [:punct:] you could define your special characters manually.

注意,您可以手动定义特殊字符,而不是使用[:punct:]。

We can try the resultant code on the first row of your data set:

我们可以在您的数据集中的第一行尝试合成代码:

#Using grepl() rather than grep() so that we return a list of logical values.
grepl(x= data[1,], pattern = "^[[:upper:]]+$|[[:punct:]]")
[1] FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE

This gives us our expected response except for column nine which has the value 1.24. Here the decimal point is being recognised as punctuation and is being flagged as a match. We can add a "negative lookahead assertion" - (?!\\.) - to remove any periods from consideration, before they are even tested for being punctuation characters. Note we use \ to escape the period.

这给出了我们期望的响应,除了第9列,它的值是1。24。在这里,小数点被识别为标点符号,并被标记为匹配。我们可以添加一个“消极的前瞻断言”-(?!\\)-删除任何一个周期,在他们甚至被测试为标点字符之前。注意我们使用\来转义周期。

grepl(x= data[1,], perl = TRUE, pattern = "(?!\\.)(^[[:upper:]]+$|[[:punct:]])")
[1] FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE

This returns a better response - it now no longer matches decimal places. NOTE: This might not be what you want as this pattern also won't match any fullstops in character fields. You would need to refine the pattern further.

这将返回更好的响应——它现在不再匹配小数部分。注意:这可能不是您想要的,因为这个模式也不会匹配字符字段中的任何fullstop。您需要进一步细化模式。

Rather than use a 'for loop' to reiterate this code across every row in your dataframe I would use vectorization instead which is 'more R like'.

与其使用“for循环”来在dataframe中的每一行重复这段代码,不如使用矢量化,这更像是“R”。

To do this we must convert our script into a function which we will call with apply()

为此,我们必须将脚本转换为一个函数,我们将使用apply()调用这个函数

myFunction <- function(x){
      matches <- grepl(x= x, perl = TRUE, pattern = "(?!\\.)(^[[:upper:]]+$|[[:punct:]])")
  #Given a set of logical vectors 'matches', is at least one of the values true? using any()
  return(any(matches))
}

apply(X = data, 1, myFunction)

The 1 above instructs apply() to reiterate across rows rather than columns.

上面的1指示apply()跨行而不是跨列重复。

[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

In your example data set all rows have an entry containing a special character or a string of all capital letters. This is unsurprising as many columns in your example data set are a list of single capital letters.

在示例数据集中,所有行都有一个包含特殊字符或所有大写字母的字符串的条目。这并不奇怪,因为您的示例数据集中的许多列都是单个大写字母的列表。

If you are just interested in which values in column thirteen fit the stated criteria you can use:

如果您只是对第13列中的哪些值符合您可以使用的指定标准感兴趣:

matches <- grepl(x= data$thir, perl = TRUE, pattern = "(?!\\.)(^[[:upper:]]+$|[[:punct:]])")
matches
 [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE

To subset your dataframe on matching rows:

将您的数据aframe子集划分为匹配的行:

data[matches,]
  one two thr fou fiv six sev eig  nin ten ele twe  thir
3   5   5   D   D   5   D   D   D 5.33   3   1   1 T(&*(
4   1   1   E   A   1   E   A   A 1.44   4   1   2   !!!
5   3   3   F   B   3   F   B   B 3.11   5   1   1   @$#
6   5   5   G   D   5   G   D   D 5.33   6   1   2  $Q%J
8   3   3   I   B   3   I   B   B 3.66   8   1   2  THIS

To subset your dataframe on non-matching rows:

将数据aframe子集划分为不匹配的行:

data[!matches,]
   one two thr fou fiv six sev eig  nin ten ele twe   thir
1    1   1   A   A   1   A   A   A 1.24   1   1   1   THiS
2    3   3   B   B   3   B   B   B 3.52   2   1   2 THAT34
7    1   1   H   A   1   H   A   A 1.55   7   1   1    who
9    5   5   J   D   5   J   D   D 5.33   9   1   1   this
10   1   1   H   A   1   H   A   A 1.32  10   1   2   this
11   3   3   I   B   3   I   B   B 3.54  11   1   1   this
12   5   5   J   D   5   J   D   D 5.77  12   1   2   this

Note that the regular expression used doesn't match THAT34 as it isn't composed wholly of capitalised letters, having the number 34 at the end.

请注意,使用的正则表达式与此不匹配,因为它并不是完全由大写字母组成的,在末尾是34。

EDIT:

编辑:

To get a list of column names identifying columns that fulfill the criteria in your edit use myFunction described above with:

要获得一个列名称的列表,在您的编辑使用myFunction中描述的列中符合标准的列如下:

colnames(data)[apply(X = data, 2, myFunction)]
"thr"  "fou"  "six"  "sev"  "eig"  "thir"

The number in apply() changes from 1 to 2 to reiterate across columns rather than rows. We pass the output from apply(), a list of logical matches (TRUE or FALSE), to colnames(data) - this returns the matching column names via subsetting.

apply()中的数字从1变为2,以跨列而不是跨行重复。我们将逻辑匹配列表apply()的输出(TRUE或FALSE)传递给colnames(数据)——这通过子设置返回匹配的列名。

#2


1  

I would collapse the data into strings (one string per row)

我将数据折叠成字符串(每行一个字符串)

strings = apply(data, 1, paste, collapse = "")
contains_only_caps = strings == toupper(strings)
strings[contains_only_caps]
# [1] "33BB3BBB3.52 212THAT34" "55DD5DDD5.33 311T(&*("  "11EA1EAA1.44 412!!!"   "33FB3FBB3.11 511@$#"   
# [5] "55GD5GDD5.33 612$Q%J"   "33IB3IBB3.66 812THIS"  


# escaping special characters
spec=c("\\$","%","\\&")
contains_spec = stringr::str_detect(strings, pattern = paste(spec, collapse = "|"))

strings[contains_spec]
# [1] "55DD5DDD5.33 311T(&*(" "33FB3FBB3.11 511@$#"   "55GD5GDD5.33 612$Q%J" 

You could also use which on contains_spec or contains_only_caps to get the corresponding row numbers for the original data frame. I think that using strings rather than row-wise data frame elements will by much faster - as long as you want to search the whole strings, not certain columns for certain conditions.

您还可以使用contains_spec或contains_only_caps上的which来获取原始数据帧的相应行号。我认为使用字符串而不是行式数据框架元素将会更快——只要你想搜索整个字符串,而不是特定的列。