如何检查R中的字符串是否包含罗马数字?

时间:2021-01-12 14:58:03

I have a column for residential adresses in my dataset 'ad'. I want to check for addresses which has no numbers(including roman numerals) present. I'm using

我的数据集'ad'中有一个住宅地址专栏。我想检查没有数字(包括罗马数字)的地址。我在用着

ad$check <- grepl("[[:digit:]]",ad$address)

to flag out addresses with no digits present. How do I do the same with addresses that contain roman numerals?

标记出没有数字的地址。如何对包含罗马数字的地址执行相同操作?

Eg: "floor X, DLF Building- III, ABC City"

例如:“X楼,DLF Building-III,ABC City”

1 个解决方案

#1


1  

You need to make a regex string.

你需要制作一个正则表达式字符串。

Edit (my first answer was nonsense):

编辑(我的第一个回答是胡说八道):

x <- c("floor Imaginary,  building- Momentum, ABC City", "floor X, DLF Building- III, ABC City")
# here come the regex 
grepl("\\b[I|V|X|L|C|D|M]\\b", x, ignore.case = FALSE)
[1] FALSE  TRUE

To break it down:

要打破它:

\\b are word boundaries. It means roman numerals must be preceded or trailed by whitespace, punctuation or beginning/end of the string.

\\ b是字边界。这意味着罗马数字必须以空格,标点符号或字符串的开头/结尾开头或尾随。

[I|V|X|L|C|D|M] the "word" we are looking for can only consist of the symbols used for roman numerals. These should be all as far as I know.

[I | V | X | L | C | D | M]我们要查找的“单词”只能包含用于罗马数字的符号。据我所知,这些应该都是。

ignore.case = FALSE this is the standard which is normally set if you omit the option. I find it safer, however, to mention it explicitly if it is important for the operation at hand.

ignore.case = FALSE如果省略该选项,这是通常设置的标准。但是,如果它对于手头的操作很重要,我会发现它更安全。

Use with caution, as a company called e.g., "LCD Industries" would also be flagged as roman numeral. You could combine my approach with this answer to further test if the symbols are in the right order.

请谨慎使用,例如,“LCD Industries”也将被标记为罗马数字。您可以将我的方法与此答案结合起来,以进一步测试符号的顺序是否正确。

Please test on your data and report if it works.

请测试您的数据并报告其是否有效。

#1


1  

You need to make a regex string.

你需要制作一个正则表达式字符串。

Edit (my first answer was nonsense):

编辑(我的第一个回答是胡说八道):

x <- c("floor Imaginary,  building- Momentum, ABC City", "floor X, DLF Building- III, ABC City")
# here come the regex 
grepl("\\b[I|V|X|L|C|D|M]\\b", x, ignore.case = FALSE)
[1] FALSE  TRUE

To break it down:

要打破它:

\\b are word boundaries. It means roman numerals must be preceded or trailed by whitespace, punctuation or beginning/end of the string.

\\ b是字边界。这意味着罗马数字必须以空格,标点符号或字符串的开头/结尾开头或尾随。

[I|V|X|L|C|D|M] the "word" we are looking for can only consist of the symbols used for roman numerals. These should be all as far as I know.

[I | V | X | L | C | D | M]我们要查找的“单词”只能包含用于罗马数字的符号。据我所知,这些应该都是。

ignore.case = FALSE this is the standard which is normally set if you omit the option. I find it safer, however, to mention it explicitly if it is important for the operation at hand.

ignore.case = FALSE如果省略该选项,这是通常设置的标准。但是,如果它对于手头的操作很重要,我会发现它更安全。

Use with caution, as a company called e.g., "LCD Industries" would also be flagged as roman numeral. You could combine my approach with this answer to further test if the symbols are in the right order.

请谨慎使用,例如,“LCD Industries”也将被标记为罗马数字。您可以将我的方法与此答案结合起来,以进一步测试符号的顺序是否正确。

Please test on your data and report if it works.

请测试您的数据并报告其是否有效。