
时间:2021-01-12 14:58:03

I have a column for residential adresses in my dataset 'ad'. I want to check for addresses which has no numbers(including roman numerals) present. I'm using


ad$check <- grepl("[[:digit:]]",ad$address)

to flag out addresses with no digits present. How do I do the same with addresses that contain roman numerals?


Eg: "floor X, DLF Building- III, ABC City"

例如:“X楼,DLF Building-III,ABC City”

1 个解决方案



You need to make a regex string.


Edit (my first answer was nonsense):


x <- c("floor Imaginary,  building- Momentum, ABC City", "floor X, DLF Building- III, ABC City")
# here come the regex 
grepl("\\b[I|V|X|L|C|D|M]\\b", x, ignore.case = FALSE)

To break it down:


\\b are word boundaries. It means roman numerals must be preceded or trailed by whitespace, punctuation or beginning/end of the string.

\\ b是字边界。这意味着罗马数字必须以空格,标点符号或字符串的开头/结尾开头或尾随。

[I|V|X|L|C|D|M] the "word" we are looking for can only consist of the symbols used for roman numerals. These should be all as far as I know.

[I | V | X | L | C | D | M]我们要查找的“单词”只能包含用于罗马数字的符号。据我所知,这些应该都是。

ignore.case = FALSE this is the standard which is normally set if you omit the option. I find it safer, however, to mention it explicitly if it is important for the operation at hand.

ignore.case = FALSE如果省略该选项,这是通常设置的标准。但是,如果它对于手头的操作很重要,我会发现它更安全。

Use with caution, as a company called e.g., "LCD Industries" would also be flagged as roman numeral. You could combine my approach with this answer to further test if the symbols are in the right order.

请谨慎使用,例如,“LCD Industries”也将被标记为罗马数字。您可以将我的方法与此答案结合起来,以进一步测试符号的顺序是否正确。

Please test on your data and report if it works.




You need to make a regex string.


Edit (my first answer was nonsense):


x <- c("floor Imaginary,  building- Momentum, ABC City", "floor X, DLF Building- III, ABC City")
# here come the regex 
grepl("\\b[I|V|X|L|C|D|M]\\b", x, ignore.case = FALSE)

To break it down:


\\b are word boundaries. It means roman numerals must be preceded or trailed by whitespace, punctuation or beginning/end of the string.

\\ b是字边界。这意味着罗马数字必须以空格,标点符号或字符串的开头/结尾开头或尾随。

[I|V|X|L|C|D|M] the "word" we are looking for can only consist of the symbols used for roman numerals. These should be all as far as I know.

[I | V | X | L | C | D | M]我们要查找的“单词”只能包含用于罗马数字的符号。据我所知,这些应该都是。

ignore.case = FALSE this is the standard which is normally set if you omit the option. I find it safer, however, to mention it explicitly if it is important for the operation at hand.

ignore.case = FALSE如果省略该选项,这是通常设置的标准。但是,如果它对于手头的操作很重要,我会发现它更安全。

Use with caution, as a company called e.g., "LCD Industries" would also be flagged as roman numeral. You could combine my approach with this answer to further test if the symbols are in the right order.

请谨慎使用,例如,“LCD Industries”也将被标记为罗马数字。您可以将我的方法与此答案结合起来,以进一步测试符号的顺序是否正确。

Please test on your data and report if it works.
