从字符串中提取县名

时间:2022-09-13 11:33:28

Trying to create a regex in R to extract the county name from a string. Of course, you can't just grab the first word in front of the word "county" because some counties have a 2- or 3-word name. In this particular dataset, there are some other tricky expressions to work around. This is my first attempt:

尝试在R中创建一个regex,从字符串中提取县名。当然,你不能仅仅在“县”一词前就抓住第一个词,因为有些县有2- 3个字的名字。在这个特定的数据集中,还有一些其他棘手的表达式需要处理。这是我第一次尝试:

library(data.table)

foo <- data.table(foo=c("Unemployment Rate in Southampton County, VA"
                        ,"Personal Income in Southampton County + Franklin City, VA"
                        ,"Mean Commuting Time for Workers in Southampton County, VA"
                        ,"Estimate of People Age 0-17 in Poverty for Southampton County, VA"))

foo[,county:=trimws(regmatches(foo,gregexpr("(?<=\\bfor|in\\b).*?(?=(City|Municipality|County|Borough|Census Area|Parish),)",foo,perl=T)),"both")]

Any help would be greatly appreciated!

非常感谢您的帮助!

1 个解决方案

#1


2  

Another strategy: use a list of possible county names:

另一个策略是:使用可能的郡名列表:

library(maps)
library(stringi)
counties <- sapply(strsplit(map("county", plot=F)$names,",",T), "[", 2)
counties <- unique(sub("(.*?):.*", "\\1", counties))
counties <- sub("^st", "st.?", counties)
foo=c("Unemployment Rate in Southampton County, VA"
                        ,"Personal Income in Southampton County + Franklin City, VA"
                        ,"Mean Commuting Time for Workers in Southampton County, VA"
                        ,"Estimate of People Age 0-17 in Poverty for Southampton County, VA")
stri_extract_all_regex(
  foo, paste0("\\b(", paste(counties, collapse = "|"), ")\\b(?!\\s*city)"), case_insensitive=TRUE
)
# [[1]]
# [1] "Southampton"
# 
# [[2]]
# [1] "Southampton"
# 
# [[3]]
# [1] "Southampton"
# 
# [[4]]
# [1] "Southampton"

#1


2  

Another strategy: use a list of possible county names:

另一个策略是:使用可能的郡名列表:

library(maps)
library(stringi)
counties <- sapply(strsplit(map("county", plot=F)$names,",",T), "[", 2)
counties <- unique(sub("(.*?):.*", "\\1", counties))
counties <- sub("^st", "st.?", counties)
foo=c("Unemployment Rate in Southampton County, VA"
                        ,"Personal Income in Southampton County + Franklin City, VA"
                        ,"Mean Commuting Time for Workers in Southampton County, VA"
                        ,"Estimate of People Age 0-17 in Poverty for Southampton County, VA")
stri_extract_all_regex(
  foo, paste0("\\b(", paste(counties, collapse = "|"), ")\\b(?!\\s*city)"), case_insensitive=TRUE
)
# [[1]]
# [1] "Southampton"
# 
# [[2]]
# [1] "Southampton"
# 
# [[3]]
# [1] "Southampton"
# 
# [[4]]
# [1] "Southampton"