使用r从(地址)字符串中提取房屋号码

时间:2022-09-13 16:02:02

I want to parse apart (extract) addresses into HouseNumber and Streetname. I should later be able to write the extracted "values" into new columns (shops$HouseNumber and shops$Streetname).

我想将地址解析(提取)到HouseNumber和Streetname。我以后应该能够将提取的“值”写入新列(商店$ HouseNumber和商店$ Streetname)。

So lets say I have a data frame called "shops":

所以我想说我有一个名为“商店”的数据框:

> shops
      Name                 city        street
 1    Something            Fakecity    New Street 3
 2    SomethingOther       Fakecity    Some-Complicated-Casestreet 1-3
 3    SomethingDifferent   Fakecity    Fake Street 14a

So is there a way to split the street column into two lists one with the streetnames and one for the house numbers including cases like "1-3","14a", so that in the end, the result could be assigned to the data frame and look like.

那么有没有办法将街道列分成两个列表,一个是街道名称,一个是房屋号码,包括“1-3”,“14a”等情况,所以最后,结果可以分配给数据框架和外观。

 > shops
      Name                 city        Streetname                    HouseNumber
 1    Something            Fakecity    New Street                    3
 2    SomethingOther       Fakecity    Some-Complicated-Casestreet   1-3
 3    SomethingDifferent   Fakecity    Fake Street                   14a 

Example: Easyfakestreet 5 --> Easyfakestreet , 5

示例:Easyfakestreet 5 - > Easyfakestreet,5

It gets slightly complicated by the fact that some of my street strings will have hyphenated street addresses and have non numerical components.

由于我的一些街道字符串将具有带连字符的街道地址并且具有非数字组件,因此稍微复杂一些。

Examples:
New Street 3 --> ['New Street', '3 ']
Some-Complicated-Casestreet 1-3 --> ['Some-Complicated-Casestreet','1-3']
Fake Street 14a --> ['Fake Street', '14a']

示例:New Street 3 - > ['New Street','3'] Some-Complicated-Casestreet 1-3 - > ['Some-Complicated-Casestreet','1-3'] Fake Street 14a - > ['假街','14a']

I would appreciate some help!

我将不胜感激!

3 个解决方案

#1


Here's a possible tidyr solution

这是一个可能的tidyr解决方案

library(tidyr)
extract(df, "street", c("Streetname", "HouseNumber"), "(\\D+)(\\d.*)")
#                 Name     city                   Streetname HouseNumber
# 1          Something Fakecity                  New Street            3
# 2     SomethingOther Fakecity Some-Complicated-Casestreet          1-3
# 3 SomethingDifferent Fakecity                 Fake Street          14a

#2


You can try:

你可以试试:

shops$Streetname <- gsub("(.+)\\s[^ ]+$","\\1", shops$street)
shops$HousNumber <- gsub(".+\\s([^ ]+)$","\\1", shops$street)

data

shops$street
#[1] "New Street 3"                    "Some-Complicated-Casestreet 1-3" "Fake Street 14a" 

results

shops$Streetname
#[1] "New Street"                  "Some-Complicated-Casestreet" "Fake` Street" 

shops$HousNumber
#[1] "3"   "1-3" "14a"

#3


Create a pattern with back references that match both the street and the number and then using sub replace it by each backreference in turn. No packages are needed:

创建一个模式,其背面引用与街道和数字相匹配,然后使用sub依次替换每个反向引用。不需要包裹:

pat <- "(.*) (\\d.*)"
transform(shops,
   street = sub(pat, "\\1", street), 
   HouseNumber = sub(pat, "\\2", street)
)

giving:

                Name     city                      street  HouseNumber
1          Something Fakecity                  New Street            3
2     SomethingOther Fakecity Some-Complicated-Casestreet          1-3
3 SomethingDifferent Fakecity                 Fake Street          14a

Here is a visualization of pat:

这是pat的可视化:

(.*) (\d.*)

使用r从(地址)字符串中提取房屋号码

Debuggex Demo

Note:

1) We used this for shops:

1)我们用这个商店:

shops <-
structure(list(Name = c("Something", "SomethingOther", "SomethingDifferent"
), city = c("Fakecity", "Fakecity", "Fakecity"), street = c("New Street 3", 
"Some-Complicated-Casestreet 1-3", "Fake Street 14a")), .Names = c("Name", 
"city", "street"), class = "data.frame", row.names = c(NA, -3L))

2) David Arenburg's pattern could alternately be used here. Just set pat to it. The pattern above has the advantage that it allows street names that have embedded numbers in them but David's has the advantage that the space may be missing before the street number.

2)David Arenburg的模式可以在这里交替使用。只需轻轻一点。上面的模式的优点是它允许在其中嵌入数字的街道名称,但大卫的优势在于街道号码之前可能缺少空间。

#1


Here's a possible tidyr solution

这是一个可能的tidyr解决方案

library(tidyr)
extract(df, "street", c("Streetname", "HouseNumber"), "(\\D+)(\\d.*)")
#                 Name     city                   Streetname HouseNumber
# 1          Something Fakecity                  New Street            3
# 2     SomethingOther Fakecity Some-Complicated-Casestreet          1-3
# 3 SomethingDifferent Fakecity                 Fake Street          14a

#2


You can try:

你可以试试:

shops$Streetname <- gsub("(.+)\\s[^ ]+$","\\1", shops$street)
shops$HousNumber <- gsub(".+\\s([^ ]+)$","\\1", shops$street)

data

shops$street
#[1] "New Street 3"                    "Some-Complicated-Casestreet 1-3" "Fake Street 14a" 

results

shops$Streetname
#[1] "New Street"                  "Some-Complicated-Casestreet" "Fake` Street" 

shops$HousNumber
#[1] "3"   "1-3" "14a"

#3


Create a pattern with back references that match both the street and the number and then using sub replace it by each backreference in turn. No packages are needed:

创建一个模式,其背面引用与街道和数字相匹配,然后使用sub依次替换每个反向引用。不需要包裹:

pat <- "(.*) (\\d.*)"
transform(shops,
   street = sub(pat, "\\1", street), 
   HouseNumber = sub(pat, "\\2", street)
)

giving:

                Name     city                      street  HouseNumber
1          Something Fakecity                  New Street            3
2     SomethingOther Fakecity Some-Complicated-Casestreet          1-3
3 SomethingDifferent Fakecity                 Fake Street          14a

Here is a visualization of pat:

这是pat的可视化:

(.*) (\d.*)

使用r从(地址)字符串中提取房屋号码

Debuggex Demo

Note:

1) We used this for shops:

1)我们用这个商店:

shops <-
structure(list(Name = c("Something", "SomethingOther", "SomethingDifferent"
), city = c("Fakecity", "Fakecity", "Fakecity"), street = c("New Street 3", 
"Some-Complicated-Casestreet 1-3", "Fake Street 14a")), .Names = c("Name", 
"city", "street"), class = "data.frame", row.names = c(NA, -3L))

2) David Arenburg's pattern could alternately be used here. Just set pat to it. The pattern above has the advantage that it allows street names that have embedded numbers in them but David's has the advantage that the space may be missing before the street number.

2)David Arenburg的模式可以在这里交替使用。只需轻轻一点。上面的模式的优点是它允许在其中嵌入数字的街道名称,但大卫的优势在于街道号码之前可能缺少空间。