I want to parse apart (extract) addresses into HouseNumber and Streetname. I should later be able to write the extracted "values" into new columns (shops$HouseNumber and shops$Streetname).
我想将地址解析(提取)到HouseNumber和Streetname。我以后应该能够将提取的“值”写入新列(商店$ HouseNumber和商店$ Streetname)。
So lets say I have a data frame called "shops":
所以我想说我有一个名为“商店”的数据框:
> shops
Name city street
1 Something Fakecity New Street 3
2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
3 SomethingDifferent Fakecity Fake Street 14a
So is there a way to split the street column into two lists one with the streetnames and one for the house numbers including cases like "1-3","14a", so that in the end, the result could be assigned to the data frame and look like.
那么有没有办法将街道列分成两个列表,一个是街道名称,一个是房屋号码,包括“1-3”,“14a”等情况,所以最后,结果可以分配给数据框架和外观。
> shops
Name city Streetname HouseNumber
1 Something Fakecity New Street 3
2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
3 SomethingDifferent Fakecity Fake Street 14a
Example: Easyfakestreet 5 --> Easyfakestreet , 5
示例:Easyfakestreet 5 - > Easyfakestreet,5
It gets slightly complicated by the fact that some of my street strings will have hyphenated street addresses and have non numerical components.
由于我的一些街道字符串将具有带连字符的街道地址并且具有非数字组件,因此稍微复杂一些。
Examples:
New Street 3 --> ['New Street', '3 ']
Some-Complicated-Casestreet 1-3 --> ['Some-Complicated-Casestreet','1-3']
Fake Street 14a --> ['Fake Street', '14a']
示例:New Street 3 - > ['New Street','3'] Some-Complicated-Casestreet 1-3 - > ['Some-Complicated-Casestreet','1-3'] Fake Street 14a - > ['假街','14a']
I would appreciate some help!
我将不胜感激!
3 个解决方案
#1
Here's a possible tidyr
solution
这是一个可能的tidyr解决方案
library(tidyr)
extract(df, "street", c("Streetname", "HouseNumber"), "(\\D+)(\\d.*)")
# Name city Streetname HouseNumber
# 1 Something Fakecity New Street 3
# 2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
# 3 SomethingDifferent Fakecity Fake Street 14a
#2
You can try:
你可以试试:
shops$Streetname <- gsub("(.+)\\s[^ ]+$","\\1", shops$street)
shops$HousNumber <- gsub(".+\\s([^ ]+)$","\\1", shops$street)
data
shops$street
#[1] "New Street 3" "Some-Complicated-Casestreet 1-3" "Fake Street 14a"
results
shops$Streetname
#[1] "New Street" "Some-Complicated-Casestreet" "Fake` Street"
shops$HousNumber
#[1] "3" "1-3" "14a"
#3
Create a pattern with back references that match both the street and the number and then using sub
replace it by each backreference in turn. No packages are needed:
创建一个模式,其背面引用与街道和数字相匹配,然后使用sub依次替换每个反向引用。不需要包裹:
pat <- "(.*) (\\d.*)"
transform(shops,
street = sub(pat, "\\1", street),
HouseNumber = sub(pat, "\\2", street)
)
giving:
Name city street HouseNumber
1 Something Fakecity New Street 3
2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
3 SomethingDifferent Fakecity Fake Street 14a
Here is a visualization of pat
:
这是pat的可视化:
(.*) (\d.*)
Note:
1) We used this for shops
:
1)我们用这个商店:
shops <-
structure(list(Name = c("Something", "SomethingOther", "SomethingDifferent"
), city = c("Fakecity", "Fakecity", "Fakecity"), street = c("New Street 3",
"Some-Complicated-Casestreet 1-3", "Fake Street 14a")), .Names = c("Name",
"city", "street"), class = "data.frame", row.names = c(NA, -3L))
2) David Arenburg's pattern could alternately be used here. Just set pat
to it. The pattern above has the advantage that it allows street names that have embedded numbers in them but David's has the advantage that the space may be missing before the street number.
2)David Arenburg的模式可以在这里交替使用。只需轻轻一点。上面的模式的优点是它允许在其中嵌入数字的街道名称,但大卫的优势在于街道号码之前可能缺少空间。
#1
Here's a possible tidyr
solution
这是一个可能的tidyr解决方案
library(tidyr)
extract(df, "street", c("Streetname", "HouseNumber"), "(\\D+)(\\d.*)")
# Name city Streetname HouseNumber
# 1 Something Fakecity New Street 3
# 2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
# 3 SomethingDifferent Fakecity Fake Street 14a
#2
You can try:
你可以试试:
shops$Streetname <- gsub("(.+)\\s[^ ]+$","\\1", shops$street)
shops$HousNumber <- gsub(".+\\s([^ ]+)$","\\1", shops$street)
data
shops$street
#[1] "New Street 3" "Some-Complicated-Casestreet 1-3" "Fake Street 14a"
results
shops$Streetname
#[1] "New Street" "Some-Complicated-Casestreet" "Fake` Street"
shops$HousNumber
#[1] "3" "1-3" "14a"
#3
Create a pattern with back references that match both the street and the number and then using sub
replace it by each backreference in turn. No packages are needed:
创建一个模式,其背面引用与街道和数字相匹配,然后使用sub依次替换每个反向引用。不需要包裹:
pat <- "(.*) (\\d.*)"
transform(shops,
street = sub(pat, "\\1", street),
HouseNumber = sub(pat, "\\2", street)
)
giving:
Name city street HouseNumber
1 Something Fakecity New Street 3
2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
3 SomethingDifferent Fakecity Fake Street 14a
Here is a visualization of pat
:
这是pat的可视化:
(.*) (\d.*)
Note:
1) We used this for shops
:
1)我们用这个商店:
shops <-
structure(list(Name = c("Something", "SomethingOther", "SomethingDifferent"
), city = c("Fakecity", "Fakecity", "Fakecity"), street = c("New Street 3",
"Some-Complicated-Casestreet 1-3", "Fake Street 14a")), .Names = c("Name",
"city", "street"), class = "data.frame", row.names = c(NA, -3L))
2) David Arenburg's pattern could alternately be used here. Just set pat
to it. The pattern above has the advantage that it allows street names that have embedded numbers in them but David's has the advantage that the space may be missing before the street number.
2)David Arenburg的模式可以在这里交替使用。只需轻轻一点。上面的模式的优点是它允许在其中嵌入数字的街道名称,但大卫的优势在于街道号码之前可能缺少空间。