I am looking to transfer data from a .php link into an R data frame, but am having trouble doing so.
我希望将数据从.php链接传输到R数据框架中,但是这样做有困难。
Attempts thus far:
尝试到目前为止:
require(XML)
data <- xmlParse("http://www.mahdial-husseini.com/xmlthing.php ")
xml_data <- xmlToList(data)
The error I am getting: Error: 1: failed to load HTTP resource
我得到的错误是:error: 1:未能加载HTTP资源
Additionally (and more conceptually), I don't quite understand the nature of the link. Is this XML data in a php file, and if so, when using R to gather data, do I treat it as XML or PHP? Thank you
此外(从概念上来说),我不太理解这种联系的本质。这是php文件中的XML数据吗?如果是,在使用R收集数据时,我是将它当作XML还是php ?谢谢你!
2 个解决方案
#1
2
Or, possibly something readable:
或者,可能是可读的东西:
library(xml2)
library(tidyverse)
This will help make better column names:
这将有助于写出更好的列名:
mcga <- function(tbl) {
x <- colnames(tbl)
x <- tolower(x)
x <- gsub("[[:punct:][:space:]]+", "_", x)
x <- gsub("_+", "_", x)
x <- gsub("(^_|_$)", "", x)
x <- make.unique(x, sep = "_")
colnames(tbl) <- x
tbl
}
This gets figured out automagically but it's nice to define it after it figures it out since it help with data consistency:
这是自动计算出来的,但最好在计算出来后再定义它,因为它有助于数据一致性:
cols(
.default = col_integer(),
site = col_character(),
aod_47 = col_double(),
omi_aot = col_double(),
omi_no2 = col_double(),
fit = col_double(),
lng = col_double(),
lat = col_double()
) -> xdf_cols
Now the work:
现在的工作:
doc <- read_xml("http://www.mahdial-husseini.com/xmlthing.php")
xml_find_all(doc, ".//PPM1_0") %>%
map_df(~{
xml_attrs(.x) %>%
as.list()
}) %>%
mcga() %>%
type_convert(col_types = xdf_cols) -> xdf
The type_convert()
isn't fully necessary but it — with the column definitions — make for consistency in results.
type_convert()并不是完全必要的,但是它(使用列定义)可以保证结果的一致性。
And, the results:
的结果:
xdf
## # A tibble: 8 x 21
## sample site month day year hour jd doy pm25_hourly aod_47 omi_aot omi_no2 fit res
## <int> <chr> <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 0 duluth 0 0 0 0 0 0 0 0.000 0.000 0.000 0.00000 0
## 2 19 <NA> 12 0 2004 5 0 0 30 0.000 0.000 0.000 0.00000 0
## 3 4545 Sarasota 4 0 2017 0 0 0 0 0.000 0.000 0.000 0.00000 0
## 4 11111 Atlanta 10 1 2004 13 2453280 275 23 0.379 0.148 0.274 16.01850 NA
## 5 11112 Birmingham 10 2 2008 14 2453281 276 0 0.000 0.000 0.000 19.19440 0
## 6 11113 Savannah 10 3 2004 13 2453282 277 15 0.181 0.133 0.127 9.00433 NA
## 7 11114 Fort Knox 6 20 2017 21 0 301 18 0.000 0.000 0.000 0.00000 0
## 8 63738 Fort Rucker 1 0 2015 0 0 0 40 0.000 0.000 0.000 0.00000 0
## # ... with 7 more variables: lng <dbl>, lat <dbl>, rel_humid <int>, altitude <int>, pressure <int>,
## # signal_received <int>, temp_c <int>
Full structure:
完整的结构:
glimpse(xdf)
## Observations: 8
## Variables: 21
## $ sample <int> 0, 19, 4545, 11111, 11112, 11113, 11114, 63738
## $ site <chr> "duluth", NA, "Sarasota", "Atlanta", "Birmingham", "Savan...
## $ month <int> 0, 12, 4, 10, 10, 10, 6, 1
## $ day <int> 0, 0, 0, 1, 2, 3, 20, 0
## $ year <int> 0, 2004, 2017, 2004, 2008, 2004, 2017, 2015
## $ hour <int> 0, 5, 0, 13, 14, 13, 21, 0
## $ jd <int> 0, 0, 0, 2453280, 2453281, 2453282, 0, 0
## $ doy <int> 0, 0, 0, 275, 276, 277, 301, 0
## $ pm25_hourly <int> 0, 30, 0, 23, 0, 15, 18, 40
## $ aod_47 <dbl> 0.000, 0.000, 0.000, 0.379, 0.000, 0.181, 0.000, 0.000
## $ omi_aot <dbl> 0.000, 0.000, 0.000, 0.148, 0.000, 0.133, 0.000, 0.000
## $ omi_no2 <dbl> 0.000, 0.000, 0.000, 0.274, 0.000, 0.127, 0.000, 0.000
## $ fit <dbl> 0.00000, 0.00000, 0.00000, 16.01850, 19.19440, 9.00433, 0...
## $ res <int> 0, 0, 0, NA, 0, NA, 0, 0
## $ lng <dbl> 84.1000, 63.6167, -82.5300, -84.7000, -86.8000, -81.1000,...
## $ lat <dbl> 34.0000, 38.4161, 27.3300, 33.7500, 33.5200, 32.0800, 37....
## $ rel_humid <int> 0, 0, 0, 0, 0, 0, 0, 0
## $ altitude <int> 0, 0, 0, 0, 0, 0, 0, 0
## $ pressure <int> 0, 0, 0, 0, 0, 0, 0, 0
## $ signal_received <int> 0, 0, 0, 0, 0, 0, 0, 0
## $ temp_c <int> 0, 0, 0, 0, 0, 0, 0, 0
#2
2
you can use rvest package (and data.table for convenience)
您可以使用rvest包(和数据)。表为了方便)
library(data.table)
library(rvest)
a <- read_html("http://www.mahdial-husseini.com/xmlthing.php")
dt <- rbindlist(lapply(a %>% html_nodes(css = "body > ppm1_0 > ppm1_0") %>%
xml_attrs(),
function(x) as.data.table(t((x)))))
dt <- cbind(dt[,2, with = FALSE],
as.data.table(lapply(dt[,-2, with = FALSE], as.numeric)))
dt
site sample month day year hour jd doy pm25_hourly aod_47
1: duluth 0 0 0 0 0 0 0 0 0.000
2: 19 12 0 2004 5 0 0 30 0.000
3: Sarasota 4545 4 0 2017 0 0 0 0 0.000
4: Atlanta 11111 10 1 2004 13 2453280 275 23 0.379
5: Birmingham 11112 10 2 2008 14 2453281 276 0 0.000
6: Savannah 11113 10 3 2004 13 2453282 277 15 0.181
7: Fort Knox 11114 6 20 2017 21 0 301 18 0.000
8: Fort Rucker 63738 1 0 2015 0 0 0 40 0.000
omi_aot omi_no2 fit res lng lat rel_humid altitude pressure
1: 0.000 0.000 0.00000 0 84.1000 34.0000 0 0 0
2: 0.000 0.000 0.00000 0 63.6167 38.4161 0 0 0
3: 0.000 0.000 0.00000 0 -82.5300 27.3300 0 0 0
4: 0.148 0.274 16.01850 NA -84.7000 33.7500 0 0 0
5: 0.000 0.000 19.19440 0 -86.8000 33.5200 0 0 0
6: 0.133 0.127 9.00433 NA -81.1000 32.0800 0 0 0
7: 0.000 0.000 0.00000 0 -85.9500 37.9100 0 0 0
8: 0.000 0.000 0.00000 0 -85.7000 31.3400 0 0 0
signal_received temp_c
1: 0 0
2: 0 0
3: 0 0
4: 0 0
5: 0 0
6: 0 0
7: 0 0
8: 0 0
#1
2
Or, possibly something readable:
或者,可能是可读的东西:
library(xml2)
library(tidyverse)
This will help make better column names:
这将有助于写出更好的列名:
mcga <- function(tbl) {
x <- colnames(tbl)
x <- tolower(x)
x <- gsub("[[:punct:][:space:]]+", "_", x)
x <- gsub("_+", "_", x)
x <- gsub("(^_|_$)", "", x)
x <- make.unique(x, sep = "_")
colnames(tbl) <- x
tbl
}
This gets figured out automagically but it's nice to define it after it figures it out since it help with data consistency:
这是自动计算出来的,但最好在计算出来后再定义它,因为它有助于数据一致性:
cols(
.default = col_integer(),
site = col_character(),
aod_47 = col_double(),
omi_aot = col_double(),
omi_no2 = col_double(),
fit = col_double(),
lng = col_double(),
lat = col_double()
) -> xdf_cols
Now the work:
现在的工作:
doc <- read_xml("http://www.mahdial-husseini.com/xmlthing.php")
xml_find_all(doc, ".//PPM1_0") %>%
map_df(~{
xml_attrs(.x) %>%
as.list()
}) %>%
mcga() %>%
type_convert(col_types = xdf_cols) -> xdf
The type_convert()
isn't fully necessary but it — with the column definitions — make for consistency in results.
type_convert()并不是完全必要的,但是它(使用列定义)可以保证结果的一致性。
And, the results:
的结果:
xdf
## # A tibble: 8 x 21
## sample site month day year hour jd doy pm25_hourly aod_47 omi_aot omi_no2 fit res
## <int> <chr> <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 0 duluth 0 0 0 0 0 0 0 0.000 0.000 0.000 0.00000 0
## 2 19 <NA> 12 0 2004 5 0 0 30 0.000 0.000 0.000 0.00000 0
## 3 4545 Sarasota 4 0 2017 0 0 0 0 0.000 0.000 0.000 0.00000 0
## 4 11111 Atlanta 10 1 2004 13 2453280 275 23 0.379 0.148 0.274 16.01850 NA
## 5 11112 Birmingham 10 2 2008 14 2453281 276 0 0.000 0.000 0.000 19.19440 0
## 6 11113 Savannah 10 3 2004 13 2453282 277 15 0.181 0.133 0.127 9.00433 NA
## 7 11114 Fort Knox 6 20 2017 21 0 301 18 0.000 0.000 0.000 0.00000 0
## 8 63738 Fort Rucker 1 0 2015 0 0 0 40 0.000 0.000 0.000 0.00000 0
## # ... with 7 more variables: lng <dbl>, lat <dbl>, rel_humid <int>, altitude <int>, pressure <int>,
## # signal_received <int>, temp_c <int>
Full structure:
完整的结构:
glimpse(xdf)
## Observations: 8
## Variables: 21
## $ sample <int> 0, 19, 4545, 11111, 11112, 11113, 11114, 63738
## $ site <chr> "duluth", NA, "Sarasota", "Atlanta", "Birmingham", "Savan...
## $ month <int> 0, 12, 4, 10, 10, 10, 6, 1
## $ day <int> 0, 0, 0, 1, 2, 3, 20, 0
## $ year <int> 0, 2004, 2017, 2004, 2008, 2004, 2017, 2015
## $ hour <int> 0, 5, 0, 13, 14, 13, 21, 0
## $ jd <int> 0, 0, 0, 2453280, 2453281, 2453282, 0, 0
## $ doy <int> 0, 0, 0, 275, 276, 277, 301, 0
## $ pm25_hourly <int> 0, 30, 0, 23, 0, 15, 18, 40
## $ aod_47 <dbl> 0.000, 0.000, 0.000, 0.379, 0.000, 0.181, 0.000, 0.000
## $ omi_aot <dbl> 0.000, 0.000, 0.000, 0.148, 0.000, 0.133, 0.000, 0.000
## $ omi_no2 <dbl> 0.000, 0.000, 0.000, 0.274, 0.000, 0.127, 0.000, 0.000
## $ fit <dbl> 0.00000, 0.00000, 0.00000, 16.01850, 19.19440, 9.00433, 0...
## $ res <int> 0, 0, 0, NA, 0, NA, 0, 0
## $ lng <dbl> 84.1000, 63.6167, -82.5300, -84.7000, -86.8000, -81.1000,...
## $ lat <dbl> 34.0000, 38.4161, 27.3300, 33.7500, 33.5200, 32.0800, 37....
## $ rel_humid <int> 0, 0, 0, 0, 0, 0, 0, 0
## $ altitude <int> 0, 0, 0, 0, 0, 0, 0, 0
## $ pressure <int> 0, 0, 0, 0, 0, 0, 0, 0
## $ signal_received <int> 0, 0, 0, 0, 0, 0, 0, 0
## $ temp_c <int> 0, 0, 0, 0, 0, 0, 0, 0
#2
2
you can use rvest package (and data.table for convenience)
您可以使用rvest包(和数据)。表为了方便)
library(data.table)
library(rvest)
a <- read_html("http://www.mahdial-husseini.com/xmlthing.php")
dt <- rbindlist(lapply(a %>% html_nodes(css = "body > ppm1_0 > ppm1_0") %>%
xml_attrs(),
function(x) as.data.table(t((x)))))
dt <- cbind(dt[,2, with = FALSE],
as.data.table(lapply(dt[,-2, with = FALSE], as.numeric)))
dt
site sample month day year hour jd doy pm25_hourly aod_47
1: duluth 0 0 0 0 0 0 0 0 0.000
2: 19 12 0 2004 5 0 0 30 0.000
3: Sarasota 4545 4 0 2017 0 0 0 0 0.000
4: Atlanta 11111 10 1 2004 13 2453280 275 23 0.379
5: Birmingham 11112 10 2 2008 14 2453281 276 0 0.000
6: Savannah 11113 10 3 2004 13 2453282 277 15 0.181
7: Fort Knox 11114 6 20 2017 21 0 301 18 0.000
8: Fort Rucker 63738 1 0 2015 0 0 0 40 0.000
omi_aot omi_no2 fit res lng lat rel_humid altitude pressure
1: 0.000 0.000 0.00000 0 84.1000 34.0000 0 0 0
2: 0.000 0.000 0.00000 0 63.6167 38.4161 0 0 0
3: 0.000 0.000 0.00000 0 -82.5300 27.3300 0 0 0
4: 0.148 0.274 16.01850 NA -84.7000 33.7500 0 0 0
5: 0.000 0.000 19.19440 0 -86.8000 33.5200 0 0 0
6: 0.133 0.127 9.00433 NA -81.1000 32.0800 0 0 0
7: 0.000 0.000 0.00000 0 -85.9500 37.9100 0 0 0
8: 0.000 0.000 0.00000 0 -85.7000 31.3400 0 0 0
signal_received temp_c
1: 0 0
2: 0 0
3: 0 0
4: 0 0
5: 0 0
6: 0 0
7: 0 0
8: 0 0