I have the JSON file in a .txt file which I'm trying to load into R but I'm getting the following error:
我在.txt文件中有JSON文件,我正在尝试加载到R中,但是我收到以下错误:
Error in feed_push_parser(readBin(con, raw(), n), reset = TRUE) :
parse error: trailing garbage
" : "SUCCESS" } /* 1 */ { "_id" : "b736c374-b8ae-4e9
(right here) ------^
I'm assuming the error is because of multiple instances of /* (number) */ and I can't manually remove them all as my file has 10k instances of these. Is there a way to remove such instances before loading the data into R?
我假设错误是由于/ *(数字)* /的多个实例而我无法手动将它们全部删除,因为我的文件有10k这些实例。有没有办法在将数据加载到R之前删除此类实例?
My JSON file looks like below:
我的JSON文件如下所示:
/* 0 */
{
"_id" : "93ccbdb6-8947",
"uiSearchRequest" : {
"travelDate" : 20151206,
"travelDuration" : 7,
"shopperDuration" : 30,
"oneWay" : false,
"userId" : "ATP1KKP",
"queryId" : "93ccbdb6-8947",
"subRequests" : [{
"origin" : "WAS",
"destination" : "LON",
"carrier" : "AA",
"fareClasses" : "",
"owrt" : "1,2"
}]
},
"downloadCount" : 0,
"requestDate" : 20151205,
"totalRecords" : 0,
"status" : "SUCCESS"
}
/* 1 */
{
"_id" : "b736c374-b8ae",
"uiSearchRequest" : {
"travelDate" : 20151206,
"travelDuration" : 7,
"shopperDuration" : 30,
"oneWay" : false,
"userId" : "ATP1KKP",
"queryId" : "b736c374-b8ae",
"subRequests" : [{
"origin" : "WAS",
"destination" : "LON",
"carrier" : "AA",
"fareClasses" : "",
"owrt" : "1,2"
}]
},
"downloadCount" : 0,
"requestDate" : 20151205,
"totalRecords" : 0,
"status" : "SUCCESS"
}
/* 2 */
{
"_id" : "3312605f-8304",
"uiSearchRequest" : {
"travelDate" : 20151206,
"travelDuration" : 7,
"shopperDuration" : 30,
"oneWay" : false,
"userId" : "ATP1SXE",
"queryId" : "3312605f-8304",
"subRequests" : [{
"origin" : "LON",
"destination" : "IAD",
"carrier" : "AA",
"fareClasses" : "",
"owrt" : "1,2"
}]
},
"downloadCount" : 2,
"requestDate" : 20151205,
"totalRecords" : 0,
"status" : "SUCCESS"
}
/* 3 */
{
"_id" : "6b668cfa-9b79",
"uiSearchRequest" : {
"travelDate" : 20151206,
"travelDuration" : 7,
"shopperDuration" : 30,
"oneWay" : false,
"userId" : "ATP1NXA",
"queryId" : "6b668cfa-9b79",
"subRequests" : [{
"origin" : "WAS",
"destination" : "LON",
"carrier" : "AA",
"fareClasses" : "",
"owrt" : "1,2"
}]
},
"downloadCount" : 1,
"requestDate" : 20151205,
"totalRecords" : 1388,
"status" : "SUCCESS"
}
/* 4 */
{
"_id" : "41c373a1-e4cb",
"uiSearchRequest" : {
"travelDate" : 20151206,
"travelDuration" : 7,
"shopperDuration" : 30,
"oneWay" : false,
"userId" : "ATP6CXS",
"queryId" : "41c373a1-e4cb",
"subRequests" : [{
"origin" : "WAS",
"destination" : "LON",
"carrier" : "AA",
"fareClasses" : "",
"owrt" : "1,2"
}]
},
"downloadCount" : 0,
"requestDate" : 20151205,
"totalRecords" : 1388,
"status" : "SUCCESS"
}
/* 5 */
{
"_id" : "2c8331c4-21ca",
"uiSearchRequest" : {
"travelDate" : 20151206,
"travelDuration" : 7,
"shopperDuration" : 30,
"oneWay" : false,
"userId" : "ATP1KKP",
"queryId" : "2c8331c4-21ca",
"subRequests" : [{
"origin" : "WAS",
"destination" : "LON",
"carrier" : "AA",
"fareClasses" : "",
"owrt" : "1,2"
}]
},
"downloadCount" : 0,
"requestDate" : 20151205,
"totalRecords" : 1388,
"status" : "SUCCESS"
}
/* 6 */
{
"_id" : "71a09900-1c13",
"uiSearchRequest" : {
"travelDate" : 20151206,
"travelDuration" : 7,
"shopperDuration" : 30,
"oneWay" : false,
"userId" : "ATP6CXS",
"queryId" : "71a09900-1c13",
"subRequests" : [{
"origin" : "WAS",
"destination" : "LON",
"carrier" : "AF",
"fareClasses" : "",
"owrt" : "1,2"
}, {
"origin" : "WAS",
"destination" : "LON",
"carrier" : "AA",
"fareClasses" : "",
"owrt" : "1,2"
}, {
"origin" : "WAS",
"destination" : "LON",
"carrier" : "DL",
"fareClasses" : "",
"owrt" : "1,2"
}, {
"origin" : "WAS",
"destination" : "LON",
"carrier" : "LH",
"fareClasses" : "",
"owrt" : "1,2"
}, {
"origin" : "WAS",
"destination" : "LON",
"carrier" : "BA",
"fareClasses" : "",
"owrt" : "1,2"
}]
},
"downloadCount" : 0,
"requestDate" : 20151205,
"totalRecords" : 6941,
"status" : "SUCCESS"
}
/* 7 */
{
"_id" : "a036a42a-918b",
"uiSearchRequest" : {
"travelDate" : 20151206,
"travelDuration" : 7,
"shopperDuration" : 30,
"oneWay" : false,
"userId" : "ATP1MMM",
"queryId" : "a036a42a-918b",
"subRequests" : [{
"origin" : "WAS",
"destination" : "LON",
"carrier" : "AA",
"fareClasses" : "",
"owrt" : "1,2"
}]
},
"downloadCount" : 0,
"requestDate" : 20151205,
"totalRecords" : 1388,
"status" : "SUCCESS"
}
/* 8 */
{
"_id" : "c547be36-805c",
"uiSearchRequest" : {
"travelDate" : 20151206,
"travelDuration" : 7,
"shopperDuration" : 30,
"oneWay" : false,
"userId" : "ATP1SXB",
"queryId" : "c547be36-805c",
"subRequests" : [{
"origin" : "CHI",
"destination" : "LON",
"carrier" : "BA",
"fareClasses" : "",
"owrt" : "1,2"
}]
},
"downloadCount" : 2,
"requestDate" : 20151205,
"totalRecords" : 1072,
"status" : "SUCCESS"
}
My code is below (Although I haven't gotten much far):
我的代码如下(虽然我还没有走得太远):
library(jsonlite)
library(RJSONIO)
json_data_raw<-fromJSON("mydata.txt")
json_file <- lapply(json_data_raw, function(x) {
x[sapply(x, is.null)] <- NA
unlist(x)
})
output <-- do.call("rbind", json_file)
write.csv(a, file="json.csv",row.names = FALSE)
file.show("json.csv")
I'm trying to get my output into a CSV file like below
我正在尝试将输出转换为CSV文件,如下所示
1 个解决方案
#1
4
There are several issues with your text file. As you already noticed, you need to remove the lines of the form /* 0 */
. What results is still not valid json. If you want to have several json objects in a file, you need to store them in an array. The json objects are the parts that are closed in curly brakets, e.g.,
您的文本文件有几个问题。正如您已经注意到的,您需要删除格式为/ * 0 * /的行。什么结果仍然无效json。如果要在文件中包含多个json对象,则需要将它们存储在数组中。 json对象是在花括号中关闭的部分,例如,
{
"_id" : "93ccbdb6-8947-4687-8e12-edf4e40d6650",
...
"totalRecords" : 0,
"status" : "SUCCESS"
}
The structure of an array of objects is as follows:
对象数组的结构如下:
[
{
...
},
{
...
}
]
To get your file in shape, you need to add a comma between the objects and add the square brackets. You could do this as follows:
要使文件保持形状,您需要在对象之间添加逗号并添加方括号。你可以这样做:
raw <- readLines("mydata.txt")
# get rid of the "/* 0 */" lines
json <- grep("^/\\* [0-9]* \\*/", raw, value = TRUE, invert = TRUE)
# add missing comma after }
n <- length(json)
json[-n] <- gsub("^}$", "},", json[-n])
# add brakets at the beginning and end
json <- c("[", json, "]")
This can be read by fromJSON()
, so I assume it is valid json:
这可以通过fromJSON()来读取,所以我假设它是有效的json:
library(jsonlite)
table <- fromJSON(json)
The table is nested, that is, some of the tables cells contain a data frame or a list themselves. For example,
该表是嵌套的,也就是说,某些表单元格本身包含数据框或列表。例如,
table[1,2]
## travelDate travelDuration shopperDuration oneWay userId queryId
## 1 20151206 7 30 FALSE ATP1KKP 93ccbdb6-8947-4687-8e12-edf4e40d6650
## subRequests
## 1 WAS, LON, AA, , 1,2
You could use flatten()
from the jsonlite
package, to get a table with one level of nesting less
您可以使用jsonlite包中的flatten()来获得一个嵌套级别较低的表
flatten(table)[1:3, c(1, 6, 12)]
## _id uiSearchRequest.travelDate uiSearchRequest.subRequests
## 1 93ccbdb6-8947-4687-8e12-edf4e40d6650 20151206 WAS, LON, AA, , 1,2
## 2 b736c374-b8ae-4e99-8073-9c54517fecd5 20151206 WAS, LON, AA, , 1,2
## 3 3312605f-8304-4ab8-96d6-6e1a03cfbd9e 20151206 LON, IAD, AA, , 1,2
The last column is still a list. There are many ways you could handle this. One possibility, is to create a row per subrequest, where the contents of all the other columns (X_id
, downloadCount
, etc.) are repeated. (This is almost the form that you give in your question, with the only difference that you left cells empty in the reapeated columns, while I repeat the contents.) This is how it can be done:
最后一列仍然是一个列表。有很多方法可以解决这个问题。一种可能性是为每个子请求创建一行,其中重复所有其他列(X_id,downloadCount等)的内容。 (这几乎就是你在问题中给出的形式,唯一不同的是你在重新列出的列中留下了空格,而我重复了内容。)这是如何做到的:
table <- flatten(fromJSON(json))
tab_list <- lapply(1:nrow(table),
function(i) data.frame(table[i, -12], table[i, 12],
stringsAsFactors = FALSE))
library(dplyr)
flat_table <- bind_rows(tab_list)
The second line creates a list of data frames. These are combined into a single data frame using bind_rows()
from dpylr
. (To be more precise, flat_table
will be a tbl_df
, but the difference to a data.frame
is small.) This can then be written to a csv file in the usual way:
第二行创建数据框列表。使用dpylr中的bind_rows()将它们组合成单个数据帧。 (更确切地说,flat_table将是一个tbl_df,但与data.frame的差异很小。)然后可以通常的方式将其写入csv文件:
write.csv(flat_table, file = "mydata.csv")
#1
4
There are several issues with your text file. As you already noticed, you need to remove the lines of the form /* 0 */
. What results is still not valid json. If you want to have several json objects in a file, you need to store them in an array. The json objects are the parts that are closed in curly brakets, e.g.,
您的文本文件有几个问题。正如您已经注意到的,您需要删除格式为/ * 0 * /的行。什么结果仍然无效json。如果要在文件中包含多个json对象,则需要将它们存储在数组中。 json对象是在花括号中关闭的部分,例如,
{
"_id" : "93ccbdb6-8947-4687-8e12-edf4e40d6650",
...
"totalRecords" : 0,
"status" : "SUCCESS"
}
The structure of an array of objects is as follows:
对象数组的结构如下:
[
{
...
},
{
...
}
]
To get your file in shape, you need to add a comma between the objects and add the square brackets. You could do this as follows:
要使文件保持形状,您需要在对象之间添加逗号并添加方括号。你可以这样做:
raw <- readLines("mydata.txt")
# get rid of the "/* 0 */" lines
json <- grep("^/\\* [0-9]* \\*/", raw, value = TRUE, invert = TRUE)
# add missing comma after }
n <- length(json)
json[-n] <- gsub("^}$", "},", json[-n])
# add brakets at the beginning and end
json <- c("[", json, "]")
This can be read by fromJSON()
, so I assume it is valid json:
这可以通过fromJSON()来读取,所以我假设它是有效的json:
library(jsonlite)
table <- fromJSON(json)
The table is nested, that is, some of the tables cells contain a data frame or a list themselves. For example,
该表是嵌套的,也就是说,某些表单元格本身包含数据框或列表。例如,
table[1,2]
## travelDate travelDuration shopperDuration oneWay userId queryId
## 1 20151206 7 30 FALSE ATP1KKP 93ccbdb6-8947-4687-8e12-edf4e40d6650
## subRequests
## 1 WAS, LON, AA, , 1,2
You could use flatten()
from the jsonlite
package, to get a table with one level of nesting less
您可以使用jsonlite包中的flatten()来获得一个嵌套级别较低的表
flatten(table)[1:3, c(1, 6, 12)]
## _id uiSearchRequest.travelDate uiSearchRequest.subRequests
## 1 93ccbdb6-8947-4687-8e12-edf4e40d6650 20151206 WAS, LON, AA, , 1,2
## 2 b736c374-b8ae-4e99-8073-9c54517fecd5 20151206 WAS, LON, AA, , 1,2
## 3 3312605f-8304-4ab8-96d6-6e1a03cfbd9e 20151206 LON, IAD, AA, , 1,2
The last column is still a list. There are many ways you could handle this. One possibility, is to create a row per subrequest, where the contents of all the other columns (X_id
, downloadCount
, etc.) are repeated. (This is almost the form that you give in your question, with the only difference that you left cells empty in the reapeated columns, while I repeat the contents.) This is how it can be done:
最后一列仍然是一个列表。有很多方法可以解决这个问题。一种可能性是为每个子请求创建一行,其中重复所有其他列(X_id,downloadCount等)的内容。 (这几乎就是你在问题中给出的形式,唯一不同的是你在重新列出的列中留下了空格,而我重复了内容。)这是如何做到的:
table <- flatten(fromJSON(json))
tab_list <- lapply(1:nrow(table),
function(i) data.frame(table[i, -12], table[i, 12],
stringsAsFactors = FALSE))
library(dplyr)
flat_table <- bind_rows(tab_list)
The second line creates a list of data frames. These are combined into a single data frame using bind_rows()
from dpylr
. (To be more precise, flat_table
will be a tbl_df
, but the difference to a data.frame
is small.) This can then be written to a csv file in the usual way:
第二行创建数据框列表。使用dpylr中的bind_rows()将它们组合成单个数据帧。 (更确切地说,flat_table将是一个tbl_df,但与data.frame的差异很小。)然后可以通常的方式将其写入csv文件:
write.csv(flat_table, file = "mydata.csv")