I have a json file that has alot of double quotes inside the values. The json file is almost 27000 records.
我有一个json文件,里面有很多双引号。json文件几乎有27000条记录。
I want to remove or replace the double quotes inside the values because otherwise its not accepted as a good json file. How can i do that?
我想要删除或替换值内的双引号,因为否则它不能作为一个好的json文件接受。我该怎么做呢?
The problem is that there are records with one double quote inside the value but there are also records with multiple quotes inside them.
问题是,在值中有一个双引号的记录,但也有包含多个引号的记录。
Instead of replacing or removing the quotes, it is also possible to remove the entire key and value. I am not gonna use it anyway. Is it any easier to do that?
不替换或删除引号,也可以删除整个键和值。反正我也不会用它。这样做容易吗?
Here is a sample of 1 record in the json file:
下面是json文件中1条记录的示例:
{
"adlibJSON": {
"recordList": {
"record": [
{
"@attributes": {
"priref": "4372",
"created": "2011-12-09T23:09:57",
"modification": "2012-08-11T17:07:51",
"selected": "False"
},
"acquisition.date": [
"1954"
],
"documentation.title": [
"A lot of text with a lot of extra double quotes like "this" and "this""
] ... ...
The problem lies in the value of the key: document.title
. I have sublime text 2, which i use to find and replace.
问题在于键的值:document.title。我有崇高的文本2,我用它来寻找和替换。
3 个解决方案
#1
1
There is a way, but in order to do that, you must be sure that you can do the following assumptions about your data:
有一种方法,但是为了做到这一点,您必须确保您可以对您的数据做以下假设:
- "documentation.title" must only appear once in your data, when it is used as a key.
- ”文档。标题“只能在您的数据中出现一次,当它被用作键时”。
- the array value referred by "documentation.title" should only have one element.
- 由“文档”引用的数组值。标题"应该只有一个元素。
- The character "]" should not appear in the value.
- 字符“]”不应该出现在值中。
Then you would follow those steps:
然后你会遵循以下步骤:
/* find first index of "[" after "documentation.title" */
n = s.indexOf("[", s.indexOf('"documentation.title"'));
/* Find index of closing "]" */
n2 = s.indexOf("]", n);
/* Get the substring enclosed by these indexes */
x = s.substr(n+1, n2-n-1);
/* Remove every double quotes in this string and rebuild the original string with the corrected value. */
s.substr(0, n) + '["' + x.replace(/"/g, "") + '"]' + s.substr(n2+1);
Edit: if you are not interested in keeping the corrected value itself, you can just replace it by an empty string.
编辑:如果您不想保留正确的值本身,您可以用一个空字符串替换它。
#2
0
I don't think you can since it's not a regular language.
我不认为你可以,因为这不是一门普通的语言。
You'll probably have similar troubles to those incurred by parsing HTML with regex.
您可能会遇到类似于使用regex解析HTML所带来的问题。
I think you'll have to write (or find if you're super lucky) some kind of parser yourself...
我认为你必须自己编写(或者发现自己是超级幸运的)某种解析器。
#3
0
Try this:
试试这个:
json.replace(/(^\s*|:\s*)"/gm, '$1[sentinel]')
.replace(/"(,?\s*$|:)/gm, '[sentinel]$1')
.replace(/"/g, '\\"').replace(/\[sentinel\]/g, '"');
Demo here: http://jsfiddle.net/D83FD/
演示:http://jsfiddle.net/D83FD/
This isn't a perfect solution; it's possible that the data could be formatted in such a way that it breaks the regular expression. Try it and see if it works for a larger data set.
这不是一个完美的解决方案;数据可能被格式化成破坏正则表达式的格式。尝试一下,看看它是否适用于更大的数据集。
Essentially we are finding opening quotes and replacing them with a placeholder value, finding closing quotes and replacing them with the placeholder, backslash-escaping all remaining quotes, and then replacing the placeholders with quotes again.
本质上,我们要找到开头的引号并用占位符的值替换它们,找到结尾的引号并用占位符替换它们,反斜杠-转义所有剩余的引号,然后再用引号替换占位符。
#1
1
There is a way, but in order to do that, you must be sure that you can do the following assumptions about your data:
有一种方法,但是为了做到这一点,您必须确保您可以对您的数据做以下假设:
- "documentation.title" must only appear once in your data, when it is used as a key.
- ”文档。标题“只能在您的数据中出现一次,当它被用作键时”。
- the array value referred by "documentation.title" should only have one element.
- 由“文档”引用的数组值。标题"应该只有一个元素。
- The character "]" should not appear in the value.
- 字符“]”不应该出现在值中。
Then you would follow those steps:
然后你会遵循以下步骤:
/* find first index of "[" after "documentation.title" */
n = s.indexOf("[", s.indexOf('"documentation.title"'));
/* Find index of closing "]" */
n2 = s.indexOf("]", n);
/* Get the substring enclosed by these indexes */
x = s.substr(n+1, n2-n-1);
/* Remove every double quotes in this string and rebuild the original string with the corrected value. */
s.substr(0, n) + '["' + x.replace(/"/g, "") + '"]' + s.substr(n2+1);
Edit: if you are not interested in keeping the corrected value itself, you can just replace it by an empty string.
编辑:如果您不想保留正确的值本身,您可以用一个空字符串替换它。
#2
0
I don't think you can since it's not a regular language.
我不认为你可以,因为这不是一门普通的语言。
You'll probably have similar troubles to those incurred by parsing HTML with regex.
您可能会遇到类似于使用regex解析HTML所带来的问题。
I think you'll have to write (or find if you're super lucky) some kind of parser yourself...
我认为你必须自己编写(或者发现自己是超级幸运的)某种解析器。
#3
0
Try this:
试试这个:
json.replace(/(^\s*|:\s*)"/gm, '$1[sentinel]')
.replace(/"(,?\s*$|:)/gm, '[sentinel]$1')
.replace(/"/g, '\\"').replace(/\[sentinel\]/g, '"');
Demo here: http://jsfiddle.net/D83FD/
演示:http://jsfiddle.net/D83FD/
This isn't a perfect solution; it's possible that the data could be formatted in such a way that it breaks the regular expression. Try it and see if it works for a larger data set.
这不是一个完美的解决方案;数据可能被格式化成破坏正则表达式的格式。尝试一下,看看它是否适用于更大的数据集。
Essentially we are finding opening quotes and replacing them with a placeholder value, finding closing quotes and replacing them with the placeholder, backslash-escaping all remaining quotes, and then replacing the placeholders with quotes again.
本质上,我们要找到开头的引号并用占位符的值替换它们,找到结尾的引号并用占位符替换它们,反斜杠-转义所有剩余的引号,然后再用引号替换占位符。