如何使用正则表达式提取json字段?

时间:2021-07-16 13:05:30

Beginner RegExp question. I have lines of JSON in a textfile, each with slightly different Fields, but there are 3 fields I want to extract for each line if it has it, ignoring everything else. How would I use a regex (in editpad or anywhere else) to do this?

初学者RegExp问题。我在文本文件中有JSON行,每个都有稍微不同的Fields,但如果有的话,我想为每行提取3个字段,忽略其他所有字段。我如何使用正则表达式(在编辑板或其他任何地方)执行此操作?

Example:

"url":"http://www.netcharles.com/orwell/essays.htm",
"domain":"netcharles.com",
"title":"Orwell Essays & Journalism Section - Charles' George Orwell Links",
"tags":["orwell","writing","literature","journalism","essays","politics","essay","reference","language","toread"],
"index":2931,
"time_created":1345419323,
"num_saves":24

I want to extract URL,TITLE,TAGS,

我想提取URL,TITLE,TAGS,

4 个解决方案

#1


13  

/"(url|title|tags)":"((\\"|[^"])*)"/i

I think this is what you're asking for. I'll provide an explanation momentarily. This regular expression (delimited by / - you probably won't have to put those in editpad) matches:

我想这就是你所要求的。我会暂时提供一个解释。这个正则表达式(由/分隔 - 你可能不必将它们放在编辑板中)匹配:

"

A literal ".

字面意思“。

(url|title|tags)

Any of the three literal strings "url", "title" or "tags" - in Regular Expressions, by default Parentheses are used to create groups, and the pipe character is used to alternate - like a logical 'or'. To match these literal characters, you'd have to escape them.

正则表达式中的三个文字字符串“url”,“title”或“tags”中的任何一个 - 默认情况下,括号用于创建组,管道字符用于交替 - 如逻辑“或”。要匹配这些文字字符,您必须转义它们。

":"

Another literal string.

另一个文字字符串。

(

The beginning of another group. (Group 2)

另一组的开始。 (第2组)

    (

Another group (3)

另一组(3)

        \\"

The literal string \" - you have to escape the backslash because otherwise it will be interpreted as escaping the next character, and you never know what that'll do.

文字字符串\“ - 你必须转义反斜杠,否则它将被解释为转义下一个字符,你永远不会知道它会做什么。

        |

or...

        [^"]

Any single character except a double quote The brackets denote a Character Class/Set, or a list of characters to match. Any given class matches exactly one character in the string. Using a carat (^) at the beginning of a class negates it, causing the matcher to match anything that's not contained in the class.

除双引号外的任何单个字符括号表示字符类/集,或匹配的字符列表。任何给定的类都匹配字符串中的一个字符。在类的开头使用克拉(^)否定它,导致匹配器匹配类中未包含的任何内容。

    )

End of group 3...

第3组结束......

    *

The asterisk causes the previous regular expression (in this case, group 3), to be repeated zero or more times, In this case causing the matcher to match anything that could be inside the double quotes of a JSON string.

星号导致前一个正则表达式(在本例中为组3)重复零次或多次,在这种情况下,使匹配器匹配可能在JSON字符串的双引号内的任何内容。

)"

The end of group 2, and a literal ".

第2组结束,字面意思“。

I've done a few non-obvious things here, that may come in handy:

我在这里做了一些非显而易见的事情,这可能会派上用场:

  1. Group 2 - when dereferenced using Backreferences - will be the actual string assigned to the field. This is useful when getting the actual value.
  2. 第2组 - 使用反向引用取消引用时 - 将是分配给该字段的实际字符串。这在获取实际值时很有用。

  3. The i at the end of the expression makes it case insensitive.
  4. 表达式末尾的i使其不区分大小写。

  5. Group 1 contains the name of the captured field.
  6. 第1组包含捕获字段的名称。

EDIT: So I see that the tags are an array. I'll update the regular expression here in a second when I've had a chance to think about it.

编辑:所以我看到标签是一个数组。当我有机会思考它时,我会在一秒钟内更新正则表达式。

Your new Regex is:

你的新正则表达式是:

/"(url|title|tags)":("(\\"|[^"])*"|\[("(\\"|[^"])*"(,"(\\"|[^"])*")*)?\])/i

All I've done here is alternate the string regular expression I had been using ("((\\"|[^"])*)"), with a regular expression for finding arrays (\[("(\\"|[^"])*"(,"(\\"|[^"])*")*)?\]). No so easy to Read, is it? Well, substituting our String Regex out for the letter S, we can rewrite it as:

我在这里所做的就是替换我一直使用的字符串正则表达式(“(((\\”| [^“])*)”),带有用于查找数组的正则表达式(\ [(“(\\”) | [^ “])*”( “(\\” | [^ “])*”)*)\])?没有那么容易阅读,是吗?好吧,用我们的String Regex替换字母S,我们可以将其重写为:

\[(S(,S)*)?\]

Which matches a literal opening bracket (hence the backslashes), optionally followed by a comma separated list of strings, and a closing bracket. The only new concept I've introduced here is the question mark (?), which is itself a type of repetition. Commonly referred to as 'making the previous expression optional', it can also be thought of as exactly 0 or 1 matches.

它与文字左括号(因此是反斜杠)匹配,可选地后跟逗号分隔的字符串列表和结束括号。我在这里介绍的唯一新概念是问号(?),它本身就是一种重复。通常称为“使前一个表达式可选”,它也可以被认为是0或1个匹配。

With our same S Notation, here's the whole dirty Regular Expression:

使用相同的S表示法,这是整个脏的正则表达式:

/"(url|title|tags)":(S|\[(S(,S)*)?\])/i

If it helps to see it in action, here's a view of it in action.

如果它有助于看到它在行动,这里是一个行动的视图。

#2


2  

This question is a bit older, but I have had browsed a bit on my PC and found that expression. I passed him as GIST, could be useful to others.

这个问题有点老了,但我已经在我的电脑上浏览了一下并发现了这个表达。我把他当作GIST,可能对别人有用。

EDIT:

# Expression was tested with PHP and Ruby
# This regular expression finds a key-value pair in JSON formatted strings
# Match 1: Key
# Match 2: Value
# https://regex101.com/r/zR2vU9/4
# http://rubular.com/r/KpF3suIL10

(?:\"|\')(?<key>[^"]*)(?:\"|\')(?=:)(?:\:\s*)(?:\"|\')?(?<value>true|false|[0-9a-zA-Z\+\-\,\.\$]*)

# test document
[
  {
    "_id": "56af331efbeca6240c61b2ca",
    "index": 120000,
    "guid": "bedb2018-c017-429E-b520-696ea3666692",
    "isActive": false,
    "balance": "$2,202,350",
    "object": {
        "name": "am",
        "lastname": "lang"
    }
  }
]

#3


1  

Why does it have to be a Regular Expression object?

为什么它必须是正则表达式对象?

Here we can just use a Hash object first and then go search it.

在这里,我们可以先使用Hash对象,然后再搜索它。

mh = {"url":"http://www.netcharles.com/orwell/essays.htm","domain":"netcharles.com","title":"Orwell Essays & Journalism Section - Charles' George Orwell Links","tags":["orwell","writing","literature","journalism","essays","politics","essay","reference","language","toread"],"index":2931,"time_created":1345419323,"num_saves":24}

The output of which would be

其输出将是

=> {:url=>"http://www.netcharles.com/orwell/essays.htm", :domain=>"netcharles.com", :title=>"Orwell Essays & Journalism Section - Charles' George Orwell Links", :tags=>["orwell", "writing", "literature", "journalism", "essays", "politics", "essay", "reference", "language", "toread"], :index=>2931, :time_created=>1345419323, :num_saves=>24}

Not that I want to avoid using Regexp but don't you think it would be easier to take it a step at a time until your getting the data you want to further search through? Just MHO.

并不是说我想避免使用Regexp,但是你不认为在你获得想要进一步搜索的数据之前一步一步更容易吗?只是MHO。

mh.values_at(:url, :title, :tags)

The output:

["http://www.netcharles.com/orwell/essays.htm", "Orwell Essays & Journalism Section - Charles' George Orwell Links", ["orwell", "writing", "literature", "journalism", "essays", "politics", "essay", "reference", "language", "toread"]]

Taking the pattern that FrankieTheKneeman gave you:

采用FrankieTheKneeman给你的模式:

pattern = /"(url|title|tags)":"((\\"|[^"])*)"/i

we can search the mh hash by converting it to a json object.

我们可以通过将其转换为json对象来搜索mh哈希。

/#{pattern}/.match(mh.to_json)

The output:

=> #<MatchData "\"url\":\"http://www.netcharles.com/orwell/essays.htm\"" 1:"url" 2:"http://www.netcharles.com/orwell/essays.htm" 3:"m">

Of course this is all done in Ruby which is not a tag that you have but relates I hope.

当然这都是在Ruby中完成的,这不是你所拥有的标签,而是我希望的相关。

But oops! Looks like we can't do all three at once with that pattern so I will do them one at a time just for sake.

但是哎呀!看起来我们不能同时使用这种模式完成所有这三种模式,所以我会一次只为了这一点做一次。

pattern = /"(title)":"((\\"|[^"])*)"/i

/#{pattern}/.match(mh.to_json)

#<MatchData "\"title\":\"Orwell Essays & Journalism Section - Charles' George Orwell Links\"" 1:"title" 2:"Orwell Essays & Journalism Section - Charles' George Orwell Links" 3:"s">

pattern = /"(tags)":"((\\"|[^"])*)"/i

/#{pattern}/.match(mh.to_json)

=> nil

Sorry about that last one. It will have to be handled differently.

抱歉,最后一个。它必须以不同的方式处理。

#4


0  

I adapted regex to work with JSON in my own library. I've detailed algorithm behavior below.

我改编正则表达式在我自己的库中使用JSON。我在下面详细介绍了算法行为。

First, stringify the JSON object. Then, you need to store the starts and lengths of the matched substrings. For example:

首先,对JSON对象进行字符串化。然后,您需要存储匹配的子串的开始和长度。例如:

"matched".search("ch") // yields 3

For a JSON string, this works exactly the same (unless you are searching explicitly for commas and curly brackets in which case I'd recommend some prior transform of your JSON object before performing regex (i.e. think :, {, }).

对于JSON字符串,它的工作方式完全相同(除非您明确搜索逗号和大括号,在这种情况下,我建议在执行正则表达式之前对JSON对象进行一些先前的转换(即think:,{,})。

Next, you need to reconstruct the JSON object. The algorithm I authored does this by detecting JSON syntax by recursively going backwards from the match index. For instance, the pseudo code might look as follows:

接下来,您需要重新构建JSON对象。我创作的算法通过递归地从匹配索引向后检测来检测JSON语法。例如,伪代码可能如下所示:

find the next key preceding the match index, call this theKey
then find the number of all occurrences of this key preceding theKey, call this theNumber
using the number of occurrences of all keys with same name as theKey up to position of theKey, traverse the object until keys named theKey has been discovered theNumber times
return this object called parentChain

With this information, it is possible to use regex to filter a JSON object to return the key, the value, and the parent object chain.

使用此信息,可以使用正则表达式过滤JSON对象以返回键,值和父对象链。

You can see the library and code I authored at http://json.spiritway.co/

你可以在http://json.spiritway.co/看到我创作的库和代码。

#1


13  

/"(url|title|tags)":"((\\"|[^"])*)"/i

I think this is what you're asking for. I'll provide an explanation momentarily. This regular expression (delimited by / - you probably won't have to put those in editpad) matches:

我想这就是你所要求的。我会暂时提供一个解释。这个正则表达式(由/分隔 - 你可能不必将它们放在编辑板中)匹配:

"

A literal ".

字面意思“。

(url|title|tags)

Any of the three literal strings "url", "title" or "tags" - in Regular Expressions, by default Parentheses are used to create groups, and the pipe character is used to alternate - like a logical 'or'. To match these literal characters, you'd have to escape them.

正则表达式中的三个文字字符串“url”,“title”或“tags”中的任何一个 - 默认情况下,括号用于创建组,管道字符用于交替 - 如逻辑“或”。要匹配这些文字字符,您必须转义它们。

":"

Another literal string.

另一个文字字符串。

(

The beginning of another group. (Group 2)

另一组的开始。 (第2组)

    (

Another group (3)

另一组(3)

        \\"

The literal string \" - you have to escape the backslash because otherwise it will be interpreted as escaping the next character, and you never know what that'll do.

文字字符串\“ - 你必须转义反斜杠,否则它将被解释为转义下一个字符,你永远不会知道它会做什么。

        |

or...

        [^"]

Any single character except a double quote The brackets denote a Character Class/Set, or a list of characters to match. Any given class matches exactly one character in the string. Using a carat (^) at the beginning of a class negates it, causing the matcher to match anything that's not contained in the class.

除双引号外的任何单个字符括号表示字符类/集,或匹配的字符列表。任何给定的类都匹配字符串中的一个字符。在类的开头使用克拉(^)否定它,导致匹配器匹配类中未包含的任何内容。

    )

End of group 3...

第3组结束......

    *

The asterisk causes the previous regular expression (in this case, group 3), to be repeated zero or more times, In this case causing the matcher to match anything that could be inside the double quotes of a JSON string.

星号导致前一个正则表达式(在本例中为组3)重复零次或多次,在这种情况下,使匹配器匹配可能在JSON字符串的双引号内的任何内容。

)"

The end of group 2, and a literal ".

第2组结束,字面意思“。

I've done a few non-obvious things here, that may come in handy:

我在这里做了一些非显而易见的事情,这可能会派上用场:

  1. Group 2 - when dereferenced using Backreferences - will be the actual string assigned to the field. This is useful when getting the actual value.
  2. 第2组 - 使用反向引用取消引用时 - 将是分配给该字段的实际字符串。这在获取实际值时很有用。

  3. The i at the end of the expression makes it case insensitive.
  4. 表达式末尾的i使其不区分大小写。

  5. Group 1 contains the name of the captured field.
  6. 第1组包含捕获字段的名称。

EDIT: So I see that the tags are an array. I'll update the regular expression here in a second when I've had a chance to think about it.

编辑:所以我看到标签是一个数组。当我有机会思考它时,我会在一秒钟内更新正则表达式。

Your new Regex is:

你的新正则表达式是:

/"(url|title|tags)":("(\\"|[^"])*"|\[("(\\"|[^"])*"(,"(\\"|[^"])*")*)?\])/i

All I've done here is alternate the string regular expression I had been using ("((\\"|[^"])*)"), with a regular expression for finding arrays (\[("(\\"|[^"])*"(,"(\\"|[^"])*")*)?\]). No so easy to Read, is it? Well, substituting our String Regex out for the letter S, we can rewrite it as:

我在这里所做的就是替换我一直使用的字符串正则表达式(“(((\\”| [^“])*)”),带有用于查找数组的正则表达式(\ [(“(\\”) | [^ “])*”( “(\\” | [^ “])*”)*)\])?没有那么容易阅读,是吗?好吧,用我们的String Regex替换字母S,我们可以将其重写为:

\[(S(,S)*)?\]

Which matches a literal opening bracket (hence the backslashes), optionally followed by a comma separated list of strings, and a closing bracket. The only new concept I've introduced here is the question mark (?), which is itself a type of repetition. Commonly referred to as 'making the previous expression optional', it can also be thought of as exactly 0 or 1 matches.

它与文字左括号(因此是反斜杠)匹配,可选地后跟逗号分隔的字符串列表和结束括号。我在这里介绍的唯一新概念是问号(?),它本身就是一种重复。通常称为“使前一个表达式可选”,它也可以被认为是0或1个匹配。

With our same S Notation, here's the whole dirty Regular Expression:

使用相同的S表示法,这是整个脏的正则表达式:

/"(url|title|tags)":(S|\[(S(,S)*)?\])/i

If it helps to see it in action, here's a view of it in action.

如果它有助于看到它在行动,这里是一个行动的视图。

#2


2  

This question is a bit older, but I have had browsed a bit on my PC and found that expression. I passed him as GIST, could be useful to others.

这个问题有点老了,但我已经在我的电脑上浏览了一下并发现了这个表达。我把他当作GIST,可能对别人有用。

EDIT:

# Expression was tested with PHP and Ruby
# This regular expression finds a key-value pair in JSON formatted strings
# Match 1: Key
# Match 2: Value
# https://regex101.com/r/zR2vU9/4
# http://rubular.com/r/KpF3suIL10

(?:\"|\')(?<key>[^"]*)(?:\"|\')(?=:)(?:\:\s*)(?:\"|\')?(?<value>true|false|[0-9a-zA-Z\+\-\,\.\$]*)

# test document
[
  {
    "_id": "56af331efbeca6240c61b2ca",
    "index": 120000,
    "guid": "bedb2018-c017-429E-b520-696ea3666692",
    "isActive": false,
    "balance": "$2,202,350",
    "object": {
        "name": "am",
        "lastname": "lang"
    }
  }
]

#3


1  

Why does it have to be a Regular Expression object?

为什么它必须是正则表达式对象?

Here we can just use a Hash object first and then go search it.

在这里,我们可以先使用Hash对象,然后再搜索它。

mh = {"url":"http://www.netcharles.com/orwell/essays.htm","domain":"netcharles.com","title":"Orwell Essays & Journalism Section - Charles' George Orwell Links","tags":["orwell","writing","literature","journalism","essays","politics","essay","reference","language","toread"],"index":2931,"time_created":1345419323,"num_saves":24}

The output of which would be

其输出将是

=> {:url=>"http://www.netcharles.com/orwell/essays.htm", :domain=>"netcharles.com", :title=>"Orwell Essays & Journalism Section - Charles' George Orwell Links", :tags=>["orwell", "writing", "literature", "journalism", "essays", "politics", "essay", "reference", "language", "toread"], :index=>2931, :time_created=>1345419323, :num_saves=>24}

Not that I want to avoid using Regexp but don't you think it would be easier to take it a step at a time until your getting the data you want to further search through? Just MHO.

并不是说我想避免使用Regexp,但是你不认为在你获得想要进一步搜索的数据之前一步一步更容易吗?只是MHO。

mh.values_at(:url, :title, :tags)

The output:

["http://www.netcharles.com/orwell/essays.htm", "Orwell Essays & Journalism Section - Charles' George Orwell Links", ["orwell", "writing", "literature", "journalism", "essays", "politics", "essay", "reference", "language", "toread"]]

Taking the pattern that FrankieTheKneeman gave you:

采用FrankieTheKneeman给你的模式:

pattern = /"(url|title|tags)":"((\\"|[^"])*)"/i

we can search the mh hash by converting it to a json object.

我们可以通过将其转换为json对象来搜索mh哈希。

/#{pattern}/.match(mh.to_json)

The output:

=> #<MatchData "\"url\":\"http://www.netcharles.com/orwell/essays.htm\"" 1:"url" 2:"http://www.netcharles.com/orwell/essays.htm" 3:"m">

Of course this is all done in Ruby which is not a tag that you have but relates I hope.

当然这都是在Ruby中完成的,这不是你所拥有的标签,而是我希望的相关。

But oops! Looks like we can't do all three at once with that pattern so I will do them one at a time just for sake.

但是哎呀!看起来我们不能同时使用这种模式完成所有这三种模式,所以我会一次只为了这一点做一次。

pattern = /"(title)":"((\\"|[^"])*)"/i

/#{pattern}/.match(mh.to_json)

#<MatchData "\"title\":\"Orwell Essays & Journalism Section - Charles' George Orwell Links\"" 1:"title" 2:"Orwell Essays & Journalism Section - Charles' George Orwell Links" 3:"s">

pattern = /"(tags)":"((\\"|[^"])*)"/i

/#{pattern}/.match(mh.to_json)

=> nil

Sorry about that last one. It will have to be handled differently.

抱歉,最后一个。它必须以不同的方式处理。

#4


0  

I adapted regex to work with JSON in my own library. I've detailed algorithm behavior below.

我改编正则表达式在我自己的库中使用JSON。我在下面详细介绍了算法行为。

First, stringify the JSON object. Then, you need to store the starts and lengths of the matched substrings. For example:

首先,对JSON对象进行字符串化。然后,您需要存储匹配的子串的开始和长度。例如:

"matched".search("ch") // yields 3

For a JSON string, this works exactly the same (unless you are searching explicitly for commas and curly brackets in which case I'd recommend some prior transform of your JSON object before performing regex (i.e. think :, {, }).

对于JSON字符串,它的工作方式完全相同(除非您明确搜索逗号和大括号,在这种情况下,我建议在执行正则表达式之前对JSON对象进行一些先前的转换(即think:,{,})。

Next, you need to reconstruct the JSON object. The algorithm I authored does this by detecting JSON syntax by recursively going backwards from the match index. For instance, the pseudo code might look as follows:

接下来,您需要重新构建JSON对象。我创作的算法通过递归地从匹配索引向后检测来检测JSON语法。例如,伪代码可能如下所示:

find the next key preceding the match index, call this theKey
then find the number of all occurrences of this key preceding theKey, call this theNumber
using the number of occurrences of all keys with same name as theKey up to position of theKey, traverse the object until keys named theKey has been discovered theNumber times
return this object called parentChain

With this information, it is possible to use regex to filter a JSON object to return the key, the value, and the parent object chain.

使用此信息,可以使用正则表达式过滤JSON对象以返回键,值和父对象链。

You can see the library and code I authored at http://json.spiritway.co/

你可以在http://json.spiritway.co/看到我创作的库和代码。