用ruby regex进行搜索和替换

I have a text blob field in a MySQL column that contains HTML. I have to change some of the markup, so I figured I'll do it in a ruby script. Ruby is irrelevant here, but it would be nice to see an answer with it. The markup looks like the following:

我在包含HTML的MySQL列中有一个文本blob字段。我必须修改一些标记，所以我想我要用ruby脚本来做。Ruby在这里是无关紧要的，但是如果能看到它的答案就好了。标记如下:

<h5>foo</h5>
  <table>
    <tbody>
    </tbody>
  </table>

<h5>bar</h5>
  <table>
    <tbody>
    </tbody>
  </table>

<h5>meow</h5>
  <table>
    <tbody>
    </tbody>
  </table>

I need to change just the first <h5>foo</h5> block of each text to <h2>something_else</h2> while leaving the rest of the string alone.

我需要将每个文本的第一个

foo

块更改为

something_else

，而只保留字符串的其余部分。

Can't seem to get the proper PCRE regex, using Ruby.

使用Ruby似乎无法获得正确的PCRE regex。

3 个解决方案

#1

# The regex literal syntax using %r{...} allows / in your regex without escaping
new_str = my_str.sub( %r{<h5>[^<]+</h5>}, '<h2>something_else</h2>' )

Using String#sub instead of String#gsub causes only the first replacement to occur. If you need to dynamically choose what 'foo' is, you can use string interpolation in regex literals:

使用字符串#sub代替字符串#gsub只会导致第一次替换。如果您需要动态选择什么是“foo”，您可以在regex文字中使用字符串插值:

new_str = my_str.sub( %r{<h5>#{searchstr}</h5>}, "<h2>#{replacestr}</h2>" )

Then again, if you know what 'foo' is, you don't need a regex:

然后，如果你知道什么是foo，你不需要一个regex:

new_str = my_str.sub( "<h5>searchstr</h5>", "<h2>#{replacestr}</h2>" )

or even:

甚至:

my_str[ "<h5>searchstr</h5>" ] = "<h2>#{replacestr}</h2>"

If you need to run code to figure out the replacement, you can use the block form of sub:

如果需要运行代码进行替换，可以使用sub的块形式:

new_str = my_str.sub %r{<h5>([^<]+)</h5>} do |full_match|
  # The expression returned from this block will be used as the replacement string
  # $1 will be the matched content between the h5 tags.
  "<h2>#{replacestr}</h2>"
end

#2

Whenever I have to parse or modify HTML or XML I reach for a parser. I almost never bother with regex or instring unless it's absolutely a no-brainer.

每当我需要解析或修改HTML或XML时，我都会使用解析器。我几乎从不费心使用regex或instring，除非它绝对是一个不需要动脑筋的东西。

Here's how to do it using Nokogiri, without any regex:

以下是如何使用Nokogiri，而不使用任何regex:

text = <<EOT
<h5>foo</h5>
  <table>
    <tbody>
    </tbody>
  </table>

<h5>bar</h5>
  <table>
    <tbody>
    </tbody>
  </table>

<h5>meow</h5>
  <table>
    <tbody>
    </tbody>
  </table>
EOT

require 'nokogiri'

fragment = Nokogiri::HTML::DocumentFragment.parse(text)
print fragment.to_html

fragment.css('h5').select{ |n| n.text == 'foo' }.each do |n|
  n.name = 'h2'
  n.content = 'something_else'
end

print fragment.to_html

After parsing, this is what Nokogiri has returned from the fragment:

解析后，这是Nokogiri从片段返回的内容:

# >> <h5>foo</h5>
# >>   <table><tbody></tbody></table><h5>bar</h5>
# >>   <table><tbody></tbody></table><h5>meow</h5>
# >>   <table><tbody></tbody></table>

This is after running:

这是在运行:

# >> <h2>something_else</h2>
# >>   <table><tbody></tbody></table><h5>bar</h5>
# >>   <table><tbody></tbody></table><h5>meow</h5>
# >>   <table><tbody></tbody></table>

#3

Use String.gsub with the regular expression <h5>[^<]+<\/h5>:

使用字符串。与正则表达式gsub < h5 >[^ <]+ < \ / h5 >:

>> current = "<h5>foo</h5>\n  <table>\n    <tbody>\n    </tbody>\n  </table>"
>> updated = current.gsub(/<h5>[^<]+<\/h5>/){"<h2>something_else</h2>"}
=> "<h2>something_else</h2>\n  <table>\n    <tbody>\n    </tbody>\n  </table>"

Note, you can test ruby regular expression comfortably in your browser.

注意，您可以在浏览器中轻松地测试ruby正则表达式。

#1

# The regex literal syntax using %r{...} allows / in your regex without escaping
new_str = my_str.sub( %r{<h5>[^<]+</h5>}, '<h2>something_else</h2>' )