如何将多个正则表达式组合成一行？

My script works fine doing this:

我的脚本工作正常：

images = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", doc)
videos = re.findall("\S*?(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*)", doc)

However, I believe it is inefficient to search through the whole document twice.

但是，我认为两次搜索整个文档效率很低。

Here's a sample document if it helps: http://pastebin.com/5kRZXjij

如果它有帮助，这是一个示例文档：http：//pastebin.com/5kRZXjij

I would expect the following output from the above:

我希望以上输出如下：

images = http://37.media.tumblr.com/tumblr_lnmh4tD3sM1qi02clo1_500.jpg
videos = http://bassrx.tumblr.com/video_file/86319903607/tumblr_lo8i76CWSP1qi02cl

Instead it would be better to do something like:

相反，做一些像这样的事情会更好：

image_and_video_links = re.findall(" <match-image-links-or-video links> ", doc)

How can I combine the two re.findall lines into one?

如何将两个re.findall行合并为一个？

I have tried using the | character but I always fail to match anything. So I'm sure I'm completely confused as to how to use it properly.

我试过使用|性格，但我总是无法匹配任何东西。所以我确信我对如何正确使用它感到困惑。

2 个解决方案

#1

As mentioned in the comments, a pipe (|) should do the trick.

正如评论中所提到的，管道（|）应该可以解决问题。

The regular expression

正则表达式

(src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg))|(\S*?(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*))

catches either of the two patterns.

捕获两种模式中的任何一种。

Demo on Regex Tester

在Regex Tester上演示

#2

If you really want efficient...

如果你真的想要高效......

For starters, I would cut out the \S*? in the second regex. It serves no purpose apart from an opportunity for lots of backtracking.

首先，我会删除\ S *？在第二个正则表达式。除了有很多回溯的机会之外，它没有用处。

src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)|(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*)

Other ideas

其他想法

You can get rid of the capture groups by using a small lookbehind in the first one, allowing you to get rid of all parentheses and directly matching what you want. Not faster, but tidier:

你可以通过在第一个中使用一个小的lookbehind来摆脱捕获组，允许你摆脱所有的括号并直接匹配你想要的。不是更快，但更整洁：

(?<=src.\")\S*?media.tumblr\S*?tumblr_\S*?jpg|http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*

Do you intend for the periods after src and media to mean "any character", or to mean "a literal period"? If the latter, escape them: \.

你是否打算在src和媒体之后的时期表示“任何角色”，或者意为“文字时期”？如果是后者，逃脱它们：\。

You can use the re.IGNORECASE option and get rid of some letters:

您可以使用re.IGNORECASE选项并删除一些字母：

(?<=src.\")\S*?media.tumblr\S*?tumblr_\S*?jpg|http\S*?video_file\S*?tumblr_[a-z0-9]*

#1