如何将多个正则表达式组合成一行?

时间:2023-01-11 15:45:59

My script works fine doing this:

我的脚本工作正常:

images = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", doc)
videos = re.findall("\S*?(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*)", doc)

However, I believe it is inefficient to search through the whole document twice.

但是,我认为两次搜索整个文档效率很低。

Here's a sample document if it helps: http://pastebin.com/5kRZXjij

如果它有帮助,这是一个示例文档:http://pastebin.com/5kRZXjij

I would expect the following output from the above:

我希望以上输出如下:

images = http://37.media.tumblr.com/tumblr_lnmh4tD3sM1qi02clo1_500.jpg
videos = http://bassrx.tumblr.com/video_file/86319903607/tumblr_lo8i76CWSP1qi02cl

Instead it would be better to do something like:

相反,做一些像这样的事情会更好:

image_and_video_links = re.findall(" <match-image-links-or-video links> ", doc)

How can I combine the two re.findall lines into one?

如何将两个re.findall行合并为一个?

I have tried using the | character but I always fail to match anything. So I'm sure I'm completely confused as to how to use it properly.

我试过使用|性格,但我总是无法匹配任何东西。所以我确信我对如何正确使用它感到困惑。

2 个解决方案

#1


6  

As mentioned in the comments, a pipe (|) should do the trick.

正如评论中所提到的,管道(|)应该可以解决问题。

The regular expression

正则表达式

(src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg))|(\S*?(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*))

catches either of the two patterns.

捕获两种模式中的任何一种。

Demo on Regex Tester

在Regex Tester上演示

#2


1  

If you really want efficient...

如果你真的想要高效......

For starters, I would cut out the \S*? in the second regex. It serves no purpose apart from an opportunity for lots of backtracking.

首先,我会删除\ S *?在第二个正则表达式。除了有很多回溯的机会之外,它没有用处。

src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)|(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*)

Other ideas

其他想法

You can get rid of the capture groups by using a small lookbehind in the first one, allowing you to get rid of all parentheses and directly matching what you want. Not faster, but tidier:

你可以通过在第一个中使用一个小的lookbehind来摆脱捕获组,允许你摆脱所有的括号并直接匹配你想要的。不是更快,但更整洁:

(?<=src.\")\S*?media.tumblr\S*?tumblr_\S*?jpg|http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*

Do you intend for the periods after src and media to mean "any character", or to mean "a literal period"? If the latter, escape them: \.

你是否打算在src和媒体之后的时期表示“任何角色”,或者意为“文字时期”?如果是后者,逃脱它们:\。

You can use the re.IGNORECASE option and get rid of some letters:

您可以使用re.IGNORECASE选项并删除一些字母:

(?<=src.\")\S*?media.tumblr\S*?tumblr_\S*?jpg|http\S*?video_file\S*?tumblr_[a-z0-9]*

#1


6  

As mentioned in the comments, a pipe (|) should do the trick.

正如评论中所提到的,管道(|)应该可以解决问题。

The regular expression

正则表达式

(src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg))|(\S*?(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*))

catches either of the two patterns.

捕获两种模式中的任何一种。

Demo on Regex Tester

在Regex Tester上演示

#2


1  

If you really want efficient...

如果你真的想要高效......

For starters, I would cut out the \S*? in the second regex. It serves no purpose apart from an opportunity for lots of backtracking.

首先,我会删除\ S *?在第二个正则表达式。除了有很多回溯的机会之外,它没有用处。

src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)|(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*)

Other ideas

其他想法

You can get rid of the capture groups by using a small lookbehind in the first one, allowing you to get rid of all parentheses and directly matching what you want. Not faster, but tidier:

你可以通过在第一个中使用一个小的lookbehind来摆脱捕获组,允许你摆脱所有的括号并直接匹配你想要的。不是更快,但更整洁:

(?<=src.\")\S*?media.tumblr\S*?tumblr_\S*?jpg|http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*

Do you intend for the periods after src and media to mean "any character", or to mean "a literal period"? If the latter, escape them: \.

你是否打算在src和媒体之后的时期表示“任何角色”,或者意为“文字时期”?如果是后者,逃脱它们:\。

You can use the re.IGNORECASE option and get rid of some letters:

您可以使用re.IGNORECASE选项并删除一些字母:

(?<=src.\")\S*?media.tumblr\S*?tumblr_\S*?jpg|http\S*?video_file\S*?tumblr_[a-z0-9]*