My script works fine doing this:
我的脚本工作正常:
images = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", doc)
videos = re.findall("\S*?(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*)", doc)
However, I believe it is inefficient to search through the whole document twice.
但是,我认为两次搜索整个文档效率很低。
Here's a sample document if it helps: http://pastebin.com/5kRZXjij
如果它有帮助,这是一个示例文档:http://pastebin.com/5kRZXjij
I would expect the following output from the above:
我希望以上输出如下:
images = http://37.media.tumblr.com/tumblr_lnmh4tD3sM1qi02clo1_500.jpg
videos = http://bassrx.tumblr.com/video_file/86319903607/tumblr_lo8i76CWSP1qi02cl
Instead it would be better to do something like:
相反,做一些像这样的事情会更好:
image_and_video_links = re.findall(" <match-image-links-or-video links> ", doc)
How can I combine the two re.findall
lines into one?
如何将两个re.findall行合并为一个?
I have tried using the |
character but I always fail to match anything. So I'm sure I'm completely confused as to how to use it properly.
我试过使用|性格,但我总是无法匹配任何东西。所以我确信我对如何正确使用它感到困惑。
2 个解决方案
#1
6
As mentioned in the comments, a pipe (|)
should do the trick.
正如评论中所提到的,管道(|)应该可以解决问题。
The regular expression
正则表达式
(src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg))|(\S*?(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*))
catches either of the two patterns.
捕获两种模式中的任何一种。
Demo on Regex Tester
在Regex Tester上演示
#2
1
If you really want efficient...
如果你真的想要高效......
For starters, I would cut out the \S*?
in the second regex. It serves no purpose apart from an opportunity for lots of backtracking.
首先,我会删除\ S *?在第二个正则表达式。除了有很多回溯的机会之外,它没有用处。
src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)|(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*)
Other ideas
其他想法
You can get rid of the capture groups by using a small lookbehind in the first one, allowing you to get rid of all parentheses and directly matching what you want. Not faster, but tidier:
你可以通过在第一个中使用一个小的lookbehind来摆脱捕获组,允许你摆脱所有的括号并直接匹配你想要的。不是更快,但更整洁:
(?<=src.\")\S*?media.tumblr\S*?tumblr_\S*?jpg|http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*
Do you intend for the periods after src
and media
to mean "any character", or to mean "a literal period"? If the latter, escape them: \.
你是否打算在src和媒体之后的时期表示“任何角色”,或者意为“文字时期”?如果是后者,逃脱它们:\。
You can use the re.IGNORECASE
option and get rid of some letters:
您可以使用re.IGNORECASE选项并删除一些字母:
(?<=src.\")\S*?media.tumblr\S*?tumblr_\S*?jpg|http\S*?video_file\S*?tumblr_[a-z0-9]*
#1
6
As mentioned in the comments, a pipe (|)
should do the trick.
正如评论中所提到的,管道(|)应该可以解决问题。
The regular expression
正则表达式
(src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg))|(\S*?(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*))
catches either of the two patterns.
捕获两种模式中的任何一种。
Demo on Regex Tester
在Regex Tester上演示
#2
1
If you really want efficient...
如果你真的想要高效......
For starters, I would cut out the \S*?
in the second regex. It serves no purpose apart from an opportunity for lots of backtracking.
首先,我会删除\ S *?在第二个正则表达式。除了有很多回溯的机会之外,它没有用处。
src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)|(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*)
Other ideas
其他想法
You can get rid of the capture groups by using a small lookbehind in the first one, allowing you to get rid of all parentheses and directly matching what you want. Not faster, but tidier:
你可以通过在第一个中使用一个小的lookbehind来摆脱捕获组,允许你摆脱所有的括号并直接匹配你想要的。不是更快,但更整洁:
(?<=src.\")\S*?media.tumblr\S*?tumblr_\S*?jpg|http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*
Do you intend for the periods after src
and media
to mean "any character", or to mean "a literal period"? If the latter, escape them: \.
你是否打算在src和媒体之后的时期表示“任何角色”,或者意为“文字时期”?如果是后者,逃脱它们:\。
You can use the re.IGNORECASE
option and get rid of some letters:
您可以使用re.IGNORECASE选项并删除一些字母:
(?<=src.\")\S*?media.tumblr\S*?tumblr_\S*?jpg|http\S*?video_file\S*?tumblr_[a-z0-9]*