I have the following code snippet from page source:
我有以下来自页面源代码的代码片段:
var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
id: "pdfObjectContainer",
width: "100%",
height: "700px",
pdfOpenParams: {
navpanes: 0,
statusbar: 1,
toolbar: 1,
view: "FitH"
}
}).embed("pdf_placeholder");
the
的
'PDFObject('
is unique on the page. I want to retreive url content using REGEX. In this case I need to get
在页面上是唯一的。我想要使用正则表达式的url内容。在这种情况下,我需要
http://www.site.com/doc55.pdf
Please help.
请帮助。
7 个解决方案
#1
0
In order to be able to find "something that happens in the line after something else", you need to match things "including the newline". For this you use the (dotall) modifier - a flag added during the compilation.
为了能够找到“行中发生的事情”,您需要匹配“包括换行”的内容。为此,您使用(dotall)修饰符——编译期间添加的标志。
Thus the following code works:
因此下面的代码可以工作:
import re
r = re.compile(r'(?<=PDFObject).*?url:.*?(http.*?)"', re.DOTALL)
s = '''var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
id: "pdfObjectContainer",
width: "100%",
height: "700px",
pdfOpenParams: {
navpanes: 0,
statusbar: 1,
toolbar: 1,
view: "FitH"
}
}).embed("pdf_placeholder"); '''
print r.findall(s)
Explanation:
解释:
r = re.compile( compile regular expression
r' treat this string as a regular expression
(?<=PDFObject) the match I want happens right after PDFObject
.*? then there may be some other characters...
url: followed by the string url:
.*? then match whatever follows until you get to the first instance (`?` : non-greedy match of
(http:.*?)" match the string http: up to (but not including) the first "
', end of regex string, but there's more...
re.DOTALL) set the DOTALL flag - this means the dot matches all characters
including newlines. This allows the match to continue from one line
to the next in the .*? right after the lookbehind
#2
3
Here is an alternative for solving your problem without using regex:
这里有一个不用regex就能解决问题的替代方案:
url,in_object = None, False
with open('input') as f:
for line in f:
in_object = in_object or 'PDFObject(' in line
if in_object and 'url:' in line:
url = line.split('"')[1]
break
print url
#3
0
using a combination of look-behind and look-ahead assertions
使用look behind和look forward断言的组合
import re
re.search(r'(?<=url:).*?(?=",)', s).group().strip('" ')
'http://www.site.com/doc55.pdf'
#4
0
This works:
如此:
import re
src='''\
var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
URL: "http://www.site.com/doc52.PDF",
id: "pdfObjectContainer",
width: "100%",
height: "700px",
pdfOpenParams: {
navpanes: 0,
statusbar: 1,
toolbar: 1,
view: "FitH"
}
}).embed("pdf_placeholder"); '''
print [m.group(1).strip('"') for m in
re.finditer(r'^url:\s*(.*)[\W]$',
re.search(r'PDFObject\(\{(.*)',src,re.M | re.S | re.I).group(1),re.M|re.I)]
prints:
打印:
['http://www.site.com/doc55.pdf', 'http://www.site.com/doc52.PDF']
#6
0
If 'PDFObject('
is the unique identifier in the page, you only have to match the first next quoted content.
如果“PDFObject”是页面中的唯一标识符,则只需匹配下一个引用的内容。
Using the DOTALL flag (re.DOTALL
or re.S
) and the non-greedy star (*?
), you can write:
使用DOTALL标志(re.DOTALL或re.S)和非贪婪星(*?),您可以写:
import re
snippet = '''
var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
id: "pdfObjectContainer",
width: "100%",
height: "700px",
pdfOpenParams: {
navpanes: 0,
statusbar: 1,
toolbar: 1,
view: "FitH"
}
}).embed("pdf_placeholder");
'''
# First version using unnamed groups
RE_UNNAMED = re.compile(r'PDFObject\(.*?"(.*?)"', re.S)
# Second version using named groups
RE_NAMED = re.compile(r'PDFObject\(.*?"(?P<url>.*?)"', re.S)
RE_UNNAMED.search(snippet, re.S).group(1)
RE_NAMED.search(snippet, re.S).group('url')
# result for both: 'http://www.site.com/doc55.pdf'
If you don't want to compile your regex because it's used once, simply this syntax:
如果您不想编译regex,因为它只使用一次,那么只需使用以下语法:
re.search(r'PDFObject\(.*?"(.*?)"', snippet, re.S).group(1)
re.search(r'PDFObject\(.*?"(?P<url>.*?)"', snippet, re.S).group('url')
Four choices, one should match you need and taste!
四个选择,一个应该匹配你的需要和品味!
#7
0
Although the other answers may appear to work, most do not take into account that the only unique thing on the page is 'PDFObject('. A much better regular expression would be the following:
虽然其他的答案似乎是可行的,但是大多数都没有考虑到页面上唯一的唯一的东西是“PDFObject(”)。一个更好的正则表达式是:
PDFObject\({\surl: "(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)",
It takes into account that 'PDFObject(' is unique and contains some basic URL verification.
它考虑到“PDFObject(”是唯一的,并包含一些基本的URL验证。
Below is an example of how this regex could be used in python
下面是如何在python中使用这个regex的示例
>>> import re
>>> strs = """var myPDF = new PDFObject({
... url: "http://www.site.com/doc55.pdf",
... id: "pdfObjectContainer",
... width: "100%",
... height: "700px",
... pdfOpenParams: {
... navpanes: 0,
... statusbar: 1,
... toolbar: 1,
... view: "FitH"
... }
... }).embed("pdf_placeholder");"""
>>> re.search(r'PDFObject\({\surl: "(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)",',strs).group(1)
'http://www.site.com/doc55.pdf'
A pure python (no regex) alternative would be:
一个纯粹的python(没有regex)替代方案是:
>>> unique = 'PDFObject({\nurl: "'
>>> start = strs.find(unique) + len(unique)
>>> end = start + strs[start:].find('"')
>>> strs[start:end]
'http://www.site.com/doc55.pdf'
No regex oneliner:
没有regex oneliner:
>>> (lambda u:(lambda s:(lambda e:strs[s:e])(s+strs[s:].find('"')))(strs.find(u)+len(u)))('PDFObject({\nurl: "')
'http://www.site.com/doc55.pdf'
#1
0
In order to be able to find "something that happens in the line after something else", you need to match things "including the newline". For this you use the (dotall) modifier - a flag added during the compilation.
为了能够找到“行中发生的事情”,您需要匹配“包括换行”的内容。为此,您使用(dotall)修饰符——编译期间添加的标志。
Thus the following code works:
因此下面的代码可以工作:
import re
r = re.compile(r'(?<=PDFObject).*?url:.*?(http.*?)"', re.DOTALL)
s = '''var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
id: "pdfObjectContainer",
width: "100%",
height: "700px",
pdfOpenParams: {
navpanes: 0,
statusbar: 1,
toolbar: 1,
view: "FitH"
}
}).embed("pdf_placeholder"); '''
print r.findall(s)
Explanation:
解释:
r = re.compile( compile regular expression
r' treat this string as a regular expression
(?<=PDFObject) the match I want happens right after PDFObject
.*? then there may be some other characters...
url: followed by the string url:
.*? then match whatever follows until you get to the first instance (`?` : non-greedy match of
(http:.*?)" match the string http: up to (but not including) the first "
', end of regex string, but there's more...
re.DOTALL) set the DOTALL flag - this means the dot matches all characters
including newlines. This allows the match to continue from one line
to the next in the .*? right after the lookbehind
#2
3
Here is an alternative for solving your problem without using regex:
这里有一个不用regex就能解决问题的替代方案:
url,in_object = None, False
with open('input') as f:
for line in f:
in_object = in_object or 'PDFObject(' in line
if in_object and 'url:' in line:
url = line.split('"')[1]
break
print url
#3
0
using a combination of look-behind and look-ahead assertions
使用look behind和look forward断言的组合
import re
re.search(r'(?<=url:).*?(?=",)', s).group().strip('" ')
'http://www.site.com/doc55.pdf'
#4
0
This works:
如此:
import re
src='''\
var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
URL: "http://www.site.com/doc52.PDF",
id: "pdfObjectContainer",
width: "100%",
height: "700px",
pdfOpenParams: {
navpanes: 0,
statusbar: 1,
toolbar: 1,
view: "FitH"
}
}).embed("pdf_placeholder"); '''
print [m.group(1).strip('"') for m in
re.finditer(r'^url:\s*(.*)[\W]$',
re.search(r'PDFObject\(\{(.*)',src,re.M | re.S | re.I).group(1),re.M|re.I)]
prints:
打印:
['http://www.site.com/doc55.pdf', 'http://www.site.com/doc52.PDF']
#5
#6
0
If 'PDFObject('
is the unique identifier in the page, you only have to match the first next quoted content.
如果“PDFObject”是页面中的唯一标识符,则只需匹配下一个引用的内容。
Using the DOTALL flag (re.DOTALL
or re.S
) and the non-greedy star (*?
), you can write:
使用DOTALL标志(re.DOTALL或re.S)和非贪婪星(*?),您可以写:
import re
snippet = '''
var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
id: "pdfObjectContainer",
width: "100%",
height: "700px",
pdfOpenParams: {
navpanes: 0,
statusbar: 1,
toolbar: 1,
view: "FitH"
}
}).embed("pdf_placeholder");
'''
# First version using unnamed groups
RE_UNNAMED = re.compile(r'PDFObject\(.*?"(.*?)"', re.S)
# Second version using named groups
RE_NAMED = re.compile(r'PDFObject\(.*?"(?P<url>.*?)"', re.S)
RE_UNNAMED.search(snippet, re.S).group(1)
RE_NAMED.search(snippet, re.S).group('url')
# result for both: 'http://www.site.com/doc55.pdf'
If you don't want to compile your regex because it's used once, simply this syntax:
如果您不想编译regex,因为它只使用一次,那么只需使用以下语法:
re.search(r'PDFObject\(.*?"(.*?)"', snippet, re.S).group(1)
re.search(r'PDFObject\(.*?"(?P<url>.*?)"', snippet, re.S).group('url')
Four choices, one should match you need and taste!
四个选择,一个应该匹配你的需要和品味!
#7
0
Although the other answers may appear to work, most do not take into account that the only unique thing on the page is 'PDFObject('. A much better regular expression would be the following:
虽然其他的答案似乎是可行的,但是大多数都没有考虑到页面上唯一的唯一的东西是“PDFObject(”)。一个更好的正则表达式是:
PDFObject\({\surl: "(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)",
It takes into account that 'PDFObject(' is unique and contains some basic URL verification.
它考虑到“PDFObject(”是唯一的,并包含一些基本的URL验证。
Below is an example of how this regex could be used in python
下面是如何在python中使用这个regex的示例
>>> import re
>>> strs = """var myPDF = new PDFObject({
... url: "http://www.site.com/doc55.pdf",
... id: "pdfObjectContainer",
... width: "100%",
... height: "700px",
... pdfOpenParams: {
... navpanes: 0,
... statusbar: 1,
... toolbar: 1,
... view: "FitH"
... }
... }).embed("pdf_placeholder");"""
>>> re.search(r'PDFObject\({\surl: "(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)",',strs).group(1)
'http://www.site.com/doc55.pdf'
A pure python (no regex) alternative would be:
一个纯粹的python(没有regex)替代方案是:
>>> unique = 'PDFObject({\nurl: "'
>>> start = strs.find(unique) + len(unique)
>>> end = start + strs[start:].find('"')
>>> strs[start:end]
'http://www.site.com/doc55.pdf'
No regex oneliner:
没有regex oneliner:
>>> (lambda u:(lambda s:(lambda e:strs[s:e])(s+strs[s:].find('"')))(strs.find(u)+len(u)))('PDFObject({\nurl: "')
'http://www.site.com/doc55.pdf'