So when I decided to parse content from a website. For example, http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx
所以当我决定解析网站上的内容时。例如,http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx
I want to parse the ingredients into a text file. The ingredients are located in:
我想将成分解析成文本文件。成分位于:
< div class="ingredients" style="margin-top: 10px;">
and within this, each ingredient is stored between
并且在其中,每种成分存储在其间
< li class="plaincharacterwrap">
Someone was nice enough to provide code using regex, but it gets confusing when you are modyfying from site to site. So I wanted to use Beautiful Soup since it has a lot of built in features. Except I can confused on how to actually do it.
有人很好地使用正则表达式来提供代码,但是当你从一个站点到另一个站点进行修改时会让人感到困惑。所以我想使用Beautiful Soup,因为它有很多内置功能。除了我可以混淆如何实际做到这一点。
Code:
码:
import re
import urllib2,sys
from BeautifulSoup import BeautifulSoup, NavigableString
html = urllib2.urlopen("http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx")
soup = BeautifulSoup(html)
try:
ingrdiv = soup.find('div', attrs={'class': 'ingredients'})
except IOError:
print 'IO error'
Is this kind of how you get started? I want to find the actual div class and then parse out all those ingredients located within the li class.
这是你如何开始的?我想找到实际的div类,然后解析li类中的所有成分。
Any help would be appreciated! Thanks!
任何帮助,将不胜感激!谢谢!
2 个解决方案
#1
4
import urllib2
import BeautifulSoup
def main():
url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"
data = urllib2.urlopen(url).read()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
ingreds = [s.getText().strip() for s in ingreds.findAll('li')]
fname = 'PorkChopsRecipe.txt'
with open(fname, 'w') as outf:
outf.write('\n'.join(ingreds))
if __name__=="__main__":
main()
results in
结果是
1/4 cup olive oil
1 cup chicken broth
2 cloves garlic, minced
1 tablespoon paprika
1 tablespoon garlic powder
1 tablespoon poultry seasoning
1 teaspoon dried oregano
1 teaspoon dried basil
4 thick cut boneless pork chops
salt and pepper to taste
.
。
Follow-up response to @eyquem:
对@eyquem的后续回复:
from time import clock
import urllib
import re
import BeautifulSoup
import lxml.html
start = clock()
url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
data = urllib.urlopen(url).read()
print "Loading took", (clock()-start), "s"
# by regex
start = clock()
x = data.find('Ingredients</h3>')
patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
res1 = '\n'.join(patingr.findall(data,x))
print "Regex parse took", (clock()-start), "s"
# by BeautifulSoup
start = clock()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
res2 = '\n'.join(s.getText().strip() for s in ingreds.findAll('li'))
print "BeautifulSoup parse took", (clock()-start), "s - same =", (res2==res1)
# by lxml
start = clock()
lx = lxml.html.fromstring(data)
ingreds = lx.xpath('//div[@class="ingredients"]//li/text()')
res3 = '\n'.join(s.strip() for s in ingreds)
print "lxml parse took", (clock()-start), "s - same =", (res3==res1)
gives
给
Loading took 1.09091222621 s
Regex parse took 0.000432703726233 s
BeautifulSoup parse took 0.28126133314 s - same = True
lxml parse took 0.0100940499505 s - same = True
Regex is much faster (except when it's wrong); but if you consider loading the page and parsing it together, BeautifulSoup is still only 20% of the runtime. If you are terribly concerned about speed, I recommend lxml instead.
正则表达式要快得多(除非它错了);但是如果考虑加载页面并将其解析在一起,BeautifulSoup仍然只占运行时的20%。如果你非常担心速度,我推荐使用lxml。
#2
2
Yes , a special regex pattern must be written for every site.
是的,必须为每个站点编写特殊的正则表达式模式。
But I think that
但我认为
1- the treatments done with Beautiful Soup must be adapted to every site, too.
1-使用Beautiful Soup进行的治疗也必须适应每个地方。
2- regexes are not so complicated to write, and with a little habit, it can be done quickly
2-regexs写起来并不复杂,而且有一点习惯,它可以快速完成
I am curious to see what kind of treatments must be done with Beautiful Soup to obtain the same results that I obtained in a few minutes. Once upon a time, I tried to learn beautiful Soup but I didn't undesrtand anything to this mess. I should try again, now I am a little more skilled in Python. But regexes have been OK and sufficient for me until now
我很想知道必须用Beautiful Soup做什么样的治疗才能获得我在几分钟内获得的相同结果。曾几何时,我试图学习美丽的汤,但我并没有对这个混乱做任何事情。我应该再试一次,现在我对Python更加熟练了。但到目前为止,正则表达对我来说已经足够了
Here's the code for this new site:
这是这个新网站的代码:
import urllib
import re
url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
sock = urllib.urlopen(url)
ch = sock.read()
sock.close()
x = ch.find('Ingredients</h3>')
patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
print '\n'.join(patingr.findall(ch,x))
.
。
EDIT
I downloaded and installed BeautifulSoup and ran a comparison with regex.
我下载并安装了BeautifulSoup并与正则表达式进行了比较。
I don't think I did any error in my comparison code
我不认为我的比较代码中有任何错误
import urllib
import re
from time import clock
import BeautifulSoup
url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
data = urllib.urlopen(url).read()
te = clock()
x = data.find('Ingredients</h3>')
patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
res1 = '\n'.join(patingr.findall(data,x))
t1 = clock()-te
te = clock()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
ingreds = [s.getText().strip() for s in ingreds.findAll('li')]
res2 = '\n'.join(ingreds)
t2 = clock()-te
print res1
print
print res2
print
print 'res1==res2 is ',res1==res2
print '\nRegex :',t1
print '\nBeautifulSoup :',t2
print '\nBeautifulSoup execution time / Regex execution time ==',t2/t1
result
结果
1/4 cup olive oil
1 cup chicken broth
2 cloves garlic, minced
1 tablespoon paprika
1 tablespoon garlic powder
1 tablespoon poultry seasoning
1 teaspoon dried oregano
1 teaspoon dried basil
4 thick cut boneless pork chops
salt and pepper to taste
1/4 cup olive oil
1 cup chicken broth
2 cloves garlic, minced
1 tablespoon paprika
1 tablespoon garlic powder
1 tablespoon poultry seasoning
1 teaspoon dried oregano
1 teaspoon dried basil
4 thick cut boneless pork chops
salt and pepper to taste
res1==res2 is True
Regex : 0.00210892725193
BeautifulSoup : 2.32453566026
BeautifulSoup execution time / Regex execution time == 1102.23605776
No comment !
没有意见 !
.
。
EDIT 2
I realized that in my code I don't use a regex, I employ a method that use a regex and find().
我意识到在我的代码中我没有使用正则表达式,我使用了一个使用正则表达式和find()的方法。
It's the method I use when I resort to regexes because it raises the speed of treatment in some cases. It is due to the function find() that runs extremly rapidly.
这是我使用正则表达式时使用的方法,因为它在某些情况下提高了治疗速度。这是由于函数find()极快地运行。
To know what we are comparing, we need the following codes.
要知道我们要比较什么,我们需要以下代码。
In the code 3 and 4, I took account of remarks of Achim in another thread of posts: using re.IGNORECASE and re.DOTALL, ["\'] instead of ".
在代码3和代码4中,我在另一个帖子中考虑了Achim的评论:使用re.IGNORECASE和re.DOTALL,[“\']代替”。
These codes are separated because they must be executed in different files to obtain reliable results: I don't know why, but if all the codes are executed in the same file ,certain resulting times are strongly different (0.00075 instead of 0.0022 for exemple)
这些代码是分开的,因为它们必须在不同的文件中执行才能获得可靠的结果:我不知道为什么,但是如果所有代码都在同一个文件中执行,那么某些结果时间就会大不相同(例如0.00075而不是0.0022)
import urllib
import re
import BeautifulSoup
from time import clock
url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
data = urllib.urlopen(url).read()
# Simple regex , without x
te = clock()
patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
res0 = '\n'.join(patingr.findall(data))
t0 = clock()-te
print '\nSimple regex , without x :',t0
and
和
# Simple regex , with x
te = clock()
x = data.find('Ingredients</h3>')
patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
res1 = '\n'.join(patingr.findall(data,x))
t1 = clock()-te
print '\nSimple regex , with x :',t1
and
和
# Regex with flags , without x and y
te = clock()
patingr = re.compile('<li class=["\']plaincharacterwrap["\']>\r\n +(.+?)</li>\r\n',
flags=re.DOTALL|re.IGNORECASE)
res10 = '\n'.join(patingr.findall(data))
t10 = clock()-te
print '\nRegex with flags , without x and y :',t10
and
和
# Regex with flags , with x and y
te = clock()
x = data.find('Ingredients</h3>')
y = data.find('h3>\r\n Footnotes</h3>\r\n')
patingr = re.compile('<li class=["\']plaincharacterwrap["\']>\r\n +(.+?)</li>\r\n',
flags=re.DOTALL|re.IGNORECASE)
res11 = '\n'.join(patingr.findall(data,x,y))
t11 = clock()-te
print '\nRegex with flags , without x and y :',t11
and
和
# BeautifulSoup
te = clock()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
ingreds = [s.getText().strip() for s in ingreds.findAll('li')]
res2 = '\n'.join(ingreds)
t2 = clock()-te
print '\nBeautifulSoup :',t2
result
结果
Simple regex , without x : 0.00230488284125
Simple regex , with x : 0.00229121279385
Regex with flags , without x and y : 0.00758719458758
Regex with flags , with x and y : 0.00183724493364
BeautifulSoup : 2.58728860791
The use of x has no influence on the speed for a simple regex.
使用x对简单正则表达式的速度没有影响。
The regex with flags , without x and y, takes longer to execute , but the result isn't the same as the others, because it catches a supplementary chunk of text. That's why in a real application, it would be the regex with flags and x/y that should be used.
带有标志的正则表达式,没有x和y,执行时间更长,但结果与其他结果不同,因为它捕获了一大块补充文本。这就是为什么在实际应用程序中,应该使用带有标志和x / y的正则表达式。
The more complicated regex with flags and with x and y takes 20 % of time less.
带有标志和x和y的更复杂的正则表达式减少了20%的时间。
Well, the results are not very much changed, with or without x/y.
嗯,无论是否有x / y,结果都没有太大变化。
So my conclusion is the same
所以我的结论是一样的
the use of a regex, resorting to find() or not, remains roughly 1000 times faster than BeautifulSoup, and I estimate 100 times faster than lxml (I didn't installed lxml)
使用正则表达式,求助于(或不),仍然比BeautifulSoup快约1000倍,我估计比lxml快100倍(我没有安装lxml)
.
。
To what you wrote, Hugh, I would say:
对你所写的,休,我会说:
When a regex is wrong, it is not faster nor slower. It doesn't run.
当正则表达式错误时,它不会更快也不会更慢。它没有运行。
When a regex is wrong, the coder makes it becoming right, that's all.
当正则表达式出错时,编码器会使它变得正确,就是这样。
I don't understand why 95% of the persons on *.com want to persuade other 5% that regexes must not be employed to analyse HTML or XML or anything else. I say "analyse", not "parse". As far as I understood it, a parser first analyse the WHOLE of a text and then displays the content of elements that we want. On the contrary, a regex goes right to what is searched, it doesn't build the tree of the HTML/XML text or whatever else a parser does and that I don't know very well.
我不明白为什么*.com上95%的人想要说服其他5%的人不得使用正则表达式来分析HTML或XML或其他任何东西。我说“分析”,而不是“解析”。据我所知,解析器首先分析文本的整数,然后显示我们想要的元素的内容。相反,正则表达式适用于搜索的内容,它不构建HTML / XML文本的树或解析器所做的其他任何事情,而且我不太了解。
So, I am very satisfied of regexes. I have no problem to write even very long REs, and regexes allow me to run programs that must react rapidly after the analyse of a text. BS or lxml would work but that would be a hassle.
所以,我对正则表达式非常满意。即使是非常长的RE也没有问题,正则表达式允许我运行在分析文本后必须迅速做出反应的程序。 BS或lxml可以工作,但这将是一个麻烦。
I would have other comments to do , but I have no time for a subject in which, in fact, I let others to do as they prefer.
我还有其他意见要做,但我没有时间讨论一个主题,事实上,我让别人按照自己的意愿去做。
#1
4
import urllib2
import BeautifulSoup
def main():
url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"
data = urllib2.urlopen(url).read()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
ingreds = [s.getText().strip() for s in ingreds.findAll('li')]
fname = 'PorkChopsRecipe.txt'
with open(fname, 'w') as outf:
outf.write('\n'.join(ingreds))
if __name__=="__main__":
main()
results in
结果是
1/4 cup olive oil
1 cup chicken broth
2 cloves garlic, minced
1 tablespoon paprika
1 tablespoon garlic powder
1 tablespoon poultry seasoning
1 teaspoon dried oregano
1 teaspoon dried basil
4 thick cut boneless pork chops
salt and pepper to taste
.
。
Follow-up response to @eyquem:
对@eyquem的后续回复:
from time import clock
import urllib
import re
import BeautifulSoup
import lxml.html
start = clock()
url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
data = urllib.urlopen(url).read()
print "Loading took", (clock()-start), "s"
# by regex
start = clock()
x = data.find('Ingredients</h3>')
patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
res1 = '\n'.join(patingr.findall(data,x))
print "Regex parse took", (clock()-start), "s"
# by BeautifulSoup
start = clock()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
res2 = '\n'.join(s.getText().strip() for s in ingreds.findAll('li'))
print "BeautifulSoup parse took", (clock()-start), "s - same =", (res2==res1)
# by lxml
start = clock()
lx = lxml.html.fromstring(data)
ingreds = lx.xpath('//div[@class="ingredients"]//li/text()')
res3 = '\n'.join(s.strip() for s in ingreds)
print "lxml parse took", (clock()-start), "s - same =", (res3==res1)
gives
给
Loading took 1.09091222621 s
Regex parse took 0.000432703726233 s
BeautifulSoup parse took 0.28126133314 s - same = True
lxml parse took 0.0100940499505 s - same = True
Regex is much faster (except when it's wrong); but if you consider loading the page and parsing it together, BeautifulSoup is still only 20% of the runtime. If you are terribly concerned about speed, I recommend lxml instead.
正则表达式要快得多(除非它错了);但是如果考虑加载页面并将其解析在一起,BeautifulSoup仍然只占运行时的20%。如果你非常担心速度,我推荐使用lxml。
#2
2
Yes , a special regex pattern must be written for every site.
是的,必须为每个站点编写特殊的正则表达式模式。
But I think that
但我认为
1- the treatments done with Beautiful Soup must be adapted to every site, too.
1-使用Beautiful Soup进行的治疗也必须适应每个地方。
2- regexes are not so complicated to write, and with a little habit, it can be done quickly
2-regexs写起来并不复杂,而且有一点习惯,它可以快速完成
I am curious to see what kind of treatments must be done with Beautiful Soup to obtain the same results that I obtained in a few minutes. Once upon a time, I tried to learn beautiful Soup but I didn't undesrtand anything to this mess. I should try again, now I am a little more skilled in Python. But regexes have been OK and sufficient for me until now
我很想知道必须用Beautiful Soup做什么样的治疗才能获得我在几分钟内获得的相同结果。曾几何时,我试图学习美丽的汤,但我并没有对这个混乱做任何事情。我应该再试一次,现在我对Python更加熟练了。但到目前为止,正则表达对我来说已经足够了
Here's the code for this new site:
这是这个新网站的代码:
import urllib
import re
url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
sock = urllib.urlopen(url)
ch = sock.read()
sock.close()
x = ch.find('Ingredients</h3>')
patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
print '\n'.join(patingr.findall(ch,x))
.
。
EDIT
I downloaded and installed BeautifulSoup and ran a comparison with regex.
我下载并安装了BeautifulSoup并与正则表达式进行了比较。
I don't think I did any error in my comparison code
我不认为我的比较代码中有任何错误
import urllib
import re
from time import clock
import BeautifulSoup
url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
data = urllib.urlopen(url).read()
te = clock()
x = data.find('Ingredients</h3>')
patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
res1 = '\n'.join(patingr.findall(data,x))
t1 = clock()-te
te = clock()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
ingreds = [s.getText().strip() for s in ingreds.findAll('li')]
res2 = '\n'.join(ingreds)
t2 = clock()-te
print res1
print
print res2
print
print 'res1==res2 is ',res1==res2
print '\nRegex :',t1
print '\nBeautifulSoup :',t2
print '\nBeautifulSoup execution time / Regex execution time ==',t2/t1
result
结果
1/4 cup olive oil
1 cup chicken broth
2 cloves garlic, minced
1 tablespoon paprika
1 tablespoon garlic powder
1 tablespoon poultry seasoning
1 teaspoon dried oregano
1 teaspoon dried basil
4 thick cut boneless pork chops
salt and pepper to taste
1/4 cup olive oil
1 cup chicken broth
2 cloves garlic, minced
1 tablespoon paprika
1 tablespoon garlic powder
1 tablespoon poultry seasoning
1 teaspoon dried oregano
1 teaspoon dried basil
4 thick cut boneless pork chops
salt and pepper to taste
res1==res2 is True
Regex : 0.00210892725193
BeautifulSoup : 2.32453566026
BeautifulSoup execution time / Regex execution time == 1102.23605776
No comment !
没有意见 !
.
。
EDIT 2
I realized that in my code I don't use a regex, I employ a method that use a regex and find().
我意识到在我的代码中我没有使用正则表达式,我使用了一个使用正则表达式和find()的方法。
It's the method I use when I resort to regexes because it raises the speed of treatment in some cases. It is due to the function find() that runs extremly rapidly.
这是我使用正则表达式时使用的方法,因为它在某些情况下提高了治疗速度。这是由于函数find()极快地运行。
To know what we are comparing, we need the following codes.
要知道我们要比较什么,我们需要以下代码。
In the code 3 and 4, I took account of remarks of Achim in another thread of posts: using re.IGNORECASE and re.DOTALL, ["\'] instead of ".
在代码3和代码4中,我在另一个帖子中考虑了Achim的评论:使用re.IGNORECASE和re.DOTALL,[“\']代替”。
These codes are separated because they must be executed in different files to obtain reliable results: I don't know why, but if all the codes are executed in the same file ,certain resulting times are strongly different (0.00075 instead of 0.0022 for exemple)
这些代码是分开的,因为它们必须在不同的文件中执行才能获得可靠的结果:我不知道为什么,但是如果所有代码都在同一个文件中执行,那么某些结果时间就会大不相同(例如0.00075而不是0.0022)
import urllib
import re
import BeautifulSoup
from time import clock
url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
data = urllib.urlopen(url).read()
# Simple regex , without x
te = clock()
patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
res0 = '\n'.join(patingr.findall(data))
t0 = clock()-te
print '\nSimple regex , without x :',t0
and
和
# Simple regex , with x
te = clock()
x = data.find('Ingredients</h3>')
patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
res1 = '\n'.join(patingr.findall(data,x))
t1 = clock()-te
print '\nSimple regex , with x :',t1
and
和
# Regex with flags , without x and y
te = clock()
patingr = re.compile('<li class=["\']plaincharacterwrap["\']>\r\n +(.+?)</li>\r\n',
flags=re.DOTALL|re.IGNORECASE)
res10 = '\n'.join(patingr.findall(data))
t10 = clock()-te
print '\nRegex with flags , without x and y :',t10
and
和
# Regex with flags , with x and y
te = clock()
x = data.find('Ingredients</h3>')
y = data.find('h3>\r\n Footnotes</h3>\r\n')
patingr = re.compile('<li class=["\']plaincharacterwrap["\']>\r\n +(.+?)</li>\r\n',
flags=re.DOTALL|re.IGNORECASE)
res11 = '\n'.join(patingr.findall(data,x,y))
t11 = clock()-te
print '\nRegex with flags , without x and y :',t11
and
和
# BeautifulSoup
te = clock()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
ingreds = [s.getText().strip() for s in ingreds.findAll('li')]
res2 = '\n'.join(ingreds)
t2 = clock()-te
print '\nBeautifulSoup :',t2
result
结果
Simple regex , without x : 0.00230488284125
Simple regex , with x : 0.00229121279385
Regex with flags , without x and y : 0.00758719458758
Regex with flags , with x and y : 0.00183724493364
BeautifulSoup : 2.58728860791
The use of x has no influence on the speed for a simple regex.
使用x对简单正则表达式的速度没有影响。
The regex with flags , without x and y, takes longer to execute , but the result isn't the same as the others, because it catches a supplementary chunk of text. That's why in a real application, it would be the regex with flags and x/y that should be used.
带有标志的正则表达式,没有x和y,执行时间更长,但结果与其他结果不同,因为它捕获了一大块补充文本。这就是为什么在实际应用程序中,应该使用带有标志和x / y的正则表达式。
The more complicated regex with flags and with x and y takes 20 % of time less.
带有标志和x和y的更复杂的正则表达式减少了20%的时间。
Well, the results are not very much changed, with or without x/y.
嗯,无论是否有x / y,结果都没有太大变化。
So my conclusion is the same
所以我的结论是一样的
the use of a regex, resorting to find() or not, remains roughly 1000 times faster than BeautifulSoup, and I estimate 100 times faster than lxml (I didn't installed lxml)
使用正则表达式,求助于(或不),仍然比BeautifulSoup快约1000倍,我估计比lxml快100倍(我没有安装lxml)
.
。
To what you wrote, Hugh, I would say:
对你所写的,休,我会说:
When a regex is wrong, it is not faster nor slower. It doesn't run.
当正则表达式错误时,它不会更快也不会更慢。它没有运行。
When a regex is wrong, the coder makes it becoming right, that's all.
当正则表达式出错时,编码器会使它变得正确,就是这样。
I don't understand why 95% of the persons on *.com want to persuade other 5% that regexes must not be employed to analyse HTML or XML or anything else. I say "analyse", not "parse". As far as I understood it, a parser first analyse the WHOLE of a text and then displays the content of elements that we want. On the contrary, a regex goes right to what is searched, it doesn't build the tree of the HTML/XML text or whatever else a parser does and that I don't know very well.
我不明白为什么*.com上95%的人想要说服其他5%的人不得使用正则表达式来分析HTML或XML或其他任何东西。我说“分析”,而不是“解析”。据我所知,解析器首先分析文本的整数,然后显示我们想要的元素的内容。相反,正则表达式适用于搜索的内容,它不构建HTML / XML文本的树或解析器所做的其他任何事情,而且我不太了解。
So, I am very satisfied of regexes. I have no problem to write even very long REs, and regexes allow me to run programs that must react rapidly after the analyse of a text. BS or lxml would work but that would be a hassle.
所以,我对正则表达式非常满意。即使是非常长的RE也没有问题,正则表达式允许我运行在分析文本后必须迅速做出反应的程序。 BS或lxml可以工作,但这将是一个麻烦。
I would have other comments to do , but I have no time for a subject in which, in fact, I let others to do as they prefer.
我还有其他意见要做,但我没有时间讨论一个主题,事实上,我让别人按照自己的意愿去做。