从非常大的HTML文件中解析特定元素

时间:2023-01-06 21:36:11

I have a very large HTML file (several megabytes). I know the data I want is under something like <div class=someName>here</div>

我有一个非常大的HTML文件(几兆字节)。我知道我想要的数据是在

here 之类的

What is a good library to parse through the HTML page so I can loop through elements and grab each someName? I want to do this in either C#, Python or C++.

什么是一个很好的库来解析HTML页面,所以我可以遍历元素并获取每个someName?我想用C#,Python或C ++来做这件事。

5 个解决方案

#1


I would use Python and BeautifulSoup for the job. It is very solid at handling this kind of stuff.

我会使用Python和BeautifulSoup来完成这项工作。处理这种东西非常扎实。

For your case, you can use SoupStrainer to make BeautifulSoup only parse DIVs in the document that have the class you want, so it doesn't have to have the whole thing in memory.

对于您的情况,您可以使用SoupStrainer使BeautifulSoup仅解析具有您想要的类的文档中的DIV,因此它不必将整个内容都存储在内存中。

For example, say your document looks like this:

例如,假设您的文档如下所示:

<div class="test">Hello World</div>
<div class="hello">Aloha World</div>
<div>Hey There</div>

You can write this:

你可以这样写:

>>> from BeautifulSoup import BeautifulSoup, SoupStrainer
>>> doc = '''
...     <div class="test">Hello World</div>
...     <div class="hello">Aloha World</div>
...     <div>Hey There</div>
... '''
>>> findDivs = SoupStrainer('div', {'class':'hello'})
>>> [tag for tag in BeautifulSoup(doc, parseOnlyThese=findDivs)]
[<div class="hello">Aloha World</div>]

#2


The Html Agility Pack is a stellar option if you want to use C#

如果你想使用C#,Html Agility Pack是一个很好的选择

#3


Xerces is well documented, supported and tested. (C++)

Xerces有很好的文档,支持和测试。 (C ++)

http://xerces.apache.org/xerces-c/

(yes, it's an XML parser but it should do the trick)

(是的,它是一个XML解析器,但它应该做的伎俩)

#4


Sounds like a case for good old regular expressions.

听起来像是一个很好的旧正则表达式的案例。

Input:

<div class="test">Hello World</div>
<div class="somename">Aloha World</div>
<div>Hey There</div>

RegEx:

\<div\sclass\=\"somename\"\>(?<Text>.*?)\<\/div\>

Yields:

Aloha World (note: In a single group named Text)

Probably need to account for enclosing quotes missing etc...

可能需要考虑封闭的报价缺失等...

Although with regular expressions now you have two problems.

虽然使用正则表达式现在有两个问题。

#5


Give TinyXML a try. (C++ XML parser)

尝试一下TinyXML。 (C ++ XML解析器)

#1


I would use Python and BeautifulSoup for the job. It is very solid at handling this kind of stuff.

我会使用Python和BeautifulSoup来完成这项工作。处理这种东西非常扎实。

For your case, you can use SoupStrainer to make BeautifulSoup only parse DIVs in the document that have the class you want, so it doesn't have to have the whole thing in memory.

对于您的情况,您可以使用SoupStrainer使BeautifulSoup仅解析具有您想要的类的文档中的DIV,因此它不必将整个内容都存储在内存中。

For example, say your document looks like this:

例如,假设您的文档如下所示:

<div class="test">Hello World</div>
<div class="hello">Aloha World</div>
<div>Hey There</div>

You can write this:

你可以这样写:

>>> from BeautifulSoup import BeautifulSoup, SoupStrainer
>>> doc = '''
...     <div class="test">Hello World</div>
...     <div class="hello">Aloha World</div>
...     <div>Hey There</div>
... '''
>>> findDivs = SoupStrainer('div', {'class':'hello'})
>>> [tag for tag in BeautifulSoup(doc, parseOnlyThese=findDivs)]
[<div class="hello">Aloha World</div>]

#2


The Html Agility Pack is a stellar option if you want to use C#

如果你想使用C#,Html Agility Pack是一个很好的选择

#3


Xerces is well documented, supported and tested. (C++)

Xerces有很好的文档,支持和测试。 (C ++)

http://xerces.apache.org/xerces-c/

(yes, it's an XML parser but it should do the trick)

(是的,它是一个XML解析器,但它应该做的伎俩)

#4


Sounds like a case for good old regular expressions.

听起来像是一个很好的旧正则表达式的案例。

Input:

<div class="test">Hello World</div>
<div class="somename">Aloha World</div>
<div>Hey There</div>

RegEx:

\<div\sclass\=\"somename\"\>(?<Text>.*?)\<\/div\>

Yields:

Aloha World (note: In a single group named Text)

Probably need to account for enclosing quotes missing etc...

可能需要考虑封闭的报价缺失等...

Although with regular expressions now you have two problems.

虽然使用正则表达式现在有两个问题。

#5


Give TinyXML a try. (C++ XML parser)

尝试一下TinyXML。 (C ++ XML解析器)