I need to do some processing on fairly large XML files ( large here being potentially upwards of a gigabyte ) in C# including performing some complex xpath queries. The problem I have is that the standard way I would normally do this through the System.XML libraries likes to load the whole file into memory before it does anything with it, which can cause memory problems with files of this size.
我需要对相当大的XML文件进行一些处理(这里有很大的可能超过十亿字节),包括执行一些复杂的xpath查询。我遇到的问题是我通常通过系统做这个的标准方法。XML库喜欢在使用整个文件之前将其加载到内存中,这可能会导致这种大小的文件出现内存问题。
I don't need to be updating the files at all just reading them and querying the data contained in them. Some of the XPath queries are quite involved and go across several levels of parent-child type relationship - I'm not sure whether this will affect the ability to use a stream reader rather than loading the data into memory as a block.
我不需要更新文件,只是读取它们并查询它们包含的数据。一些XPath查询非常复杂,并且涉及多个级别的父子类型关系——我不确定这是否会影响使用流阅读器而不是将数据作为块加载到内存中的能力。
One way I can see of making it work is to perform the simple analysis using a stream-based approach and perhaps wrapping the XPath statements into XSLT transformations that I could run across the files afterward, although it seems a little convoluted.
我能看到的使它工作的一种方法是使用基于流的方法执行简单的分析,并可能将XPath语句封装到XSLT转换中,我可以在后面的文件中运行这些转换,尽管这看起来有点复杂。
Alternately I know that there are some elements that the XPath queries will not run across, so I guess I could break the document up into a series of smaller fragments based on it's original tree structure, which could perhaps be small enough to process in memory without causing too much havoc.
时而我知道有一些元素的XPath查询不会遇到,所以我想我可以把文档分成一系列更小的碎片基于原来的树状结构,这可能是足够小,进程在内存中没有造成太多的破坏。
I've tried to explain my objective here so if I'm barking up totally the wrong tree in terms of general approach I'm sure you folks can set me right...
我试着在这里解释我的目标,所以如果我在一般方法上完全弄错了目标,我相信你们可以纠正我……
10 个解决方案
#1
9
XPathReader is the answer. It isn't part of the C# runtime, but it is available for download from Microsoft. Here is an MSDN article.
XPathReader就是答案。它不是c#运行时的一部分,但是可以从Microsoft下载。这是一篇MSDN的文章。
If you construct an XPathReader with an XmlTextReader you get the efficiency of a streaming read with the convenience of XPath expressions.
如果用XmlTextReader构造一个XPathReader,就可以使用XPath表达式的方便来获得流读的效率。
I haven't used it on gigabyte sized files, but I have used it on files that are tens of megabytes, which is usually enough to slow down DOM based solutions.
我还没有在gb大小的文件中使用它,但是我已经在几十兆字节的文件中使用了它,这通常足以减缓基于DOM的解决方案。
Quoting from the below: "The XPathReader provides the ability to perform XPath over XML documents in a streaming manner".
引用以下内容:“XPathReader提供了以流方式对XML文档执行XPath的能力”。
从微软下载
#2
2
Gigabyte XML files! I don't envy you this task.
g XML文件!我并不羡慕你做这项工作。
Is there any way that the files could be sent in a better way? E.g. Are they being sent over the net to you - if they are then a more efficient format might be better for all concerned. Reading the file into a database isn't a bad idea but it could be very time consuming indeed.
有没有什么方法可以让文件以更好的方式发送?他们是通过网络发送给你的吗?如果是这样的话,对所有相关的人来说,一种更有效的格式可能更好。将文件读入数据库并不是一个坏主意,但它确实可能非常耗时。
I wouldn't try and do it all in memory by reading the entire file - unless you have a 64bit OS and lots of memory. What if the file becomes 2, 3, 4GB?
我不会试图通过读取整个文件来完成所有的工作——除非你有一个64位的操作系统和大量的内存。如果文件变成了2 3 4GB呢?
One other approach could be to read in the XML file and use SAX to parse the file and write out smaller XML files according to some logical split. You could then process these with XPath. I've used XPath on 20-30MB files and it is very quick. I was originally going to use SAX but thought I would give XPath a go and was surprised how quick it was. I saved a lot of development time and probably only lost 250ms per query. I was using Java for my parsing but I suspect there would be little difference in .NET.
另一种方法是读取XML文件并使用SAX解析该文件,并根据某些逻辑分割写出更小的XML文件。然后可以使用XPath处理这些。我在20-30MB的文件中使用了XPath,它非常快。我本来打算使用SAX,但我认为我将会给出XPath,并惊讶于它有多快。我节省了大量的开发时间,每个查询可能只丢失了250ms。我使用Java进行解析,但我怀疑。net没有什么不同。
I did read that XML::Twig (A Perl CPAN module) was written explicitly to handle SAX based XPath parsing. Can you use a different language?
我确实读过XML::Twig(一个Perl CPAN模块)是显式编写的,用于处理基于SAX的XPath解析。你会使用不同的语言吗?
This might also help https://web.archive.org/web/1/http://articles.techrepublic%2ecom%2ecom/5100-10878_11-1044772.html
这也可以帮助https://web.archive.org/web/1/http://articles.techrepublic%2ecom%2ecom/5100-10878_11-1044772.html
#3
2
http://msdn.microsoft.com/en-us/library/bb387013.aspx has a relevant example leveraging XStreamingElement.
http://msdn.microsoft.com/en-us/library/bb387013.aspx有一个利用XStreamingElement的相关示例。
#4
1
You've outlined your choices already.
你已经概述了你的选择。
Either you need to abandon the XPath and use XmlTextReader or you need to break the document up into managable chunks on which you can use XPath.
您需要放弃XPath并使用XmlTextReader,或者需要将文档分解为可管理的块,以便使用XPath。
If you choose the latter use XPathDocument its readonly restriction allows better used of memory.
如果您选择后者,使用XPathDocument,它的readonly限制允许更好地使用内存。
#5
1
In order to perform XPath queries with the standard .NET classes the whole document tree needs to be loaded in memory which might not be a good idea if it can take up to a gigabyte. IMHO the XmlReader is a nice class for handling such tasks.
为了使用标准的。net类执行XPath查询,需要将整个文档树加载到内存中,如果需要1g的话,这可能不是一个好主意。XmlReader是处理此类任务的一个很好的类。
#6
1
It seems that you already tried using XPathDocument
and could not accomodate the parsed xml document in memory.
似乎您已经尝试过使用XPathDocument,并且无法在内存中容纳解析后的xml文档。
If this is the case, before starting to split the file (which is ultimately the right decision!) you may try using the Saxon XSLT/XQuery processor. It has a very efficient in-memory representation of a loaded XML document (the "tinytree" model). In addition Saxon SA (the shema-aware version, which isn't free) has some streaming extensions. Read more about this here.
如果是这样,在开始拆分文件(这最终是正确的决定!)之前,您可以尝试使用Saxon XSLT/XQuery处理器。它具有加载的XML文档(“tinytree”模型)的非常有效的内存表示。此外,Saxon SA(也不是免费的shema-aware版本)有一些流扩展。请在这里阅读更多。
#7
1
How about just reading the whole thing into a database and then work with the temp database? That might be better because then your queries can be done more efficiently using TSQL.
将整个内容读入数据库,然后使用临时数据库,怎么样?这可能更好,因为使用TSQL可以更有效地完成查询。
#8
1
I think the best solution is to make your own xml parser that can read small chunks not the whole file, or you can split the large file into small files and use dotnet classes with these files. The problem is you can not parse some of data till the whole data is available so I recommend to use your own parser not dotnet classes
我认为最好的解决方案是创建自己的xml解析器,它可以读取小块而不是整个文件,或者可以将大文件分割成小文件,并使用dotnet类来处理这些文件。问题是,在整个数据可用之前,您不能解析某些数据,因此我建议使用您自己的解析器而不是dotnet类。
#9
0
Have you been trying XPathDocument? This class is optimized for handling XPath queries efficiently.
你试过XPathDocument吗?这个类经过优化,可以有效地处理XPath查询。
If you cannot handle your input documents efficiently using XPathDocument you might consider preprocessing and/or splitting up your input documents using an XmlReader.
如果不能有效地使用XPathDocument处理输入文档,可以考虑使用XmlReader进行预处理和/或分割输入文档。
#10
0
Since in your case the data size can run in Gbs have you considered using ADO.NET with XML as a database. In addition to that the memory footprint would not be huge.
因为在您的例子中,数据大小可以在Gbs中运行,所以您考虑过使用ADO。以XML作为数据库。除此之外,内存占用不会很大。
Another approach would be using Linq to XML with using elements like XElementStream. Hope this helps.
另一种方法是使用Linq到XML并使用像XElementStream这样的元素。希望这个有帮助。
#1
9
XPathReader is the answer. It isn't part of the C# runtime, but it is available for download from Microsoft. Here is an MSDN article.
XPathReader就是答案。它不是c#运行时的一部分,但是可以从Microsoft下载。这是一篇MSDN的文章。
If you construct an XPathReader with an XmlTextReader you get the efficiency of a streaming read with the convenience of XPath expressions.
如果用XmlTextReader构造一个XPathReader,就可以使用XPath表达式的方便来获得流读的效率。
I haven't used it on gigabyte sized files, but I have used it on files that are tens of megabytes, which is usually enough to slow down DOM based solutions.
我还没有在gb大小的文件中使用它,但是我已经在几十兆字节的文件中使用了它,这通常足以减缓基于DOM的解决方案。
Quoting from the below: "The XPathReader provides the ability to perform XPath over XML documents in a streaming manner".
引用以下内容:“XPathReader提供了以流方式对XML文档执行XPath的能力”。
从微软下载
#2
2
Gigabyte XML files! I don't envy you this task.
g XML文件!我并不羡慕你做这项工作。
Is there any way that the files could be sent in a better way? E.g. Are they being sent over the net to you - if they are then a more efficient format might be better for all concerned. Reading the file into a database isn't a bad idea but it could be very time consuming indeed.
有没有什么方法可以让文件以更好的方式发送?他们是通过网络发送给你的吗?如果是这样的话,对所有相关的人来说,一种更有效的格式可能更好。将文件读入数据库并不是一个坏主意,但它确实可能非常耗时。
I wouldn't try and do it all in memory by reading the entire file - unless you have a 64bit OS and lots of memory. What if the file becomes 2, 3, 4GB?
我不会试图通过读取整个文件来完成所有的工作——除非你有一个64位的操作系统和大量的内存。如果文件变成了2 3 4GB呢?
One other approach could be to read in the XML file and use SAX to parse the file and write out smaller XML files according to some logical split. You could then process these with XPath. I've used XPath on 20-30MB files and it is very quick. I was originally going to use SAX but thought I would give XPath a go and was surprised how quick it was. I saved a lot of development time and probably only lost 250ms per query. I was using Java for my parsing but I suspect there would be little difference in .NET.
另一种方法是读取XML文件并使用SAX解析该文件,并根据某些逻辑分割写出更小的XML文件。然后可以使用XPath处理这些。我在20-30MB的文件中使用了XPath,它非常快。我本来打算使用SAX,但我认为我将会给出XPath,并惊讶于它有多快。我节省了大量的开发时间,每个查询可能只丢失了250ms。我使用Java进行解析,但我怀疑。net没有什么不同。
I did read that XML::Twig (A Perl CPAN module) was written explicitly to handle SAX based XPath parsing. Can you use a different language?
我确实读过XML::Twig(一个Perl CPAN模块)是显式编写的,用于处理基于SAX的XPath解析。你会使用不同的语言吗?
This might also help https://web.archive.org/web/1/http://articles.techrepublic%2ecom%2ecom/5100-10878_11-1044772.html
这也可以帮助https://web.archive.org/web/1/http://articles.techrepublic%2ecom%2ecom/5100-10878_11-1044772.html
#3
2
http://msdn.microsoft.com/en-us/library/bb387013.aspx has a relevant example leveraging XStreamingElement.
http://msdn.microsoft.com/en-us/library/bb387013.aspx有一个利用XStreamingElement的相关示例。
#4
1
You've outlined your choices already.
你已经概述了你的选择。
Either you need to abandon the XPath and use XmlTextReader or you need to break the document up into managable chunks on which you can use XPath.
您需要放弃XPath并使用XmlTextReader,或者需要将文档分解为可管理的块,以便使用XPath。
If you choose the latter use XPathDocument its readonly restriction allows better used of memory.
如果您选择后者,使用XPathDocument,它的readonly限制允许更好地使用内存。
#5
1
In order to perform XPath queries with the standard .NET classes the whole document tree needs to be loaded in memory which might not be a good idea if it can take up to a gigabyte. IMHO the XmlReader is a nice class for handling such tasks.
为了使用标准的。net类执行XPath查询,需要将整个文档树加载到内存中,如果需要1g的话,这可能不是一个好主意。XmlReader是处理此类任务的一个很好的类。
#6
1
It seems that you already tried using XPathDocument
and could not accomodate the parsed xml document in memory.
似乎您已经尝试过使用XPathDocument,并且无法在内存中容纳解析后的xml文档。
If this is the case, before starting to split the file (which is ultimately the right decision!) you may try using the Saxon XSLT/XQuery processor. It has a very efficient in-memory representation of a loaded XML document (the "tinytree" model). In addition Saxon SA (the shema-aware version, which isn't free) has some streaming extensions. Read more about this here.
如果是这样,在开始拆分文件(这最终是正确的决定!)之前,您可以尝试使用Saxon XSLT/XQuery处理器。它具有加载的XML文档(“tinytree”模型)的非常有效的内存表示。此外,Saxon SA(也不是免费的shema-aware版本)有一些流扩展。请在这里阅读更多。
#7
1
How about just reading the whole thing into a database and then work with the temp database? That might be better because then your queries can be done more efficiently using TSQL.
将整个内容读入数据库,然后使用临时数据库,怎么样?这可能更好,因为使用TSQL可以更有效地完成查询。
#8
1
I think the best solution is to make your own xml parser that can read small chunks not the whole file, or you can split the large file into small files and use dotnet classes with these files. The problem is you can not parse some of data till the whole data is available so I recommend to use your own parser not dotnet classes
我认为最好的解决方案是创建自己的xml解析器,它可以读取小块而不是整个文件,或者可以将大文件分割成小文件,并使用dotnet类来处理这些文件。问题是,在整个数据可用之前,您不能解析某些数据,因此我建议使用您自己的解析器而不是dotnet类。
#9
0
Have you been trying XPathDocument? This class is optimized for handling XPath queries efficiently.
你试过XPathDocument吗?这个类经过优化,可以有效地处理XPath查询。
If you cannot handle your input documents efficiently using XPathDocument you might consider preprocessing and/or splitting up your input documents using an XmlReader.
如果不能有效地使用XPathDocument处理输入文档,可以考虑使用XmlReader进行预处理和/或分割输入文档。
#10
0
Since in your case the data size can run in Gbs have you considered using ADO.NET with XML as a database. In addition to that the memory footprint would not be huge.
因为在您的例子中,数据大小可以在Gbs中运行,所以您考虑过使用ADO。以XML作为数据库。除此之外,内存占用不会很大。
Another approach would be using Linq to XML with using elements like XElementStream. Hope this helps.
另一种方法是使用Linq到XML并使用像XElementStream这样的元素。希望这个有帮助。