将文件输入Xpath的最佳方法

时间:2021-01-09 20:09:14

I'm using Xpath to red XML files. The size of a file is unknown (between 700Kb - 2Mb) and have to read around 100 files per second. So I want fast a way to load and read from Xpath.

我正在使用Xpath来处理红色XML文件。文件的大小未知(介于700Kb - 2Mb之间),每秒必须读取大约100个文件。所以我想快速地从Xpath加载和读取。

I tried to use java nio file channels and memory mapped files but was hard to use with Xpath. So can someone tell a way to do it ?

我尝试使用java nio文件通道和内存映射文件,但很难用于Xpath。那么有人能说出办法吗?

3 个解决方案

#1


1  

A lot depends on what the XPath expressions are doing. There are four costs here: basic I/O to read the files, XML parsing, tree building, and XPath evaluation. (Plus a possible fifth, generating the output, but you haven't mentioned what the output might be.) From your description we have no way of knowing which factor is dominant. The first step in performance improvement is always measurement, and my first step would be to try and measure the contribution of these four factors.

很大程度上取决于XPath表达式的作用。这里有四个成本:读取文件的基本I / O,XML解析,树构建和XPath评估。 (加上可能的第五个,产生输出,但你没有提到输出可能是什么。)从你的描述中我们无法知道哪个因素占主导地位。性能改进的第一步始终是测量,我的第一步是尝试测量这四个因素的贡献。

If you're on an environment with multiple processors (and who isn't?) then parallel execution would make sense. You may get this "for free" if you can organize the processing using the collection() function in Saxon-EE.

如果你在一个拥有多个处理器的环境中(谁不是?)那么并行执行就有意义了。如果您可以使用Saxon-EE中的collection()函数组织处理,则可以“免费”获得此项。

#2


0  

If I were you, I would probably drop Java in this case at all, not because you can't do it in Java, but because using some bash script (in case you are on Unix) is going to be faster, at least this is what my experience dealing with lots of files tells me.

如果我是你,我可能会在这种情况下放弃Java,不是因为你不能用Java做,而是因为使用一些bash脚本(如果你在Unix上)会更快,至少这个是我处理大量文件的经验告诉我的。

On *nix you have the utility called xpath exactly for that.

在* nix上,您可以使用名为xpath的实用程序。

Since you are doing lots of I/O operations, having a decent SSD disk would help way more, then doing it in separate threads. You still need to do it with multiple threads, but not more then one per CPU.

由于您正在进行大量的I / O操作,因此拥有一个像样的SSD磁盘会有所帮助,然后在单独的线程中进行。您仍然需要使用多个线程,但每个CPU不超过一个。

#3


-1  

If you want performance I would simply drop XPath altogether and use a SAX parser to read the files. You can search * for SAX vs XPath vs DOM kind of questions to get more details. Here is one Is XPath much more efficient as compared to DOM and SAX?

如果你想要性能,我只需完全删除XPath并使用SAX解析器来读取文件。您可以搜索* for SAX vs XPath vs DOM类问题以获取更多详细信息。与DOM和SAX相比,这是一个更有效的XPath吗?

#1


1  

A lot depends on what the XPath expressions are doing. There are four costs here: basic I/O to read the files, XML parsing, tree building, and XPath evaluation. (Plus a possible fifth, generating the output, but you haven't mentioned what the output might be.) From your description we have no way of knowing which factor is dominant. The first step in performance improvement is always measurement, and my first step would be to try and measure the contribution of these four factors.

很大程度上取决于XPath表达式的作用。这里有四个成本:读取文件的基本I / O,XML解析,树构建和XPath评估。 (加上可能的第五个,产生输出,但你没有提到输出可能是什么。)从你的描述中我们无法知道哪个因素占主导地位。性能改进的第一步始终是测量,我的第一步是尝试测量这四个因素的贡献。

If you're on an environment with multiple processors (and who isn't?) then parallel execution would make sense. You may get this "for free" if you can organize the processing using the collection() function in Saxon-EE.

如果你在一个拥有多个处理器的环境中(谁不是?)那么并行执行就有意义了。如果您可以使用Saxon-EE中的collection()函数组织处理,则可以“免费”获得此项。

#2


0  

If I were you, I would probably drop Java in this case at all, not because you can't do it in Java, but because using some bash script (in case you are on Unix) is going to be faster, at least this is what my experience dealing with lots of files tells me.

如果我是你,我可能会在这种情况下放弃Java,不是因为你不能用Java做,而是因为使用一些bash脚本(如果你在Unix上)会更快,至少这个是我处理大量文件的经验告诉我的。

On *nix you have the utility called xpath exactly for that.

在* nix上,您可以使用名为xpath的实用程序。

Since you are doing lots of I/O operations, having a decent SSD disk would help way more, then doing it in separate threads. You still need to do it with multiple threads, but not more then one per CPU.

由于您正在进行大量的I / O操作,因此拥有一个像样的SSD磁盘会有所帮助,然后在单独的线程中进行。您仍然需要使用多个线程,但每个CPU不超过一个。

#3


-1  

If you want performance I would simply drop XPath altogether and use a SAX parser to read the files. You can search * for SAX vs XPath vs DOM kind of questions to get more details. Here is one Is XPath much more efficient as compared to DOM and SAX?

如果你想要性能,我只需完全删除XPath并使用SAX解析器来读取文件。您可以搜索* for SAX vs XPath vs DOM类问题以获取更多详细信息。与DOM和SAX相比,这是一个更有效的XPath吗?