I have to implement a search feature which is able to quickly perform arbitrary complex queries to XML-data. If the user makes a query, all XML files must be searched to find possible matches. The users will have lots of XML-Files (a few 10000 or more) which are typically a few kilobytes in size. All the XML-files have almost the same structure.
我必须实现一个搜索特性,它能够快速地对xml数据执行任意复杂的查询。如果用户进行查询,则必须搜索所有XML文件以找到可能的匹配。用户将有大量的xml文件(大约10000个或更多),这些文件通常大小为几千字节。所有xml文件的结构几乎相同。
I already benchmarked XPath, it is too slow for my needs.
我已经对XPath进行了基准测试,它对我的需求来说太慢了。
How can it be done most efficiently? Is is possible to create indexes for the contents of the XML files (preserving content semantics, not just plain fulltext search)?
如何才能最有效地完成?是否可以为XML文件的内容创建索引(保留内容语义,而不仅仅是纯全文搜索)?
Will it be useful to put the XML data into an (embedded) SQL database and do the queries with SQL?
将XML数据放入(嵌入式)SQL数据库并使用SQL进行查询是否有用?
What other possibilities do I have?
我还有其他的可能性吗?
4 个解决方案
#1
0
Don't try an re-invent the wheel!
I would import the XML into a database(eg SQLite) (plus meta data, XML information), and query that.
我将把XML导入数据库(如SQLite)(加上元数据、XML信息),然后查询。
Edit 1:
编辑1:
You could implement a 'drop folder' which is 'indexed'/imported upon first run. A Folder watcher can be implemented to ONLY update new/changes to XML files. SQLite can be run in memeory for the fastest I/O performance.
您可以实现一个'drop folder',它是'索引'/导入在第一次运行。可以实现一个文件夹监视程序,只更新XML文件的新/更改。SQLite可以在memeory中运行,以获得最快的I/O性能。
#2
0
The fastest way is to create your own in memory model of data available in XML, convert it to simple objects and simple types, and organize it in the structure that suits your queries best. Index it additionally as appropriate for your problem (using Dictionary/SortedDictionary). This approach will be significantly faster then the one with using SQL database, and using SQL database will also be a lot faster then querying each XML. Depending on the complexity of your queries, this could range from a fairly simple thing to do, to a very hard in which case you should definitely go for embedded database.
最快的方法是创建您自己的XML数据内存模型,将其转换为简单对象和简单类型,并以最适合查询的结构组织它。根据您的问题(使用Dictionary/SortedDictionary)添加索引。这种方法将大大加快使用SQL数据库的方法,而且使用SQL数据库也会更快地查询每个XML。根据查询的复杂性,这可以是非常简单的事情,也可以是非常困难的事情,在这种情况下,您应该使用嵌入式数据库。
#3
0
The SQL Server 2005+ allows for creating XML indexes. The queries can be performed on the SQL server, without retrieving the XML data on the application side. This feature is present in the free Express edition.
SQL Server 2005+允许创建XML索引。查询可以在SQL服务器上执行,而无需在应用程序端检索XML数据。这一特性在免费快递版中有体现。
#4
0
For indexing the contents of xml: use Lucene (and a .net based implementation of it). This will allow you to quickly retrieve the xml docs that contain some specific values; then you might pay more attention to these ones.
为xml内容建立索引:使用Lucene(以及基于。net的实现)。这将允许您快速检索包含某些特定值的xml文档;那么你可能会更注意这些。
#1
0
Don't try an re-invent the wheel!
I would import the XML into a database(eg SQLite) (plus meta data, XML information), and query that.
我将把XML导入数据库(如SQLite)(加上元数据、XML信息),然后查询。
Edit 1:
编辑1:
You could implement a 'drop folder' which is 'indexed'/imported upon first run. A Folder watcher can be implemented to ONLY update new/changes to XML files. SQLite can be run in memeory for the fastest I/O performance.
您可以实现一个'drop folder',它是'索引'/导入在第一次运行。可以实现一个文件夹监视程序,只更新XML文件的新/更改。SQLite可以在memeory中运行,以获得最快的I/O性能。
#2
0
The fastest way is to create your own in memory model of data available in XML, convert it to simple objects and simple types, and organize it in the structure that suits your queries best. Index it additionally as appropriate for your problem (using Dictionary/SortedDictionary). This approach will be significantly faster then the one with using SQL database, and using SQL database will also be a lot faster then querying each XML. Depending on the complexity of your queries, this could range from a fairly simple thing to do, to a very hard in which case you should definitely go for embedded database.
最快的方法是创建您自己的XML数据内存模型,将其转换为简单对象和简单类型,并以最适合查询的结构组织它。根据您的问题(使用Dictionary/SortedDictionary)添加索引。这种方法将大大加快使用SQL数据库的方法,而且使用SQL数据库也会更快地查询每个XML。根据查询的复杂性,这可以是非常简单的事情,也可以是非常困难的事情,在这种情况下,您应该使用嵌入式数据库。
#3
0
The SQL Server 2005+ allows for creating XML indexes. The queries can be performed on the SQL server, without retrieving the XML data on the application side. This feature is present in the free Express edition.
SQL Server 2005+允许创建XML索引。查询可以在SQL服务器上执行,而无需在应用程序端检索XML数据。这一特性在免费快递版中有体现。
#4
0
For indexing the contents of xml: use Lucene (and a .net based implementation of it). This will allow you to quickly retrieve the xml docs that contain some specific values; then you might pay more attention to these ones.
为xml内容建立索引:使用Lucene(以及基于。net的实现)。这将允许您快速检索包含某些特定值的xml文档;那么你可能会更注意这些。