什么是通过xml进行搜索的最快方法

时间:2021-09-13 19:13:32

Suppose i have an XML file, that i use as local database, like this):

假设我有一个XML文件,我用作本地数据库,就像这样):

<root>
 <address>
  <firstName></firstName>
  <lastName></lastName>
  <phone></phone>
 </address>
</root>

I have a couple of questions:
1. What will be a fastest way to find address(or addresses) in XML where firstName contains 'er' for example?
2. Is it possible to do without whole loading of XML file in memory?

我有几个问题:1。在first中包含'er'的XML中寻找地址(或地址)的最快方法是什么? 2.是否可以在内存中没有完整加载XML文件?

P.S. I am not looking for XML file alternatives, ideally i need a search that not depend on count of addresses in XML file. But i am realist, and it seems to me that it not possible.

附:我不是在寻找XML文件替代品,理想情况下我需要一个不依赖于XML文件中地址数的搜索。但我是现实主义者,在我看来,这是不可能的。

Update: I am using .net 4
Thanks for suggestions, but it's more scientific task than practical.. I probably looking for more fastest ways than linq and xmltextreader.

更新:我正在使用.net 4感谢您的建议,但它比实际更科学的任务..我可能正在寻找比linq和xmltextreader更快的方法。

6 个解决方案

#1


7  

LINQ to Xml works pretty fine:

LINQ to Xml非常好用:

XDocument doc = XDocument.Load("myfile.xml");
var addresses = from address in doc.Root.Elements("address")
                where address.Element("firstName").Value.Contains("er")
                select address;

UPDATE: Try to look at this question on *: Best way to search data in xml files?.

更新:尝试在*上查看这个问题:在xml文件中搜索数据的最佳方法?

Marc Gravell's accepted answer works using SQL indexing:

Marc Gravell接受的答案使用SQL索引:

First: how big are the xml files? XmlDocument doesn't scale to "huge"... but can handle "large" OK.

第一:xml文件有多大? XmlDocument不会扩展为“巨大”......但可以处理“大”OK。

Second: can you perhaps put the data into a regular database structure (perhaps SQL Server Express Edition), index it, and access via regular TSQL? That will usually out-perform an xpath search. Equally, if it is structured, SQL Server 2005 and above supports the xml data-type, which shreds data - this allows you to index and query xml data in the database without having the entire DOM in memory (it translates xpath into relational queries).

第二:您可以将数据放入常规数据库结构(可能是SQL Server Express Edition),索引它,并通过常规TSQL访问吗?这通常会超出xpath搜索范围。同样,如果它是结构化的,SQL Server 2005及更高版本支持分割数据的xml数据类型 - 这允许您在数据库中索引和查询xml数据,而不必将整个DOM放在内存中(它将xpath转换为关系查询) 。

UPDATE 2: Read also another link taken by the previous question that explains how the structure of the XML affects performances: http://www.15seconds.com/issue/010410.htm

更新2:阅读上一个问题所采用的另一个链接,该链接解释了XML结构如何影响性能:http://www.15seconds.com/issue/010410.htm

#2


2  

If you have .NET 3.5+, consider using LINQ To XML.

如果您使用的是.NET 3.5+,请考虑使用LINQ To XML。

Some sample code to give you some idea: (code below lifted/modified liberally from the article)

一些示例代码可以给您一些想法:(以下代码从文章中解放/修改)

IEnumerable<string> addresses =
    from inv in customer.Descendants("Invoice")
    where inv.Attribute("ProductName").StartsWith("er")
    select (string) inv.Attribute("StreetAddress");

#3


1  

You can use XmlTextReader if you don't want to read the whole file into memory. Such solution will probably run faster, but it will involve more coding.

如果您不想将整个文件读入内存,可以使用XmlTextReader。这样的解决方案可能运行得更快,但它将涉及更多编码。

#4


1  

I'm worried you might want to optimize something that might not need it. How many email addresses are we talking about? Most of the time you would read in the input and build a structure that supports the kind of queries you will be running.

我担心你可能想要优化可能不需要它的东西。我们在谈论多少个电子邮件地址?大多数情况下,您会阅读输入并构建一个支持您将运行的查询类型的结构。

There are trees that can get to the kind of results you are looking for in order log(n) time. And you can store a ton of addresses in even a small amount of memory.

有些树可以在log(n)时间内获得您正在寻找的那种结果。而且你甚至可以在少量内存中存储大量地址。

#5


1  

If you really need not to do this on server side, you can do it with regular expressions. But loading the XML on memmory would be faster I think...

如果您真的不需要在服务器端执行此操作,则可以使用正则表达式执行此操作。但是我认为在memmory上加载XML会更快...

#6


1  

And what about XmlReader ? I think it could be the fastest way...

那么XmlReader呢?我认为这可能是最快的方式......

I tried approx 110 MB file and it took about 1,1 sec. Same file with LinqToXML (above) takes about 3 sec.

我尝试了大约110 MB的文件,大约需要1,1秒。与LinqToXML(上面)相同的文件大约需要3秒。

XmlReaderSettings settings = new XmlReaderSettings();
settings.DtdProcessing = DtdProcessing.Parse;
XmlReader reader = XmlReader.Create("C:\\Temp\\items.xml", settings);

String firstName = "", lastName = "", phone = "";
String lastTagName = "";
Boolean bItemFound = false;
long nCounter = 0;

Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();

reader.MoveToContent();
// Parse the file and display each of the nodes.
while (reader.Read())
{
    switch (reader.NodeType)
    {
        case XmlNodeType.Element:
            //Console.Write("<{0}>", reader.Name);

            lastTagName = reader.Name;

            if (lastTagName ==  "address")
                nCounter++;

            break;
        case XmlNodeType.Text:
            //Console.Write(reader.Value);
            switch (lastTagName)
            {
               case "firstName":
                    firstName = reader.Value.ToString();
                    bItemFound = firstName.Contains("97331");
                    break;
                case "lastName":
                    lastName = reader.Value.ToString();
                    break;
                case "phone":
                    phone = reader.Value.ToString();
                    break;
            }
            break;
        case XmlNodeType.CDATA:
            //Console.Write("<![CDATA[{0}]]>", reader.Value);
            break;
        case XmlNodeType.ProcessingInstruction:
            //Console.Write("<?{0} {1}?>", reader.Name, reader.Value);
            break;
        case XmlNodeType.Comment:
            //Console.Write("<!--{0}-->", reader.Value);
            break;
        case XmlNodeType.XmlDeclaration:
            //Console.Write("<?xml version='1.0'?>");
            break;
        case XmlNodeType.Document:
        case XmlNodeType.DocumentType:
            //Console.Write("<!DOCTYPE {0} [{1}]", reader.Name, reader.Value);
            break;
        case XmlNodeType.EntityReference:
            //Console.Write(reader.Name);
            break;
        case XmlNodeType.EndElement:
            //Console.Write("</{0}>", reader.Name);
            break;
    }

    if (bItemFound)
    {
        Console.Write("{0}\n{1}\n{2}\n", firstName, lastName, phone);
        bItemFound = false;
    }
}

stopWatch.Stop();
TimeSpan ts = stopWatch.Elapsed;
string elapsedTime = String.Format("{0:00}:{1:00}:{2:00}.{3:00}",
    ts.Hours, ts.Minutes, ts.Seconds,
    ts.Milliseconds / 10);
Console.WriteLine("RunTime " + elapsedTime);
Console.WriteLine("Searched items: {0}", nCounter);

Console.ReadKey();

#1


7  

LINQ to Xml works pretty fine:

LINQ to Xml非常好用:

XDocument doc = XDocument.Load("myfile.xml");
var addresses = from address in doc.Root.Elements("address")
                where address.Element("firstName").Value.Contains("er")
                select address;

UPDATE: Try to look at this question on *: Best way to search data in xml files?.

更新:尝试在*上查看这个问题:在xml文件中搜索数据的最佳方法?

Marc Gravell's accepted answer works using SQL indexing:

Marc Gravell接受的答案使用SQL索引:

First: how big are the xml files? XmlDocument doesn't scale to "huge"... but can handle "large" OK.

第一:xml文件有多大? XmlDocument不会扩展为“巨大”......但可以处理“大”OK。

Second: can you perhaps put the data into a regular database structure (perhaps SQL Server Express Edition), index it, and access via regular TSQL? That will usually out-perform an xpath search. Equally, if it is structured, SQL Server 2005 and above supports the xml data-type, which shreds data - this allows you to index and query xml data in the database without having the entire DOM in memory (it translates xpath into relational queries).

第二:您可以将数据放入常规数据库结构(可能是SQL Server Express Edition),索引它,并通过常规TSQL访问吗?这通常会超出xpath搜索范围。同样,如果它是结构化的,SQL Server 2005及更高版本支持分割数据的xml数据类型 - 这允许您在数据库中索引和查询xml数据,而不必将整个DOM放在内存中(它将xpath转换为关系查询) 。

UPDATE 2: Read also another link taken by the previous question that explains how the structure of the XML affects performances: http://www.15seconds.com/issue/010410.htm

更新2:阅读上一个问题所采用的另一个链接,该链接解释了XML结构如何影响性能:http://www.15seconds.com/issue/010410.htm

#2


2  

If you have .NET 3.5+, consider using LINQ To XML.

如果您使用的是.NET 3.5+,请考虑使用LINQ To XML。

Some sample code to give you some idea: (code below lifted/modified liberally from the article)

一些示例代码可以给您一些想法:(以下代码从文章中解放/修改)

IEnumerable<string> addresses =
    from inv in customer.Descendants("Invoice")
    where inv.Attribute("ProductName").StartsWith("er")
    select (string) inv.Attribute("StreetAddress");

#3


1  

You can use XmlTextReader if you don't want to read the whole file into memory. Such solution will probably run faster, but it will involve more coding.

如果您不想将整个文件读入内存,可以使用XmlTextReader。这样的解决方案可能运行得更快,但它将涉及更多编码。

#4


1  

I'm worried you might want to optimize something that might not need it. How many email addresses are we talking about? Most of the time you would read in the input and build a structure that supports the kind of queries you will be running.

我担心你可能想要优化可能不需要它的东西。我们在谈论多少个电子邮件地址?大多数情况下,您会阅读输入并构建一个支持您将运行的查询类型的结构。

There are trees that can get to the kind of results you are looking for in order log(n) time. And you can store a ton of addresses in even a small amount of memory.

有些树可以在log(n)时间内获得您正在寻找的那种结果。而且你甚至可以在少量内存中存储大量地址。

#5


1  

If you really need not to do this on server side, you can do it with regular expressions. But loading the XML on memmory would be faster I think...

如果您真的不需要在服务器端执行此操作,则可以使用正则表达式执行此操作。但是我认为在memmory上加载XML会更快...

#6


1  

And what about XmlReader ? I think it could be the fastest way...

那么XmlReader呢?我认为这可能是最快的方式......

I tried approx 110 MB file and it took about 1,1 sec. Same file with LinqToXML (above) takes about 3 sec.

我尝试了大约110 MB的文件,大约需要1,1秒。与LinqToXML(上面)相同的文件大约需要3秒。

XmlReaderSettings settings = new XmlReaderSettings();
settings.DtdProcessing = DtdProcessing.Parse;
XmlReader reader = XmlReader.Create("C:\\Temp\\items.xml", settings);

String firstName = "", lastName = "", phone = "";
String lastTagName = "";
Boolean bItemFound = false;
long nCounter = 0;

Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();

reader.MoveToContent();
// Parse the file and display each of the nodes.
while (reader.Read())
{
    switch (reader.NodeType)
    {
        case XmlNodeType.Element:
            //Console.Write("<{0}>", reader.Name);

            lastTagName = reader.Name;

            if (lastTagName ==  "address")
                nCounter++;

            break;
        case XmlNodeType.Text:
            //Console.Write(reader.Value);
            switch (lastTagName)
            {
               case "firstName":
                    firstName = reader.Value.ToString();
                    bItemFound = firstName.Contains("97331");
                    break;
                case "lastName":
                    lastName = reader.Value.ToString();
                    break;
                case "phone":
                    phone = reader.Value.ToString();
                    break;
            }
            break;
        case XmlNodeType.CDATA:
            //Console.Write("<![CDATA[{0}]]>", reader.Value);
            break;
        case XmlNodeType.ProcessingInstruction:
            //Console.Write("<?{0} {1}?>", reader.Name, reader.Value);
            break;
        case XmlNodeType.Comment:
            //Console.Write("<!--{0}-->", reader.Value);
            break;
        case XmlNodeType.XmlDeclaration:
            //Console.Write("<?xml version='1.0'?>");
            break;
        case XmlNodeType.Document:
        case XmlNodeType.DocumentType:
            //Console.Write("<!DOCTYPE {0} [{1}]", reader.Name, reader.Value);
            break;
        case XmlNodeType.EntityReference:
            //Console.Write(reader.Name);
            break;
        case XmlNodeType.EndElement:
            //Console.Write("</{0}>", reader.Name);
            break;
    }

    if (bItemFound)
    {
        Console.Write("{0}\n{1}\n{2}\n", firstName, lastName, phone);
        bItemFound = false;
    }
}

stopWatch.Stop();
TimeSpan ts = stopWatch.Elapsed;
string elapsedTime = String.Format("{0:00}:{1:00}:{2:00}.{3:00}",
    ts.Hours, ts.Minutes, ts.Seconds,
    ts.Milliseconds / 10);
Console.WriteLine("RunTime " + elapsedTime);
Console.WriteLine("Searched items: {0}", nCounter);

Console.ReadKey();