如果我使用模式，如何提高DataSet.ReadXml的性能？

I'm have a ADO DataSet that I'm loading from its XML file via ReadXml. The data and the schema are in separate files.

我有一个ADO DataSet,我通过ReadXml从它的XML文件加载。数据和架构位于单独的文件中。

Right now, it takes close to 13 seconds to load this DataSet. I can cut this to 700 milliseconds if I don't read the DataSet's schema and just let ReadXml infer the schema, but then the resulting DataSet doesn't contain any constraints.

现在,加载此DataSet需要将近13秒。如果我不读取DataSet的架构并且让ReadXml推断出架构,那么我可以将其减少到700毫秒,但结果DataSet不包含任何约束。

I've tried doing this:

我试过这样做:

Console.WriteLine("Reading dataset with external schema.");
ds.ReadXmlSchema(xsdPath);
Console.WriteLine("Reading the schema took {0} milliseconds.", sw.ElapsedMilliseconds);
foreach (DataTable dt in ds.Tables)
{
   dt.BeginLoadData();
}
ds.ReadXml(xmlPath);
Console.WriteLine("ReadXml completed after {0} milliseconds.", sw.ElapsedMilliseconds);
foreach (DataTable dt in ds.Tables)
{
   dt.EndLoadData();
}
Console.WriteLine("Process complete at {0} milliseconds.", sw.ElapsedMilliseconds);

When I do this, reading the schema takes 27ms, and reading the DataSet takes 12000+ milliseconds. And that's the time reported before I call EndLoadData on all the DataTables.

当我这样做时,读取模式需要27ms,读取DataSet需要12000+毫秒。这就是我在所有DataTable上调用EndLoadData之前报告的时间。

This is not an enormous amount of data - it's about 1.5mb, there are no nested relations, and all of the tables contain two or three columns of 6-30 characters. The only thing I can figure that's different if I read the schema up front is that the schema includes all of the unique constraints. But BeginLoadData is supposed to turn constraints off (as well as change notification, etc.). So that shouldn't apply here. (And yes, I've tried just setting EnforceConstraints to false.)

这不是一个庞大的数据量 - 大约1.5mb,没有嵌套关系,所有表包含两到三列6-30个字符。如果我事先阅读架构,那么我唯一能想到的就是架构包含所有唯一约束。但是BeginLoadData应该关闭约束(以及更改通知等)。所以这不适用于此。 (是的,我尝试过将EnforceConstraints设置为false。)

I've read many reports of people improving the load time of DataSets by reading the schema first instead of having the object infer the schema. In my case, inferring the schema makes for a process that's about 20 times faster than having the schema provided explicitly.

我已经阅读了许多关于人们通过首先读取模式而不是让对象推断模式来改善DataSet的加载时间的报告。在我的例子中,推断模式使得进程比显式提供的模式快20倍。

This is making me a little crazy. This DataSet's schema is generated off of metainformation, and I'm tempted to write a method that creates it programatically and just deseralizes it with an XmlReader. But I'd much prefer not to.

这让我有点疯狂。这个DataSet的模式是从元信息生成的,我很想编写一个以编程方式创建它的方法,并使用XmlReader对其进行解除分类。但我更不愿意。

What am I missing? What else can I do to improve the speed here?

我错过了什么?我还能做些什么来提高速度?

3 个解决方案

#1

I will try to give you a performance comparison between storing data in text plain files and xml files.

我将尝试在文本普通文件和xml文件中存储数据之间进行性能比较。

The first function creates two files: one file with 1000000 records in plain text and one file with 1000000 (same data) records in xml. First you have to notice the difference in file size: ~64MB(plain text) vs ~102MB (xml file).

第一个函数创建两个文件:一个文件包含1000000个纯文本记录,另一个文件包含1000000个(相同数据)的xml记录。首先你必须注意文件大小的差异:~64MB(纯文本)vs~102MB(xml文件)。

void create_files()
    {
        //create text file with data
        StreamWriter sr = new StreamWriter("plain_text.txt");

        for(int i=0;i<1000000;i++)
        {
            sr.WriteLine(i.ToString() + "<SEP>" + "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbb" + i.ToString());
        }

        sr.Flush();
        sr.Close();

        //create xml file with data
        DataSet ds = new DataSet("DS1");

        DataTable dt = new DataTable("T1");

        DataColumn c1 = new DataColumn("c1", typeof(int));
        DataColumn c2 = new DataColumn("c2", typeof(string));

        dt.Columns.Add(c1);
        dt.Columns.Add(c2);

        ds.Tables.Add(dt);

        DataRow dr;

        for(int j=0; j< 1000000; j++)
        {
            dr = dt.NewRow();
            dr[0]=j;
            dr[1] = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbb" + j.ToString();
            dt.Rows.Add(dr);
        }

        ds.WriteXml("xml_text.xml");

    }

The second function reads these two files: first it reads the plain text into a dictionary (just to simulate the real world of using it) and after that it reads the XML file. Both steps are measured in milliseconds (and results are written to console):

第二个函数读取这两个文件:首先它将纯文本读入字典(只是为了模拟使用它的真实世界),然后读取XML文件。这两个步骤都以毫秒为单位(结果写入控制台):

Start read Text file into memory
Text file loaded into memory in 7628 milliseconds
Start read XML file into memory
XML file loaded into memory in 21018 milliseconds

开始读取文本文件到内存文本文件加载到内存中7628毫秒开始读取XML文件到内存XML文件加载到内存中21018毫秒

void read_files()
    {

        //timers
        Stopwatch stw = new Stopwatch();
        long milliseconds;

        //read text file in a dictionary

        Debug.WriteLine("Start read Text file into memory");

        stw.Start();
        milliseconds = 0;

        StreamReader sr = new StreamReader("plain_text.txt");
        Dictionary<int, string> dict = new Dictionary<int, string>(1000000);
        string line;
        string[] sep = new string[]{"<SEP>"};
        string [] arValues;
        while (sr.EndOfStream!=true) 
        {
            line = sr.ReadLine();
            arValues = line.Split(sep,StringSplitOptions.None);
            dict.Add(Convert.ToInt32(arValues[0]),arValues[1]);
        }

        stw.Stop();
        milliseconds = stw.ElapsedMilliseconds;

        Debug.WriteLine("Text file loaded into memory in " + milliseconds.ToString() + " milliseconds" );



        //create xml structure
        DataSet ds = new DataSet("DS1");

        DataTable dt = new DataTable("T1");

        DataColumn c1 = new DataColumn("c1", typeof(int));
        DataColumn c2 = new DataColumn("c2", typeof(string));

        dt.Columns.Add(c1);
        dt.Columns.Add(c2);

        ds.Tables.Add(dt);

        //read xml file

        Debug.WriteLine("Start read XML file into memory");

        stw.Restart();
        milliseconds = 0;

        ds.ReadXml("xml_text.xml");

        stw.Stop();
        milliseconds = stw.ElapsedMilliseconds;

        Debug.WriteLine("XML file loaded into memory in " + milliseconds.ToString() + " milliseconds");

    }

Conclusion: the XML file size is almost double than the text file size and is loaded three times slower than the text file.

结论:XML文件大小几乎是文本文件大小的两倍,加载速度比文本文件慢三倍。

XML handling is more convenient (because of the abstraction level) than plain text but it is more CPU/disk consuming.

XML处理比普通文本更方便(因为抽象级别),但它消耗的CPU /磁盘更多。

So, if you have small files and is acceptable from the performance point of view, XML data Sets are more than ok. But, if you need performance, I don't know if XML Data set ( with any kind of method available) is faster that plain text files. And basically, it start from the very first reason: XML file is bigger because it has more tags.

因此,如果您有小文件并且从性能的角度来看是可接受的,那么XML数据集就可以了。但是,如果您需要性能,我不知道XML数据集(使用任何类型的方法)是否比纯文本文件更快。基本上,它从第一个原因开始:XML文件更大,因为它有更多的标签。

#2

It's not an answer, exactly (though it's better than nothing, which is what I've gotten so far), but after a long time struggling with this problem I discovered that it's completely absent when my program's not running inside Visual Studio.

这不是一个答案,确切地说(虽然它比没有更好,这是我迄今为止得到的),但经过长时间的努力解决这个问题,我发现当我的程序没有在Visual Studio中运行时它完全不存在。

Something I didn't mention before, which makes this even more mystifying, is that when I loaded a different (but comparably large) XML document into the DataSet, the program performed just fine. I'm now wondering if one of my DataSets has some kind of metainformation attached to it that Visual Studio is checking at runtime while the other one doesn't. I dunno.

我之前没有提到过的东西,这使得它更加神秘,就是当我将一个不同的(但相当大的)XML文档加载到DataSet中时,程序执行得很好。我现在想知道我的一个DataSet是否有某种形式的元信息附加到Visual Studio在运行时检查而另一个没有。我不知道。

#3

Another dimesion to try is to read the dataset without the schema and then Merge it into a typed dataset that has the constraints enabled. That way it has all of the data on hand as it builds the indexes used to enforce constraints -- maybe it would be more efficient?

尝试的另一个方面是在没有模式的情况下读取数据集,然后将其合并到已启用约束的类型化数据集中。这样它就拥有了所有数据,因为它构建了用于强制执行约束的索引 - 也许它会更有效率?

From MSDN:

The Merge method is typically called at the end of a series of procedures that involve validating changes, reconciling errors, updating the data source with the changes, and finally refreshing the existing DataSet

Merge方法通常在一系列过程结束时调用,这些过程涉及验证更改,协调错误,使用更改更新数据源,最后刷新现有DataSet

#1