如何在OpenXml Excel电子表格工具中提高从SharedStringTable检索值的性能？

I'm using DocumentFormat.OpenXml to read an Excel spreadsheet. I have a performance bottleneck with the code used to look up the cell value from the SharedStringTable object (it seems to be some sort of lookup table for cell values):

我正在使用DocumentFormat.OpenXml来读取Excel电子表格。我有一个性能瓶颈，用于从SharedStringTable对象中查找单元格值的代码（它似乎是某种单元格值的查找表）：

var returnValue = sharedStringTablePart.SharedStringTable.ChildElements.GetItem(parsedValue).InnerText;

I've created a dictionary to ensure I only retrieve a value once:

我创建了一个字典，以确保我只检索一次值：

if (dictionary.ContainsKey(parsedValue))
{
    return dictionary[parsedValue];
}

var fetchedValue = sharedStringTablePart.SharedStringTable.ChildElements.GetItem(parsedValue).InnerText;
dictionary.Add(parsedValue, fetchedValue);
return fetchedValue;

This has cut down the performance time by almost 50%. However my metrics indicate that it still takes 208 seconds for the line of code fetching the value from the SharedStringTable object to execute 123,951 times. Is there any other way of optimising this operation?

这使性能时间缩短了近50％。但是，我的指标表明，从SharedStringTable对象获取值的代码行仍需要208秒才能执行123,951次。有没有其他方法来优化此操作？

1 个解决方案

#1

I would read the whole shared string table into your dictionary in one go rather than looking up each value as required. This will allow you to move through the file in order and stash the values ready for a hashed lookup which will be more efficient than scanning the SST for each value you require.

我会一次性将整个共享字符串表读入您的字典，而不是根据需要查找每个值。这将允许您按顺序浏览文件并存储准备进行散列查找的值，这比扫描所需的每个值的SST更有效。

Running something like the following at the start of your process will allow you to access each value using dictionary[parsedValue].

在进程开始时运行类似下面的内容将允许您使用dictionary [parsedValue]访问每个值。

private static void LoadDictionary()
{
    int i = 0;

    foreach (var ss in sharedStringTablePart.SharedStringTable.ChildElements)
    {
        dictionary.Add(i++, ss.InnerText);
    }
}

If your file is very large, you might see some gains using a SAX approach to read the file rather than the DOM approach above:

如果您的文件非常大，您可能会看到使用SAX方法读取文件而不是上面的DOM方法的一些好处：

private static void LoadDictionarySax()
{
    using (OpenXmlReader reader = OpenXmlReader.Create(sharedStringTablePart))
    {
        int i = 0;
        while (reader.Read())
        {
            if (reader.ElementType == typeof(SharedStringItem))
            {
                SharedStringItem ssi = (SharedStringItem)reader.LoadCurrentElement();
                dictionary.Add(i++, ssi.Text != null ? ssi.Text.Text : string.Empty);
            }
        }
    }
}

On my machine, using a file with 60000 rows and 2 columns it was around 300 times quicker using the LoadDictionary method above instead of the GetValue method from your question. The LoadDictionarySax method gave similar performance but on a larger file (100000 rows with 10 columns) the SAX approach was around 25% faster than the LoadDictionary method. On an even larger file (100000 rows, 26 columns), the LoadDictionary method threw an out of memory exception but the LoadDictionarySax worked without issue.

在我的机器上，使用60000行和2列的文件，使用上面的LoadDictionary方法而不是问题中的GetValue方法，速度提高了大约300倍。 LoadDictionarySax方法提供了类似的性能，但是在更大的文件（100000行，10列）上，SAX方法比LoadDictionary方法快25％左右。在更大的文件（100000行，26列）上，LoadDictionary方法引发了内存不足异常，但LoadDictionarySax没有问题。

#1