I have very large xlsx file (6 sheets, 1 million rows on every sheet, 19 columns on each). The idea is to read all rows in all sheets and populate database with entity framework.
我有非常大的xlsx文件(6张,每张纸上100万行,每张19列)。我们的想法是读取所有工作表中的所有行,并使用实体框架填充数据库。
I have tried ms open xml sdk but its really really slow. Just to iterate over all rows and cells (in a sax way) I would probably run my program for one month or maybe more.
我已经尝试过ms open xml sdk但它真的很慢。只是迭代所有的行和单元格(以萨克斯的方式)我可能会运行我的程序一个月或更多。
I have also tried this library https://github.com/ExcelDataReader/ExcelDataReader . Initially, reading with this library starts much faster (around 100 rows in 2 seconds with db inserts), but than it goes really slow (after 50000 rows, 1 minute and more). Maybe that speed decrease cause entity framework - i did not test that (I only test maybe first 1000 rows without db access), but even if it is, reading without anything is just too slow.
我也试过这个库https://github.com/ExcelDataReader/ExcelDataReader。最初,使用此库的读取开始要快得多(使用db插入在2秒内大约100行),但是它真的很慢(在50000行,1分钟和更长时间之后)。也许速度降低导致实体框架 - 我没有测试(我只测试可能前1000行没有数据库访问),但即使它是,读取没有任何东西只是太慢。
Does anyone have any idea how can I speed up reading of xlsx? Also, should I abandon entity framework and do inserts on my own?
有谁知道如何加快阅读xlsx?另外,我应该放弃实体框架并自行插入吗?
Currently my code look something like (there are things like columnMapper.Populate(tmpColumns, columns);
and columnMapper.getCOlumnId
because different sheets have different order of columns and even different number of columns)
目前我的代码看起来像(有像columnMapper.Populate(tmpColumns,columns);和columnMapper.getCOlumnId之类的东西,因为不同的工作表有不同的列顺序甚至不同的列数)
using (FileStream stream = File.Open(inputFilePath, FileMode.Open, FileAccess.Read))
{
IExcelDataReader excelReader = ExcelReaderFactory.CreateOpenXmlReader(stream);
DataSet result = excelReader.AsDataSet();
int tableNo = 0;
string[] tmpColumns = new string[20];
Console.WriteLine("Start processing...");
foreach (DataTable table in result.Tables)
{
// skip bad table
if (++tableNo > 5)
{
continue;
}
for (int i = 0; i < tmpColumns.Length; ++i)
{
tmpColumns[i] = string.Empty;
}
int columns = 0;
var rowEnumerator = table.Rows.GetEnumerator();
rowEnumerator.MoveNext();
foreach (string item in ((DataRow)rowEnumerator.Current).ItemArray)
{
tmpColumns[columns++] = item;
}
columnMapper.Populate(tmpColumns, columns);
int rowNumber = 0;
while (rowEnumerator.MoveNext())
{
var row = (DataRow)rowEnumerator.Current;
int col = 0;
foreach (object item in row.ItemArray)
{
tmpColumns[columnMapper.GetColumnId(col++)] = item.ToString();
}
var newBoxData = new BoxData()
{
...
Year = tmpColumns[4],
RetentionPeriod = tmpColumns[6],
ContractNumber = tmpColumns[7],
Mbr = tmpColumns[8],
CardId = tmpColumns[10],
Package = tmpColumns[12],
UnitType = tmpColumns[13],
MFCBox = tmpColumns[14],
DescriptionNameAndSurname = tmpColumns[15],
DateFromTo = tmpColumns[16],
OrderNumberRange = tmpColumns[17],
PartnerCode = tmpColumns[18],
};
db.BoxesDatas.Add(newBoxData);
if (++rowNumber % SaveAfterCount == 0)
{
db.SaveChanges();
Console.WriteLine(rowNumber);
}
}
}
db.SaveChanges();
EDIT: Solution was to remove entity fw and to insert data with plain sql commands. So, after leaving ms libraries (entity fw, open xml sdk) and replace them with ExcelDataReader and plain sql inserts everything worked much faster. ~2,000,000 rows was extracted and inserted in db for less than 20 minutes
编辑:解决方案是删除实体fw并使用普通的sql命令插入数据。因此,在离开ms库(实体fw,打开xml sdk)并用ExcelDataReader和普通sql替换它们之后,一切工作都要快得多。提取~200,000,000行并在db中插入少于20分钟
2 个解决方案
#1
2
refer this:Read Excel Sheet Data into DataTable: Code Project and Best /Fastest way to read an Excel Sheet into a DataTable
请参阅:将Excel工作表数据读入DataTable:代码项目和将Excel工作表读入DataTable的最佳/最快方法
#2
1
After solving XL performances you have also EF performances issues.
The first step is to create sometimes a new DbContext (i.e. after SaveChanges).
The second step is to use a different ORM.
解决XL表演后,您还会遇到EF表演问题。第一步是创建一个新的DbContext(即在SaveChanges之后)。第二步是使用不同的ORM。
#1
2
refer this:Read Excel Sheet Data into DataTable: Code Project and Best /Fastest way to read an Excel Sheet into a DataTable
请参阅:将Excel工作表数据读入DataTable:代码项目和将Excel工作表读入DataTable的最佳/最快方法
#2
1
After solving XL performances you have also EF performances issues.
The first step is to create sometimes a new DbContext (i.e. after SaveChanges).
The second step is to use a different ORM.
解决XL表演后,您还会遇到EF表演问题。第一步是创建一个新的DbContext(即在SaveChanges之后)。第二步是使用不同的ORM。