什么方法的数据验证最适合大型数据集

时间:2022-05-30 16:58:11

I have a large database and want to implement a feature which would allow a user to do a bulk update of information. The user downloads an excel file, makes the changes and the system accepts the excel file.

我有一个大型数据库,并希望实现一个允许用户进行批量更新信息的功能。用户下载excel文件,进行更改,系统接受excel文件。

  1. The user uses a web interface (ASP.NET) to download the data from database to Excel.
  2. 用户使用Web界面(ASP.NET)将数据从数据库下载到Excel。

  3. User modifies the Excel file. Only certain data is allowed to be modified as other map into the DB.
  4. 用户修改Excel文件。只允许将某些数据修改为其他映射到数据库中。

  5. Once the user is happy with their changes they upload the changed Excel file through the ASP.NET interface.
  6. 一旦用户对他们的更改感到满意,他们就会通过ASP.NET界面上传更改后的Excel文件。

  7. Now it's the server's job to suck data from the Excel file (using Gembox) and validate the data against the database (this is where I'm having the trouble)
  8. 现在服务器的工作就是从Excel文件中吸取数据(使用Gembox)并根据数据库验证数据(这是我遇到麻烦的地方)

  9. Validation results are shown on another ASP.NET page after validation is complete. Validation is soft and so hard fails only occur when say an index mapping into DB is missing. (Missing data causes ignore, etc)
  10. 验证完成后,验证结果显示在另一个ASP.NET页面上。验证是软的,因此只有在缺少索引到DB的映射时才会发生。 (缺少数据导致忽略等)

  11. User can decide whether the actions that will be taken are appropriate, in accepting these the system will apply the changes. (Add, Modify, or Ignore)
  12. 用户可以决定将采取的操作是否合适,在接受这些操作时,系统将应用更改。 (添加,修改或忽略)

Before applying the changes and/or additions the user has made, the data must be validated to avoid mistakes by the user. (The accidentally deleted dates which they didn't mean to)

在应用用户所做的更改和/或添加之前,必须验证数据以避免用户的错误。 (意外删除的日期,他们并不意味着)

It's not far fetched for the rows that need updating to reach over 65k.

对于需要更新以达到65k以上的行而言,这并不遥远。

The question is: What is the best way to parse the data to do validation and to build up the change and addition sets?

问题是:解析数据以进行验证以及构建更改和添加集的最佳方法是什么?

If I load all data that the excel data must be validated against into memory I might unnecessarily be affecting the already memory hungry application. If I do a database hit for every tuple in the excel file I am looking at over 65k database hits.

如果我将必须验证excel数据的所有数据加载到内存中,我可能会不必要地影响已经占用内存的应用程序。如果我为excel文件中的每个元组执行数据库命中,我正在查看超过65k的数据库命中。

Help?

4 个解决方案

#1


The approach I've seen used in the past is:

我过去看过的方法是:

  1. Bulk-load the user's data into a 'scratch' table in the database.
  2. 将用户的数据批量加载到数据库中的“临时”表中。

  3. Validate data in the scratch table via a single stored procedure (executing a series of queries), marking rows that fail validation, require update etc.
  4. 通过单个存储过程(执行一系列查询)验证临时表中的数据,标记未通过验证的行,需要更新等。

  5. Action the marked rows as appropriate.
  6. 根据需要操作标记的行。

This works well for validating missing columns, valid key values etc. It's not so good for checking the format of individual fields (don't make SQL pull strings apart).

这适用于验证缺少的列,有效的键值等。检查单个字段的格式不是很好(不要让SQL拉开字符串)。

As we know, some folk feel uncomfortable putting business logic in the database, but this approach does limit the number of database hits your application makes, and avoids holding all the data in memory at once.

我们知道,有些人觉得将业务逻辑放在数据库中会感到很不舒服,但这种方法确实限制了应用程序的数据库命中数,并避免一次性将所有数据保存在内存中。

#2


Your problem is very common in Data Warehouse systems, where bulk uploads and data cleansing are a core part of the (regular) work to be done. I suggest you google around ETL (Extract Transform Load), Staging tables and you'll find a wealth of good stuff.

您的问题在数据仓库系统中非常常见,批量上载和数据清理是(常规)工作的核心部分。我建议你谷歌围绕ETL(提取变换加载),临时表,你会发现很多好东西。

In broad answer to your problem, if you do 'load the data into memory' for checking, you're effectively re-implementing a part of the DB engine in your own code. Now that could be a good thing if it's faster and clever to do so. For instance you may only have a small range of valid dates for your Excel extract, so you don't need to join to a table to check that dates are in range. However, for other data like foreign keys etc, let the DB do what it's good at.

在广泛回答您的问题时,如果您“将数据加载到内存中”进行检查,则可以在您自己的代码中有效地重新实现数据库引擎的一部分。现在,如果这样做更快更聪明,那将是一件好事。例如,您的Excel提取可能只有一小部分有效日期,因此您无需加入表格来检查日期是否在范围内。但是,对于其他数据,如外键等,让DB做自己擅长的事情。

Using a staging table/database/server is a common solution as the data volumes get large. BTW allowing users to clean data in Excel is a really good idea, allowing them to 'accidentally' remove crucial data is a really bad idea. Can you lock cells/columns to prevent this, and/or put in some basic validation into Excel. If a field should be filled and should be a date, you can check that in a few lines of excel. Your users will be happy as they don't have to upload before finding problems.

随着数据量变大,使用临时表/数据库/服务器是一种常见的解决方案。 BTW允许用户清理Excel中的数据是一个非常好的主意,允许他们“意外”删除关键数据是一个非常糟糕的主意。您可以锁定单元格/列以防止这种情况,和/或在Excel中进行一些基本验证。如果一个字段应该填写并且应该是一个日期,你可以检查几行excel。您的用户会很高兴,因为他们在发现问题之前无需上传。

#3


To answer this properly the following information would be useful

要正确回答这一点,以下信息将非常有用

  1. How are you going to notify the user of failures?
  2. 您如何通知用户失败?

  3. Will one validation failure result in loading 64,999 records or none?
  4. 一次验证失败会导致加载64,999条记录还是没有?

#4


first store in a temp table from text file data using bulk uploading. then retrives this, and validate using your made interface. and after validation store it in the main table or DB

首先使用批量上传从文本文件数据存储在临时表中。然后重新执行此操作,并使用您制作的界面进行验证。并在验证后将其存储在主表或DB中

#1


The approach I've seen used in the past is:

我过去看过的方法是:

  1. Bulk-load the user's data into a 'scratch' table in the database.
  2. 将用户的数据批量加载到数据库中的“临时”表中。

  3. Validate data in the scratch table via a single stored procedure (executing a series of queries), marking rows that fail validation, require update etc.
  4. 通过单个存储过程(执行一系列查询)验证临时表中的数据,标记未通过验证的行,需要更新等。

  5. Action the marked rows as appropriate.
  6. 根据需要操作标记的行。

This works well for validating missing columns, valid key values etc. It's not so good for checking the format of individual fields (don't make SQL pull strings apart).

这适用于验证缺少的列,有效的键值等。检查单个字段的格式不是很好(不要让SQL拉开字符串)。

As we know, some folk feel uncomfortable putting business logic in the database, but this approach does limit the number of database hits your application makes, and avoids holding all the data in memory at once.

我们知道,有些人觉得将业务逻辑放在数据库中会感到很不舒服,但这种方法确实限制了应用程序的数据库命中数,并避免一次性将所有数据保存在内存中。

#2


Your problem is very common in Data Warehouse systems, where bulk uploads and data cleansing are a core part of the (regular) work to be done. I suggest you google around ETL (Extract Transform Load), Staging tables and you'll find a wealth of good stuff.

您的问题在数据仓库系统中非常常见,批量上载和数据清理是(常规)工作的核心部分。我建议你谷歌围绕ETL(提取变换加载),临时表,你会发现很多好东西。

In broad answer to your problem, if you do 'load the data into memory' for checking, you're effectively re-implementing a part of the DB engine in your own code. Now that could be a good thing if it's faster and clever to do so. For instance you may only have a small range of valid dates for your Excel extract, so you don't need to join to a table to check that dates are in range. However, for other data like foreign keys etc, let the DB do what it's good at.

在广泛回答您的问题时,如果您“将数据加载到内存中”进行检查,则可以在您自己的代码中有效地重新实现数据库引擎的一部分。现在,如果这样做更快更聪明,那将是一件好事。例如,您的Excel提取可能只有一小部分有效日期,因此您无需加入表格来检查日期是否在范围内。但是,对于其他数据,如外键等,让DB做自己擅长的事情。

Using a staging table/database/server is a common solution as the data volumes get large. BTW allowing users to clean data in Excel is a really good idea, allowing them to 'accidentally' remove crucial data is a really bad idea. Can you lock cells/columns to prevent this, and/or put in some basic validation into Excel. If a field should be filled and should be a date, you can check that in a few lines of excel. Your users will be happy as they don't have to upload before finding problems.

随着数据量变大,使用临时表/数据库/服务器是一种常见的解决方案。 BTW允许用户清理Excel中的数据是一个非常好的主意,允许他们“意外”删除关键数据是一个非常糟糕的主意。您可以锁定单元格/列以防止这种情况,和/或在Excel中进行一些基本验证。如果一个字段应该填写并且应该是一个日期,你可以检查几行excel。您的用户会很高兴,因为他们在发现问题之前无需上传。

#3


To answer this properly the following information would be useful

要正确回答这一点,以下信息将非常有用

  1. How are you going to notify the user of failures?
  2. 您如何通知用户失败?

  3. Will one validation failure result in loading 64,999 records or none?
  4. 一次验证失败会导致加载64,999条记录还是没有?

#4


first store in a temp table from text file data using bulk uploading. then retrives this, and validate using your made interface. and after validation store it in the main table or DB

首先使用批量上传从文本文件数据存储在临时表中。然后重新执行此操作,并使用您制作的界面进行验证。并在验证后将其存储在主表或DB中