I have two DataTables, A
and B
, produced from CSV files. I need to be able to check which rows exist in B
that do not exist in A
.
我有两个DataTables,A和B,由CSV文件生成。我需要能够检查B中存在哪些A中不存在的行。
Is there a way to do some sort of query to show the different rows or would I have to iterate through each row on each DataTable to check if they are the same? The latter option seems to be very intensive if the tables become large.
有没有办法进行某种查询来显示不同的行,还是我必须遍历每个DataTable上的每一行来检查它们是否相同?如果表格变大,后一种选择似乎非常密集。
13 个解决方案
#1
9
would I have to iterate through each row on each DataTable to check if they are the same.
我是否必须遍历每个DataTable上的每一行以检查它们是否相同。
Seeing as you've loaded the data from a CSV file, you're not going to have any indexes or anything, so at some point, something is going to have to iterate through every row, whether it be your code, or a library, or whatever.
当你从CSV文件加载数据时,你不会有任何索引或任何东西,所以在某些时候,某些东西必须遍历每一行,无论是你的代码还是库, 管他呢。
Anyway, this is an algorithms question, which is not my specialty, but my naive approach would be as follows:
无论如何,这是一个算法问题,这不是我的专长,但我天真的方法如下:
1: Can you exploit any properties of the data? Are all the rows in each table unique, and can you sort them both by the same criteria? If so, you can do this:
1:您可以利用数据的任何属性吗?每个表中的所有行都是唯一的,您可以按相同的标准对它们进行排序吗?如果是这样,你可以这样做:
- Sort both tables by their ID (using some useful thing like a quicksort). If they're already sorted then you win big.
- Step through both tables at once, skipping over any gaps in ID's in either table. Matched ID's mean duplicated records.
按照ID对两个表进行排序(使用一些有用的东西,比如quicksort)。如果它们已经排序,那么你就赢了。
一次跳过两个表,跳过任一表中ID的任何间隙。匹配ID的平均重复记录。
This allows you to do it in (sort time * 2 ) + one pass, so if my big-O-notation is correct, it'd be (whatever-sort-time) + O(m+n) which is pretty good.
(Revision: this is the approach that ΤΖΩΤΖΙΟΥ describes )
这允许你在(排序时间* 2)+一次通过中进行,所以如果我的big-O-notation是正确的,那就是(无论什么时候排序)+ O(m + n)这是非常好的。 (修订版:这是ΤΖΩΤΖΙΟΥ描述的方法)
2: An alternative approach, which may be more or less efficient depending on how big your data is:
2:另一种方法,根据数据的大小,可能会有效或多或少:
- Run through table 1, and for each row, stick it's ID (or computed hashcode, or some other unique ID for that row) into a dictionary (or hashtable if you prefer to call it that).
- Run through table 2, and for each row, see if the ID (or hashcode etc) is present in the dictionary. You're exploiting the fact that dictionaries have really fast - O(1) I think? lookup. This step will be really fast, but you'll have paid the price doing all those dictionary inserts.
运行表1,对于每一行,将它的ID(或计算的哈希码,或该行的一些其他唯一ID)粘贴到字典中(如果您更喜欢将其称为哈希表)。
运行表2,对于每一行,查看字典中是否存在ID(或哈希码等)。你正在利用字典真的很快的事实 - 我认为O(1)?抬头。这一步将非常快,但您已经付出了所有这些字典插入的价格。
I'd be really interested to see what people with better knowledge of algorithms than myself come up with for this one :-)
我真的很想知道那些比我更熟悉算法的人会想出这个:-)
#2
19
Assuming you have an ID column which is of an appropriate type (i.e. gives a hashcode and implements equality) - string in this example, which is slightly pseudocode because I'm not that familiar with DataTables and don't have time to look it all up just now :)
假设你有一个适当类型的ID列(即给出一个哈希码并实现相等) - 在这个例子中的字符串,这是一个伪代码,因为我不熟悉DataTables并且没有时间查看所有内容刚才:)
IEnumerable<string> idsInA = tableA.AsEnumerable().Select(row => (string)row["ID"]);
IEnumerable<string> idsInB = tableB.AsEnumerable().Select(row => (string)row["ID"]);
IEnumerable<string> bNotA = idsInB.Except(idsInA);
#3
7
You can use the Merge and GetChanges methods on the DataTable to do this:
您可以在DataTable上使用Merge和GetChanges方法来执行此操作:
A.Merge(B); // this will add to A any records that are in B but not A
return A.GetChanges(); // returns records originally only in B
#4
4
The answers so far assume that you're simply looking for duplicate primary keys. That's a pretty easy problem - you can use the Merge() method, for instance.
到目前为止的答案假设您只是在寻找重复的主键。这是一个非常简单的问题 - 例如,你可以使用Merge()方法。
But I understand your question to mean that you're looking for duplicate DataRows. (From your description of the problem, with both tables being imported from CSV files, I'd even assume that the original rows didn't have primary key values, and that any primary keys are being assigned via AutoNumber during the import.)
但我理解你的问题意味着你正在寻找重复的DataRows。 (根据您对问题的描述,两个表都是从CSV文件导入的,我甚至假设原始行没有主键值,并且在导入过程中通过AutoNumber分配了所有主键。)
The naive implementation (for each row in A, compare its ItemArray with that of each row in B) is indeed going to be computationally expensive.
天真的实现(对于A中的每一行,将其ItemArray与B中的每一行进行比较)确实在计算上是昂贵的。
A much less expensive way to do this is with a hashing algorithm. For each DataRow, concatenate the string values of its columns into a single string, and then call GetHashCode() on that string to get an int value. Create a Dictionary<int, DataRow>
that contains an entry, keyed on the hash code, for each DataRow in DataTable B. Then, for each DataRow in DataTable A, calculate the hash code, and see if it's contained in the dictionary. If it's not, you know that the DataRow doesn't exist in DataTable B.
一种便宜得多的方法是使用散列算法。对于每个DataRow,将其列的字符串值连接成单个字符串,然后对该字符串调用GetHashCode()以获取int值。为DataTable B中的每个DataRow创建一个Dictionary
This approach has two weaknesses that both emerge from the fact that two strings can be unequal but produce the same hash code. If you find a row in A whose hash is in the dictionary, you then need to check the DataRow in the dictionary to verify that the two rows are really equal.
这种方法有两个缺点,这两个缺点都来自两个字符串可能不相等但产生相同的哈希码的事实。如果在A中找到一个哈希在字典中的行,则需要检查字典中的DataRow以验证这两行是否真的相等。
The second weakness is more serious: it's unlikely, but possible, that two different DataRows in B could hash to the same key value. For this reason, the dictionary should really be a Dictionary<int, List<DataRow>>
, and you should perform the check described in the previous paragraph against each DataRow in the list.
第二个弱点更严重:B中的两个不同的DataRows可能会散列到相同的键值,但不太可能,但可能。因此,字典应该是Dictionary
It takes a fair amount of work to get this working, but it's an O(m+n) algorithm, which I think is going to be as good as it gets.
要使这个工作需要相当多的工作,但它是一个O(m + n)算法,我认为它会得到很好的效果。
#5
1
Just FYI:
Generally speaking about algorithms, comparing two sets of sortable (as ids typically are) is not an O(M*N/2) operation, but O(M+N) if the two sets are ordered. So you scan one table with a pointer to the start of the other, and:
一般而言,关于算法,比较两组可排序(通常为ids)不是O(M * N / 2)操作,而是如果两组是有序的则为O(M + N)。因此,您使用指向另一个表的开头扫描一个表,并且:
other_item= A.first()
only_in_B= empty_list()
for item in B:
while other_item > item:
other_item= A.next()
if A.eof():
only_in_B.add( all the remaining B items)
return only_in_B
if item < other_item:
empty_list.append(item)
return only_in_B
The code above is obviously pseudocode, but should give you the general gist if you decide to code it yourself.
上面的代码显然是伪代码,但如果您决定自己编写代码,应该给出一般要点。
#6
1
Thanks for all the feedback.
感谢所有的反馈。
I do not have any index's unfortunately. I will give a little more information about my situation.
不幸的是,我没有任何索引。我将提供有关我的情况的更多信息。
We have a reporting program (replaced Crystal reports) that is installed in 7 Servers across EU. These servers have many reports on them (not all the same for each country). They are invoked by a commandline application that uses XML files for their configuration. So One XML file can call multiple reports.
我们有一个报告程序(取代了Crystal报告),该程序安装在整个欧盟的7个服务器中。这些服务器上有很多报告(每个国家都不一样)。它们由命令行应用程序调用,该应用程序使用XML文件进行配置。因此,一个XML文件可以调用多个报告。
The commandline application is scheduled and controlled by our overnight process. So the XML file could be called from multiple places.
命令行应用程序由我们的隔夜进程安排和控制。因此可以从多个位置调用XML文件。
The goal of the CSV is to produce a list of all the reports that are being used and where they are being called from.
CSV的目标是生成所有正在使用的报告列表以及从中调用它们的位置。
I am going through the XML files for all references, querying the scheduling program and producing a list of all the reports. (this is not too bad).
我将浏览所有引用的XML文件,查询调度程序并生成所有报告的列表。 (这不是太糟糕)。
The problem I have is I have to keep a list of all the reports that might have been removed from production. So I need to compare the old CSV with the new data. For this I thought it best to put it into DataTables and compare the information, (this could be the wrong approach. I suppose I could create an object that holds it and compares the difference then create iterate through them).
我遇到的问题是我必须保留可能已从生产中删除的所有报告的列表。所以我需要将旧CSV与新数据进行比较。为此,我认为最好将它放入DataTables并比较信息(这可能是错误的方法。我想我可以创建一个保存它的对象并比较差异然后通过它们创建迭代)。
The data I have about each report is as follows:
我对每份报告的数据如下:
String - Task Name String - Action Name Int - ActionID (the Action ID can be in multiple records as a single action can call many reports, i.e. an XML file). String - XML File called String - Report Name
字符串 - 任务名称字符串 - 操作名称Int - ActionID(操作ID可以位于多个记录中,因为单个操作可以调用许多报告,即XML文件)。 String - 名为String的XML文件 - 报告名称
I will try the Merge idea given by MusiGenesis (thanks). (rereading some of the posts not sure if the Merge will work, but worth trying as I have not heard about it before so something new to learn).
我将尝试MusiGenesis给出的Merge想法(谢谢)。 (重读一些帖子不确定Merge是否会起作用,但值得尝试,因为我之前没有听说过这样的新东西要学习)。
The HashCode Idea sounds interesting as well.
HashCode Idea听起来也很有趣。
Thanks for all the advice.
感谢所有的建议。
#7
1
I found an easy way to solve this. Unlike previous "except method" answers, I use the except method twice. This not only tells you what rows were deleted but what rows were added. If you only use one except method - it will only tell you one difference and not both. This code is tested and works. See below
我找到了解决这个问题的简单方法。与之前的“除方法”答案不同,我使用except方法两次。这不仅告诉您删除了哪些行,还添加了哪些行。如果你只使用一个除了方法 - 它只会告诉你一个区别,而不是两个。此代码经过测试并可以使用。见下文
//Pass in your two datatables into your method
//build the queries based on id.
var qry1 = datatable1.AsEnumerable().Select(a => new { ID = a["ID"].ToString() });
var qry2 = datatable2.AsEnumerable().Select(b => new { ID = b["ID"].ToString() });
//detect row deletes - a row is in datatable1 except missing from datatable2
var exceptAB = qry1.Except(qry2);
//detect row inserts - a row is in datatable2 except missing from datatable1
var exceptAB2 = qry2.Except(qry1);
then execute your code against the results
然后针对结果执行代码
if (exceptAB.Any())
{
foreach (var id in exceptAB)
{
//execute code here
}
}
if (exceptAB2.Any())
{
foreach (var id in exceptAB2)
{
//execute code here
}
}
#8
1
public DataTable compareDataTables(DataTable First, DataTable Second)
{
First.TableName = "FirstTable";
Second.TableName = "SecondTable";
//Create Empty Table
DataTable table = new DataTable("Difference");
DataTable table1 = new DataTable();
try
{
//Must use a Dataset to make use of a DataRelation object
using (DataSet ds4 = new DataSet())
{
//Add tables
ds4.Tables.AddRange(new DataTable[] { First.Copy(), Second.Copy() });
//Get Columns for DataRelation
DataColumn[] firstcolumns = new DataColumn[ds4.Tables[0].Columns.Count];
for (int i = 0; i < firstcolumns.Length; i++)
{
firstcolumns[i] = ds4.Tables[0].Columns[i];
}
DataColumn[] secondcolumns = new DataColumn[ds4.Tables[1].Columns.Count];
for (int i = 0; i < secondcolumns.Length; i++)
{
secondcolumns[i] = ds4.Tables[1].Columns[i];
}
//Create DataRelation
DataRelation r = new DataRelation(string.Empty, firstcolumns, secondcolumns, false);
ds4.Relations.Add(r);
//Create columns for return table
for (int i = 0; i < First.Columns.Count; i++)
{
table.Columns.Add(First.Columns[i].ColumnName, First.Columns[i].DataType);
}
//If First Row not in Second, Add to return table.
table.BeginLoadData();
foreach (DataRow parentrow in ds4.Tables[0].Rows)
{
DataRow[] childrows = parentrow.GetChildRows(r);
if (childrows == null || childrows.Length == 0)
table.LoadDataRow(parentrow.ItemArray, true);
table1.LoadDataRow(childrows, false);
}
table.EndLoadData();
}
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
return table;
}
#9
0
try
{
if (ds.Tables[0].Columns.Count == ds1.Tables[0].Columns.Count)
{
for (int i = 0; i < ds.Tables[0].Rows.Count; i++)
{
for (int j = 0; j < ds.Tables[0].Columns.Count; j++)
{
if (ds.Tables[0].Rows[i][j].ToString() == ds1.Tables[0].Rows[i][j].ToString())
{
}
else
{
MessageBox.Show(i.ToString() + "," + j.ToString());
}
}
}
}
else
{
MessageBox.Show("Table has different columns ");
}
}
catch (Exception)
{
MessageBox.Show("Please select The Table");
}
#10
0
I'm continuing tzot's idea ...
我继续tzot的想法......
If you have two sortable sets, then you can just use:
如果您有两个可排序的集合,那么您可以使用:
List<string> diffList = new List<string>(sortedListA.Except(sortedListB));
If you need more complicated objects, you can define a comparator yourself and still use it.
如果您需要更复杂的对象,可以自己定义比较器并仍然使用它。
#11
0
The usual usage scenario considers a user that has a DataTable
in hand and changes it by Adding, Deleting or Modifying some of the DataRows
.
通常的使用方案考虑手头有DataTable的用户,并通过添加,删除或修改某些DataRows来更改它。
After the changes are performed, the DataTable
is aware of the proper DataRowState
for each row, and also keeps track of the Original
DataRowVersion
for any rows that were changed.
执行更改后,DataTable会识别每行的正确DataRowState,并且还会跟踪已更改的任何行的原始DataRowVersion。
In this usual scenario, one can Merge
the changes back into a source table (in which all rows are Unchanged
). After merging, one can get a nice summary of only the changed rows with a call to GetChanges()
.
在这种通常情况下,可以将更改合并回源表(其中所有行都是未更改的)。合并之后,只需调用GetChanges(),就可以得到只有已更改行的很好的摘要。
In a more unusual scenario, a user has two DataTables
with the same schema (or perhaps only the same columns and lacking primary keys). These two DataTables
consist of only Unchanged
rows. The user may want to find out what changes does he need to apply to one of the two tables in order to get to the other one. That is, which rows need to be Added, Deleted, or Modified.
在一个更不寻常的场景中,用户有两个具有相同模式的DataTable(或者可能只有相同的列且缺少主键)。这两个DataTable只包含Unchanged行。用户可能想要找出他需要将哪些更改应用于两个表中的一个以便到达另一个表。也就是说,需要添加,删除或修改哪些行。
We define here a function called GetDelta()
which does the job:
我们在这里定义了一个名为GetDelta()的函数来完成这项工作:
using System;
using System.Data;
using System.Xml;
using System.Linq;
using System.Collections.Generic;
using System.Data.DataSetExtensions;
public class Program
{
private static DataTable GetDelta(DataTable table1, DataTable table2)
{
// Modified2 : row1 keys match rowOther keys AND row1 does not match row2:
IEnumerable<DataRow> modified2 = (
from row1 in table1.AsEnumerable()
from row2 in table2.AsEnumerable()
where table1.PrimaryKey.Aggregate(true, (boolAggregate, keycol) => boolAggregate & row1[keycol].Equals(row2[keycol.Ordinal]))
&& !row1.ItemArray.SequenceEqual(row2.ItemArray)
select row2);
// Modified1 :
IEnumerable<DataRow> modified1 = (
from row1 in table1.AsEnumerable()
from row2 in table2.AsEnumerable()
where table1.PrimaryKey.Aggregate(true, (boolAggregate, keycol) => boolAggregate & row1[keycol].Equals(row2[keycol.Ordinal]))
&& !row1.ItemArray.SequenceEqual(row2.ItemArray)
select row1);
// Added : row2 not in table1 AND row2 not in modified2
IEnumerable<DataRow> added = table2.AsEnumerable().Except(modified2, DataRowComparer.Default).Except(table1.AsEnumerable(), DataRowComparer.Default);
// Deleted : row1 not in row2 AND row1 not in modified1
IEnumerable<DataRow> deleted = table1.AsEnumerable().Except(modified1, DataRowComparer.Default).Except(table2.AsEnumerable(), DataRowComparer.Default);
Console.WriteLine();
Console.WriteLine("modified count =" + modified1.Count());
Console.WriteLine("added count =" + added.Count());
Console.WriteLine("deleted count =" + deleted.Count());
DataTable deltas = table1.Clone();
foreach (DataRow row in modified2)
{
// Match the unmodified version of the row via the PrimaryKey
DataRow matchIn1 = modified1.Where(row1 => table1.PrimaryKey.Aggregate(true, (boolAggregate, keycol) => boolAggregate & row1[keycol].Equals(row[keycol.Ordinal]))).First();
DataRow newRow = deltas.NewRow();
// Set the row with the original values
foreach(DataColumn dc in deltas.Columns)
newRow[dc.ColumnName] = matchIn1[dc.ColumnName];
deltas.Rows.Add(newRow);
newRow.AcceptChanges();
// Set the modified values
foreach (DataColumn dc in deltas.Columns)
newRow[dc.ColumnName] = row[dc.ColumnName];
// At this point newRow.DataRowState should be : Modified
}
foreach (DataRow row in added)
{
DataRow newRow = deltas.NewRow();
foreach (DataColumn dc in deltas.Columns)
newRow[dc.ColumnName] = row[dc.ColumnName];
deltas.Rows.Add(newRow);
// At this point newRow.DataRowState should be : Added
}
foreach (DataRow row in deleted)
{
DataRow newRow = deltas.NewRow();
foreach (DataColumn dc in deltas.Columns)
newRow[dc.ColumnName] = row[dc.ColumnName];
deltas.Rows.Add(newRow);
newRow.AcceptChanges();
newRow.Delete();
// At this point newRow.DataRowState should be : Deleted
}
return deltas;
}
private static void DemonstrateGetDelta()
{
DataTable table1 = new DataTable("Items");
// Add columns
DataColumn column1 = new DataColumn("id1", typeof(System.Int32));
DataColumn column2 = new DataColumn("id2", typeof(System.Int32));
DataColumn column3 = new DataColumn("item", typeof(System.Int32));
table1.Columns.Add(column1);
table1.Columns.Add(column2);
table1.Columns.Add(column3);
// Set the primary key column.
table1.PrimaryKey = new DataColumn[] { column1, column2 };
// Add some rows.
DataRow row;
for (int i = 0; i <= 4; i++)
{
row = table1.NewRow();
row["id1"] = i;
row["id2"] = i*i;
row["item"] = i;
table1.Rows.Add(row);
}
// Accept changes.
table1.AcceptChanges();
PrintValues(table1, "table1:");
// Create a second DataTable identical to the first.
DataTable table2 = table1.Clone();
// Add a row that exists in table1:
row = table2.NewRow();
row["id1"] = 0;
row["id2"] = 0;
row["item"] = 0;
table2.Rows.Add(row);
// Modify the values of a row that exists in table1:
row = table2.NewRow();
row["id1"] = 1;
row["id2"] = 1;
row["item"] = 455;
table2.Rows.Add(row);
// Modify the values of a row that exists in table1:
row = table2.NewRow();
row["id1"] = 2;
row["id2"] = 4;
row["item"] = 555;
table2.Rows.Add(row);
// Add a row that does not exist in table1:
row = table2.NewRow();
row["id1"] = 13;
row["id2"] = 169;
row["item"] = 655;
table2.Rows.Add(row);
table2.AcceptChanges();
Console.WriteLine();
PrintValues(table2, "table2:");
DataTable delta = GetDelta(table1,table2);
Console.WriteLine();
PrintValues(delta,"delta:");
// Verify that the deltas DataTable contains the adequate Original DataRowVersions:
DataTable originals = table1.Clone();
foreach (DataRow drow in delta.Rows)
{
if (drow.RowState != DataRowState.Added)
{
DataRow originalRow = originals.NewRow();
foreach (DataColumn dc in originals.Columns)
originalRow[dc.ColumnName] = drow[dc.ColumnName, DataRowVersion.Original];
originals.Rows.Add(originalRow);
}
}
originals.AcceptChanges();
Console.WriteLine();
PrintValues(originals,"delta original values:");
}
private static void Row_Changed(object sender,
DataRowChangeEventArgs e)
{
Console.WriteLine("Row changed {0}\t{1}",
e.Action, e.Row.ItemArray[0]);
}
private static void PrintValues(DataTable table, string label)
{
// Display the values in the supplied DataTable:
Console.WriteLine(label);
foreach (DataRow row in table.Rows)
{
foreach (DataColumn col in table.Columns)
{
Console.Write("\t " + row[col, row.RowState == DataRowState.Deleted ? DataRowVersion.Original : DataRowVersion.Current].ToString());
}
Console.Write("\t DataRowState =" + row.RowState);
Console.WriteLine();
}
}
public static void Main()
{
DemonstrateGetDelta();
}
}
The code above can be tested in https://dotnetfiddle.net/. The resulting output is shown below:
上面的代码可以在https://dotnetfiddle.net/中测试。结果输出如下所示:
table1:
0 0 0 DataRowState =Unchanged
1 1 1 DataRowState =Unchanged
2 4 2 DataRowState =Unchanged
3 9 3 DataRowState =Unchanged
4 16 4 DataRowState =Unchanged
table2:
0 0 0 DataRowState =Unchanged
1 1 455 DataRowState =Unchanged
2 4 555 DataRowState =Unchanged
13 169 655 DataRowState =Unchanged
modified count =2
added count =1
deleted count =2
delta:
1 1 455 DataRowState =Modified
2 4 555 DataRowState =Modified
13 169 655 DataRowState =Added
3 9 3 DataRowState =Deleted
4 16 4 DataRowState =Deleted
delta original values:
1 1 1 DataRowState =Unchanged
2 4 2 DataRowState =Unchanged
3 9 3 DataRowState =Unchanged
4 16 4 DataRowState =Unchanged
Note that if your tables don't have a PrimaryKey
, the where
clause in the LINQ queries gets simplified a little bit. I'll let you figure that out on your own.
请注意,如果您的表没有PrimaryKey,LINQ查询中的where子句会稍微简化一下。我会让你自己解决这个问题。
#12
0
Achieve it simply using linq.
使用linq实现它。
private DataTable CompareDT(DataTable TableA, DataTable TableB)
{
DataTable TableC = new DataTable();
try
{
var idsNotInB = TableA.AsEnumerable().Select(r => r.Field<string>(Keyfield))
.Except(TableB.AsEnumerable().Select(r => r.Field<string>(Keyfield)));
TableC = (from row in TableA.AsEnumerable()
join id in idsNotInB
on row.Field<string>(ddlColumn.SelectedItem.ToString()) equals id
select row).CopyToDataTable();
}
catch (Exception ex)
{
lblresult.Text = ex.Message;
ex = null;
}
return TableC;
}
#13
0
Could you not simply compare the CSV files before loading them into DataTables?
您是否可以在将CSV文件加载到DataTables之前简单地比较它们?
string[] a = System.IO.File.ReadAllLines(@"cvs_a.txt");
string[] b = System.IO.File.ReadAllLines(@"csv_b.txt");
// get the lines from b that are not in a
IEnumerable<string> diff = b.Except(a);
//... parse b into DataTable ...
#1
9
would I have to iterate through each row on each DataTable to check if they are the same.
我是否必须遍历每个DataTable上的每一行以检查它们是否相同。
Seeing as you've loaded the data from a CSV file, you're not going to have any indexes or anything, so at some point, something is going to have to iterate through every row, whether it be your code, or a library, or whatever.
当你从CSV文件加载数据时,你不会有任何索引或任何东西,所以在某些时候,某些东西必须遍历每一行,无论是你的代码还是库, 管他呢。
Anyway, this is an algorithms question, which is not my specialty, but my naive approach would be as follows:
无论如何,这是一个算法问题,这不是我的专长,但我天真的方法如下:
1: Can you exploit any properties of the data? Are all the rows in each table unique, and can you sort them both by the same criteria? If so, you can do this:
1:您可以利用数据的任何属性吗?每个表中的所有行都是唯一的,您可以按相同的标准对它们进行排序吗?如果是这样,你可以这样做:
- Sort both tables by their ID (using some useful thing like a quicksort). If they're already sorted then you win big.
- Step through both tables at once, skipping over any gaps in ID's in either table. Matched ID's mean duplicated records.
按照ID对两个表进行排序(使用一些有用的东西,比如quicksort)。如果它们已经排序,那么你就赢了。
一次跳过两个表,跳过任一表中ID的任何间隙。匹配ID的平均重复记录。
This allows you to do it in (sort time * 2 ) + one pass, so if my big-O-notation is correct, it'd be (whatever-sort-time) + O(m+n) which is pretty good.
(Revision: this is the approach that ΤΖΩΤΖΙΟΥ describes )
这允许你在(排序时间* 2)+一次通过中进行,所以如果我的big-O-notation是正确的,那就是(无论什么时候排序)+ O(m + n)这是非常好的。 (修订版:这是ΤΖΩΤΖΙΟΥ描述的方法)
2: An alternative approach, which may be more or less efficient depending on how big your data is:
2:另一种方法,根据数据的大小,可能会有效或多或少:
- Run through table 1, and for each row, stick it's ID (or computed hashcode, or some other unique ID for that row) into a dictionary (or hashtable if you prefer to call it that).
- Run through table 2, and for each row, see if the ID (or hashcode etc) is present in the dictionary. You're exploiting the fact that dictionaries have really fast - O(1) I think? lookup. This step will be really fast, but you'll have paid the price doing all those dictionary inserts.
运行表1,对于每一行,将它的ID(或计算的哈希码,或该行的一些其他唯一ID)粘贴到字典中(如果您更喜欢将其称为哈希表)。
运行表2,对于每一行,查看字典中是否存在ID(或哈希码等)。你正在利用字典真的很快的事实 - 我认为O(1)?抬头。这一步将非常快,但您已经付出了所有这些字典插入的价格。
I'd be really interested to see what people with better knowledge of algorithms than myself come up with for this one :-)
我真的很想知道那些比我更熟悉算法的人会想出这个:-)
#2
19
Assuming you have an ID column which is of an appropriate type (i.e. gives a hashcode and implements equality) - string in this example, which is slightly pseudocode because I'm not that familiar with DataTables and don't have time to look it all up just now :)
假设你有一个适当类型的ID列(即给出一个哈希码并实现相等) - 在这个例子中的字符串,这是一个伪代码,因为我不熟悉DataTables并且没有时间查看所有内容刚才:)
IEnumerable<string> idsInA = tableA.AsEnumerable().Select(row => (string)row["ID"]);
IEnumerable<string> idsInB = tableB.AsEnumerable().Select(row => (string)row["ID"]);
IEnumerable<string> bNotA = idsInB.Except(idsInA);
#3
7
You can use the Merge and GetChanges methods on the DataTable to do this:
您可以在DataTable上使用Merge和GetChanges方法来执行此操作:
A.Merge(B); // this will add to A any records that are in B but not A
return A.GetChanges(); // returns records originally only in B
#4
4
The answers so far assume that you're simply looking for duplicate primary keys. That's a pretty easy problem - you can use the Merge() method, for instance.
到目前为止的答案假设您只是在寻找重复的主键。这是一个非常简单的问题 - 例如,你可以使用Merge()方法。
But I understand your question to mean that you're looking for duplicate DataRows. (From your description of the problem, with both tables being imported from CSV files, I'd even assume that the original rows didn't have primary key values, and that any primary keys are being assigned via AutoNumber during the import.)
但我理解你的问题意味着你正在寻找重复的DataRows。 (根据您对问题的描述,两个表都是从CSV文件导入的,我甚至假设原始行没有主键值,并且在导入过程中通过AutoNumber分配了所有主键。)
The naive implementation (for each row in A, compare its ItemArray with that of each row in B) is indeed going to be computationally expensive.
天真的实现(对于A中的每一行,将其ItemArray与B中的每一行进行比较)确实在计算上是昂贵的。
A much less expensive way to do this is with a hashing algorithm. For each DataRow, concatenate the string values of its columns into a single string, and then call GetHashCode() on that string to get an int value. Create a Dictionary<int, DataRow>
that contains an entry, keyed on the hash code, for each DataRow in DataTable B. Then, for each DataRow in DataTable A, calculate the hash code, and see if it's contained in the dictionary. If it's not, you know that the DataRow doesn't exist in DataTable B.
一种便宜得多的方法是使用散列算法。对于每个DataRow,将其列的字符串值连接成单个字符串,然后对该字符串调用GetHashCode()以获取int值。为DataTable B中的每个DataRow创建一个Dictionary
This approach has two weaknesses that both emerge from the fact that two strings can be unequal but produce the same hash code. If you find a row in A whose hash is in the dictionary, you then need to check the DataRow in the dictionary to verify that the two rows are really equal.
这种方法有两个缺点,这两个缺点都来自两个字符串可能不相等但产生相同的哈希码的事实。如果在A中找到一个哈希在字典中的行,则需要检查字典中的DataRow以验证这两行是否真的相等。
The second weakness is more serious: it's unlikely, but possible, that two different DataRows in B could hash to the same key value. For this reason, the dictionary should really be a Dictionary<int, List<DataRow>>
, and you should perform the check described in the previous paragraph against each DataRow in the list.
第二个弱点更严重:B中的两个不同的DataRows可能会散列到相同的键值,但不太可能,但可能。因此,字典应该是Dictionary
It takes a fair amount of work to get this working, but it's an O(m+n) algorithm, which I think is going to be as good as it gets.
要使这个工作需要相当多的工作,但它是一个O(m + n)算法,我认为它会得到很好的效果。
#5
1
Just FYI:
Generally speaking about algorithms, comparing two sets of sortable (as ids typically are) is not an O(M*N/2) operation, but O(M+N) if the two sets are ordered. So you scan one table with a pointer to the start of the other, and:
一般而言,关于算法,比较两组可排序(通常为ids)不是O(M * N / 2)操作,而是如果两组是有序的则为O(M + N)。因此,您使用指向另一个表的开头扫描一个表,并且:
other_item= A.first()
only_in_B= empty_list()
for item in B:
while other_item > item:
other_item= A.next()
if A.eof():
only_in_B.add( all the remaining B items)
return only_in_B
if item < other_item:
empty_list.append(item)
return only_in_B
The code above is obviously pseudocode, but should give you the general gist if you decide to code it yourself.
上面的代码显然是伪代码,但如果您决定自己编写代码,应该给出一般要点。
#6
1
Thanks for all the feedback.
感谢所有的反馈。
I do not have any index's unfortunately. I will give a little more information about my situation.
不幸的是,我没有任何索引。我将提供有关我的情况的更多信息。
We have a reporting program (replaced Crystal reports) that is installed in 7 Servers across EU. These servers have many reports on them (not all the same for each country). They are invoked by a commandline application that uses XML files for their configuration. So One XML file can call multiple reports.
我们有一个报告程序(取代了Crystal报告),该程序安装在整个欧盟的7个服务器中。这些服务器上有很多报告(每个国家都不一样)。它们由命令行应用程序调用,该应用程序使用XML文件进行配置。因此,一个XML文件可以调用多个报告。
The commandline application is scheduled and controlled by our overnight process. So the XML file could be called from multiple places.
命令行应用程序由我们的隔夜进程安排和控制。因此可以从多个位置调用XML文件。
The goal of the CSV is to produce a list of all the reports that are being used and where they are being called from.
CSV的目标是生成所有正在使用的报告列表以及从中调用它们的位置。
I am going through the XML files for all references, querying the scheduling program and producing a list of all the reports. (this is not too bad).
我将浏览所有引用的XML文件,查询调度程序并生成所有报告的列表。 (这不是太糟糕)。
The problem I have is I have to keep a list of all the reports that might have been removed from production. So I need to compare the old CSV with the new data. For this I thought it best to put it into DataTables and compare the information, (this could be the wrong approach. I suppose I could create an object that holds it and compares the difference then create iterate through them).
我遇到的问题是我必须保留可能已从生产中删除的所有报告的列表。所以我需要将旧CSV与新数据进行比较。为此,我认为最好将它放入DataTables并比较信息(这可能是错误的方法。我想我可以创建一个保存它的对象并比较差异然后通过它们创建迭代)。
The data I have about each report is as follows:
我对每份报告的数据如下:
String - Task Name String - Action Name Int - ActionID (the Action ID can be in multiple records as a single action can call many reports, i.e. an XML file). String - XML File called String - Report Name
字符串 - 任务名称字符串 - 操作名称Int - ActionID(操作ID可以位于多个记录中,因为单个操作可以调用许多报告,即XML文件)。 String - 名为String的XML文件 - 报告名称
I will try the Merge idea given by MusiGenesis (thanks). (rereading some of the posts not sure if the Merge will work, but worth trying as I have not heard about it before so something new to learn).
我将尝试MusiGenesis给出的Merge想法(谢谢)。 (重读一些帖子不确定Merge是否会起作用,但值得尝试,因为我之前没有听说过这样的新东西要学习)。
The HashCode Idea sounds interesting as well.
HashCode Idea听起来也很有趣。
Thanks for all the advice.
感谢所有的建议。
#7
1
I found an easy way to solve this. Unlike previous "except method" answers, I use the except method twice. This not only tells you what rows were deleted but what rows were added. If you only use one except method - it will only tell you one difference and not both. This code is tested and works. See below
我找到了解决这个问题的简单方法。与之前的“除方法”答案不同,我使用except方法两次。这不仅告诉您删除了哪些行,还添加了哪些行。如果你只使用一个除了方法 - 它只会告诉你一个区别,而不是两个。此代码经过测试并可以使用。见下文
//Pass in your two datatables into your method
//build the queries based on id.
var qry1 = datatable1.AsEnumerable().Select(a => new { ID = a["ID"].ToString() });
var qry2 = datatable2.AsEnumerable().Select(b => new { ID = b["ID"].ToString() });
//detect row deletes - a row is in datatable1 except missing from datatable2
var exceptAB = qry1.Except(qry2);
//detect row inserts - a row is in datatable2 except missing from datatable1
var exceptAB2 = qry2.Except(qry1);
then execute your code against the results
然后针对结果执行代码
if (exceptAB.Any())
{
foreach (var id in exceptAB)
{
//execute code here
}
}
if (exceptAB2.Any())
{
foreach (var id in exceptAB2)
{
//execute code here
}
}
#8
1
public DataTable compareDataTables(DataTable First, DataTable Second)
{
First.TableName = "FirstTable";
Second.TableName = "SecondTable";
//Create Empty Table
DataTable table = new DataTable("Difference");
DataTable table1 = new DataTable();
try
{
//Must use a Dataset to make use of a DataRelation object
using (DataSet ds4 = new DataSet())
{
//Add tables
ds4.Tables.AddRange(new DataTable[] { First.Copy(), Second.Copy() });
//Get Columns for DataRelation
DataColumn[] firstcolumns = new DataColumn[ds4.Tables[0].Columns.Count];
for (int i = 0; i < firstcolumns.Length; i++)
{
firstcolumns[i] = ds4.Tables[0].Columns[i];
}
DataColumn[] secondcolumns = new DataColumn[ds4.Tables[1].Columns.Count];
for (int i = 0; i < secondcolumns.Length; i++)
{
secondcolumns[i] = ds4.Tables[1].Columns[i];
}
//Create DataRelation
DataRelation r = new DataRelation(string.Empty, firstcolumns, secondcolumns, false);
ds4.Relations.Add(r);
//Create columns for return table
for (int i = 0; i < First.Columns.Count; i++)
{
table.Columns.Add(First.Columns[i].ColumnName, First.Columns[i].DataType);
}
//If First Row not in Second, Add to return table.
table.BeginLoadData();
foreach (DataRow parentrow in ds4.Tables[0].Rows)
{
DataRow[] childrows = parentrow.GetChildRows(r);
if (childrows == null || childrows.Length == 0)
table.LoadDataRow(parentrow.ItemArray, true);
table1.LoadDataRow(childrows, false);
}
table.EndLoadData();
}
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
return table;
}
#9
0
try
{
if (ds.Tables[0].Columns.Count == ds1.Tables[0].Columns.Count)
{
for (int i = 0; i < ds.Tables[0].Rows.Count; i++)
{
for (int j = 0; j < ds.Tables[0].Columns.Count; j++)
{
if (ds.Tables[0].Rows[i][j].ToString() == ds1.Tables[0].Rows[i][j].ToString())
{
}
else
{
MessageBox.Show(i.ToString() + "," + j.ToString());
}
}
}
}
else
{
MessageBox.Show("Table has different columns ");
}
}
catch (Exception)
{
MessageBox.Show("Please select The Table");
}
#10
0
I'm continuing tzot's idea ...
我继续tzot的想法......
If you have two sortable sets, then you can just use:
如果您有两个可排序的集合,那么您可以使用:
List<string> diffList = new List<string>(sortedListA.Except(sortedListB));
If you need more complicated objects, you can define a comparator yourself and still use it.
如果您需要更复杂的对象,可以自己定义比较器并仍然使用它。
#11
0
The usual usage scenario considers a user that has a DataTable
in hand and changes it by Adding, Deleting or Modifying some of the DataRows
.
通常的使用方案考虑手头有DataTable的用户,并通过添加,删除或修改某些DataRows来更改它。
After the changes are performed, the DataTable
is aware of the proper DataRowState
for each row, and also keeps track of the Original
DataRowVersion
for any rows that were changed.
执行更改后,DataTable会识别每行的正确DataRowState,并且还会跟踪已更改的任何行的原始DataRowVersion。
In this usual scenario, one can Merge
the changes back into a source table (in which all rows are Unchanged
). After merging, one can get a nice summary of only the changed rows with a call to GetChanges()
.
在这种通常情况下,可以将更改合并回源表(其中所有行都是未更改的)。合并之后,只需调用GetChanges(),就可以得到只有已更改行的很好的摘要。
In a more unusual scenario, a user has two DataTables
with the same schema (or perhaps only the same columns and lacking primary keys). These two DataTables
consist of only Unchanged
rows. The user may want to find out what changes does he need to apply to one of the two tables in order to get to the other one. That is, which rows need to be Added, Deleted, or Modified.
在一个更不寻常的场景中,用户有两个具有相同模式的DataTable(或者可能只有相同的列且缺少主键)。这两个DataTable只包含Unchanged行。用户可能想要找出他需要将哪些更改应用于两个表中的一个以便到达另一个表。也就是说,需要添加,删除或修改哪些行。
We define here a function called GetDelta()
which does the job:
我们在这里定义了一个名为GetDelta()的函数来完成这项工作:
using System;
using System.Data;
using System.Xml;
using System.Linq;
using System.Collections.Generic;
using System.Data.DataSetExtensions;
public class Program
{
private static DataTable GetDelta(DataTable table1, DataTable table2)
{
// Modified2 : row1 keys match rowOther keys AND row1 does not match row2:
IEnumerable<DataRow> modified2 = (
from row1 in table1.AsEnumerable()
from row2 in table2.AsEnumerable()
where table1.PrimaryKey.Aggregate(true, (boolAggregate, keycol) => boolAggregate & row1[keycol].Equals(row2[keycol.Ordinal]))
&& !row1.ItemArray.SequenceEqual(row2.ItemArray)
select row2);
// Modified1 :
IEnumerable<DataRow> modified1 = (
from row1 in table1.AsEnumerable()
from row2 in table2.AsEnumerable()
where table1.PrimaryKey.Aggregate(true, (boolAggregate, keycol) => boolAggregate & row1[keycol].Equals(row2[keycol.Ordinal]))
&& !row1.ItemArray.SequenceEqual(row2.ItemArray)
select row1);
// Added : row2 not in table1 AND row2 not in modified2
IEnumerable<DataRow> added = table2.AsEnumerable().Except(modified2, DataRowComparer.Default).Except(table1.AsEnumerable(), DataRowComparer.Default);
// Deleted : row1 not in row2 AND row1 not in modified1
IEnumerable<DataRow> deleted = table1.AsEnumerable().Except(modified1, DataRowComparer.Default).Except(table2.AsEnumerable(), DataRowComparer.Default);
Console.WriteLine();
Console.WriteLine("modified count =" + modified1.Count());
Console.WriteLine("added count =" + added.Count());
Console.WriteLine("deleted count =" + deleted.Count());
DataTable deltas = table1.Clone();
foreach (DataRow row in modified2)
{
// Match the unmodified version of the row via the PrimaryKey
DataRow matchIn1 = modified1.Where(row1 => table1.PrimaryKey.Aggregate(true, (boolAggregate, keycol) => boolAggregate & row1[keycol].Equals(row[keycol.Ordinal]))).First();
DataRow newRow = deltas.NewRow();
// Set the row with the original values
foreach(DataColumn dc in deltas.Columns)
newRow[dc.ColumnName] = matchIn1[dc.ColumnName];
deltas.Rows.Add(newRow);
newRow.AcceptChanges();
// Set the modified values
foreach (DataColumn dc in deltas.Columns)
newRow[dc.ColumnName] = row[dc.ColumnName];
// At this point newRow.DataRowState should be : Modified
}
foreach (DataRow row in added)
{
DataRow newRow = deltas.NewRow();
foreach (DataColumn dc in deltas.Columns)
newRow[dc.ColumnName] = row[dc.ColumnName];
deltas.Rows.Add(newRow);
// At this point newRow.DataRowState should be : Added
}
foreach (DataRow row in deleted)
{
DataRow newRow = deltas.NewRow();
foreach (DataColumn dc in deltas.Columns)
newRow[dc.ColumnName] = row[dc.ColumnName];
deltas.Rows.Add(newRow);
newRow.AcceptChanges();
newRow.Delete();
// At this point newRow.DataRowState should be : Deleted
}
return deltas;
}
private static void DemonstrateGetDelta()
{
DataTable table1 = new DataTable("Items");
// Add columns
DataColumn column1 = new DataColumn("id1", typeof(System.Int32));
DataColumn column2 = new DataColumn("id2", typeof(System.Int32));
DataColumn column3 = new DataColumn("item", typeof(System.Int32));
table1.Columns.Add(column1);
table1.Columns.Add(column2);
table1.Columns.Add(column3);
// Set the primary key column.
table1.PrimaryKey = new DataColumn[] { column1, column2 };
// Add some rows.
DataRow row;
for (int i = 0; i <= 4; i++)
{
row = table1.NewRow();
row["id1"] = i;
row["id2"] = i*i;
row["item"] = i;
table1.Rows.Add(row);
}
// Accept changes.
table1.AcceptChanges();
PrintValues(table1, "table1:");
// Create a second DataTable identical to the first.
DataTable table2 = table1.Clone();
// Add a row that exists in table1:
row = table2.NewRow();
row["id1"] = 0;
row["id2"] = 0;
row["item"] = 0;
table2.Rows.Add(row);
// Modify the values of a row that exists in table1:
row = table2.NewRow();
row["id1"] = 1;
row["id2"] = 1;
row["item"] = 455;
table2.Rows.Add(row);
// Modify the values of a row that exists in table1:
row = table2.NewRow();
row["id1"] = 2;
row["id2"] = 4;
row["item"] = 555;
table2.Rows.Add(row);
// Add a row that does not exist in table1:
row = table2.NewRow();
row["id1"] = 13;
row["id2"] = 169;
row["item"] = 655;
table2.Rows.Add(row);
table2.AcceptChanges();
Console.WriteLine();
PrintValues(table2, "table2:");
DataTable delta = GetDelta(table1,table2);
Console.WriteLine();
PrintValues(delta,"delta:");
// Verify that the deltas DataTable contains the adequate Original DataRowVersions:
DataTable originals = table1.Clone();
foreach (DataRow drow in delta.Rows)
{
if (drow.RowState != DataRowState.Added)
{
DataRow originalRow = originals.NewRow();
foreach (DataColumn dc in originals.Columns)
originalRow[dc.ColumnName] = drow[dc.ColumnName, DataRowVersion.Original];
originals.Rows.Add(originalRow);
}
}
originals.AcceptChanges();
Console.WriteLine();
PrintValues(originals,"delta original values:");
}
private static void Row_Changed(object sender,
DataRowChangeEventArgs e)
{
Console.WriteLine("Row changed {0}\t{1}",
e.Action, e.Row.ItemArray[0]);
}
private static void PrintValues(DataTable table, string label)
{
// Display the values in the supplied DataTable:
Console.WriteLine(label);
foreach (DataRow row in table.Rows)
{
foreach (DataColumn col in table.Columns)
{
Console.Write("\t " + row[col, row.RowState == DataRowState.Deleted ? DataRowVersion.Original : DataRowVersion.Current].ToString());
}
Console.Write("\t DataRowState =" + row.RowState);
Console.WriteLine();
}
}
public static void Main()
{
DemonstrateGetDelta();
}
}
The code above can be tested in https://dotnetfiddle.net/. The resulting output is shown below:
上面的代码可以在https://dotnetfiddle.net/中测试。结果输出如下所示:
table1:
0 0 0 DataRowState =Unchanged
1 1 1 DataRowState =Unchanged
2 4 2 DataRowState =Unchanged
3 9 3 DataRowState =Unchanged
4 16 4 DataRowState =Unchanged
table2:
0 0 0 DataRowState =Unchanged
1 1 455 DataRowState =Unchanged
2 4 555 DataRowState =Unchanged
13 169 655 DataRowState =Unchanged
modified count =2
added count =1
deleted count =2
delta:
1 1 455 DataRowState =Modified
2 4 555 DataRowState =Modified
13 169 655 DataRowState =Added
3 9 3 DataRowState =Deleted
4 16 4 DataRowState =Deleted
delta original values:
1 1 1 DataRowState =Unchanged
2 4 2 DataRowState =Unchanged
3 9 3 DataRowState =Unchanged
4 16 4 DataRowState =Unchanged
Note that if your tables don't have a PrimaryKey
, the where
clause in the LINQ queries gets simplified a little bit. I'll let you figure that out on your own.
请注意,如果您的表没有PrimaryKey,LINQ查询中的where子句会稍微简化一下。我会让你自己解决这个问题。
#12
0
Achieve it simply using linq.
使用linq实现它。
private DataTable CompareDT(DataTable TableA, DataTable TableB)
{
DataTable TableC = new DataTable();
try
{
var idsNotInB = TableA.AsEnumerable().Select(r => r.Field<string>(Keyfield))
.Except(TableB.AsEnumerable().Select(r => r.Field<string>(Keyfield)));
TableC = (from row in TableA.AsEnumerable()
join id in idsNotInB
on row.Field<string>(ddlColumn.SelectedItem.ToString()) equals id
select row).CopyToDataTable();
}
catch (Exception ex)
{
lblresult.Text = ex.Message;
ex = null;
}
return TableC;
}
#13
0
Could you not simply compare the CSV files before loading them into DataTables?
您是否可以在将CSV文件加载到DataTables之前简单地比较它们?
string[] a = System.IO.File.ReadAllLines(@"cvs_a.txt");
string[] b = System.IO.File.ReadAllLines(@"csv_b.txt");
// get the lines from b that are not in a
IEnumerable<string> diff = b.Except(a);
//... parse b into DataTable ...