We are currently scraping structured data from a variety of different sources. Before ingesting new data into our table, we check to see if the data_id exists already.
我们目前正在从各种不同来源抓取结构化数据。在将新数据提取到表中之前,我们检查data_id是否已经存在。
IF NOT EXISTS (SELECT TOP 1 * FROM TABLE_NAME WHERE DATA_ID=@P0)
如果不是EXISTS(从TABLE_NAME中选择TOP 1 * WHERE DATA_ID = @ P0)
We have no indexes; however, we have a PK set for our id column which seems unnecessary, should we remove this to improve insert speed?
我们没有索引;但是,我们的id列的PK设置似乎是不必要的,我们应该删除它以提高插入速度吗?
Our server is currently at full load checking through 3 million or so worth of data to make sure we are not inserting duplicate data. We have tried upgrading our SQL Server for higher DTU but that doesn't seem to help at all.
我们的服务器目前正在通过300万左右的数据进行满负荷检查,以确保我们不会插入重复数据。我们尝试升级我们的SQL Server以获得更高的DTU,但这似乎没有任何帮助。
When we have multiple jobs running at the same time checking for unique data or SQL Server comes to a crawl and insert speed takes forever.
当我们同时运行多个作业检查唯一数据或SQL Server进行爬行并且插入速度需要永远。
Should we get rid of this unique data check and create new tables for every scraping job, then use a SQL Query to compare the differences, such as new data or data that was removed?
我们是否应该摆脱这种独特的数据检查并为每个抓取作业创建新表,然后使用SQL查询来比较差异,例如新数据或已删除的数据?
Query used for conditional insertion:
用于条件插入的查询:
String sql = "IF NOT EXISTS (SELECT TOP 1 * FROM A_PROV_CVV_LDG_1 WHERE DATA_ID=?) " +
"INSERT INTO A_PROV_CVV_LDG_1 (DATA_ID, SourceID, BASE_ID, BIN, BANK, CARD_TYPE, CARD_CLASS," +
" CARD_LEVEL, CARD_EXP, COUNTRY, STATE, CITY, ZIP, DOB, SSN, EMAIL, PHONE, GENDER, ADDR_LINE_1, ADDR_LINE_2," +
" FIRST_NAME, LAST_NAME, DateAddedToMarket, PRICE) " +
"VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)";
This is the entire table definition, no indexes only PK is 'id' which seems unnecessary.
这是整个表的定义,没有索引只有PK是'id',这似乎是不必要的。
+-------------------+--------------+-----------+ | (PK)id | int | Unchecked | | DATA_ID | int | Checked | | SourceID | int | Checked | | BASE_ID | varchar(255) | Checked | | BIN | varchar(255) | Checked | | BANK | varchar(255) | Checked | | CARD_TYPE | varchar(255) | Checked | | CARD_CLASS | varchar(255) | Checked | | CARD_LEVEL | varchar(255) | Checked | | CARD_EXP | varchar(255) | Checked | | COUNTRY | varchar(255) | Checked | | STATE | varchar(255) | Checked | | CITY | varchar(255) | Checked | | ZIP | varchar(255) | Checked | | DOB | varchar(255) | Checked | | SSN | varchar(255) | Checked | | EMAIL | varchar(255) | Checked | | PHONE | varchar(255) | Checked | | GENDER | varchar(255) | Checked | | ADDR_LINE_1 | varchar(255) | Checked | | ADDR_LINE_2 | varchar(255) | Checked | | FIRST_NAME | varchar(255) | Checked | | LAST_NAME | varchar(255) | Checked | | PRICE | varchar(255) | Checked | | DateAddedToMarket | varchar(255) | Checked | | DateAdded | datetime | Unchecked | +-------------------+--------------+-----------+
+ ------------------- + -------------- + ----------- + | (PK)id | int |未选中| | DATA_ID | int |检查| | SourceID | int |检查| | BASE_ID | varchar(255)|检查| | BIN | varchar(255)|检查| |银行| varchar(255)|检查| | CARD_TYPE | varchar(255)|检查| | CARD_CLASS | varchar(255)|检查| | CARD_LEVEL | varchar(255)|检查| | CARD_EXP | varchar(255)|检查| |国家| varchar(255)|检查| |国家| varchar(255)|检查| | CITY | varchar(255)|检查| | ZIP | varchar(255)|检查| | DOB | varchar(255)|检查| | SSN | varchar(255)|检查| |电子邮件| varchar(255)|检查| |电话| varchar(255)|检查| |性别| varchar(255)|检查| | ADDR_LINE_1 | varchar(255)|检查| | ADDR_LINE_2 | varchar(255)|检查| | FIRST_NAME | varchar(255)|检查| | LAST_NAME | varchar(255)|检查| |价格| varchar(255)|检查| | DateAddedToMarket | varchar(255)|检查| |添加日期| datetime |未选中| + ------------------- + -------------- ----------- + +
3 个解决方案
#1
0
You absolutely need a unique index on DATA_ID
for your query--indeed for any deduplication attempt on DATA_ID
---to work efficiently. Without it every attempted insert does a full table scan.
对于查询,您绝对需要DATA_ID的唯一索引 - 实际上对于DATA_ID上的任何重复数据删除尝试 - 才能有效工作。没有它,每次尝试插入都会进行全表扫描。
Yes, indexes slow down insertion a little bit. But an index on an integer column isn't very expensive. Certainly not compared to the mess you're in now with a table scan for every insertion. Create that index.
是的,索引会减慢插入速度。但是整数列上的索引并不是很昂贵。当然没有与你现在的混乱相比,每次插入都有一个表扫描。创建该索引。
#2
1
If the server is busy, the statement: IF NOT EXISTS (SELECT TOP 1 * FROM TABLE_NAME WHERE DATA_ID=@P0) might be blocked or disk requests are queued. Run sp_who2 to check if there is blocking. If this is the only routine that puts data in the table add WITH (NOLOCK), selecting Null instead of anything unnecessary:
如果服务器正忙,则可能会阻止语句:IF NOT EXISTS(SELECT TOP 1 * FROM TABLE_NAME WHERE DATA_ID = @ P0)或磁盘请求排队。运行sp_who2以检查是否存在阻塞。如果这是将数据放入表中的唯一例程,则添加WITH(NOLOCK),选择Null而不是任何不必要的:
IF NOT EXISTS (SELECT null FROM TABLE_NAME WITH (NOLOCK) WHERE DATA_ID=@P0)
如果不是EXISTS(SELECT FROM FROM TABLE_NAME WITH(NOLOCK)WHERE DATA_ID = @ P0)
#3
0
This construct:
IF NOT EXISTS (SELECT TOP 1 * FROM A_PROV_CVV_LDG_1 WHERE DATA_ID=?)
INSERT INTO A_PROV_CVV_LDG_1 . . .
is an anti-pattern. It is attempting to prevent duplicates in code. However, it suffers from race conditions. And you should let the database implement data integrity checks, where it can.
是一种反模式。它试图阻止代码中的重复。但是,它受到竞争条件的影响。您应该让数据库尽可能地实施数据完整性检查。
Instead, implement a unique constraint/index to prevent duplicates:
相反,实现一个唯一的约束/索引来防止重复:
alter table A_PROV_CVV_LDG_1 add constraint unq_A_PROV_CVV_LDG_1_data_id
unique (data_id);
This does mean that you need to catch an error if you try to insert a duplicate value. That is easy enough in SQL Server using try
/catch
blocks.
这意味着如果您尝试插入重复值,则需要捕获错误。在SQL Server中使用try / catch块很容易。
#1
0
You absolutely need a unique index on DATA_ID
for your query--indeed for any deduplication attempt on DATA_ID
---to work efficiently. Without it every attempted insert does a full table scan.
对于查询,您绝对需要DATA_ID的唯一索引 - 实际上对于DATA_ID上的任何重复数据删除尝试 - 才能有效工作。没有它,每次尝试插入都会进行全表扫描。
Yes, indexes slow down insertion a little bit. But an index on an integer column isn't very expensive. Certainly not compared to the mess you're in now with a table scan for every insertion. Create that index.
是的,索引会减慢插入速度。但是整数列上的索引并不是很昂贵。当然没有与你现在的混乱相比,每次插入都有一个表扫描。创建该索引。
#2
1
If the server is busy, the statement: IF NOT EXISTS (SELECT TOP 1 * FROM TABLE_NAME WHERE DATA_ID=@P0) might be blocked or disk requests are queued. Run sp_who2 to check if there is blocking. If this is the only routine that puts data in the table add WITH (NOLOCK), selecting Null instead of anything unnecessary:
如果服务器正忙,则可能会阻止语句:IF NOT EXISTS(SELECT TOP 1 * FROM TABLE_NAME WHERE DATA_ID = @ P0)或磁盘请求排队。运行sp_who2以检查是否存在阻塞。如果这是将数据放入表中的唯一例程,则添加WITH(NOLOCK),选择Null而不是任何不必要的:
IF NOT EXISTS (SELECT null FROM TABLE_NAME WITH (NOLOCK) WHERE DATA_ID=@P0)
如果不是EXISTS(SELECT FROM FROM TABLE_NAME WITH(NOLOCK)WHERE DATA_ID = @ P0)
#3
0
This construct:
IF NOT EXISTS (SELECT TOP 1 * FROM A_PROV_CVV_LDG_1 WHERE DATA_ID=?)
INSERT INTO A_PROV_CVV_LDG_1 . . .
is an anti-pattern. It is attempting to prevent duplicates in code. However, it suffers from race conditions. And you should let the database implement data integrity checks, where it can.
是一种反模式。它试图阻止代码中的重复。但是,它受到竞争条件的影响。您应该让数据库尽可能地实施数据完整性检查。
Instead, implement a unique constraint/index to prevent duplicates:
相反,实现一个唯一的约束/索引来防止重复:
alter table A_PROV_CVV_LDG_1 add constraint unq_A_PROV_CVV_LDG_1_data_id
unique (data_id);
This does mean that you need to catch an error if you try to insert a duplicate value. That is easy enough in SQL Server using try
/catch
blocks.
这意味着如果您尝试插入重复值,则需要捕获错误。在SQL Server中使用try / catch块很容易。