I will describe my situation first in order to make the following question as clear as possible.
我将首先描述我的情况,以便使以下问题尽可能清楚。
For simplicity, let say I have a table in MySQL database (InnoDB) with records about dogs with structure as follows:
为了简单起见,假设我在MySQL数据库(InnoDB)中有一个表,其中记录了狗的结构如下:
dog_id (PK) | dog_name
And there is 10,000,000 rows in the table (each represents a unique dog) and index build on the dog_name column.
表中有10,000,000行(每个代表一个唯一的dog)和建立在dog_name列上的索引。
My program searches through vets records that I need to process. Each record is somehow connected with a dog and there is like 100 records for each dog. And I want to find dogs which have not been inserted to the database yet.
我的程序通过我需要处理的兽医记录进行搜索。每条记录都与一条狗有某种联系,每条狗大约有100条记录。我想找到尚未插入数据库的狗。
That means that 100 times in a row the record that is being processed can be about a dog which is already in the database and therefore the dog doesn't have to be added to the database. But sometimes it happens (as mentioned before 1:100 ratio) that I need to add a dog to the database because it is the first time the program approached a record about the dog. (I hope this example makes my situation clear)
这意味着,连续100次正在处理的记录可能与已经在数据库中的狗有关,因此狗不需要添加到数据库中。但有时我需要向数据库中添加一条狗,因为这是程序第一次接近关于这条狗的记录。(我希望这个例子能说明我的情况)
My question is: What is the most effective way how to verify that the dog has not beed inserted into the database yet?
我的问题是:最有效的方法是什么?
- Load all the dog names (suppose all the dogs in the world have unique names) to the memory of the program (a set) and check if the dog is in the set or not. When it is in the set I skip the record, when it is not I insert the dog.
- 加载所有的狗名(假设世界上所有的狗都有唯一的名字)到程序的内存(一组),并检查狗是否在集合中。当它在集合中,我跳过记录,当它不是,我插入狗。
- Define the column as UNIQUE and try to insert all the records. When there is a database error because of the uniqueness, I just skip the dog and continue.
- 将列定义为惟一,并尝试插入所有记录。当出现数据库错误时,因为惟一性,我就跳过这条狗,继续。
- Query the database to find out if the dog is in the database every time I process a record and if it is in the database I skip the record and if it is not I insert the dog into the table.
- 查询数据库,以查明每次我处理一个记录时,狗是否在数据库中,如果它在数据库中,我跳过记录,如果不是,我将狗插入到表中。
To give you as much information as I can. I use Python, SqlAlchemy, MySQL, InnoDB.
尽可能多地给你提供信息。我使用Python, SqlAlchemy, MySQL, InnoDB。
2 个解决方案
#1
1
You should use dog_name as the primary key, and then use
您应该使用dog_name作为主键,然后使用。
INSERT INTO dogs (dog_name) VALUES ('[NAME HERE]') ON DUPLICATE KEY UPDATE dog_name='[NAME HERE]';
This will only insert unique dog names. If you still want to use a numerical ID for each dog, you can set that column to auto increment, but the primary key should be the dog names (assuming all are unique).
这将只插入唯一的狗名。如果仍然想为每个dog使用数字ID,可以将该列设置为自动递增,但是主键应该是dog名称(假设所有的都是惟一的)。
SQLAlchemy does not have this functionality built in, but can make force it to make a similar query with session.merge()
.
SQLAlchemy没有内置此功能,但可以强制它使用session.merge()进行类似的查询。
#2
1
Something like option 2 or option 3 will work best; they should take similar amounts of time, and which one wins will depend on exactly how MySQL/InnoDB decides that a collision has occurred. I don't actually know; it's possible that insert with a UNIQUE key triggers the same operation as a SELECT. Prototype both and profile performance.
诸如选项2或选项3之类的东西将最有效;他们应该花同样的时间,而哪一个赢将取决于MySQL/InnoDB如何决定发生冲突。我不知道;使用唯一键插入可能会触发与SELECT相同的操作。样机和配置文件性能。
If performance is an issue, you can always hand-code the SELECT statement since it's relatively simple. This cuts out the Python MySQL overhead to construct the SQL; that's normally not a huge issue, but SQLAlchemy can add dozens of layers of function calls that support its ability to construct arbitrary queries. You can short-circuit those calls using Python string formatting.
如果性能是一个问题,您总是可以手工编写SELECT语句,因为它相对简单。这样就省去了构建SQL所需的Python MySQL开销;这通常不是一个大问题,但是SQLAlchemy可以添加几十个函数调用层,以支持其构造任意查询的能力。可以使用Python字符串格式对这些调用进行短路。
Assuming that 's' is your SQLAlchemy Session object:
假设's'是您的SQLAlchemy会话对象:
def dog_in_db(dog_name):
q = 'SELECT COUNT (*) FROM dogs WHERE dog_name = %s;' % dog_name
res = s.execute(q)
return res.first()[0] > 0
You could also try a SELECTing and check whether any rows are returned:
您也可以尝试选择并检查是否返回任何行:
q 'SELECT dog_id FROM dogs WHERE dog_name = %s;' % dog_name
res = s.execute(q)
return res.rowcount() > 0
Assuming that your option 1 means loading all of the names from the database, it will be slow. MySQL will always perform any single operation it supports faster than Python can; and what you're doing here is exactly the same single operation (finding a member in a list).
假设您的选项1意味着从数据库中加载所有的名称,那么将会很慢。MySQL将始终执行它支持的任何单个操作,比Python能够更快;你在这里做的就是同一个操作(在列表中找到一个成员)。
#1
1
You should use dog_name as the primary key, and then use
您应该使用dog_name作为主键,然后使用。
INSERT INTO dogs (dog_name) VALUES ('[NAME HERE]') ON DUPLICATE KEY UPDATE dog_name='[NAME HERE]';
This will only insert unique dog names. If you still want to use a numerical ID for each dog, you can set that column to auto increment, but the primary key should be the dog names (assuming all are unique).
这将只插入唯一的狗名。如果仍然想为每个dog使用数字ID,可以将该列设置为自动递增,但是主键应该是dog名称(假设所有的都是惟一的)。
SQLAlchemy does not have this functionality built in, but can make force it to make a similar query with session.merge()
.
SQLAlchemy没有内置此功能,但可以强制它使用session.merge()进行类似的查询。
#2
1
Something like option 2 or option 3 will work best; they should take similar amounts of time, and which one wins will depend on exactly how MySQL/InnoDB decides that a collision has occurred. I don't actually know; it's possible that insert with a UNIQUE key triggers the same operation as a SELECT. Prototype both and profile performance.
诸如选项2或选项3之类的东西将最有效;他们应该花同样的时间,而哪一个赢将取决于MySQL/InnoDB如何决定发生冲突。我不知道;使用唯一键插入可能会触发与SELECT相同的操作。样机和配置文件性能。
If performance is an issue, you can always hand-code the SELECT statement since it's relatively simple. This cuts out the Python MySQL overhead to construct the SQL; that's normally not a huge issue, but SQLAlchemy can add dozens of layers of function calls that support its ability to construct arbitrary queries. You can short-circuit those calls using Python string formatting.
如果性能是一个问题,您总是可以手工编写SELECT语句,因为它相对简单。这样就省去了构建SQL所需的Python MySQL开销;这通常不是一个大问题,但是SQLAlchemy可以添加几十个函数调用层,以支持其构造任意查询的能力。可以使用Python字符串格式对这些调用进行短路。
Assuming that 's' is your SQLAlchemy Session object:
假设's'是您的SQLAlchemy会话对象:
def dog_in_db(dog_name):
q = 'SELECT COUNT (*) FROM dogs WHERE dog_name = %s;' % dog_name
res = s.execute(q)
return res.first()[0] > 0
You could also try a SELECTing and check whether any rows are returned:
您也可以尝试选择并检查是否返回任何行:
q 'SELECT dog_id FROM dogs WHERE dog_name = %s;' % dog_name
res = s.execute(q)
return res.rowcount() > 0
Assuming that your option 1 means loading all of the names from the database, it will be slow. MySQL will always perform any single operation it supports faster than Python can; and what you're doing here is exactly the same single operation (finding a member in a list).
假设您的选项1意味着从数据库中加载所有的名称,那么将会很慢。MySQL将始终执行它支持的任何单个操作,比Python能够更快;你在这里做的就是同一个操作(在列表中找到一个成员)。