I have a question pertaining to SQLAlchemy, database sharding, and UUIDs for you fine folks.
我有一个关于SQLAlchemy,数据库分片和UUID的问题,对你来说很好。
I'm currently using MySQL in which I have a table of the form:
我目前正在使用MySQL,其中我有一个表格表:
CREATE TABLE foo (
added_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
id BINARY(16) NOT NULL,
... other stuff ...
UNIQUE KEY(id)
);
A little background on this table. I never care about the 'added_id', I'm only using to ensure that the inserted items are clustered together on disk (since the B-Tree used to index the table in MySQL uses the primary key as the cluster index). The 'id' column contains the binary representation of a UUID -- this is the column I actually care about and all other things reference this ID. Again, I don't want the UUID to be the primary key, since the UUID is random and thus making the B-Tree created to index the table have horrible IO characteristics (at least that is what has been said elsewhere). Also, although UUID1 includes the timestamp to ensure that IDs are generated in "sequential" order, the inclusion of the MAC address in the ID makes it something I'd rather avoid. Thus, I'd like to use UUID4s.
这张桌子上的一点背景。我从不关心'added_id',我只是用来确保插入的项目在磁盘上聚集在一起(因为用于索引MySQL中的表的B-Tree使用主键作为集群索引)。 'id'列包含UUID的二进制表示 - 这是我实际关注的列,所有其他内容都引用此ID。同样,我不希望UUID成为主键,因为UUID是随机的,因此创建用于索引表的B树具有可怕的IO特性(至少在其他地方已经说过)。此外,尽管UUID1包含确保以“顺序”顺序生成ID的时间戳,但在ID中包含MAC地址使我更愿意避免使用。因此,我想使用UUID4s。
Ok, now moving on to the SQLAlchemy part. In SQLAlchemy one can define a model using their ORM for the above table by doing something like:
好的,现在转到SQLAlchemy部分。在SQLAlchemy中,可以通过执行以下操作,使用ORM为上表定义模型:
# The SQL Alchemy ORM base class
Base = declerative_base()
# The model for table 'foo'
class Foo(Base):
__table__ = 'foo'
add_id = Column(Integer, primary_key=True, nullable=False)
id = Column(Binary, index=True, unique=True, nullable=False)
...
Again, this is basically the same as the SQL above.
同样,这与上面的SQL基本相同。
And now to the question. Let's say that this database is going to be sharded (horizontally partitioned) into 2 (or more) separate databases. Now, (assuming no deletions) each of these databases will have records with added_id of 1, 2, 3, etc in table foo. Since SQLAlchemy uses a session to manage the objects that are being worked on such that each object is identified only by its primary key, it seems like it would be possible to have the situation where I could end trying to access two Foo objects from the two shards with the same added_id resulting in some conflict in the managed session.
而现在问题。假设该数据库将被分片(水平分区)为2个(或更多个)单独的数据库。现在,(假设没有删除)这些数据库中的每一个都将具有表foo中的added_id为1,2,3等的记录。由于SQLAlchemy使用一个会话来管理正在处理的对象,使得每个对象仅由其主键识别,所以似乎可能会出现这样的情况:我可以尝试从两个对象访问两个Foo对象具有相同added_id的分片导致托管会话中的某些冲突。
Has anyone run in to this issue? What have you done to solve it? Or, more than likely, am I missing something from the SQLAlchemy documentation that ensures that this cannot happen. However, looking at the sharding example provided with the SQLAlchemy download (examples/sharding/attribute_shard.py) they seem to side-step this issue by designating one of the database shards as an ID generator... creating an implicit bottle neck as all INSERTS have to go against that single database to get an ID. (They also mention using UUIDs, but apparently that causes the performance issue for the indexes.)
有人遇到过这个问题吗?你做了什么来解决它?或者,我很可能会遗漏SQLAlchemy文档中的某些内容,以确保不会发生这种情况。但是,查看SQLAlchemy下载(examples / sharding / attribute_shard.py)提供的分片示例,他们似乎通过将其中一个数据库分片指定为ID生成器来解决这个问题...创建一个隐含的瓶颈,因为所有INSERTS必须针对该单个数据库来获取ID。 (他们还提到使用UUID,但显然会导致索引的性能问题。)
Alternatively, is there a way to set the UUID as the primary key and have the data be clustered on disk using the added_id? If it's not possible in MySQL is it possible in another DB like Postgres?
或者,有没有办法将UUID设置为主键,并使用added_id将数据聚集在磁盘上?如果在MySQL中不可能在Postgres这样的另一个数据库中可行吗?
Thanks in advance for any and all input!
提前感谢任何和所有输入!
--- UPDATE ---- I just want to add an out of band answer that I received to this question. The following text isn't something I wrote, I just want to include it here in case someone finds it useful.
---更新----我只是想在这个问题上添加一个带外答案。以下文字不是我写的东西,我只是想把它包含在这里以防有人发现它有用。
The easiest way to avoid that situation with MySQL and auto increment keys is to use different auto increment offsets for each database, e.g.:
使用MySQL和自动增量键避免这种情况的最简单方法是为每个数据库使用不同的自动增量偏移量,例如:
ALTER TABLE foo AUTO_INCREMENT=100000;
ALTER TABLE foo AUTO_INCREMENT = 100000;
The downside is that you need to take care in terms of how you configure each shard, and you need to plan a bit wrt the total number of shards you use.
缺点是您需要注意如何配置每个分片,并且需要计划一些您使用的分片总数。
There isn't any way to convince MySQL to use a non-primary key for the clustered index. If you don't care about using SQLAlchemy to manage your database schema (although, you probably should), you can simply set the UUID as the primary key in the SQLAlchemy schema and leave the add_id as the pk in the actual table.
没有任何方法可以说服MySQL使用非主键作为聚簇索引。如果您不关心使用SQLAlchemy来管理数据库模式(尽管您可能应该这样做),您只需将UUID设置为SQLAlchemy模式中的主键,并将add_id作为实际表中的pk。
I've also seen alternate solutions that simply use an external server (e.g. redis) to maintain the row id.
我还看到了只使用外部服务器(例如redis)来维护行ID的替代解决方案。
1 个解决方案
#1
5
yes, you can specify any of the table's columns as the primary key for the purposes of the mapping using the "primary_key" mapper argument, which is a list of Column objects or a single Column:
是的,您可以使用“primary_key”映射器参数指定任何表的列作为主键,以便进行映射,该参数是Column对象列表或单个列:
Base = declarative_base()
# The model for table 'foo'
class Foo(Base):
__table__ = 'foo'
add_id = Column(Integer, primary_key=True, nullable=False)
id = Column(Binary, index=True, unique=True, nullable=False)
__mapper_args__ = {'primary_key': id}
Above, while the SQLAlchemy Core will treat "add_id" as the "autoincrement" column, the mapper will be mostly uninterested in it, instead using "id" as the column it cares about when considering the "identity" of the object.
上面,虽然SQLAlchemy Core将“add_id”视为“自动增量”列,但映射器对它几乎不感兴趣,而是在考虑对象的“身份”时使用“id”作为它关心的列。
See the documentation for mapper() for more description.
有关更多说明,请参阅mapper()的文档。
#1
5
yes, you can specify any of the table's columns as the primary key for the purposes of the mapping using the "primary_key" mapper argument, which is a list of Column objects or a single Column:
是的,您可以使用“primary_key”映射器参数指定任何表的列作为主键,以便进行映射,该参数是Column对象列表或单个列:
Base = declarative_base()
# The model for table 'foo'
class Foo(Base):
__table__ = 'foo'
add_id = Column(Integer, primary_key=True, nullable=False)
id = Column(Binary, index=True, unique=True, nullable=False)
__mapper_args__ = {'primary_key': id}
Above, while the SQLAlchemy Core will treat "add_id" as the "autoincrement" column, the mapper will be mostly uninterested in it, instead using "id" as the column it cares about when considering the "identity" of the object.
上面,虽然SQLAlchemy Core将“add_id”视为“自动增量”列,但映射器对它几乎不感兴趣,而是在考虑对象的“身份”时使用“id”作为它关心的列。
See the documentation for mapper() for more description.
有关更多说明,请参阅mapper()的文档。