Cassandra是否根据性能推荐用户定义的数据类型?

时间:2021-08-25 16:32:35

I have a Cassandra Customers table which is going to keep a list of customers. Every customer has an address which is a list of standard fields:

我有一个Cassandra Customers表,它将保留一份客户列表。每个客户都有一个地址,这是一个标准字段列表:

{
   CustomerName: "",
   etc...,
   Address: {
              street: "",
              city: "",
              province: "",
              etc...
            }
}

My question is if I have a million customers in this table and I use a user defined data type Address to keep the address information for each customers in the Customers table, what are the implications of such a model, especially in terms of disk space. Is this going to be very expensive? Should I use the Address user defined data type or flattent the address information or even use a separate table?

我的问题是,如果我在此表中有一百万客户,并且我使用用户定义的数据类型Address来保存Customers表中每个客户的地址信息,这种模型的含义是什么,特别是在磁盘空间方面。这会非常昂贵吗?我应该使用Address用户定义的数据类型还是flattent地址信息甚至使用单独的表?

1 个解决方案

#1


3  

Basically what happens in this case is that Cassandra will serialize instances of address into a blob, which is stored as a single column as part of your customer table. I don't have any numbers at hand on how much the serialization will put on top on disk or cpu usage, but it probably will not make a big difference for your use case. You should test both cases to be sure.

基本上,在这种情况下发生的事情是Cassandra会将地址实例序列化为blob,blob作为客户表的一部分存储为单个列。我没有任何关于序列化将在磁盘或CPU使用量上占多少的数字,但它可能不会对您的用例产生很大影响。你应该测试两种情况。

Edit: Another aspect I should also have mentioned: handling UDTs as single blobs will imply to replace the complete UDT for any updates. This will be less efficient than updating individual columns and is a potential cause for inconsistencies. In case of concurrent updates both writes could overwrite each others changes. See CASSANDRA-7423.

编辑:我还应该提到的另一个方面:处理UDT作为单个blob将意味着替换任何更新的完整UDT。这比更新单个列效率低,并且是导致不一致的潜在原因。在并发更新的情况下,两个写入都可以覆盖彼此的更改。见CASSANDRA-7423。

#1


3  

Basically what happens in this case is that Cassandra will serialize instances of address into a blob, which is stored as a single column as part of your customer table. I don't have any numbers at hand on how much the serialization will put on top on disk or cpu usage, but it probably will not make a big difference for your use case. You should test both cases to be sure.

基本上,在这种情况下发生的事情是Cassandra会将地址实例序列化为blob,blob作为客户表的一部分存储为单个列。我没有任何关于序列化将在磁盘或CPU使用量上占多少的数字,但它可能不会对您的用例产生很大影响。你应该测试两种情况。

Edit: Another aspect I should also have mentioned: handling UDTs as single blobs will imply to replace the complete UDT for any updates. This will be less efficient than updating individual columns and is a potential cause for inconsistencies. In case of concurrent updates both writes could overwrite each others changes. See CASSANDRA-7423.

编辑:我还应该提到的另一个方面:处理UDT作为单个blob将意味着替换任何更新的完整UDT。这比更新单个列效率低,并且是导致不一致的潜在原因。在并发更新的情况下,两个写入都可以覆盖彼此的更改。见CASSANDRA-7423。