在Cassandra的一些表之间同步(复制)数据的标准方法是什么?

时间:2021-08-02 07:45:13

I'm new in Cassandra, recently I watched very good tutorials on DataStax that was about data modeling.

我是Cassandra的新手,最近我看到了关于DataStax关于数据建模的很好的教程。

As I understood, in Cassandra we always have to have different tables for queries we want to have, for example even a simple query for sorting data by time or id.

正如我所理解的,在Cassandra中,我们总是需要有不同的表来查询我们想要的查询,例如,即使是一个简单的查询,可以按时间或id对数据进行排序。

It means we have to have some tables for each entity that has data according to query we want to have later. Imagine we have Videos, and we may have some tables for videos entity.

这意味着我们必须为每个有数据的实体提供一些表,根据我们以后想要的查询。假设我们有视频,我们可能有一些视频实体的表格。

First question, as I said for each query we must have a table, we are going to have different kind of sorts on video by different columns:

第一个问题,就像我说的,对于每个查询我们必须有一个表格,我们会有不同种类的视频不同的栏目:

our columns for video table are:

我们的视力表栏目是:

video_id  |  video_title  |  video_create_year  |  director  |  timestap

Now should we make other tables for other sorting we may need?

现在,我们是否应该为可能需要的其他排序创建其他表?

We may need to sort the table by director name (ASC | DESC), video_create_year (ASC | DESC), video_title (ASC | DESC)

我们可能需要按director名称(ASC | DESC)、video_create_year (ASC | DESC)、video_title (ASC | DESC)对表进行排序

I am not sure, Should we make different tables for each different sorting?

我不确定,我们是否应该为不同的排序创建不同的表?

Such as:

如:

videos_by_diractor_asc
videos_by_diractor_desc
videos_by_title_asc
videos_by_title_desc

So on...

等等……

Did I understand it correctly?

我理解对了吗?

Second question, if I understood it correctly, then I forgot to make a table that I may needed in our website (Imagine one day I get I forgot to have video_by_title_asc) then what should I do? Should I write a program and copy whole data from video table? Or there are some ways in Cassandra to copy whole data if it is necessary?

第二个问题,如果我理解正确,那么我就忘记制作一个我可能需要的表格在我们的网站上(想象有一天我忘记了有一个video_by_title_asc)然后我该怎么做?我应该编写一个程序并从视频表中复制整个数据吗?或者如果有必要的话,卡桑德拉有一些方法可以复制整个数据?

I hope the question was not confusing.

我希望这个问题没有让人困惑。

1 个解决方案

#1


2  

Okay, you're understanding Cassandra partially right.

好吧,你对卡桑德拉的理解是部分正确的。

I hope i understand you right. Your Primary Key of this tables would look like this:

我希望我没听错。这个表的主键是这样的:

videos_by_diractor_asc PRIMARY KEY(director)
videos_by_title_asc PRIMARY KEY(title)

But in this case you forgot one thing: The partition key. The partition key is the first part of the primary key. I think, in your case, the year make sense. All rows with the same partition key are always on the same node. Cassandra split your rows by the partition key. The columns after the partition keys, called column keys, are sorted. The partition keys itself are not sorted. This means: node1 can have year 2015, 1998 and 1950 and node2 2010, 1990, 1577. Cassandra evenly distribute the data between the nodes. On modelling, you have to think about one important thing: What are the expected size of my table inside one partition key. This mean, in the video case, how many rows do you expect in one year? 2 Mio? 1 bln? If you will get more than 2bln rows x column, you will have a huge problem. 2bln is the maximum size of each partition key. But remember: It's the maximum. I recommend not more than 500mio. I calculate, in the worst case, with 500mio.

但是在这种情况下,你忘记了一件事:分区键。分区键是主键的第一部分。我认为,就你而言,这一年是有意义的。所有具有相同分区键的行总是在同一个节点上。Cassandra通过分区键分割您的行。分区键之后的列(称为列键)被排序。分区键本身没有排序。这意味着:node1可以有2015年、1998年和1950年以及node2 2010年、1990年和1577年。Cassandra在节点之间均匀地分布数据。在建模方面,您必须考虑一件重要的事情:我的表在一个分区键中的预期大小是多少。这意味着,在视频的情况下,你希望一年有多少行?2我的太阳?1左右?如果您将得到多于2bln行x列,那么您将会遇到一个巨大的问题。2bln是每个分区键的最大大小。但是记住:这是最大值。我建议不超过500mio。在最坏的情况下,我用500mio计算。

So now we can talk about the column keys. Yes, every sorting needs a new table. And you also need a new table if you want to access data in your WHERE conditions in different orders. One Example: You have this primary key PRIMARY KEY(year, director, title)

现在我们讨论列键。是的,每一个排序都需要一个新表。如果您想在不同订单的条件下访问数据,您还需要一个新表。一个例子:您有这个主键主键(年份,director, title)

The first is the partition key. This means: You always need the year in your where condition. Then your data, with the same partition key, sorted, default in ASC, by director. After the director by title. In this case you can't use this WHERE condition: WHERE year = 2016 and title = 'whatever'

第一个是分区键。这意味着:在你所在的地方,你总是需要一年。然后,您的数据,使用相同的分区键,在ASC中按director进行排序。在导演的头衔之后。在这种情况下,不能使用WHERE条件:WHERE year = 2016, title = 'whatever'

Okay, now i will answer your main question :) The thing about the duplicated data. In Cassandra 3.0 you can use materialized views. Yes, it's a nice feature but it has his overhead. The best solution is to write a wrapper around cassandra. This wrapper only does one thing: It handles all this duplicated data. It knows what's the best way to access data if you need it sorted by title and then by director and not sorted by director and then by title. And one thing: Have no concerns to write data 5 or more times. Cassandra is optimized for writing. It's okay to write data. But don't forgot one thing: Cassandra is a database for known queries. If you know that you will need the data really often in this sorting order, create a table for it. But if you don't know it and you create this table only for the case when: Don't create a table. For this, sometimes queries, you can use spark or another solution.

好的,现在我来回答你的主要问题:)关于重复数据的事情。在Cassandra 3.0中,你可以使用物化视图。是的,这是一个不错的功能,但它有他的开销。最好的解决方案是围绕cassandra编写一个包装器。这个包装器只做一件事:它处理所有这些重复的数据。它知道什么是访问数据的最佳方式如果你需要按标题排序,然后按导演排序,而不是按导演排序,再按标题排序。还有一件事:不用担心写数据5次或5次以上。Cassandra最适合写作。可以编写数据。但别忘了一件事:Cassandra是一个已知查询的数据库。如果您知道您确实经常需要以这种排序顺序排序数据,那么为它创建一个表。但是,如果您不知道,并且只在以下情况下创建此表:不要创建表。对于这个问题,有时您可以使用spark或其他解决方案。

And one more thing: If you need only to query data by one thing, like only by title, only by director, don't use cassandra for it. This is a main feature of a key value storage.

还有一件事:如果你只需要用一件事查询数据,就像只需要用标题,只需要用导演,不要用cassandra。这是键值存储的一个主要特性。

#1


2  

Okay, you're understanding Cassandra partially right.

好吧,你对卡桑德拉的理解是部分正确的。

I hope i understand you right. Your Primary Key of this tables would look like this:

我希望我没听错。这个表的主键是这样的:

videos_by_diractor_asc PRIMARY KEY(director)
videos_by_title_asc PRIMARY KEY(title)

But in this case you forgot one thing: The partition key. The partition key is the first part of the primary key. I think, in your case, the year make sense. All rows with the same partition key are always on the same node. Cassandra split your rows by the partition key. The columns after the partition keys, called column keys, are sorted. The partition keys itself are not sorted. This means: node1 can have year 2015, 1998 and 1950 and node2 2010, 1990, 1577. Cassandra evenly distribute the data between the nodes. On modelling, you have to think about one important thing: What are the expected size of my table inside one partition key. This mean, in the video case, how many rows do you expect in one year? 2 Mio? 1 bln? If you will get more than 2bln rows x column, you will have a huge problem. 2bln is the maximum size of each partition key. But remember: It's the maximum. I recommend not more than 500mio. I calculate, in the worst case, with 500mio.

但是在这种情况下,你忘记了一件事:分区键。分区键是主键的第一部分。我认为,就你而言,这一年是有意义的。所有具有相同分区键的行总是在同一个节点上。Cassandra通过分区键分割您的行。分区键之后的列(称为列键)被排序。分区键本身没有排序。这意味着:node1可以有2015年、1998年和1950年以及node2 2010年、1990年和1577年。Cassandra在节点之间均匀地分布数据。在建模方面,您必须考虑一件重要的事情:我的表在一个分区键中的预期大小是多少。这意味着,在视频的情况下,你希望一年有多少行?2我的太阳?1左右?如果您将得到多于2bln行x列,那么您将会遇到一个巨大的问题。2bln是每个分区键的最大大小。但是记住:这是最大值。我建议不超过500mio。在最坏的情况下,我用500mio计算。

So now we can talk about the column keys. Yes, every sorting needs a new table. And you also need a new table if you want to access data in your WHERE conditions in different orders. One Example: You have this primary key PRIMARY KEY(year, director, title)

现在我们讨论列键。是的,每一个排序都需要一个新表。如果您想在不同订单的条件下访问数据,您还需要一个新表。一个例子:您有这个主键主键(年份,director, title)

The first is the partition key. This means: You always need the year in your where condition. Then your data, with the same partition key, sorted, default in ASC, by director. After the director by title. In this case you can't use this WHERE condition: WHERE year = 2016 and title = 'whatever'

第一个是分区键。这意味着:在你所在的地方,你总是需要一年。然后,您的数据,使用相同的分区键,在ASC中按director进行排序。在导演的头衔之后。在这种情况下,不能使用WHERE条件:WHERE year = 2016, title = 'whatever'

Okay, now i will answer your main question :) The thing about the duplicated data. In Cassandra 3.0 you can use materialized views. Yes, it's a nice feature but it has his overhead. The best solution is to write a wrapper around cassandra. This wrapper only does one thing: It handles all this duplicated data. It knows what's the best way to access data if you need it sorted by title and then by director and not sorted by director and then by title. And one thing: Have no concerns to write data 5 or more times. Cassandra is optimized for writing. It's okay to write data. But don't forgot one thing: Cassandra is a database for known queries. If you know that you will need the data really often in this sorting order, create a table for it. But if you don't know it and you create this table only for the case when: Don't create a table. For this, sometimes queries, you can use spark or another solution.

好的,现在我来回答你的主要问题:)关于重复数据的事情。在Cassandra 3.0中,你可以使用物化视图。是的,这是一个不错的功能,但它有他的开销。最好的解决方案是围绕cassandra编写一个包装器。这个包装器只做一件事:它处理所有这些重复的数据。它知道什么是访问数据的最佳方式如果你需要按标题排序,然后按导演排序,而不是按导演排序,再按标题排序。还有一件事:不用担心写数据5次或5次以上。Cassandra最适合写作。可以编写数据。但别忘了一件事:Cassandra是一个已知查询的数据库。如果您知道您确实经常需要以这种排序顺序排序数据,那么为它创建一个表。但是,如果您不知道,并且只在以下情况下创建此表:不要创建表。对于这个问题,有时您可以使用spark或其他解决方案。

And one more thing: If you need only to query data by one thing, like only by title, only by director, don't use cassandra for it. This is a main feature of a key value storage.

还有一件事:如果你只需要用一件事查询数据,就像只需要用标题,只需要用导演,不要用cassandra。这是键值存储的一个主要特性。