如何从MySQL迁移到Cassandra建模

时间:2021-07-20 16:53:08

I am trying to move from MySQL to Cassandra for a music service application I am building.

我正在尝试从MySQL转移到Cassandra,以获得我正在构建的音乐服务应用程序。

I have read the following stackexchange: MySQL Data Model to Cassandra Help?

我读过以下stackexchange: MySQL数据模型给Cassandra帮助吗?

and checked out https://wiki.apache.org/cassandra/DataModel - also the DataStax Cassandra Modeling they did with the music service also, but the documentation so far are very small and narrow that I can't ditch MySql type queries away, so I would need help on.

并且检查了https://wiki.apache.org/cassandra/DataModel -也就是他们在音乐服务上所做的DataStax Cassandra建模,但是到目前为止,文档非常小而且很窄,我不能丢弃MySql类型的查询,所以我需要帮助。

This is my album table that works so far in mysql

这是我在mysql中使用的相册表。

CREATE TABLE `albums` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `title` varchar(150) NOT NULL,
  `description` varchar(300) NOT NULL,
  `release_date` int(10) unsigned NOT NULL,
  `status` enum('active','inactive','pending') NOT NULL,
  `licensor_id` int(11) NOT NULL,
  `score` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `status` (`status`),
  KEY `licensor_id` (`licensor_id`),
  KEY `batch_id` (`batch_id`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 AUTO_INCREMENT=1720100 ;

I also have a one to many relationship on the following tables:, artist (many artist to one album), genre(many genre to one album), songs(1 album contains many songs).

我在以下几张表上也有一到多的关系:艺术家(一张专辑里有很多艺术家),流派(一张专辑里有很多流派),歌曲(一张专辑里有很多歌曲)。

I have many pivot tables going around in order to couple these around.

我有很多支点表来把它们结合起来。

So because Cassandra doesn't allow joins, I figure that doing set,list,map would help me resolve to the proper dataset.

因为Cassandra不允许连接,我认为做set、list、map可以帮助我解析到正确的数据集。

at first my thoughts were to solve my maping by just reusing the same table:

首先,我的想法是通过重复使用同一个表来解决我的地图绘制:

CREATE TABLE `albums` (
  `id` int(10) ,
  `title` varchar(150) ,
  `description` varchar(300) ,
  `release_date` date ,
  `status` enum('active','inactive','pending') ,
  `licensor_id` int(11) ,
  `data_source_provider_id` int(10) ,
  `score` int(10)
  `genre` <set>
  `artist` <set>
  PRIMARY KEY (`id`),
) ;

(apologies if the above are not the correct syntax for Cassandra, Ive only begun installing the system on a dev system)

(如果上面的语法对Cassandra来说不正确,我很抱歉,我只是开始在开发系统上安装系统)

My queries are of the following:

我的问题如下:

  1. Give me all albums sorted by Score (Descending)
  2. 给我所有按分数排序的专辑(降序)
  3. Give me all albums from a particular genre, sorted by score
  4. 给我所有特定类型的专辑,按分数排序
  5. Give me all albums from a particular artist, sorted by score
  6. 给我所有的专辑,从一个特定的艺术家,按分数排序
  7. Give me all albums sorted by release date, then by score.
  8. 给我所有的专辑按发行日期排序,然后按分数排序。

In SQL the 4 are easy when doing the join - however since Cassandra doesn't allow joins i figure that my modelling was adequent enough however #4 cannot be satisified (there are no double order by as far as i can tell).

在SQL中,4是很容易的,但是由于Cassandra不允许加入,我认为我的模型是足够的,但是#4不能被满足(在我看来,没有双重的顺序)。

Multiple indexes are slow, and considering that its on a large dataset (there are 1.8M records for now, but I'm planning on pumping triple the amount at least, hence why Cassandra would be useful)

多个索引是缓慢的,并且考虑到它在一个大型数据集中(目前有1.8M的记录,但我计划至少增加三倍的数量,因此Cassandra是有用的)

My question are:

我的问题是:

1) Is my path to go from MySQL to Cassandra correct despite being stuck on my 4 questions - or did it do it wrong? (I've done some Active Records before with MongoDB where you can have a sub entity within the document, but Cassandra only has set,list and map).

1)我从MySQL到Cassandra的路径是正确的还是错误的?(我以前用MongoDB做过一些活动记录,您可以在文档中有一个子实体,但Cassandra只有set、list和map)。

2) If I want to expand my modelling to: " I want to create a list X that contains a predefined number elements from the albums table". Would tagging each Albums element with a new field "tag" that has X be the smart way to filter things OR would it be best to create a new table, that contains all the elements that I need and just query that.

2)如果我想扩展我的模型到:“我想创建一个列表X,其中包含来自相册表的预定义数字元素”。会给每个相册元素加上一个新的字段“tag”,这个字段有X,这是过滤东西的聪明方式,还是最好创建一个包含我需要的所有元素的新表,然后查询它。

1 个解决方案

#1


2  

The general advice for Cassandra is to write your tables based on your queries. Don't be shy about writing the same data to multiple tables if some of those queries are not compatible with each other. (Twitter, for example would write each tweet to a table of all the followers of that user.)

对于Cassandra来说,一般的建议是基于查询编写表。如果某些查询彼此不兼容,不要羞于将相同的数据写入多个表。(比如,Twitter会把每条推文都写在用户所有关注者的表格上。)

That said, looking at your queries, your challenge will be that Cassandra does not inherently have a way of handling some of your sorting needs. You will need to add an analytics engine like Spark or Hadoop's M/R to sort on a non-unique (constantly changing?) field like score.

也就是说,看看您的查询,您的挑战将是Cassandra天生没有处理某些排序需求的方法。您需要添加像Spark或Hadoop的M/R这样的分析引擎来对非唯一的(不断变化的?)字段进行排序。

Let's look at some table definitions that will be a good start. Then you can determine if you need a full blown distributed analytics engine or whether locally sorting the results of the query will be enough.

让我们来看一些表定义,这将是一个良好的开端。然后您可以确定是否需要一个完整的分布式分析引擎,或者是否对查询的结果进行本地排序就足够了。

CREATE TABLE albums(
  id uuid,
  title text,
  description text,
  releasedate timestamp,
  status text,
  license_id varint,
  data_source_provider_id varint,
  score counter,
  genre set<text>,
  artist set<text>,
  PRIMARY KEY (id)
);

This table will store all your albums by id. Based on your use case, selecting all the albums and sorting them by score would definitely be out of the question. You could, potentially, do something clever like modulo-ing the score and putting the albums in buckets, but I'm not convinced that would scale. Any of your queries could be answered using this table plus analytics, but in the interest of completeness, let's look at some other options for putting your data in Cassandra. Each of the following tables could readily reduce the load from any analytics investigations you run that have additional parameters (like a range of dates or set of genres).

这个表将按id存储所有的相册。根据您的用例,选择所有的相册并按分数排序肯定是不可能的。你可能会做一些很聪明的事情,比如把分数写出来,把专辑放进桶里,但我不相信会有这么大的规模。您的任何查询都可以使用这个表格和分析来回答,但是为了完整起见,让我们看看其他一些将数据放入Cassandra的选项。下面的每个表都可以很容易地减少您运行的任何具有附加参数的分析调查的负载(如日期范围或类型集合)。

CREATE TABLE albums(
  id uuid,
  title text,
  description text,
  releasedate timestamp,
  status text,
  license_id varint,
  data_source_provider_id varint,
  score counter,
  genre set<text>,
  artist text,
  PRIMARY KEY (artist, releasedate, title)
); 

Cassandra can automatically sort immutable fields. The table above will store each artist's albums in a separate partition (each partition is colocated in your cluster and replicated based on your replication factor). If an album has multiple artists, this record would be duplicated under each artist's entry, and that's OKAY. The second and third keys (releasedate and title) are considered sorting keys. Cassandra will sort the albums first by releasedate and second by title (for the other priority, reverse their order above). Each combo of artist, releasedate and title is logically one row (although on disk, they will be stored as a widerow per artist only). For one artist, you can probably sort the entries by score locally, without direct intervention from the database.

Cassandra可以自动对不可变字段进行排序。上面的表将把每个艺术家的相册存储在一个单独的分区中(每个分区在您的集群中进行配置并基于您的复制因子进行复制)。如果一个专辑有多个艺术家,那么这张唱片会在每个艺术家的作品下被复制,这是可以的。第二个和第三个键(releasedate和title)被视为排序键。Cassandra将首先通过发布来对专辑进行排序,然后再根据标题进行排序(对于其他优先级,颠倒上面的顺序)。每个艺术家、releasedate和title的组合在逻辑上都是一行(尽管在磁盘上,它们仅作为每个艺术家的宽端口存储)。对于一个艺术家,您可以根据本地的分数对条目进行排序,而不需要数据库的直接干预。

Sorting by release date can easily be accomplished by a similar table, but changing the PRIMARY KEY to: PRIMARY KEY (releasedate, ..?). In this case, however, you probably will face a challenge in sorting (locally) if you have a substantial range of release dates.

按发布日期排序可以通过类似的表来完成,但是将主键更改为:主键(releasedate, ..)。但是,在这种情况下,如果您有大量的发布日期,您可能会面临排序(本地)的挑战。

Finally, don't try something similar for genre. Genre is too large a set to be contained in a single partition key. Hypothetically if you had a secondary way of splitting that set up, you could do PRIMARY KEY ((genre, artist)), (double parens intentional) but I don't think this fits well with your particular use case as both of such keys are required to look up an entry.

最后,不要尝试类似的类型。类型太大,不能包含在单个分区键中。假设您有一种次要的方式来分割这个设置,您可以使用主键(类型、艺术家)、(双重意图),但是我认为这与您的特定用例不太匹配,因为这两个键都需要查找条目。

#1


2  

The general advice for Cassandra is to write your tables based on your queries. Don't be shy about writing the same data to multiple tables if some of those queries are not compatible with each other. (Twitter, for example would write each tweet to a table of all the followers of that user.)

对于Cassandra来说,一般的建议是基于查询编写表。如果某些查询彼此不兼容,不要羞于将相同的数据写入多个表。(比如,Twitter会把每条推文都写在用户所有关注者的表格上。)

That said, looking at your queries, your challenge will be that Cassandra does not inherently have a way of handling some of your sorting needs. You will need to add an analytics engine like Spark or Hadoop's M/R to sort on a non-unique (constantly changing?) field like score.

也就是说,看看您的查询,您的挑战将是Cassandra天生没有处理某些排序需求的方法。您需要添加像Spark或Hadoop的M/R这样的分析引擎来对非唯一的(不断变化的?)字段进行排序。

Let's look at some table definitions that will be a good start. Then you can determine if you need a full blown distributed analytics engine or whether locally sorting the results of the query will be enough.

让我们来看一些表定义,这将是一个良好的开端。然后您可以确定是否需要一个完整的分布式分析引擎,或者是否对查询的结果进行本地排序就足够了。

CREATE TABLE albums(
  id uuid,
  title text,
  description text,
  releasedate timestamp,
  status text,
  license_id varint,
  data_source_provider_id varint,
  score counter,
  genre set<text>,
  artist set<text>,
  PRIMARY KEY (id)
);

This table will store all your albums by id. Based on your use case, selecting all the albums and sorting them by score would definitely be out of the question. You could, potentially, do something clever like modulo-ing the score and putting the albums in buckets, but I'm not convinced that would scale. Any of your queries could be answered using this table plus analytics, but in the interest of completeness, let's look at some other options for putting your data in Cassandra. Each of the following tables could readily reduce the load from any analytics investigations you run that have additional parameters (like a range of dates or set of genres).

这个表将按id存储所有的相册。根据您的用例,选择所有的相册并按分数排序肯定是不可能的。你可能会做一些很聪明的事情,比如把分数写出来,把专辑放进桶里,但我不相信会有这么大的规模。您的任何查询都可以使用这个表格和分析来回答,但是为了完整起见,让我们看看其他一些将数据放入Cassandra的选项。下面的每个表都可以很容易地减少您运行的任何具有附加参数的分析调查的负载(如日期范围或类型集合)。

CREATE TABLE albums(
  id uuid,
  title text,
  description text,
  releasedate timestamp,
  status text,
  license_id varint,
  data_source_provider_id varint,
  score counter,
  genre set<text>,
  artist text,
  PRIMARY KEY (artist, releasedate, title)
); 

Cassandra can automatically sort immutable fields. The table above will store each artist's albums in a separate partition (each partition is colocated in your cluster and replicated based on your replication factor). If an album has multiple artists, this record would be duplicated under each artist's entry, and that's OKAY. The second and third keys (releasedate and title) are considered sorting keys. Cassandra will sort the albums first by releasedate and second by title (for the other priority, reverse their order above). Each combo of artist, releasedate and title is logically one row (although on disk, they will be stored as a widerow per artist only). For one artist, you can probably sort the entries by score locally, without direct intervention from the database.

Cassandra可以自动对不可变字段进行排序。上面的表将把每个艺术家的相册存储在一个单独的分区中(每个分区在您的集群中进行配置并基于您的复制因子进行复制)。如果一个专辑有多个艺术家,那么这张唱片会在每个艺术家的作品下被复制,这是可以的。第二个和第三个键(releasedate和title)被视为排序键。Cassandra将首先通过发布来对专辑进行排序,然后再根据标题进行排序(对于其他优先级,颠倒上面的顺序)。每个艺术家、releasedate和title的组合在逻辑上都是一行(尽管在磁盘上,它们仅作为每个艺术家的宽端口存储)。对于一个艺术家,您可以根据本地的分数对条目进行排序,而不需要数据库的直接干预。

Sorting by release date can easily be accomplished by a similar table, but changing the PRIMARY KEY to: PRIMARY KEY (releasedate, ..?). In this case, however, you probably will face a challenge in sorting (locally) if you have a substantial range of release dates.

按发布日期排序可以通过类似的表来完成,但是将主键更改为:主键(releasedate, ..)。但是,在这种情况下,如果您有大量的发布日期,您可能会面临排序(本地)的挑战。

Finally, don't try something similar for genre. Genre is too large a set to be contained in a single partition key. Hypothetically if you had a secondary way of splitting that set up, you could do PRIMARY KEY ((genre, artist)), (double parens intentional) but I don't think this fits well with your particular use case as both of such keys are required to look up an entry.

最后,不要尝试类似的类型。类型太大,不能包含在单个分区键中。假设您有一种次要的方式来分割这个设置,您可以使用主键(类型、艺术家)、(双重意图),但是我认为这与您的特定用例不太匹配,因为这两个键都需要查找条目。