
时间:2021-07-20 16:53:08

I am trying to move from MySQL to Cassandra for a music service application I am building.


I have read the following stackexchange: MySQL Data Model to Cassandra Help?

我读过以下stackexchange: MySQL数据模型给Cassandra帮助吗?

and checked out https://wiki.apache.org/cassandra/DataModel - also the DataStax Cassandra Modeling they did with the music service also, but the documentation so far are very small and narrow that I can't ditch MySql type queries away, so I would need help on.

并且检查了https://wiki.apache.org/cassandra/DataModel -也就是他们在音乐服务上所做的DataStax Cassandra建模,但是到目前为止,文档非常小而且很窄,我不能丢弃MySql类型的查询,所以我需要帮助。

This is my album table that works so far in mysql


CREATE TABLE `albums` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `title` varchar(150) NOT NULL,
  `description` varchar(300) NOT NULL,
  `release_date` int(10) unsigned NOT NULL,
  `status` enum('active','inactive','pending') NOT NULL,
  `licensor_id` int(11) NOT NULL,
  `score` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `status` (`status`),
  KEY `licensor_id` (`licensor_id`),
  KEY `batch_id` (`batch_id`)

I also have a one to many relationship on the following tables:, artist (many artist to one album), genre(many genre to one album), songs(1 album contains many songs).


I have many pivot tables going around in order to couple these around.


So because Cassandra doesn't allow joins, I figure that doing set,list,map would help me resolve to the proper dataset.


at first my thoughts were to solve my maping by just reusing the same table:


CREATE TABLE `albums` (
  `id` int(10) ,
  `title` varchar(150) ,
  `description` varchar(300) ,
  `release_date` date ,
  `status` enum('active','inactive','pending') ,
  `licensor_id` int(11) ,
  `data_source_provider_id` int(10) ,
  `score` int(10)
  `genre` <set>
  `artist` <set>
  PRIMARY KEY (`id`),
) ;

(apologies if the above are not the correct syntax for Cassandra, Ive only begun installing the system on a dev system)


My queries are of the following:


  1. Give me all albums sorted by Score (Descending)
  2. 给我所有按分数排序的专辑(降序)
  3. Give me all albums from a particular genre, sorted by score
  4. 给我所有特定类型的专辑,按分数排序
  5. Give me all albums from a particular artist, sorted by score
  6. 给我所有的专辑,从一个特定的艺术家,按分数排序
  7. Give me all albums sorted by release date, then by score.
  8. 给我所有的专辑按发行日期排序,然后按分数排序。

In SQL the 4 are easy when doing the join - however since Cassandra doesn't allow joins i figure that my modelling was adequent enough however #4 cannot be satisified (there are no double order by as far as i can tell).


Multiple indexes are slow, and considering that its on a large dataset (there are 1.8M records for now, but I'm planning on pumping triple the amount at least, hence why Cassandra would be useful)


My question are:


1) Is my path to go from MySQL to Cassandra correct despite being stuck on my 4 questions - or did it do it wrong? (I've done some Active Records before with MongoDB where you can have a sub entity within the document, but Cassandra only has set,list and map).


2) If I want to expand my modelling to: " I want to create a list X that contains a predefined number elements from the albums table". Would tagging each Albums element with a new field "tag" that has X be the smart way to filter things OR would it be best to create a new table, that contains all the elements that I need and just query that.


1 个解决方案



The general advice for Cassandra is to write your tables based on your queries. Don't be shy about writing the same data to multiple tables if some of those queries are not compatible with each other. (Twitter, for example would write each tweet to a table of all the followers of that user.)


That said, looking at your queries, your challenge will be that Cassandra does not inherently have a way of handling some of your sorting needs. You will need to add an analytics engine like Spark or Hadoop's M/R to sort on a non-unique (constantly changing?) field like score.


Let's look at some table definitions that will be a good start. Then you can determine if you need a full blown distributed analytics engine or whether locally sorting the results of the query will be enough.


  id uuid,
  title text,
  description text,
  releasedate timestamp,
  status text,
  license_id varint,
  data_source_provider_id varint,
  score counter,
  genre set<text>,
  artist set<text>,

This table will store all your albums by id. Based on your use case, selecting all the albums and sorting them by score would definitely be out of the question. You could, potentially, do something clever like modulo-ing the score and putting the albums in buckets, but I'm not convinced that would scale. Any of your queries could be answered using this table plus analytics, but in the interest of completeness, let's look at some other options for putting your data in Cassandra. Each of the following tables could readily reduce the load from any analytics investigations you run that have additional parameters (like a range of dates or set of genres).


  id uuid,
  title text,
  description text,
  releasedate timestamp,
  status text,
  license_id varint,
  data_source_provider_id varint,
  score counter,
  genre set<text>,
  artist text,
  PRIMARY KEY (artist, releasedate, title)

Cassandra can automatically sort immutable fields. The table above will store each artist's albums in a separate partition (each partition is colocated in your cluster and replicated based on your replication factor). If an album has multiple artists, this record would be duplicated under each artist's entry, and that's OKAY. The second and third keys (releasedate and title) are considered sorting keys. Cassandra will sort the albums first by releasedate and second by title (for the other priority, reverse their order above). Each combo of artist, releasedate and title is logically one row (although on disk, they will be stored as a widerow per artist only). For one artist, you can probably sort the entries by score locally, without direct intervention from the database.


Sorting by release date can easily be accomplished by a similar table, but changing the PRIMARY KEY to: PRIMARY KEY (releasedate, ..?). In this case, however, you probably will face a challenge in sorting (locally) if you have a substantial range of release dates.

按发布日期排序可以通过类似的表来完成,但是将主键更改为:主键(releasedate, ..)。但是,在这种情况下,如果您有大量的发布日期,您可能会面临排序(本地)的挑战。

Finally, don't try something similar for genre. Genre is too large a set to be contained in a single partition key. Hypothetically if you had a secondary way of splitting that set up, you could do PRIMARY KEY ((genre, artist)), (double parens intentional) but I don't think this fits well with your particular use case as both of such keys are required to look up an entry.




The general advice for Cassandra is to write your tables based on your queries. Don't be shy about writing the same data to multiple tables if some of those queries are not compatible with each other. (Twitter, for example would write each tweet to a table of all the followers of that user.)


That said, looking at your queries, your challenge will be that Cassandra does not inherently have a way of handling some of your sorting needs. You will need to add an analytics engine like Spark or Hadoop's M/R to sort on a non-unique (constantly changing?) field like score.


Let's look at some table definitions that will be a good start. Then you can determine if you need a full blown distributed analytics engine or whether locally sorting the results of the query will be enough.


  id uuid,
  title text,
  description text,
  releasedate timestamp,
  status text,
  license_id varint,
  data_source_provider_id varint,
  score counter,
  genre set<text>,
  artist set<text>,

This table will store all your albums by id. Based on your use case, selecting all the albums and sorting them by score would definitely be out of the question. You could, potentially, do something clever like modulo-ing the score and putting the albums in buckets, but I'm not convinced that would scale. Any of your queries could be answered using this table plus analytics, but in the interest of completeness, let's look at some other options for putting your data in Cassandra. Each of the following tables could readily reduce the load from any analytics investigations you run that have additional parameters (like a range of dates or set of genres).


  id uuid,
  title text,
  description text,
  releasedate timestamp,
  status text,
  license_id varint,
  data_source_provider_id varint,
  score counter,
  genre set<text>,
  artist text,
  PRIMARY KEY (artist, releasedate, title)

Cassandra can automatically sort immutable fields. The table above will store each artist's albums in a separate partition (each partition is colocated in your cluster and replicated based on your replication factor). If an album has multiple artists, this record would be duplicated under each artist's entry, and that's OKAY. The second and third keys (releasedate and title) are considered sorting keys. Cassandra will sort the albums first by releasedate and second by title (for the other priority, reverse their order above). Each combo of artist, releasedate and title is logically one row (although on disk, they will be stored as a widerow per artist only). For one artist, you can probably sort the entries by score locally, without direct intervention from the database.


Sorting by release date can easily be accomplished by a similar table, but changing the PRIMARY KEY to: PRIMARY KEY (releasedate, ..?). In this case, however, you probably will face a challenge in sorting (locally) if you have a substantial range of release dates.

按发布日期排序可以通过类似的表来完成,但是将主键更改为:主键(releasedate, ..)。但是,在这种情况下,如果您有大量的发布日期,您可能会面临排序(本地)的挑战。

Finally, don't try something similar for genre. Genre is too large a set to be contained in a single partition key. Hypothetically if you had a secondary way of splitting that set up, you could do PRIMARY KEY ((genre, artist)), (double parens intentional) but I don't think this fits well with your particular use case as both of such keys are required to look up an entry.
