Lets say I have this table
可以说我有这张桌子
CREATE TABLE device_data_by_year (
year int,
device_id uuid,
sensor_id uuid,
nano_since_epoch bigint,
unit text,
value double,
source text,
username text,
PRIMARY KEY (year, device_id, nano_since_epoch,sensor_id)
) WITH CLUSTERING ORDER BY (device_id desc, nano_since_epoch desc);
I need to query data for a particular device and sensor in a period between 2017 and 2018. In this case 2 queries will be issued:
我需要在2017年到2018年之间查询特定设备和传感器的数据。在这种情况下,将发出2个查询:
select * from device_data_by_year where year = 2018 AND device_id = ? AND sensor_id = ? AND nano_since_epoch >= ? AND nano_since_epoch <= ?
select * from device_data_by_year where year = 2018 AND device_id = ? AND sensor_id = ? AND nano_since_epoch >= ? AND nano_since_epoch <= ?
Currently I iterate over the resultsets and build a List with all the results. I am aware that this could (and will) run into OOM problems some day. Is there a better approach, how to handle / merge query results into one set?
目前,我迭代结果集并构建包含所有结果的List。我知道这有可能(而且会)有一天会遇到OOM问题。有没有更好的方法,如何处理/合并查询结果到一个集合?
Thanks
谢谢
2 个解决方案
#1
2
You can use IN
to specify a list of years, but this is not very optimal solution - because the year
field is partition key, then most probably the data will be on different machines, so one of the node will work as "coordinator", and will need to ask another machine for results, and aggregate data. From performance point of view, 2 async requests issued in parallel could be faster, and then do the merge on client side.
您可以使用IN指定年份列表,但这不是最佳解决方案 - 因为年份字段是分区键,那么很可能数据将在不同的机器上,因此其中一个节点将作为“协调器”工作,并需要向另一台机器询问结果和汇总数据。从性能的角度来看,并行发出的2个异步请求可能更快,然后在客户端进行合并。
P.S. your data model have quite serious problems - you partition by year, this means:
附:你的数据模型有很严重的问题 - 你按年分区,这意味着:
- Data isn't very good distributed across the cluster - only N=RF machines will hold the data;
- 整个集群中的数据分布不是很好 - 只有N = RF机器才能保存数据;
- These partitions will be very huge, even if you get only hundred of devices, reporting one measurement per minute;
- 这些分区将非常庞大,即使您只获得数百台设备,每分钟报告一次测量;
- Only one partition will be "hot" - it will receive all data during the year, and other partitions won't be used very often.
- 只有一个分区“热” - 它将在一年中接收所有数据,而其他分区将不会经常使用。
You can use months, or even days as partition key to decrease the size of partition, but it still won't solve the problem of the "hot" partitions.
您可以使用几个月甚至几天作为分区键来减小分区的大小,但它仍然无法解决“热”分区的问题。
If I remember correctly, Data Modelling course at DataStax Academy has an example of data model for sensor network.
如果我没记错的话,DataStax Academy的数据建模课程有一个传感器网络数据模型的例子。
#2
0
Changed the table structure to:
将表结构更改为:
CREATE TABLE device_data (
week_first_day timestamp,
device_id uuid,
sensor_id uuid,
nano_since_epoch bigint,
unit text,
value double,
source text,
username text,
PRIMARY KEY ((week_first_day, device_id), nano_since_epoch, sensor_id)
) WITH CLUSTERING ORDER BY (nano_since_epoch desc, sensor_id desc);
according to @AlexOtt proposal. Some changes to the application logic are required - for example findAllByYear needs to iterate over single weeks now.
根据@AlexOtt的提议。需要对应用程序逻辑进行一些更改 - 例如,findAllByYear现在需要迭代一周。
Coming back to the original question: would you rather send 52 queries (getDataByYear, one query per week) oder would you use the IN operator here?
回到最初的问题:你宁愿发送52个查询(getDataByYear,每周一个查询)oder你会在这里使用IN运算符吗?
#1
2
You can use IN
to specify a list of years, but this is not very optimal solution - because the year
field is partition key, then most probably the data will be on different machines, so one of the node will work as "coordinator", and will need to ask another machine for results, and aggregate data. From performance point of view, 2 async requests issued in parallel could be faster, and then do the merge on client side.
您可以使用IN指定年份列表,但这不是最佳解决方案 - 因为年份字段是分区键,那么很可能数据将在不同的机器上,因此其中一个节点将作为“协调器”工作,并需要向另一台机器询问结果和汇总数据。从性能的角度来看,并行发出的2个异步请求可能更快,然后在客户端进行合并。
P.S. your data model have quite serious problems - you partition by year, this means:
附:你的数据模型有很严重的问题 - 你按年分区,这意味着:
- Data isn't very good distributed across the cluster - only N=RF machines will hold the data;
- 整个集群中的数据分布不是很好 - 只有N = RF机器才能保存数据;
- These partitions will be very huge, even if you get only hundred of devices, reporting one measurement per minute;
- 这些分区将非常庞大,即使您只获得数百台设备,每分钟报告一次测量;
- Only one partition will be "hot" - it will receive all data during the year, and other partitions won't be used very often.
- 只有一个分区“热” - 它将在一年中接收所有数据,而其他分区将不会经常使用。
You can use months, or even days as partition key to decrease the size of partition, but it still won't solve the problem of the "hot" partitions.
您可以使用几个月甚至几天作为分区键来减小分区的大小,但它仍然无法解决“热”分区的问题。
If I remember correctly, Data Modelling course at DataStax Academy has an example of data model for sensor network.
如果我没记错的话,DataStax Academy的数据建模课程有一个传感器网络数据模型的例子。
#2
0
Changed the table structure to:
将表结构更改为:
CREATE TABLE device_data (
week_first_day timestamp,
device_id uuid,
sensor_id uuid,
nano_since_epoch bigint,
unit text,
value double,
source text,
username text,
PRIMARY KEY ((week_first_day, device_id), nano_since_epoch, sensor_id)
) WITH CLUSTERING ORDER BY (nano_since_epoch desc, sensor_id desc);
according to @AlexOtt proposal. Some changes to the application logic are required - for example findAllByYear needs to iterate over single weeks now.
根据@AlexOtt的提议。需要对应用程序逻辑进行一些更改 - 例如,findAllByYear现在需要迭代一周。
Coming back to the original question: would you rather send 52 queries (getDataByYear, one query per week) oder would you use the IN operator here?
回到最初的问题:你宁愿发送52个查询(getDataByYear,每周一个查询)oder你会在这里使用IN运算符吗?