EMR上的AWS Hive + Kinesis =了解检查点

时间:2022-02-03 00:53:03

I have an AWS Kinesis stream and I created an external table in Hive pointing at it. I then create a DynamoDB table for the checkpoints and in my Hive query I set the following properties as described here:

我有一个AWS Kinesis流,我在Hive中创建了一个指向它的外部表。然后我为检查点创建一个DynamoDB表,在我的Hive查询中,我设置了如下所述的以下属性:

set kinesis.checkpoint.enabled=true;
set kinesis.checkpoint.metastore.table.name=my_dynamodb_table;
set kinesis.checkpoint.metastore.hash.key.name=HashKey;                                                               
set kinesis.checkpoint.metastore.range.key.name=RangeKey;                                                            
set kinesis.checkpoint.logical.name=my_logical_name;                                                                 
set kinesis.checkpoint.iteration.no=0;

I have the following questions:

我有以下问题:

  • Do I always have to start with iteration.no set to 0?
  • 我是否总是必须从iteration.no开始设置为0?

  • Does this always start from the beginning of the script (oldest Kinesis record about to be evicted)?
  • 这总是从脚本的开头开始(最古老的Kinesis记录即将被驱逐)吗?

  • Imagine I set up a cron to schedule the execution of the script, how do I retrieve the 'next' iteration number?
  • 想象一下,我设置了一个cron来安排脚本的执行,如何检索“下一个”迭代编号?

  • To re-execute the script on the same data, is it enough to re run the query with the same execution number?
  • 要在相同的数据上重新执行脚本,是否足以重新运行具有相同执行号的查询?

  • If I execute a select * from kinesis_ext_table limit 100with iteration.no=0 over and over, will I get different/weird results once the first Kinesis records start to be evicted?
  • 如果我一遍又一遍地执行select * from kinesis_ext_table limit 100 with iteration.no = 0,一旦第一个Kinesis记录开始被驱逐,我会得到不同/奇怪的结果吗?

Given the DynamoDB checkpoint entry:

给定DynamoDB检查点条目:

{"startSeqNo":"1234",
 "endSeqNo":"5678",
 "closed":false}
  • What's the meaning of the closed field?
  • 封闭场的意义是什么?

  • Are sequence number incremental and is there a relation between the start and end (EG: end - start = number of records read)?
  • 序列号是否增量,是否有开始和结束之间的关系(EG:结束 - 开始=读取的记录数)?

  • I noticed that sometimes there is only the endSeqNum (no startSeqNum), how should I interpret that?
  • 我注意到有时候只有endSeqNum(没有startSeqNum),我应该怎么解释呢?

I know that it's a lot of questions but I could not find these answers on the documentation.

我知道这是很多问题,但我在文档上找不到这些答案。

1 个解决方案

#1


Check out the Kinesis documentation and the Kinesis Storage Handler Readme which contains answers to many of your questions.

查看Kinesis文档和Kinesis存储处理程序自述文件,其中包含许多问题的答案。

Do I always have to start with iteration.no set to 0?

我是否总是必须从iteration.no开始设置为0?

Yes, unless you are doing some advanced logic which requires you to skip a known or already processed part of the stream

是的,除非您正在执行某些需要跳过已知或已处理的流部分的高级逻辑

Does this always start from the beginning of the script (oldest Kinesis record about to be evicted)?

这总是从脚本的开头开始(最古老的Kinesis记录即将被驱逐)吗?

Yes

Imagine I set up a cron to schedule the execution of the script, how do I retrieve the 'next' iteration number?

想象一下,我设置了一个cron来安排脚本的执行,如何检索“下一个”迭代编号?

This is handled by the hive script, since it is querying all data in the kinesis stream at each run

这由hive脚本处理,因为它在每次运行时查询kinesis流中的所有数据

To re-execute the script on the same data, is it enough to re run the query with the same execution number?

要在相同的数据上重新执行脚本,是否足以重新运行具有相同执行号的查询?

As Kinesis data is a 24-hour time window, the data has (possibly) changed since your last query, so you probably would want to query all records again in the Hive job

由于Kinesis数据是一个24小时的时间窗口,自上次查询以来,数据已经(可能)发生了变化,因此您可能希望在Hive作业中再次查询所有记录

If I execute a select * from kinesis_ext_table limit 100with iteration.no=0 over and over, will I get different/weird results once the first Kinesis records start to be evicted?

如果我一遍又一遍地执行select * from kinesis_ext_table limit 100 with iteration.no = 0,一旦第一个Kinesis记录开始被驱逐,我会得到不同/奇怪的结果吗?

Yes, you would expect the results to change as the stream changes

是的,您会希望结果随着流的变化而变化

Given the DynamoDB checkpoint entry: What's the meaning of the closed field?

鉴于DynamoDB检查点条目:封闭字段的含义是什么?

Although this is an internal detail of the Kinesis Storage Handler, I believe this indicates whether the shard is a parent shard, which indicates whether is it open and accepting new data or closed and not accepting new data into the shard. If you have scaled your stream up or down, parent shards exist for 24 hours, and contain all data since you scaled, however no new data will be inserted into these shards.

虽然这是Kinesis存储处理程序的内部细节,但我相信这表明该分片是否为父分片,它指示它是否已打开并接受新数据或已关闭且未接受新数据到分片中。如果您已向上或向下缩放流,则父碎片将存在24小时,并包含自缩放后的所有数据,但不会将新数据插入这些碎片中。

Are sequence number incremental and is there a relation between the start and end (EG: end - start = number of records read)?

序列号是否增量,是否有开始和结束之间的关系(EG:结束 - 开始=读取的记录数)?

New sequence numbers generally increase over time is the only guidance that Amazon provide on this.

新的序列号通常会随着时间的推移而增加,这是亚马逊提供的唯一指导。

I noticed that sometimes there is only the endSeqNum (no startSeqNum), how should I interpret that?

我注意到有时候只有endSeqNum(没有startSeqNum),我应该怎么解释呢?

This means the shard is open and still accepting new data (not a parent shard)

这意味着该分片已打开并仍然接受新数据(不是父分片)

#1


Check out the Kinesis documentation and the Kinesis Storage Handler Readme which contains answers to many of your questions.

查看Kinesis文档和Kinesis存储处理程序自述文件,其中包含许多问题的答案。

Do I always have to start with iteration.no set to 0?

我是否总是必须从iteration.no开始设置为0?

Yes, unless you are doing some advanced logic which requires you to skip a known or already processed part of the stream

是的,除非您正在执行某些需要跳过已知或已处理的流部分的高级逻辑

Does this always start from the beginning of the script (oldest Kinesis record about to be evicted)?

这总是从脚本的开头开始(最古老的Kinesis记录即将被驱逐)吗?

Yes

Imagine I set up a cron to schedule the execution of the script, how do I retrieve the 'next' iteration number?

想象一下,我设置了一个cron来安排脚本的执行,如何检索“下一个”迭代编号?

This is handled by the hive script, since it is querying all data in the kinesis stream at each run

这由hive脚本处理,因为它在每次运行时查询kinesis流中的所有数据

To re-execute the script on the same data, is it enough to re run the query with the same execution number?

要在相同的数据上重新执行脚本,是否足以重新运行具有相同执行号的查询?

As Kinesis data is a 24-hour time window, the data has (possibly) changed since your last query, so you probably would want to query all records again in the Hive job

由于Kinesis数据是一个24小时的时间窗口,自上次查询以来,数据已经(可能)发生了变化,因此您可能希望在Hive作业中再次查询所有记录

If I execute a select * from kinesis_ext_table limit 100with iteration.no=0 over and over, will I get different/weird results once the first Kinesis records start to be evicted?

如果我一遍又一遍地执行select * from kinesis_ext_table limit 100 with iteration.no = 0,一旦第一个Kinesis记录开始被驱逐,我会得到不同/奇怪的结果吗?

Yes, you would expect the results to change as the stream changes

是的,您会希望结果随着流的变化而变化

Given the DynamoDB checkpoint entry: What's the meaning of the closed field?

鉴于DynamoDB检查点条目:封闭字段的含义是什么?

Although this is an internal detail of the Kinesis Storage Handler, I believe this indicates whether the shard is a parent shard, which indicates whether is it open and accepting new data or closed and not accepting new data into the shard. If you have scaled your stream up or down, parent shards exist for 24 hours, and contain all data since you scaled, however no new data will be inserted into these shards.

虽然这是Kinesis存储处理程序的内部细节,但我相信这表明该分片是否为父分片,它指示它是否已打开并接受新数据或已关闭且未接受新数据到分片中。如果您已向上或向下缩放流,则父碎片将存在24小时,并包含自缩放后的所有数据,但不会将新数据插入这些碎片中。

Are sequence number incremental and is there a relation between the start and end (EG: end - start = number of records read)?

序列号是否增量,是否有开始和结束之间的关系(EG:结束 - 开始=读取的记录数)?

New sequence numbers generally increase over time is the only guidance that Amazon provide on this.

新的序列号通常会随着时间的推移而增加,这是亚马逊提供的唯一指导。

I noticed that sometimes there is only the endSeqNum (no startSeqNum), how should I interpret that?

我注意到有时候只有endSeqNum(没有startSeqNum),我应该怎么解释呢?

This means the shard is open and still accepting new data (not a parent shard)

这意味着该分片已打开并仍然接受新数据(不是父分片)