I want to know how many items are in my dynamodb table. From the API guide, one way to do it is using a scan as follows:
我想知道我的dynamodb表中有多少项。从API指南中,一种方法是使用如下扫描:
<?php
$dynamodb = new AmazonDynamoDB();
$scan_response = $dynamodb->scan(array(
'TableName' => 'ProductCatalog'
));
echo "Total number of items: ".count($scan_response->body->Items)."\n";
However, this has to fetch all items and store them in an array in memory which isn't feasible in most cases I would presume. Is there a way to get the total item count more efficiently?
但是,这必须获取所有项目并将它们存储在内存中的数组中,这在我认为的大多数情况下是不可行的。有没有办法更有效地获得总项目数?
This data is not available in the AWS Dynamo web-console, I have already checked. (at first it looks like it is shown alongside the pagination buttons, but it turns out the figure grows as you go to the next page of items).
我已经检查过AWS Dynamo Web控制台中没有此数据。 (起初看起来它显示在分页按钮旁边,但事实证明,当你转到下一页的项目时,数字会增长)。
7 个解决方案
#1
19
I can think of three options to get the total number of items in a DynamoDB table.
我可以想到三个选项来获取DynamoDB表中的项目总数。
-
The first option is using the scan, but the scan function is inefficient and is in general a bad practice, especially for tables with heavy reads or production tables.
第一个选项是使用扫描,但扫描功能效率低,通常是一种不好的做法,特别是对于具有大量读取或生产表的表。
-
The second option is what was mention by Atharva:
第二种选择是Atharva提到的:
A better solution that comes to my mind is to maintain the total number of item counts for such tables in a separate table, where each item will have Table name as it's hash key and total number of items in that table as it's non-key attribute. You can then keep this Table possibly named "TotalNumberOfItemsPerTable" updated by making atomic update operations to increment/decrement the total item count for a particular table.
我想到的一个更好的解决方案是在单独的表中维护这些表的项目总数,其中每个项目都有表名作为它的散列键和该表中的项目总数,因为它是非键属性。然后,您可以通过使原子更新操作增加/减少特定表的总项数来更新此表可能名为“TotalNumberOfItemsPerTable”。
The only problem this is that increment operations are not idempotent. So if a write fails or you write more than once this will be reflected in the count. If you need pin-point accuracy, use a conditional update instead.
唯一的问题是增量操作不是幂等的。因此,如果写入失败或您不止一次写入,这将反映在计数中。如果您需要精确定位,请使用条件更新。
-
The simplest solution is the DescribeTable which returns ItemCount. The only issue is that the count isn't up to date. The count is updated every 6 hours.
最简单的解决方案是DescribeTable,它返回ItemCount。唯一的问题是计数不是最新的。计数每6小时更新一次。
http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_DescribeTable.html
http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_DescribeTable.html
#2
10
The Count
option is definitely what you want, but you also have to take into account that there may be one or more "page" of results in your Scan result. The Scan operation only scans 1MB of data in your table at a time, so the value of Count
in the result is only going to reflect the count of the first 1MB of the table. You will need to make subsequent requests using the value of LastEvaluatedKey
in the result (if it is there). Here is some sample code for doing something like that:
“计数”选项绝对是您想要的,但您还必须考虑扫描结果中可能有一个或多个“页面”结果。扫描操作一次只扫描表中的1MB数据,因此结果中Count的值仅反映表的前1MB的计数。您需要在结果中使用LastEvaluatedKey的值进行后续请求(如果存在)。这是一些用于执行此类操作的示例代码:
<?php
$dynamo_db = new AmazonDynamoDB();
$total = 0;
$start_key = null;
$params = array(
'TableName' => 'my-table',
'Count' => true
);
do {
if ($start_key) {
$params['ExclusiveStartKey'] = $start_key->getArrayCopy();
}
$response = $dynamo_db->scan($params);
if ($response->isOK()) {
$total += (string) $response->body->Count;
if ($response->body->LastEvaluatedKey) {
$start_key = $response->body->LastEvaluatedKey->to_array();
} else {
$start_key = null;
}
}
} while ($start_key);
echo "Count: {$total}";
#3
5
Aha, there is a Count
option in the scan
API, see http://docs.amazonwebservices.com/AWSSDKforPHP/latest/#m=AmazonDynamoDB/scan
啊啊,扫描API中有一个Count选项,请参阅http://docs.amazonwebservices.com/AWSSDKforPHP/latest/#m=AmazonDynamoDB/scan
<?php
$dynamodb = new DynamoMetadata();
$scan_response = $dynamodb->scan(array(
'TableName' => 'ProductCatalog'
'Count' => true,
));
echo "Count: ".$scan_response->body->Count."\n";
#4
3
If you are interested in using the total number of items in a table in your application's logic, that means you are going to query for the total counts pretty frequently. Now one way to achieve this is by using scan operation. But remember that scan operation literally scans through the whole table and therefore consumes lots of throughput, so all the query operations will receive Throttled Exception in that duration. And even considering the fact that scan will limit the resultant count by size of 1MB, you will have to make repeated scan operations to get the actual number of items if the table is very large. This will require to write a custom query logic and handle inevitable throttling in query operations.
如果您有兴趣在应用程序逻辑中使用表中的项目总数,则意味着您将经常查询总计数。现在,实现此目的的一种方法是使用扫描操作。但请记住,扫描操作会扫描整个表,因此会占用大量吞吐量,因此所有查询操作都将在该持续时间内收到Throttled Exception。即使考虑到扫描会将结果数量限制为1MB的大小这一事实,如果表格非常大,您将不得不进行重复扫描操作以获得实际的项目数。这将需要编写自定义查询逻辑并处理查询操作中不可避免的限制。
A better solution that comes to my mind is to maintain the total number of item counts for such tables in a separate table, where each item will have Table name as it's hash key and total number of items in that table as it's non-key attribute. You can then keep this Table possibly named "TotalNumberOfItemsPerTable" updated by making atomic update operations to increment/decrement the total item count for a particular table.
我想到的一个更好的解决方案是在单独的表中维护这些表的项目总数,其中每个项目都有表名作为它的散列键和该表中的项目总数,因为它是非键属性。然后,您可以通过使原子更新操作增加/减少特定表的总项数来更新此表可能名为“TotalNumberOfItemsPerTable”。
No issue of throttling or 1MB limit.
没有限制或1MB限制的问题。
Furthermore, you can expand this concept to even further granularity for example to maintain total number of items matching with some hash key or any arbitrary criteria which you can encode in string form to make an entry in your table named something like "TotalNumberOfItemsInSomeCollection" or "TotalNumberOfItemsMatchingSomeCriteria". These tables can then contain entries for number of items per table, per collection or items matching with some criteria.
此外,您可以将此概念扩展到更进一步的粒度,例如,以维护与某些散列键匹配的项目总数或任何可以以字符串形式编码的任意条件,以在表中创建名为“TotalNumberOfItemsInSomeCollection”或“ TotalNumberOfItemsMatchingSomeCriteria”。然后,这些表可以包含每个表,每个集合的项目数或与某些条件匹配的项的条目。
#5
2
An approximate item count value (supposedly updated every six hours) is available in the AWS console for DynamoDB. Just select the table and look under the Details tab, last entry is Item Count. If this works for you, then you can avoid consuming your table throughput to do the count.
AWS控制台中提供了DynamoDB的近似项目计数值(据称每六小时更新一次)。只需选择表格并在详细信息选项卡下查看,最后一项是项目计数。如果这对您有用,那么您可以避免消耗表吞吐量来计算。
#6
0
This is now available in the AWS table overview screen under the section 'Table details', field 'Item count'. It appears to be just a dump of DescribeTable, and notes that its updated roughly every six hours.
现在,可在AWS表概述屏幕的“表详细信息”部分的“项目计数”字段中找到它。它似乎只是DescribeTable的转储,并指出它大约每六个小时更新一次。
#7
0
Here's how I get the exact item count on my billion records DynamoDB table:
这是我如何获得我的十亿记录DynamoDB表上的确切项目数:
hive>
蜂巢>
set dynamodb.throughput.write.percent = 1;
set dynamodb.throughput.read.percent = 1;
set hive.execution.engine = mr;
set mapreduce.reduce.speculative=false;
set mapreduce.map.speculative=false;
CREATE EXTERNAL TABLE dynamodb_table (`ID` STRING,`DateTime` STRING,`ReportedbyName` STRING,`ReportedbySurName` STRING,`Company` STRING,`Position` STRING,`Country` STRING,`MailDomain` STRING) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES ("dynamodb.table.name" = "BillionData", "dynamodb.column.mapping" = "ID:ID,DateTime:DateTime,ReportedbyName:ReportedbyName,ReportedbySurName:ReportedbySurName,Company:Company,Position:Position,Country:Country,MailDomain:MailDomain");
SELECT count(*) FROM dynamodb_table;
*You should have a EMR cluster, which comes installed with Hive and DynamoDB record Handler. *With this command, DynamoDB handler on the hive issues "PARALLEL SCANS" with multiple Mapreduce mappers(AKA Workers) working on different partitions to get the count. This will be much efficient and faster than normal scans.
*You must be willing to bump up Read capacity very high for certain period of time. * On a decent sized(20 node) cluster , With 10000 RCU , it took 15 minutes to get count on billion records Approx.
* New writes on this DDB table during this period will make the count inconsistent.
*您应该有一个EMR集群,它随Hive和DynamoDB记录处理程序一起安装。 *使用此命令,配置单元上的DynamoDB处理程序发出“PARALLEL SCANS”,其中有多个Mapreduce映射器(AKA Workers)在不同的分区上工作以获取计数。这将比普通扫描更有效,更快。 *您必须愿意在一段时间内提高读取容量。 *在体积适中(20节点)的群集上,使用10000 RCU,花费15分钟来计算数十亿条记录。 *在此期间对此DDB表的新写入将使计数不一致。
#1
19
I can think of three options to get the total number of items in a DynamoDB table.
我可以想到三个选项来获取DynamoDB表中的项目总数。
-
The first option is using the scan, but the scan function is inefficient and is in general a bad practice, especially for tables with heavy reads or production tables.
第一个选项是使用扫描,但扫描功能效率低,通常是一种不好的做法,特别是对于具有大量读取或生产表的表。
-
The second option is what was mention by Atharva:
第二种选择是Atharva提到的:
A better solution that comes to my mind is to maintain the total number of item counts for such tables in a separate table, where each item will have Table name as it's hash key and total number of items in that table as it's non-key attribute. You can then keep this Table possibly named "TotalNumberOfItemsPerTable" updated by making atomic update operations to increment/decrement the total item count for a particular table.
我想到的一个更好的解决方案是在单独的表中维护这些表的项目总数,其中每个项目都有表名作为它的散列键和该表中的项目总数,因为它是非键属性。然后,您可以通过使原子更新操作增加/减少特定表的总项数来更新此表可能名为“TotalNumberOfItemsPerTable”。
The only problem this is that increment operations are not idempotent. So if a write fails or you write more than once this will be reflected in the count. If you need pin-point accuracy, use a conditional update instead.
唯一的问题是增量操作不是幂等的。因此,如果写入失败或您不止一次写入,这将反映在计数中。如果您需要精确定位,请使用条件更新。
-
The simplest solution is the DescribeTable which returns ItemCount. The only issue is that the count isn't up to date. The count is updated every 6 hours.
最简单的解决方案是DescribeTable,它返回ItemCount。唯一的问题是计数不是最新的。计数每6小时更新一次。
http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_DescribeTable.html
http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_DescribeTable.html
#2
10
The Count
option is definitely what you want, but you also have to take into account that there may be one or more "page" of results in your Scan result. The Scan operation only scans 1MB of data in your table at a time, so the value of Count
in the result is only going to reflect the count of the first 1MB of the table. You will need to make subsequent requests using the value of LastEvaluatedKey
in the result (if it is there). Here is some sample code for doing something like that:
“计数”选项绝对是您想要的,但您还必须考虑扫描结果中可能有一个或多个“页面”结果。扫描操作一次只扫描表中的1MB数据,因此结果中Count的值仅反映表的前1MB的计数。您需要在结果中使用LastEvaluatedKey的值进行后续请求(如果存在)。这是一些用于执行此类操作的示例代码:
<?php
$dynamo_db = new AmazonDynamoDB();
$total = 0;
$start_key = null;
$params = array(
'TableName' => 'my-table',
'Count' => true
);
do {
if ($start_key) {
$params['ExclusiveStartKey'] = $start_key->getArrayCopy();
}
$response = $dynamo_db->scan($params);
if ($response->isOK()) {
$total += (string) $response->body->Count;
if ($response->body->LastEvaluatedKey) {
$start_key = $response->body->LastEvaluatedKey->to_array();
} else {
$start_key = null;
}
}
} while ($start_key);
echo "Count: {$total}";
#3
5
Aha, there is a Count
option in the scan
API, see http://docs.amazonwebservices.com/AWSSDKforPHP/latest/#m=AmazonDynamoDB/scan
啊啊,扫描API中有一个Count选项,请参阅http://docs.amazonwebservices.com/AWSSDKforPHP/latest/#m=AmazonDynamoDB/scan
<?php
$dynamodb = new DynamoMetadata();
$scan_response = $dynamodb->scan(array(
'TableName' => 'ProductCatalog'
'Count' => true,
));
echo "Count: ".$scan_response->body->Count."\n";
#4
3
If you are interested in using the total number of items in a table in your application's logic, that means you are going to query for the total counts pretty frequently. Now one way to achieve this is by using scan operation. But remember that scan operation literally scans through the whole table and therefore consumes lots of throughput, so all the query operations will receive Throttled Exception in that duration. And even considering the fact that scan will limit the resultant count by size of 1MB, you will have to make repeated scan operations to get the actual number of items if the table is very large. This will require to write a custom query logic and handle inevitable throttling in query operations.
如果您有兴趣在应用程序逻辑中使用表中的项目总数,则意味着您将经常查询总计数。现在,实现此目的的一种方法是使用扫描操作。但请记住,扫描操作会扫描整个表,因此会占用大量吞吐量,因此所有查询操作都将在该持续时间内收到Throttled Exception。即使考虑到扫描会将结果数量限制为1MB的大小这一事实,如果表格非常大,您将不得不进行重复扫描操作以获得实际的项目数。这将需要编写自定义查询逻辑并处理查询操作中不可避免的限制。
A better solution that comes to my mind is to maintain the total number of item counts for such tables in a separate table, where each item will have Table name as it's hash key and total number of items in that table as it's non-key attribute. You can then keep this Table possibly named "TotalNumberOfItemsPerTable" updated by making atomic update operations to increment/decrement the total item count for a particular table.
我想到的一个更好的解决方案是在单独的表中维护这些表的项目总数,其中每个项目都有表名作为它的散列键和该表中的项目总数,因为它是非键属性。然后,您可以通过使原子更新操作增加/减少特定表的总项数来更新此表可能名为“TotalNumberOfItemsPerTable”。
No issue of throttling or 1MB limit.
没有限制或1MB限制的问题。
Furthermore, you can expand this concept to even further granularity for example to maintain total number of items matching with some hash key or any arbitrary criteria which you can encode in string form to make an entry in your table named something like "TotalNumberOfItemsInSomeCollection" or "TotalNumberOfItemsMatchingSomeCriteria". These tables can then contain entries for number of items per table, per collection or items matching with some criteria.
此外,您可以将此概念扩展到更进一步的粒度,例如,以维护与某些散列键匹配的项目总数或任何可以以字符串形式编码的任意条件,以在表中创建名为“TotalNumberOfItemsInSomeCollection”或“ TotalNumberOfItemsMatchingSomeCriteria”。然后,这些表可以包含每个表,每个集合的项目数或与某些条件匹配的项的条目。
#5
2
An approximate item count value (supposedly updated every six hours) is available in the AWS console for DynamoDB. Just select the table and look under the Details tab, last entry is Item Count. If this works for you, then you can avoid consuming your table throughput to do the count.
AWS控制台中提供了DynamoDB的近似项目计数值(据称每六小时更新一次)。只需选择表格并在详细信息选项卡下查看,最后一项是项目计数。如果这对您有用,那么您可以避免消耗表吞吐量来计算。
#6
0
This is now available in the AWS table overview screen under the section 'Table details', field 'Item count'. It appears to be just a dump of DescribeTable, and notes that its updated roughly every six hours.
现在,可在AWS表概述屏幕的“表详细信息”部分的“项目计数”字段中找到它。它似乎只是DescribeTable的转储,并指出它大约每六个小时更新一次。
#7
0
Here's how I get the exact item count on my billion records DynamoDB table:
这是我如何获得我的十亿记录DynamoDB表上的确切项目数:
hive>
蜂巢>
set dynamodb.throughput.write.percent = 1;
set dynamodb.throughput.read.percent = 1;
set hive.execution.engine = mr;
set mapreduce.reduce.speculative=false;
set mapreduce.map.speculative=false;
CREATE EXTERNAL TABLE dynamodb_table (`ID` STRING,`DateTime` STRING,`ReportedbyName` STRING,`ReportedbySurName` STRING,`Company` STRING,`Position` STRING,`Country` STRING,`MailDomain` STRING) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES ("dynamodb.table.name" = "BillionData", "dynamodb.column.mapping" = "ID:ID,DateTime:DateTime,ReportedbyName:ReportedbyName,ReportedbySurName:ReportedbySurName,Company:Company,Position:Position,Country:Country,MailDomain:MailDomain");
SELECT count(*) FROM dynamodb_table;
*You should have a EMR cluster, which comes installed with Hive and DynamoDB record Handler. *With this command, DynamoDB handler on the hive issues "PARALLEL SCANS" with multiple Mapreduce mappers(AKA Workers) working on different partitions to get the count. This will be much efficient and faster than normal scans.
*You must be willing to bump up Read capacity very high for certain period of time. * On a decent sized(20 node) cluster , With 10000 RCU , it took 15 minutes to get count on billion records Approx.
* New writes on this DDB table during this period will make the count inconsistent.
*您应该有一个EMR集群,它随Hive和DynamoDB记录处理程序一起安装。 *使用此命令,配置单元上的DynamoDB处理程序发出“PARALLEL SCANS”,其中有多个Mapreduce映射器(AKA Workers)在不同的分区上工作以获取计数。这将比普通扫描更有效,更快。 *您必须愿意在一段时间内提高读取容量。 *在体积适中(20节点)的群集上,使用10000 RCU,花费15分钟来计算数十亿条记录。 *在此期间对此DDB表的新写入将使计数不一致。