从HBASE获取列值的优化方法？

I have a situation where i just know the columnfamily and columnname in hbase and i want to retrieve all the unique values for that particular column and populate on my webapplication GUI with the time at utmost important.

我有一种情况，我只知道hbase中的columnfamily和columnname，我想检索该特定列的所有唯一值，并在我的webapplication GUI上填充最重要的时间。

One way is to try scan applying colfamily and columnname which takes time and make the end user wait for so long.

一种方法是尝试扫描应用colfamily和columnname，这需要花费时间并使最终用户等待这么久。

Is there any other way of doing it effectively and efficiently?

有没有其他方式有效和高效地做到这一点？

would be great if you could help. Thanks

如果你可以提供帮助，那就太好了。谢谢

1 个解决方案

#1

There is no magic way that is going to make scanning this data fast for a user interface. It needs to rip through all the data in the column family from disk to get the information that you want. Pretty much the only things you will get from hbase in any sort of interactive way is a specific rowkey get or a very small range scan.

没有什么神奇的方法可以让用户界面快速扫描这些数据。它需要从磁盘中翻录列族中的所有数据以获取所需的信息。您可以通过任何交互方式从hbase获得的唯一内容是特定的rowkey get或非常小范围的扫描。

Here are a couple of high-level approaches:

以下是一些高级方法：

Do you care about latency/updates? recalculate the unique list every 20 minutes with a MapReduce job or a scan and store the results in a text file somewhere.
你关心延迟/更新吗？使用MapReduce作业或扫描每20分钟重新计算唯一列表，并将结果存储在某个文本文件中。
Use co-processors to determine the unique list per region, and then in the client aggregate the unique lists into one unique list. This will likely still be too slow, but it will speed up your scan if you have lots of duplicates and your network is being saturated.
使用协处理器确定每个区域的唯一列表，然后在客户端将唯一列表聚合到一个唯一列表中。这可能仍然太慢，但如果您有大量重复项并且网络已经饱和，它将加快扫描速度。
Rethink how you are storing your data in hbase. Unlike RDBMS I can't just arbitrarily add indexes to columns. In schema design you have to think about how you are accessing your data and then base your schema design on that. Are you trying to get your unique list fast? Maybe you should build a second table with the original values as keys and then pointers back to the original rowkeys.
重新思考如何在hbase中存储数据。与RDBMS不同，我不能随意向列添加索引。在架构设计中，您必须考虑如何访问数据，然后将架构设计基于此。你想快速获得你的独特名单吗？也许您应该构建第二个表，其中原始值为键，然后指针返回原始rowkeys。
Can you keep track of the unique values in a separate system where you can fetch that information quickly?
您是否可以在单独的系统中跟踪唯一值，以便快速获取该信息？

#1

Here are a couple of high-level approaches:

以下是一些高级方法：

Do you care about latency/updates? recalculate the unique list every 20 minutes with a MapReduce job or a scan and store the results in a text file somewhere.
你关心延迟/更新吗？使用MapReduce作业或扫描每20分钟重新计算唯一列表，并将结果存储在某个文本文件中。
Use co-processors to determine the unique list per region, and then in the client aggregate the unique lists into one unique list. This will likely still be too slow, but it will speed up your scan if you have lots of duplicates and your network is being saturated.
使用协处理器确定每个区域的唯一列表，然后在客户端将唯一列表聚合到一个唯一列表中。这可能仍然太慢，但如果您有大量重复项并且网络已经饱和，它将加快扫描速度。
Rethink how you are storing your data in hbase. Unlike RDBMS I can't just arbitrarily add indexes to columns. In schema design you have to think about how you are accessing your data and then base your schema design on that. Are you trying to get your unique list fast? Maybe you should build a second table with the original values as keys and then pointers back to the original rowkeys.
重新思考如何在hbase中存储数据。与RDBMS不同，我不能随意向列添加索引。在架构设计中，您必须考虑如何访问数据，然后将架构设计基于此。你想快速获得你的独特名单吗？也许您应该构建第二个表，其中原始值为键，然后指针返回原始rowkeys。
Can you keep track of the unique values in a separate system where you can fetch that information quickly?
您是否可以在单独的系统中跟踪唯一值，以便快速获取该信息？

秒客网

从HBASE获取列值的优化方法？

1 个解决方案

#1

#1

相关文章