用于计算太字节数据集中分位数的有效算法

I am trying to compute quantiles (can be approximate with some accuracy guarantees or error bounds) for a huge dataset (terabytes of data) . How can i efficiently compute quantiles . The requirements are

我正在尝试为一个巨大的数据集（太字节数据）计算分位数（可以是一些准确度保证或误差界限的近似值）。我怎样才能有效地计算分位数。要求是

1) Can be computed efficiently (one-pass) or in a distributed way (merging)
2) High accuracy (or at least can be controlled)
3) Can be re-computed or reproduced in multiple language (java and python)
4) Incrementally updated (not a requirement but good to have)

The few approaches I am looking at are :

我正在研究的几种方法是：

1) The naive solution : reservoir sampling (not sure how to do it in
distributed map reduce way specially how to merge different reservoir samples for same data or two different distributions, are there any
good implementations ? )

1）天真的解决方案：水库采样（不确定如何在分布式地图中减少方式，特别是如何将相同数据或两个不同分布的不同油藏样本合并，是否有任何良好的实施？）

2) t-digest

2）t-digest

3) Gurmeet Singh Manku, Sridhar Rajagopalan, and Bruce G. Lindsay. Approximate medians and other quantiles in one pass and with
limited memory. (Reason being i think some map reduce frameworks like dataflow and BigQuery already implement variation of this AFAIK)

3）Gurmeet Singh Manku，Sridhar Rajagopalan和Bruce G. Lindsay。一次通过近似中位数和其他分位数，内存有限。（原因是我认为像dataflow和BigQuery这样的地图缩减框架已经实现了这个AFAIK的变体）

Can someone who has prior experience with working with these algorithm and techniques provide me some pointers about what are the caveats, pros and cons for each . When to use which method, is one approach arguably better than other if requirement is efficient computation and accuracy better.

有过使用这些算法和技术的经验的人能否为我提供一些指示，告诉我们每个人的注意事项，利弊。何时使用哪种方法，如果要求有效计算和准确性更好，一种方法可以说比其他方法更好。

I have not in particular used digest based approach and would like to understand better why and when would i prefer something like t-digest over something simple like reservoir sampling to compute the approximate quantiles.

我没有特别使用基于摘要的方法，并且想要更好地理解为什么以及什么时候我更喜欢t-digest之类的东西，比如像水库采样这样简单的分数来计算近似的分位数。

1 个解决方案

#1

UPDATE: it appears that a new and very good algorithm has appeared, called KLL. See paper. It has an implementation in Python and in Go.

更新：似乎出现了一种新的非常好的算法，称为KLL。见纸。它在Python和Go中都有一个实现。

t-digest has implementations in several languages and satisfies all of your requirements. See the paper that makes comparisons to some other algorithms, e.g. to Q-Digest. You can look for more comparisons in the Q-Digest paper.

t-digest具有多种语言的实现，可满足您的所有要求。请参阅与其他算法进行比较的论文，例如：到Q-Digest。您可以在Q-Digest论文中查找更多比较。

Generally, both of these algorithms are far superior to sampling-based algorithms for estimating quantiles, in terms of giving much better accuracy given the same amount of storage. You can look for a discussion of many more approximate algorithms in the excellent book Data Streams: Algorithms and Applications (it does not discuss t-digest because it was created after the book was published).

通常，这两种算法都远远优于基于采样的算法来估计分位数，从而在给定相同存储量的情况下提供更好的准确性。你可以在优秀的书籍Data Streams：Algorithms and Applications中找到更多近似算法的讨论（它没有讨论t-digest，因为它是在书出版后创建的）。

There might be other, better algorithms that I'm not familiar with.

可能还有其他更好的算法，我不熟悉。

There is currently no Beam wrapper for the t-digest library, but it should not be difficult to develop one using a custom CombineFn. See, for example, a current pending PR adding support for a different approximate algorithm using a CombineFn.

目前没有用于t-digest库的Beam包装器，但使用自定义CombineFn开发它应该不难。例如，请参阅当前待处理的PR，使用CombineFn添加对不同近似算法的支持。

#1