什么可以在实践中使用'TreeDict'(或树图)?

时间:2022-08-12 18:05:12

I'm developing a 'TreeDict' class in Python. This is a basically a dict that allows you to retrieve its key-value pairs in sorted order, just like the Treemap collection class in Java.

我正在用Python开发一个'TreeDict'类。这基本上是一个dict,允许您按排序顺序检索其键值对,就像Java中的Treemap集合类一样。

I've implemented some functionality based on the way unique indexes in relational databases can be used, e.g. functions to let you retrieve values corresponding to a range of keys, keys greater than, less than or equal to a particular value in sorted order, strings or tuples that have a specific prefix in sorted order, etc.

我已经基于关系数据库中的唯一索引的使用方式实现了一些功能,例如,函数,用于检索对应于一系列键,大于,小于或等于排序顺序的特定值的键,具有排序顺序的特定前缀的字符串或元组等。

Unfortunately, I can't think of any real life problem that will require a class like this. I suspect that the reason we don't have sorted dicts in Python is that in practice they aren't required often enough to be worth it, but I want to be proved wrong.

不幸的是,我想不出任何需要像这样的课程的现实生活问题。我怀疑我们在Python中没有排序的原因是,在实践中它们并不经常被要求得到它,但我想被证明是错误的。

Can you think of any specific applications of a 'TreeDict'? Any real life problem that would be best solved by this data structure? I just want to know for sure whether this is worth it.

你能想到'TreeDict'的任何具体应用吗?这个数据结构最能解决的任何现实问题?我只是想知道这是否值得。

7 个解决方案

#1


It's useful when you need to go through a Dictionary in order of the keys; which comes up on occasion. I've actually found its infinitely more common in certain programming contests then anything else (think ACM, etc).

当你需要按键的顺序浏览字典时,它很有用;偶尔会出现。实际上我发现它在某些编程竞赛中更为常见,然后是其他任何东西(想想ACM等)。

The most useful feature of a TreeMap is when you want to quickly find the min or max key; using a sorted dictionary this is often a single method call; and algorithmically can be done in O(log(n)) time, as opposed to iterating over each key looking for a min/max if the collection is unsorted. Basically, a much friendlier interface.

TreeMap最有用的功能是当你想快速找到最小或最大键时;使用排序字典这通常是单个方法调用;并且在算法上可以在O(log(n))时间内完成,而不是如果集合未排序则迭代每个键寻找最小值/最大值。基本上,一个更友好的界面。

One of the more common times I run into it is when objects are identified by a specific name, and you want to print out the objects ordered according to the name; say a mapping from directory name to number of files in a directory.

我遇到的一个比较常见的事情是当对象被特定名称标识,并且您想要打印出根据名称排序的对象;说一个从目录名到目录中文件数的映射。

One other place I've used it is in an excel spreadsheet wrapper; mapping from row number to row object. This lets you quickly find the last row index, without looping through each row.

我用过它的另一个地方是excel电子表格包装器;从行号映射到行对象。这使您可以快速查找最后一行索引,而无需遍历每一行。

Also, it's useful when you can easily define a comparison relation on keys, but not necessarily a hashing function, as needed for HashMaps. The best (though weak) example I can think of is case insensitive string keys.

此外,当您可以根据HashMaps的需要轻松定义键上的比较关系,但不一定是散列函数时,它非常有用。我能想到的最好(尽管很弱)的例子是不区分大小写的字符串键。

#2


I've seen several answers pointing to the "walk in ordered sequence" feature, which is indeed important, but none highlighting the other big feature, which is "find first entry with a key >= this". This has many uses even when there's no real need to "walk" from there.

我已经看到几个答案指向“走进有序序列”功能,这确实很重要,但没有突出显示另一个重要功能,即“找到第一个带有键> =这个”。即使没有真正需要从那里“行走”,这也有很多用途。

For example (this came up in a recent SO answer), say you want to generate pseudo-random values with given relative frequencies -- i.e, you're given, say, a dict d:

例如(这在最近的SO答案中出现),假设您想要生成具有给定相对频率的伪随机值 - 即,您给出了一个dict d:

{'wolf': 42, 'sheep': 15, 'dog': 23, 'goat': 15, 'cat': 5}

and need a way to generate 'wolf' with a probability of 42 out of 100 (since 100 is the total of the relative frequencies given), 'sheep' 15 out of 100, and so on; and the number of distinct values can be quite large, as can the relative frequencies.

并且需要一种方法来产生'狼',概率为百分之42(因为100是给定的相对频率的总和),'羊'中有15分,等等;并且不同值的数量可以非常大,相对频率也可以。

Then, store the given values (in whatever order) as the values in a tree map, with the corresponding keys being the "total cumulative frequency" up to that point. I.e.:

然后,将给定值(以任何顺序)存储为树图中的值,相应的键是直到该点的“总累积频率”。即:

def preprocess(d):
    tot = 0
    for v in d:
        tot += d[v]
        treemap.insert(key=tot, value=v)
    return tot, treemap

Now, generating a value can be pretty fast (O(log(len(d)))), as follows:

现在,生成一个值可以非常快(O(log(len(d)))),如下所示:

def generate(tot, treemap, r=random):
    n = r.randrange(tot)
    return treemap.firstGTkey(n).value

where firstGTKey is a method that returns the first entry (with .key and .value attributes, in this hypothetical example) with a key > the given argument. I've used this approach with large files stored as B-Trees, for example (using e.g. bsddb.bt_open and the set_location method).

其中firstGTKey是一个方法,它返回第一个条目(在此假设示例中具有.key和.value属性),其中包含一个键>给定的参数。例如,我使用这种方法将大文件存储为B树(使用例如bsddb.bt_open和set_location方法)。

#3


The reason for keeping the elements in sorted order is for faster retrieval. Say I wanted all of the values in the dictionary in a sorted range. This is much faster with a TreeDict then with the regular hashmap. It basically allows you to keep everything in the dictionary in sorted order. I know in the application I'm currently working on uses a class like this to basically query the data structure.

将元素保持在排序顺序的原因是为了更快地检索。假设我希望字典中的所有值都在排序范围内。使用TreeDict然后使用常规hashmap,这要快得多。它基本上允许您按排序顺序保存字典中的所有内容。我知道在我正在使用的应用程序中使用这样的类来基本查询数据结构。

#4


I often use Dict<DateTime, someClassOrValue> when working with industrial process data-- Valve open/close, machinery start/stop, etc.

在处理工业过程数据时,我经常使用Dict - 阀门开/关,机械启动/停止等。 ,someclassorvalue>

Having the keys sorted is especially useful when I need to compare time intervals between start/stop or open/close events in a decent amount of time.

当我需要在相当长的时间内比较开始/停止或开/关事件之间的时间间隔时,对键进行排序特别有用。

However, since I've been able to use linq in C# I've found that it's often easier to just work with IEnumerables and use the IQueryable extension methods to get the information I need.

但是,由于我已经能够在C#中使用linq,我发现使用IEnumerables并使用IQueryable扩展方法来获取我需要的信息通常更容易。

#5


Almost all "GROUP BY" reporting requires a sorted dictionary.

几乎所有“GROUP BY”报告都需要一个排序字典。

summary = sortedDefaultDict()
for row in somePileOfData:
    summary[row.group_by] += row.balance
for k in sorted(summary.keys()):
    print k, summary[k]

This is done so often in data warehousing applications, that it's difficult to express how central this is.

这在数据仓库应用程序中经常这样做,因此很难表达它的核心。

If the sorted function call does no work, it saves a ton of time in the long run.

如果已排序的函数调用不起作用,则从长远来看会节省大量时间。

#6


Have you seen that: http://code.activestate.com/recipes/576998/ ?

你见过这个:http://code.activestate.com/recipes/576998/?

zuo

#7


They can make various algorithms easier to implement.

它们可以使各种算法更容易实现。

#1


It's useful when you need to go through a Dictionary in order of the keys; which comes up on occasion. I've actually found its infinitely more common in certain programming contests then anything else (think ACM, etc).

当你需要按键的顺序浏览字典时,它很有用;偶尔会出现。实际上我发现它在某些编程竞赛中更为常见,然后是其他任何东西(想想ACM等)。

The most useful feature of a TreeMap is when you want to quickly find the min or max key; using a sorted dictionary this is often a single method call; and algorithmically can be done in O(log(n)) time, as opposed to iterating over each key looking for a min/max if the collection is unsorted. Basically, a much friendlier interface.

TreeMap最有用的功能是当你想快速找到最小或最大键时;使用排序字典这通常是单个方法调用;并且在算法上可以在O(log(n))时间内完成,而不是如果集合未排序则迭代每个键寻找最小值/最大值。基本上,一个更友好的界面。

One of the more common times I run into it is when objects are identified by a specific name, and you want to print out the objects ordered according to the name; say a mapping from directory name to number of files in a directory.

我遇到的一个比较常见的事情是当对象被特定名称标识,并且您想要打印出根据名称排序的对象;说一个从目录名到目录中文件数的映射。

One other place I've used it is in an excel spreadsheet wrapper; mapping from row number to row object. This lets you quickly find the last row index, without looping through each row.

我用过它的另一个地方是excel电子表格包装器;从行号映射到行对象。这使您可以快速查找最后一行索引,而无需遍历每一行。

Also, it's useful when you can easily define a comparison relation on keys, but not necessarily a hashing function, as needed for HashMaps. The best (though weak) example I can think of is case insensitive string keys.

此外,当您可以根据HashMaps的需要轻松定义键上的比较关系,但不一定是散列函数时,它非常有用。我能想到的最好(尽管很弱)的例子是不区分大小写的字符串键。

#2


I've seen several answers pointing to the "walk in ordered sequence" feature, which is indeed important, but none highlighting the other big feature, which is "find first entry with a key >= this". This has many uses even when there's no real need to "walk" from there.

我已经看到几个答案指向“走进有序序列”功能,这确实很重要,但没有突出显示另一个重要功能,即“找到第一个带有键> =这个”。即使没有真正需要从那里“行走”,这也有很多用途。

For example (this came up in a recent SO answer), say you want to generate pseudo-random values with given relative frequencies -- i.e, you're given, say, a dict d:

例如(这在最近的SO答案中出现),假设您想要生成具有给定相对频率的伪随机值 - 即,您给出了一个dict d:

{'wolf': 42, 'sheep': 15, 'dog': 23, 'goat': 15, 'cat': 5}

and need a way to generate 'wolf' with a probability of 42 out of 100 (since 100 is the total of the relative frequencies given), 'sheep' 15 out of 100, and so on; and the number of distinct values can be quite large, as can the relative frequencies.

并且需要一种方法来产生'狼',概率为百分之42(因为100是给定的相对频率的总和),'羊'中有15分,等等;并且不同值的数量可以非常大,相对频率也可以。

Then, store the given values (in whatever order) as the values in a tree map, with the corresponding keys being the "total cumulative frequency" up to that point. I.e.:

然后,将给定值(以任何顺序)存储为树图中的值,相应的键是直到该点的“总累积频率”。即:

def preprocess(d):
    tot = 0
    for v in d:
        tot += d[v]
        treemap.insert(key=tot, value=v)
    return tot, treemap

Now, generating a value can be pretty fast (O(log(len(d)))), as follows:

现在,生成一个值可以非常快(O(log(len(d)))),如下所示:

def generate(tot, treemap, r=random):
    n = r.randrange(tot)
    return treemap.firstGTkey(n).value

where firstGTKey is a method that returns the first entry (with .key and .value attributes, in this hypothetical example) with a key > the given argument. I've used this approach with large files stored as B-Trees, for example (using e.g. bsddb.bt_open and the set_location method).

其中firstGTKey是一个方法,它返回第一个条目(在此假设示例中具有.key和.value属性),其中包含一个键>给定的参数。例如,我使用这种方法将大文件存储为B树(使用例如bsddb.bt_open和set_location方法)。

#3


The reason for keeping the elements in sorted order is for faster retrieval. Say I wanted all of the values in the dictionary in a sorted range. This is much faster with a TreeDict then with the regular hashmap. It basically allows you to keep everything in the dictionary in sorted order. I know in the application I'm currently working on uses a class like this to basically query the data structure.

将元素保持在排序顺序的原因是为了更快地检索。假设我希望字典中的所有值都在排序范围内。使用TreeDict然后使用常规hashmap,这要快得多。它基本上允许您按排序顺序保存字典中的所有内容。我知道在我正在使用的应用程序中使用这样的类来基本查询数据结构。

#4


I often use Dict<DateTime, someClassOrValue> when working with industrial process data-- Valve open/close, machinery start/stop, etc.

在处理工业过程数据时,我经常使用Dict - 阀门开/关,机械启动/停止等。 ,someclassorvalue>

Having the keys sorted is especially useful when I need to compare time intervals between start/stop or open/close events in a decent amount of time.

当我需要在相当长的时间内比较开始/停止或开/关事件之间的时间间隔时,对键进行排序特别有用。

However, since I've been able to use linq in C# I've found that it's often easier to just work with IEnumerables and use the IQueryable extension methods to get the information I need.

但是,由于我已经能够在C#中使用linq,我发现使用IEnumerables并使用IQueryable扩展方法来获取我需要的信息通常更容易。

#5


Almost all "GROUP BY" reporting requires a sorted dictionary.

几乎所有“GROUP BY”报告都需要一个排序字典。

summary = sortedDefaultDict()
for row in somePileOfData:
    summary[row.group_by] += row.balance
for k in sorted(summary.keys()):
    print k, summary[k]

This is done so often in data warehousing applications, that it's difficult to express how central this is.

这在数据仓库应用程序中经常这样做,因此很难表达它的核心。

If the sorted function call does no work, it saves a ton of time in the long run.

如果已排序的函数调用不起作用,则从长远来看会节省大量时间。

#6


Have you seen that: http://code.activestate.com/recipes/576998/ ?

你见过这个:http://code.activestate.com/recipes/576998/?

zuo

#7


They can make various algorithms easier to implement.

它们可以使各种算法更容易实现。