Trie vs.后缀树vs.后缀数组

时间:2022-11-24 11:12:20

Which structure provides the best performance results; trie (prefix tree), suffix tree or suffix array? Are there other similar structures? What are good Java implementations of these structures?

该结构提供了最佳的性能结果;trie(前缀树),后缀树还是后缀数组?还有其他类似的结构吗?这些结构的好的Java实现是什么?

Edit: in this case I want to make string matching between a large dictionary of names and a large set of natural language texts, in order to identify the names of the dictionary on texts.

编辑:在这种情况下,我想要在一个大的名称字典和大量的自然语言文本之间进行字符串匹配,以便在文本上识别字典的名称。

6 个解决方案

#1


53  

The trie was the first data structure of this kind discovered.

trie是这种发现的第一个数据结构。

The suffix tree is an improvement over the trie (it has suffix links which allow linear error search, the suffix tree trims unnecessary branches of the trie therefore it does not require as much space).

后缀树是对trie的改进(它有允许线性错误搜索的后缀链接,后缀树修剪trie不必要的分支,因此它不需要太多的空间)。

The suffix array is a stripped down data structure based on the suffix tree (no suffix links (slow error matches), yet pattern matching is very fast).

后缀数组是一个基于后缀树的精简数据结构(没有后缀链接(慢错误匹配),但是模式匹配非常快)。

The trie is not for real world use because it consumes too much space.

trie不能用于现实世界,因为它消耗了太多的空间。

The suffix tree is lighter and faster than the trie and is used to index DNA or optimize some large web search engines.

后缀树比trie更轻、更快,用于索引DNA或优化一些大型网络搜索引擎。

The suffix array is slower in some pattern searches than the suffix tree but uses less space, and is more widely used than the Suffix tree.

在某些模式搜索中,后缀数组比后缀树要慢,但是使用的空间更少,并且比后缀树使用得更广泛。

In the same family of data structures:

在同一系列的数据结构中:

There are other implementations, the CST is an implementation of the suffix tree using a suffix array and some additional data structures to get some of the suffix tree search capabilities.

还有其他实现,CST是后缀树的实现,使用后缀数组和一些附加的数据结构来获得一些后缀树搜索功能。

The FCST takes it further, it implements a sampled suffix tree with a suffix array.

FCST更进一步,它实现一个带有后缀数组的抽样后缀树。

The DFCST is a dynamic version of the FCST.

DFCST是FCST的动态版本。

Expanding:

扩展:

The two important factors are space use and operation execution time. You might think that with modern day machines this is not relevant but to index the DNA of a single human being would require 40 gigabytes of memory (using an uncompressed and unoptimized suffix tree). And to build one of this indexes over this much data can take days. Imagine Google, it has lots of searchable data, they need a large index over all web data and they do not change it every time someone builds a web page. They have some form of caching for that. However the main index is probably static. And every couple of weeks or so they gather all new web sites and data and build a new index, replacing the old when the new is finished. I do not know which algorithm they use to index, but it is probably a suffix array with suffix tree properties over a partitioned database.

两个重要的因素是空间使用和运行执行时间。您可能会认为,对于现代机器来说,这是无关的,但是索引一个人的DNA需要40g的内存(使用一个未压缩的、未优化的后缀树)。要在这么多数据上建立一个这样的索引,可能需要几天的时间。想象一下谷歌,它有很多可搜索的数据,他们需要一个很大的索引覆盖所有的web数据,并且他们不会在每次有人创建一个web页面时改变它。它们有某种形式的缓存。然而,主要指数可能是静态的。每隔几周左右,他们就会收集所有新的网站和数据,建立一个新的索引,取代旧的索引。我不知道他们使用哪种算法来索引,但它可能是一个在分区数据库上带有后缀树属性的后缀数组。

The CST uses 8 gigabytes, however the suffix tree operations speed are heavily reduced.

CST使用8g,但是后缀树操作速度大大降低。

The suffix array can do the same in some 700 megas to 2 Gigas. However you will not find genetic errors in the DNA with a suffix array (meaning: searching for a pattern with a wildcard is much much slower).

后缀数组可以在700兆到2兆的范围内做同样的事情。但是,您不会在带有后缀数组的DNA中发现遗传错误(这意味着:用通配符搜索模式要慢得多)。

The FCST (fully compressed suffix tree) can create a suffix tree in 800 to 1.5 gigas. With a rather small speed deterioration towards the CST.

FCST(完全压缩的后缀树)可以在800到1.5的gigas中创建一个后缀树。以相当小的速度下降到CST。

The DFCST uses 20% more space than the FCST, and loses speed to the static implementation of the FCST (however a dynamic index is very important).

DFCST比FCST多占用20%的空间,并且由于FCST的静态实现而失去速度(但是动态索引非常重要)。

There are not many viable (space wise) implementations of the suffix tree because it is very hard to make the operations speed boost compensate the data structures RAM space cost.

后缀树没有很多可行的(空间方面的)实现,因为很难使操作速度提高来补偿数据结构RAM空间成本。

This said, the suffix tree has very interesting search results for pattern matching with errors. The aho corasick is not as fast (though nearly as fast for some operations, not error matching) and the boyer moore is left in the dust.

也就是说,后缀树有非常有趣的搜索结果来匹配错误的模式。阿霍·科拉西克的速度并没有那么快(尽管对于某些操作来说,速度快得多,而不是错误匹配),而博伊尔·摩尔(boyer moore)则留在了尘埃中。

#2


4  

What operations do you plan on doing? libdivsufsort was at one time the best suffix array implementation in C.

你打算做什么手术?libdivsort曾经是C语言中最好的后缀数组实现。

#3


2  

Using Suffix Trees you can write something that will match your dictionary to your text in O(n+m+k) time where n is letters in your dictionary, m is letters in your text, and k is the number of matches. Tries are much slower for this. I'm not sure what a Suffix Array is, so I can't comment on that.

使用后缀树你可以写一些东西来匹配你的字典和你的文本在O(n+m+k)时间里n是你字典里的字母,m是你文本里的字母,k是匹配的数目。尝试要慢得多。我不确定后缀数组是什么,所以我不能评论它。

That said, it's non-trivial to code and I don't happen to know of any Java libraries that provide the necessary functions.

也就是说,编写代码是很重要的,我碰巧不知道有任何Java库提供必要的函数。

#4


1  

EDIT: In this case I want to make string matching between a large dictionary of names and a large set of natural language texts, in order to identify the names of the dictionary on texts.

编辑:在这种情况下,我想要在一个大的名称字典和大量的自然语言文本之间进行字符串匹配,以便在文本上识别字典的名称。

This sounds like an application for the Aho-Corasick algorithm: construct an automaton from the dictionary (in linear time), which can then be used to find all the occurrences of any of the dictionary words in multiple texts (also in linear time).

这听起来像是Aho-Corasick算法的一个应用:从字典中构造一个自动机(以线性时间),然后可以使用它在多个文本(也以线性时间)中查找所有字典词的出现。

(The description in these lecture notes, linked from the "External links" section of the Wikipedia page, is a lot easier to read than the description on the page itself.)

(这些课堂笔记中的描述,来自*页面的“外部链接”部分,比页面上的描述更容易阅读。)

#5


1  

Trie vs Suffix tree

Trie vs后缀树

both data structure ensure a very fast look up, the time of search is proportional to the lenght of the query word, complexity time O(m) where m is the lenght of the query word.

两种数据结构都保证了快速查找,搜索时间与查询词的长度成正比,复杂度O(m),其中m是查询词的长度。

it's mean if we have query word that have 10 chars, so we need at most 10 steps to find it.

这意味着如果查询词有10个字符,那么我们最多需要10步才能找到它。

Trie : A tree for storing strings in which there is one node for every common prefix. The strings are stored in extra leaf nodes.

Trie:用于存储字符串的树,其中每个公共前缀都有一个节点。字符串存储在额外的叶节点中。

suffix tree: A compact representation of a trie corresponding to the suffixes of a given string where all nodes with one child are merged with their parents.

后缀树:与给定字符串的后缀相对应的trie的紧凑表示,其中具有一个子节点的所有节点都与它们的父节点合并。

def are from : Dictionary of Algorithms and Data Structures

def来自:算法和数据结构字典

generally Trie used to index dictionary words (lexicon) or any sets of strings example D={abcd, abcdd, bxcdf,.....,zzzz }

一般用于索引字典词(lexicon)或任何一组字符串示例D={abcd, abcdd, bxcdf,…,睡眠}

a suffix tree used to index text by using the same data structure "Trie" on all suffixes of our text T=abcdabcg all suffixes of T = {abcdabcg , abcdabc , abcdab, abcda, abcd, abc , ab, a}

在我们的文本T=abcdabcg的所有后缀上使用相同的数据结构“Trie”来索引文本的后缀树。所有的T= {abcdababcg, abcdabc, abcdab, abcda, abcd, abc, ab, a}

now it look like a groups of strings. we build a Trie over over this groups of strings (all suffixes of T).

现在它看起来像一组字符串。我们在这组字符串上构建一个Trie(所有后缀T)。

the construction of both data structure is in linear, it take O(n) in time and space.

这两个数据结构的构造都是线性的,在时间和空间上都需要O(n)。

in case of dicionary (a set of strings): n = the sum of the characters of all the words. in text : n = length of text.

如果是dici(一组字符串):n =所有单词字符的和。文本:n =文本长度。

suffix array : is a technic to represent a suffix tree in compressed sapce, it's an array of all starting positions of suffixes of a string.

后缀数组:是在压缩的sapce中表示后缀树的一种技术,它是字符串后缀的所有起始位置的数组。

it's slower than suffix tree in search time.

搜索时间比后缀树慢。

for more information go to wikipedia , there is a good article talking on this topic.

要了解更多的信息,请访问*,有一篇关于这个话题的好文章。

#6


0  

This implementation of the induced sorting algorithm (called sais) has a Java version for constructing suffix arrays.

这个诱导排序算法的实现(称为sais)有一个用于构造后缀数组的Java版本。

#1


53  

The trie was the first data structure of this kind discovered.

trie是这种发现的第一个数据结构。

The suffix tree is an improvement over the trie (it has suffix links which allow linear error search, the suffix tree trims unnecessary branches of the trie therefore it does not require as much space).

后缀树是对trie的改进(它有允许线性错误搜索的后缀链接,后缀树修剪trie不必要的分支,因此它不需要太多的空间)。

The suffix array is a stripped down data structure based on the suffix tree (no suffix links (slow error matches), yet pattern matching is very fast).

后缀数组是一个基于后缀树的精简数据结构(没有后缀链接(慢错误匹配),但是模式匹配非常快)。

The trie is not for real world use because it consumes too much space.

trie不能用于现实世界,因为它消耗了太多的空间。

The suffix tree is lighter and faster than the trie and is used to index DNA or optimize some large web search engines.

后缀树比trie更轻、更快,用于索引DNA或优化一些大型网络搜索引擎。

The suffix array is slower in some pattern searches than the suffix tree but uses less space, and is more widely used than the Suffix tree.

在某些模式搜索中,后缀数组比后缀树要慢,但是使用的空间更少,并且比后缀树使用得更广泛。

In the same family of data structures:

在同一系列的数据结构中:

There are other implementations, the CST is an implementation of the suffix tree using a suffix array and some additional data structures to get some of the suffix tree search capabilities.

还有其他实现,CST是后缀树的实现,使用后缀数组和一些附加的数据结构来获得一些后缀树搜索功能。

The FCST takes it further, it implements a sampled suffix tree with a suffix array.

FCST更进一步,它实现一个带有后缀数组的抽样后缀树。

The DFCST is a dynamic version of the FCST.

DFCST是FCST的动态版本。

Expanding:

扩展:

The two important factors are space use and operation execution time. You might think that with modern day machines this is not relevant but to index the DNA of a single human being would require 40 gigabytes of memory (using an uncompressed and unoptimized suffix tree). And to build one of this indexes over this much data can take days. Imagine Google, it has lots of searchable data, they need a large index over all web data and they do not change it every time someone builds a web page. They have some form of caching for that. However the main index is probably static. And every couple of weeks or so they gather all new web sites and data and build a new index, replacing the old when the new is finished. I do not know which algorithm they use to index, but it is probably a suffix array with suffix tree properties over a partitioned database.

两个重要的因素是空间使用和运行执行时间。您可能会认为,对于现代机器来说,这是无关的,但是索引一个人的DNA需要40g的内存(使用一个未压缩的、未优化的后缀树)。要在这么多数据上建立一个这样的索引,可能需要几天的时间。想象一下谷歌,它有很多可搜索的数据,他们需要一个很大的索引覆盖所有的web数据,并且他们不会在每次有人创建一个web页面时改变它。它们有某种形式的缓存。然而,主要指数可能是静态的。每隔几周左右,他们就会收集所有新的网站和数据,建立一个新的索引,取代旧的索引。我不知道他们使用哪种算法来索引,但它可能是一个在分区数据库上带有后缀树属性的后缀数组。

The CST uses 8 gigabytes, however the suffix tree operations speed are heavily reduced.

CST使用8g,但是后缀树操作速度大大降低。

The suffix array can do the same in some 700 megas to 2 Gigas. However you will not find genetic errors in the DNA with a suffix array (meaning: searching for a pattern with a wildcard is much much slower).

后缀数组可以在700兆到2兆的范围内做同样的事情。但是,您不会在带有后缀数组的DNA中发现遗传错误(这意味着:用通配符搜索模式要慢得多)。

The FCST (fully compressed suffix tree) can create a suffix tree in 800 to 1.5 gigas. With a rather small speed deterioration towards the CST.

FCST(完全压缩的后缀树)可以在800到1.5的gigas中创建一个后缀树。以相当小的速度下降到CST。

The DFCST uses 20% more space than the FCST, and loses speed to the static implementation of the FCST (however a dynamic index is very important).

DFCST比FCST多占用20%的空间,并且由于FCST的静态实现而失去速度(但是动态索引非常重要)。

There are not many viable (space wise) implementations of the suffix tree because it is very hard to make the operations speed boost compensate the data structures RAM space cost.

后缀树没有很多可行的(空间方面的)实现,因为很难使操作速度提高来补偿数据结构RAM空间成本。

This said, the suffix tree has very interesting search results for pattern matching with errors. The aho corasick is not as fast (though nearly as fast for some operations, not error matching) and the boyer moore is left in the dust.

也就是说,后缀树有非常有趣的搜索结果来匹配错误的模式。阿霍·科拉西克的速度并没有那么快(尽管对于某些操作来说,速度快得多,而不是错误匹配),而博伊尔·摩尔(boyer moore)则留在了尘埃中。

#2


4  

What operations do you plan on doing? libdivsufsort was at one time the best suffix array implementation in C.

你打算做什么手术?libdivsort曾经是C语言中最好的后缀数组实现。

#3


2  

Using Suffix Trees you can write something that will match your dictionary to your text in O(n+m+k) time where n is letters in your dictionary, m is letters in your text, and k is the number of matches. Tries are much slower for this. I'm not sure what a Suffix Array is, so I can't comment on that.

使用后缀树你可以写一些东西来匹配你的字典和你的文本在O(n+m+k)时间里n是你字典里的字母,m是你文本里的字母,k是匹配的数目。尝试要慢得多。我不确定后缀数组是什么,所以我不能评论它。

That said, it's non-trivial to code and I don't happen to know of any Java libraries that provide the necessary functions.

也就是说,编写代码是很重要的,我碰巧不知道有任何Java库提供必要的函数。

#4


1  

EDIT: In this case I want to make string matching between a large dictionary of names and a large set of natural language texts, in order to identify the names of the dictionary on texts.

编辑:在这种情况下,我想要在一个大的名称字典和大量的自然语言文本之间进行字符串匹配,以便在文本上识别字典的名称。

This sounds like an application for the Aho-Corasick algorithm: construct an automaton from the dictionary (in linear time), which can then be used to find all the occurrences of any of the dictionary words in multiple texts (also in linear time).

这听起来像是Aho-Corasick算法的一个应用:从字典中构造一个自动机(以线性时间),然后可以使用它在多个文本(也以线性时间)中查找所有字典词的出现。

(The description in these lecture notes, linked from the "External links" section of the Wikipedia page, is a lot easier to read than the description on the page itself.)

(这些课堂笔记中的描述,来自*页面的“外部链接”部分,比页面上的描述更容易阅读。)

#5


1  

Trie vs Suffix tree

Trie vs后缀树

both data structure ensure a very fast look up, the time of search is proportional to the lenght of the query word, complexity time O(m) where m is the lenght of the query word.

两种数据结构都保证了快速查找,搜索时间与查询词的长度成正比,复杂度O(m),其中m是查询词的长度。

it's mean if we have query word that have 10 chars, so we need at most 10 steps to find it.

这意味着如果查询词有10个字符,那么我们最多需要10步才能找到它。

Trie : A tree for storing strings in which there is one node for every common prefix. The strings are stored in extra leaf nodes.

Trie:用于存储字符串的树,其中每个公共前缀都有一个节点。字符串存储在额外的叶节点中。

suffix tree: A compact representation of a trie corresponding to the suffixes of a given string where all nodes with one child are merged with their parents.

后缀树:与给定字符串的后缀相对应的trie的紧凑表示,其中具有一个子节点的所有节点都与它们的父节点合并。

def are from : Dictionary of Algorithms and Data Structures

def来自:算法和数据结构字典

generally Trie used to index dictionary words (lexicon) or any sets of strings example D={abcd, abcdd, bxcdf,.....,zzzz }

一般用于索引字典词(lexicon)或任何一组字符串示例D={abcd, abcdd, bxcdf,…,睡眠}

a suffix tree used to index text by using the same data structure "Trie" on all suffixes of our text T=abcdabcg all suffixes of T = {abcdabcg , abcdabc , abcdab, abcda, abcd, abc , ab, a}

在我们的文本T=abcdabcg的所有后缀上使用相同的数据结构“Trie”来索引文本的后缀树。所有的T= {abcdababcg, abcdabc, abcdab, abcda, abcd, abc, ab, a}

now it look like a groups of strings. we build a Trie over over this groups of strings (all suffixes of T).

现在它看起来像一组字符串。我们在这组字符串上构建一个Trie(所有后缀T)。

the construction of both data structure is in linear, it take O(n) in time and space.

这两个数据结构的构造都是线性的,在时间和空间上都需要O(n)。

in case of dicionary (a set of strings): n = the sum of the characters of all the words. in text : n = length of text.

如果是dici(一组字符串):n =所有单词字符的和。文本:n =文本长度。

suffix array : is a technic to represent a suffix tree in compressed sapce, it's an array of all starting positions of suffixes of a string.

后缀数组:是在压缩的sapce中表示后缀树的一种技术,它是字符串后缀的所有起始位置的数组。

it's slower than suffix tree in search time.

搜索时间比后缀树慢。

for more information go to wikipedia , there is a good article talking on this topic.

要了解更多的信息,请访问*,有一篇关于这个话题的好文章。

#6


0  

This implementation of the induced sorting algorithm (called sais) has a Java version for constructing suffix arrays.

这个诱导排序算法的实现(称为sais)有一个用于构造后缀数组的Java版本。