放牧代码和思想
专注自然语言处理、机器学习算法
    This thing called love. Know I would've. Thrown it all away. Wouldn't hesitate.

双数组Trie树(DoubleArrayTrie)Java实现

An Implementation of Double-Array Trie

双数组Trie的一种实现

原文http://linux.thai.net/~thep/datrie/datrie.html

引文http://quweiprotoss.blog.163.com/blog/static/4088288320091120112155178/

Contents

  1. What is Trie?

  2. What Does It Take to Implement a Trie?

  3. Tripple-Array Trie

  4. Double-Array Trie

  5. Suffix Compression

  6. Key Insertion

  7. Key Deletion

  8. Double-Array Pool Allocation

  9. An Implementation

  10. Download

  11. References

What is Trie?

Trie is a kind of digital search tree. (See [Knuth1972] for the detail of digital search tree.)[Fredkin1960] introduced the trie terminology, which is abbreviated from "Retrieval".

Trie是一种数字搜索树,[Fredkin]引入了trie这个术语,它是Retrieval的缩写。

       双数组Trie的一种实现 - quweiprotoss - Koala++s blog

Trie is an efficient indexing method. It is indeed also a kind of deterministic finite automaton (DFA) (See[Cohen1990], for example, for the definition of DFA). Within the tree structure, each node corresponds to a DFA state, each (directed) labeled edge from a parent node to a child node corresponds to a DFA transition. The traversal starts at the root node. Then, from head to tail, one by one character in the key string is taken to determine the next state to go. The edge labeled with the same character is chosen to walk. Notice that each step of such walking consumes one character from the key and descends one step down the tree. If the key is exhausted and a leaf node is reached, then we arrive at the exit for that key. If we get stuck at some node, either because there is no branch labeled with the current character we have or because the key is exhausted at an internal node, then it simply implies that the key is not recognized by the trie.

Trie是一种高效的索引方法,它实际上是一种确定有限自动机(DFA),在树的结构中,每一个结点对应一个DFA状态,每一个从父结点指向子结点(有向)标记的边对应一个DFA转换。遍历从根结点开始,然后从headtail,由关键词(本想译成键字符串,感太别扭)的每个字符来决定下一个状态,标记有相同字符的边被选中做移动。注意每次这种移动会从关键词中消耗一个字符并走向树的下一层,如果这个关键字符串空了,并且走到了叶子结点,那么我们达到了这个关键词的出口。如果我们被困在了一点结点,比如因为没有分枝被标记为当前我们有的字符,或是因为关键字符串在中间结点就空了,这表示关键字符串没有被trie认出来。

Notice that the time needed to traverse from the root to the leaf is not dependent on the size of the database, but is proportional to the length of the key. Therefore, it is usually much faster than B-tree or any comparison-based indexing method in general cases. Its time complexity is comparable with hashing techniques.

注意从根遍历到叶子的时间不依赖于数据量的大小,而与关键词的长度成比例。也就是它通常比B-tree要快的多,或是比任何基于比较的索引方法在一般情况下都快的多。它的时间复杂度可以与hashing技术相比。

In addition to the efficiency, trie also provides flexibility in searching for the closest path in case that the key is misspelled. For example, by skipping a certain character in the key while walking, we can fix the insertion kind of typo. By walking toward all the immediate children of one node without consuming a character from the key, we can fix the deletion typo, or even substitution typo if we just drop the key character that has no branch to go and descend to all the immediate children of the current node.

除了效率,trie可以当关键词被拼错了时,提供灵活的搜索。比如在遍历时忽略一个特定的字符,我们可以修正多输入的这种输入错误。直接遍历所有直接的子结点而不消耗一个字符,这样我可以修正少输的输入错误,或者甚至替换输入的错误,如果我们没有分支可以向下移动,就向当前结点的所有子结点直接向下移动。

What Does It Take to Implement a Trie?

In general, a DFA is represented with a transition table, in which the rows correspond to the states, and the columns correspond to the transition labels. The data kept in each cell is then the next state to go for a given state when the input is equal to the label.

通常,一个DFA是用一个transition table表示,它的行对应状态s,它的列对应转换标签s。每一个单元中的数据当输入与标记相同时,给定一个状态后要到达的状态。

This is an efficient method for the traversal, because every transition can be calculated by two-dimensional array indexing. However, in term of space usage, this is rather extravagant, because, in the case of trie, most nodes have only a few branches, leaving the majority of the table cells blanks.

这是一个的遍历高效方法,因为每次转换可以通过计算二维的数组索引得到。但是,从空间使用的角度看,这是相当浪费的,因为,在trie这种情况中,大多数结点只有少量的分枝,表中大多数单元是空白的。

Meanwhile, a more compact scheme is to use a linked list to store the transitions out of each state. But this results in slower access, due to the linear search.

同时,一个更紧凑的方式是用链表保存每个状态的转移,但这种方式会比较慢,因为要进行线性搜索

Hence, table compression techniques which still allows fast access have been devised to solve the problem.

所以,建议使用可以快速访问的表压缩技术。

  1. [Johnson1975] (Also explained in [Aho+1985] pp. 144-146) represented DFA with four arrays, which can be simplified to three in case of trie. The transition table rows are allocated in overlapping manner, allowing the free cells to be used by other rows.

  2. [Aoe1989] proposed an improvement from the three-array structure by reducing the arrays to two.

Tripple-Array Trie

As explained in [Aho+1985] pp. 144-146, a DFA compression could be done using four linear arrays, namely defaultbasenext, and check. However, in a case simpler than the lexical analyzer, such as the mere trie for information retrieval, the default array could be omitted. Thus, a trie can be implemented using three arrays according to this scheme.

[Aho+1985]中解释到,一个DFA压缩可以用四个数据来完成,分别是default, base, next, check。但是与语法分析不同,比如用于信息检索的triedefault可以被忽略。所以,一个trie可以用三个数组来实现。

Structure

The tripple-array structure is composed of:

  1. base. Each element in base corresponds to a node of the trie. For a trie nodesbase[s] is the starting index within the next and check pool (to be explained later) for the row of the node s in the transition table.

  2. next. This array, in coordination with check, provides a pool for the allocation of the sparse vectors for the rows in the trie transition table. The vector data, that is, the vector of transitions from every node, would be stored in this array.

  3. check. This array works in parallel to next. It marks the owner of every cell in next. This allows the cells next to one another to be allocated to different trie nodes. That means the sparse vectors of transitions from more than one node are allowed to be overlapped.

三数组结构包括:

1.  base.base中的每个元素对应trie中的一个结点,对于一个trie结点sbase[s]nextcheck pool的起始索引,它们是在转移表中结点s的行。

2.  next. 这个数组,与check配合,提供一个pool来分配trie转移表中稀疏向量。这个向量数据是从每一个结点的转移向量,会被保存在这个数组中。

3.  check. 这个数组与next平行地工作。它标记在next单元的owner。它允许单元s转移到别一个单元可以分配到不同的trie结点s。这意味着转移的稀疏向量s可以从不同的结点转移到。

Definition 1. For a transition from state s to t which takes character c as the input, the condition maintained in the tripple-array trie is:

check[base[s] + c] = s
next[base[s] + c] = t

 

定义 1. 对于从状态sc做为输入,转移到t,保存在三数组中trie条件是:

check[base[s] + c] = s
next[base[s] + c] = t

         双数组Trie的一种实现 - quweiprotoss - Koala++s blog

Walking

According to definition 1, the walking algorithm for a given state s and the input character c is:

根据定义1,对于一个给定状态s和一个输入字符c,移动算法如下:

  t := base[s] + c;

  if check[t] = s then

      next state := next[t]

  else

      fail

  endif

 

Construction

To insert a transition that takes character c to traverse from a state s to another state t, the cellnext[base[s] + c]] must be managed to be available. If it is already vacant, we are lucky. Otherwise, either the entire transition vector for the current owner of the cell or that of the state s itself must be relocated. The estimated cost for each case could determine which one to move. After finding the free slots to place the vector, the transition vector must be recalculated as follows. Assuming the new place begins at b, the procedure for the relocation is:

插入一个从状态s到另一个状态t的转移,单元next[base[s]+c]必须是空闲的。如果它就是空的,那么我们很幸运。不然,或是将当前 owner的整个转移向量重新分配,或是状态s自己要重新分配。这两种情况的估计代价会决定采用哪用方式。在找到空闲的位置来放这个向量,转移向量必须重新如下计算,假设新的位置开始于b,重新分配的过程如下:

Procedure Relocate(s : stateb : base_index)

{ Move base for state s to a new place beginning at b }

begin

    foreach input character c for the state s

    { i.e. foreach c such that check[base[s] + c]] = s }

    begin

        check[b + c] := s;     { mark owner }

        next[b + c] := next[base[s] + c];     { copy data }

        check[base[s] + c] := none     { free the cell }

    end;

    base[s] := b

end

      双数组Trie的一种实现 - quweiprotoss - Koala++s blog

Double-Array Trie

The tripple-array structure for implementing trie appears to be well defined, but is still not practical to keep in a single file. The next/check pool may be able to keep in a single array of integer couples, but thebase array does not grow in parallel to the pool, and is therefore usually split.

三数组结构来实现trie看真似为是一个很好的方式,但是将它保存到一个单一文件还是不现实的,next/check pool也许可以保存在整型对的一个单个数组,但是base数组不与pool平行地增长,所以通常分离。

To solve this problem, [Aoe1989] reduced the structure into two parallel arrays. In the double-array structure, the base and next are merged, resulting in only two parallel arrays, namely, base and check.

解决这个问题,[Aoe1989]减少这个结构到二个平行的数组,在这个双数组结构中,将basenext合并,它只使用两个平行的数组为,basecheck

Structure

Instead of indirectly referencing through state numbers as in tripple-array trie, nodes in double-array trie are linked directly within the base/check pool.

不同于三数组trie中通过状态号间接引用,在双数组trie是直接在base / check pool中链接。

Definition 2. For a transition from state s to t which takes character c as the input, the condition maintained in the double-array trie is:

check[base[s] + c] = s
base[s] + c = t

 

定义2. 对于一个接收字符c从状态s移动到t的转移,在双数组中保存的条件是:

check[base[s] + c] = s
base[s] + c = t

           双数组Trie的一种实现[2] - quweiprotoss - Koala++s blog

Walking

According to definition 2, the walking algorithm for a given state s and the input character c is:

根据定义2,对于从状态s接收输入c的移动算法如下:

  t := base[s] + c;

  if check[t] = s then

      next state := t

  else

      fail

  endif

 

Construction

The construction of double-array trie is in principle the same as that of tripple-array trie. The difference is the base relocation:

构造双数组trie在原理上是与三数组trie相同的,不同点在于base的重新定位。

Procedure Relocate(s : stateb : base_index)

{ Move base for state s to a new place beginning at b }

begin

    foreach input character c for the state s

    { i.e. foreach c such that check[base[s] + c]] = s }

    begin

        check[b + c] := s;     { mark owner }

        base[b + c] := base[base[s] + c];     { copy data }

        { the node base[s] + c is to be moved to b + c;

          Hence, for any i for which check[i] = base[s] + c, update check[i] to bc }

        foreach input character d for the node base[s] + c

        begin

            check[base[base[s] + c] + d] := b + c

        end;

        check[base[s] + c] := none     { free the cell }

    end;

    base[s] := b

end

 

               双数组Trie的一种实现[2] - quweiprotoss - Koala++s blog

Suffix Compression

[Aoe1989] also suggested a storage compression strategy, by splitting non-branching suffixes into single string storages, called tail, so that the rest non-branching steps are reduced into mere string comparison.

[Aoe1989]也建议了一种压缩策略,通过拆分non-branching后缀到单一字符串保存,称为tail,这样剩余的non-branching步骤就减为字符串压缩了。

With the two separate data structures, double-array branches and suffix-spool tail, key insertion and deletion algorithms must be modified accordingly.

使用两种不同的数据结构,双数组分枝和suffix-spool tail,关键词插入和删除算法也必须相应的修改。

Key Insertion

To insert a new key, the branching position can be found by traversing the trie with the key one by one character until it gets stuck. The state where there is no branch to go is the very place to insert a new edge, labeled by the failing character. However, with the branch-tail structure, the insertion point can be either in the branch or in the tail.

插入一个新的关键词,分枝位置可以通过关键词中的字符遍历trie,直到它困住。这种状态就是没有分枝可以走,要插入一个新的边的位置,这个边标记为那个失败的字符。然而,用branch-tail结构,插入点可以在branch或是tail

1. When the branching point is in the double-array structure

Suppose that the new key is a string a1a2…ah-1ahah+1…an, where a1a2…ah-1 traverses the trie from the root to a node sr in the double-array structure, and there is no edge labeled ah that goes out of sr. The algorithm called A_INSERT in [Aoe1989] does as follows:

假设一个新的key是一个字符串a1a2…ah-1ahah+1…an,其中a1a2…ah-1trie的根遍历到结点sr,并且没有边标记为ah可以走出sr。这个算法称为A_INSERT[Aoe1989]中如下操作:

From sr, insert edge labeled ah to new node st;

Let st be a separate node poining to a string ah+1…an in tail pool.

 

                双数组Trie的一种实现[2] - quweiprotoss - Koala++s blog

2. When the branching point is in the tail pool

Since the path through a tail string has no branch, and therefore corresponds to exactly one key, suppose that the key corresponding to the tail is

a1a2…ah-1ah…ah+k-1b1…bm,

where a1a2…ah-1 is in double-array structure, and ah…ah+k-1b1…bm is in tail. Suppose that the substring a1a2…ah-1 traverses the trie from the root to a node sr.

And suppose that the new key is in the form

a1a2…ah-1ah…ah+k-1ah+k…an,

where ah+k <> b1. The algorithm called B_INSERT in 
[Aoe1989] does as follows:

当路径经过tail字符串没有branch,所以对应一个关键词,假设在tail这个关键词对应的是:

a1a2…ah-1ah…ah+k-1b1…bm,

其中a1a2…ah-1在双数组结构中,而ah…ah+k-1b1…bmtail。假设子串a1a2…ah-1遍历trie从根到sr

并假设这个新的关键词是如下形式:

a1a2…ah-1ah…ah+k-1ah+k…an,
其中ah+k <> b1。这个算法在[Aoe1989]中被称为B_INSERT如下操作:

From sr, insert straight path with ah…ah+k-1, ending at a new node st;

From st, insert edge labeled b1 to new node su;

Let su be separate node pointing to a string b2…bm in tail pool;

From st, insert edge labeled ah+k to new node sv;

Let sv be separate node pointing to a string ah+k+1…an in tail pool.

 

              双数组Trie的一种实现[2] - quweiprotoss - Koala++s blog

Key Deletion

To delete a key from the trie, all we need to do is delete the tail block occupied by the key, and all double-array nodes belonging exclusively to the key, without touching any node belonging to other keys.

trie中删除一个关键词,我们仅需要的是删除被这个关键词占有的tail块,和所有只属于这个关键词的双数组结点,不改变其它关键词的结点。

Consider a trie which accepts a language K = {pool#, prepare#, preview#, prize#, produce#, producer#, progress#} :

考虑一个trie,它接收语言K = {pool#, prepare#, preview#, prize#, produce#, producer#, progress#} :

      双数组Trie的一种实现[2] - quweiprotoss - Koala++s blog

The key "pool#" can be deleted by removing the tail string "ol#" from the tail pool, and node 3 from the double-array structure. This is the simplest case.

关键词"pool#"可以通过从tail pool移除tail字符串"ol#",即双数组结构中的结点3,这是最简单的情况。

To remove the key "produce#", it is sufficient to delete node 14 from the double-array structure. But the resulting trie will not obay the convention that every node in the double-array structure, except the separate nodes which point to tail blocks, must belong to more than one key. The path from node 10 on will belong solely to the key "producer#".

移除关键词"produce#",从双数组结构中删除结点14就可以了,但是产生的trie不再遵循trie的定义,即除指向tail块的分离的结点外,必须属于多个关键词。从结点10的路径仅属于关键词"producer#"

But there is no harm violating this rule. The only drawback is the uncompactnesss of the trie. Traversal, insertion and deletion algoritms are intact. Therefore, this should be relaxed, for the sake of simplicity and efficiency of the deletion algorithm. Otherwise, there must be extra steps to examine other keys in the same subtree ("producer#" for the deletion of "produce#") if any node needs to be moved from the double-array structure to tail pool.

但是违反这个规则没有什么坏处。唯一的缺点是trie不紧凑。遍历,插入,删除算法都是不改变的。所以,出于删除算法简单和高效,这是可以放宽的。不然,必须有额外的步骤来检查其它在相同子树的关键词,看是否其它结点需要从双数组结构删除到 tail pool

Suppose further that having removed "produce#" as such (by removing only node 14), we also need to remove "producer#" from the trie. What we have to do is remove string "#" from tail, and remove nodes 15, 13, 12, 11, 10 (which now belong solely to the key "producer#") from the double-array structure.

进一步假设删除的除了"produce#"(只删除结点14),我们还需要从trie中删除"producer#"。我们需要做的是从tail中删除"#",并删除15, 13, 12, 11, 10(它们仅属于关键词"producer#")

We can thus summarize the algorithm to delete a key k = a1a2…ah-1ah…an, where a1a2…ah-1 is in double-array structure, and ah…an is in tail pool, as follows :

我们可以总结这个算法,其中删除了一个关键词k = a1a2…ah-1ah…an,其中a1a2…ah-1是在双数组结构中,且ah…antail pool中, 如下:

Let sr := the node reached by a1a2…ah-1;

Delete ah…an from tail;

s := sr;

repeat

    p := parent of s;

    Delete node s from double-array structure;

    s := p

until s = root or outdegree(s) > 0.

 

Where outdegree(s) is the number of children nodes of s.

其中outdegree(s)s的子结点数。

Double-Array Pool Allocation

When inserting a new branch for a node, it is possible that the array element for the new branch has already been allocated to another node. In that case, relocation is needed. The efficiency-critical part then turns out to be the search for a new place. A brute force algorithm iterates along the check array to find an empty cell to place the first branch, and then assure that there are empty cells for all other branches as well. The time used is therefore proportional to the size of the double-array pool and the size of the alphabet.

当对一个结点插入一个新的分枝,有可能数组中为新的分枝的位置已经分配给了其它结点,在这种情况下,需要重新定位,影响效率的部分就是寻找一个新的位置,一种brute force(蛮力)算法迭代check数组找出一个空的单元来放置第一个分枝,然后假设还有空的单元来放其它的分枝。所以时间复杂性与双数组的大小和字母集的大小成正比。

Suppose that there are n nodes in the trie, and the alphabet is of size m. The size of the double-array structure would be n + cm, where c is a coefficient which is dependent on the characteristic of the trie. And the time complexity of the brute force algorithm would be O(nm + cm2).

假设在trie中有n个结点,字母集的大小为m。双数组结构的大小为n+cm,其中c是一个依赖trie字符形式的一个系数。Brute force算法的时间复杂性是O(nm + cm2)

[Aoe1989] proposed a free-space list in the double-array structure to make the time complexity independent of the size of the trie, but dependent on the number of the free cells only. The check array for the free cells are redefined to keep a pointer to the next free cell (called G-link) :

[Aoe1989]提出了在双数组结构中一个free-space list使时间复杂性与trie的大小无关,但仅与空闲单元的方法。Check数组的空闲单元被重定义为保存一下指向下一个空闲单元的指针(称为G-link)

Definition 3. Let r1, r2, … , rcm be the free cells in the double-array structure, ordered by position. G-link is defined as follows :

check[0] = -r1
check[ri] = -ri+1 ; 1 <= i <= cm-1
check[rcm] = -1

 

定义3. r1, r2, … , rcm为双数组结构中的空闲单元,以位置排序。G-link定义如下:

check[0] = -r1
check[ri] = -ri+1 ; 1 <= i <= cm-1
check[rcm] = -1

By this definition, negative check means unoccupied in the same sense as that for "none" check in the ordinary algorithm. This encoding scheme forms a singly-linked list of free cells. When searching for an empty cell, only cm free cells are visited, instead of all n + cm cells as in the brute force algorithm.

根据这个定义,负check意味着空闲与在普通算法中的"none" check是同一个意思,这种编码方式形成了一个空闲单无单一链接的链表。当搜索一个空闲单元,仅cm空闲单元被访问,而不是在brute force算法中的所有n+cm

This, however, can still be improved. Notice that for those cells with negative check, the correspondingbase's are not given any definition. Therefore, in our implementation, Aoe's G-link is modified to be doubly-linked list by letting base of every free cell points to a previous free cell. This can speed up the insertion and deletion processes. And, for convenience in referencing the list head and tail, we let the list be circular. The zeroth node is dedicated to be the entry point of the list. And the root node of the trie will begin with cell number one.

这仍然是可以提高的,注意那些是负check的单元,对应的base单元没有任何给定的定义,所以,在我们的实现中,Aoe’s G-link被修改为双链接的列表,使base中的每一个空闲单元指到前一个空闲单元。这可以提高插入和删除的速度。并且为了方便引用列表的头和尾,我们使用了循环链表,第0个结点被指定为列表的入口。并且trie的根结点开始于第1个单元。

Definition 4. Let r1, r2, … , rcm be the free cells in the double-array structure, ordered by position. G-link is defined as follows :

check[0] = -r1
check[ri] = -ri+1 ; 1 <= i <= cm-1
check[rcm] = 0
base[0] = -rcm
base[r1] = 0
base[ri+1] = -ri ; 1 <= i <= cm-1

 

定义4. r1, r2, … , rcm为双数组结构中的空闲单元,以位置排序,G-link的定义如下:

 

check[0] = -r1
check[ri] = -ri+1 ; 1 <= i <= cm-1
check[rcm] = 0
base[0] = -rcm
base[r1] = 0
base[ri+1] = -ri ; 1 <= i <= cm-1

 

Then, the searching for the slots for a node with input symbol set P = {c1, c2, …, cp} needs to iterate only the cells with negative check :

那么,为输入符号集P = {c1, c2, …, cp}寻找一个结点的单元,只需要在有负check的单元中迭代。

{find least free cell s such that s > c1}

s := –check[0];

while s <> 0 and s <= c1 do

    s := –check[s]

end;

if s = 0 then return FAIL;  {or reserve some additional space}

 

{continue searching for the row, given that s matches c1}

while s <> 0 do

    i := 2;

    while i <= p and check[s + ci – c1] < 0 do

        i := i + 1

    end;

    if i = p + 1 then return s – c1;  {all cells required are free, so return it}

    s := –check[-s]

end;

return FAIL;  {or reserve some additional space}

 

The time complexity for free slot searching is reduced to O(cm2). The relocation stage takes O(m2). The total time complexity is therefore O(cm2 + m2) = O(cm2).

搜索slot的时间复杂度下降到O(cm2)。重定位步骤占O(m2)。所以全部的时间复杂度是O(cm2 + m2) = O(cm2)

It is useful to keep the free list ordered by position, so that the access through the array becomes more sequential. This would be beneficial when the trie is stored in a disk file or virtual memory, because the disk caching or page swapping would be used more efficiently. So, the free cell reusing should maintain this strategy :

free list以位置排序是有好处的,这样访问位置可以是顺序访问,当trie保存在磁盘文件或是虚拟内存中,这会是有好处的,因为磁盘缓存和交换页的使用会更有效。所以,空闲单元重用应保持这个策略。

t := –check[0];

while check[t] <> 0 and t < s do

    t := –check[t]

end;

{t now points to the cell after s' place}

check[s] := -t;

check[-base[t]] := -s;

base[s] := base[t];

base[t] := -s;

 

Time complexity of freeing a cell is thus O(cm).

释放一个单元的时间复杂度是O(cm)

An Implementation

In my implementation, I exploit the concept of virtual memory to manage large and permenent trie. A base class called VirtualMem divides the data file into pages, and swaps them in as needed. Memory economy and disk caching is thus achieved. Different memory accessing methods are provided so that the page swapping mechanism is hidden from the class users. Meanwhile, byte order is internally managed on-the-fly inside the methods to achieve data portability.

Double-array structure and tail pool are then built upon the VirtualMem base class. Trie class is then created to contain the two data structures, with basic interfaces for trie manipulation.

Download

Update: The double-array trie implementation has been simplified and rewritten from scratch in C, and is now named libdatrie. It is now available under the terms of GNU Lesser General Public License (LGPL):

SVN: svn co http://linux.thai.net/svn/software/datrie

The old C++ source code below is under the terms of GNU Lesser General Public License (LGPL):

References

  1. [Knuth1972] Knuth, D. E. The Art of Computer Programming Vol. 3, Sorting and Searching. Addison-Wesley. 1972.

  2. [Fredkin1960] Fredkin, E. Trie Memory. Communication of the ACM. Vol. 3:9 (Sep 1960). pp. 490-499.

  3. [Cohen1990] Cohen, D. Introduction to Theory of Computing. John Wiley & Sons. 1990.

  4. [Johnson1975] Johnson, S. C. YACC-Yet another compiler-compiler. Bell Lab. NJ. Computing Science Technical Report 32. pp.1-34. 1975.

  5. [Aho+1985] Aho, A. V., Sethi, R., Ullman, J. D. Compilers : Principles, Techniques, and Tools. Addison-Wesley. 1985.

  6. [Aoe1989] Aoe, J. An Efficient Digital Search Algorithm by Using a Double-Array Structure. IEEE Transactions on Software Engineering. Vol. 15, 9 (Sep 1989). pp. 1066-1077.

  7. [Virach+1993] Virach Sornlertlamvanich, Apichit Pittayaratsophon, Kriangchai Chansaenwilai. Thai Dictionary Data Base Manipulation using Multi-indexed Double Array Trie. 5th Annual Conference. National Electronics and Computer Technology Center. Bangkok. 1993. pp 197-206. (in Thai)


Theppitak Karoonboonyanan
Created: 1999-06-13
Last Updated 2009-07-12

 

 

koala++ 附注:《双数组Trie树算法优化及其应用研究》这篇论文是同事推荐我看的,并且我看到也有人实现时采用了他的方法。因为一眼注意到文中英文摘要把processed写成了pracessed,因为中文论文一般用word排版,所以感觉这是不可原谅的,英文文法也有些问题。文中还有多处输入的错误,并且题目中的优化也太夸张了,trie本来是支持动态的插入,删除的,用他的方法虽然也可以动态插入,删除,但总感觉不是那么回事。但是文中解释的还是比较清楚,不同于一般的垃圾论文,也可以看看。

知识共享许可协议 知识共享署名-非商业性使用-相同方式共享码农场 » 双数组Trie树(DoubleArrayTrie)Java实现

评论 65

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
  1. #22

    我发现有个问题熬,对于大词库 难免会有点欸冲突

    蒋武2年前 (2021-12-24)回复
  2. #21

    这个begin的选取是基于什么原则呢?如果只想想满足check[begin + a1…an] == 0,为什么举的base会是20032呢,明明前面有很多选择, begin的值为什么要初始化为pos – siblings.get(0).code

    森中人3年前 (2021-01-24)回复
  3. #20

    你这个需要词库事先排序

    影法师4年前 (2020-02-15)回复
  4. #19

    这个是直接用转移基数加上编码来做的转移吗?这样会很稀疏吧

    outsider6年前 (2018-09-12)回复
  5. #18

    这里,我想指出一点,双数组的实现中全程没有用到check数组,也就是说不用check数组依然可以完成trie数构建。
    我觉得check数据计算应该是check[base + c] = s这样。而不是楼主写的check[base +c]=base 。用我提出的那种写法,可以根据子节点回溯到父节点的位置,而楼主这种实现,显然做不到。

    david8年前 (2016-07-13)回复
  6. #17

    你好,请教一个问题:在构造DoubleArrayTrie的时候,实际上先是把字典填充到TreeMap的。我想问的是,向TreeMap;里面灌词的时候,需要安全字典序的先后顺序进行PUT不?

    LMSS_878年前 (2016-04-22)回复
    • 你好,请教一个问题:在构造DoubleArrayTrie的时候,实际上先是把字典填充到TreeMap的。我想问的是,向TreeMap里面灌词的时候,需要按照字典序的先后顺序进行PUT不?

      LMSS_878年前 (2016-04-22)回复
      • 不需要

        hankcs8年前 (2016-04-22)回复
        • 哦,谢谢!

          飞星一念8年前 (2016-04-25)回复
  7. #16

    我仔细看了一下代码,发现博文中这句是不对的:“如果base[p] == check[p] && base[p] < 0 则查到一个词;” 所以,后面的解释,博主也写错了。

    源代码是这样的:

    for (int i = pos; i < len; i++) {
    p = b;
    n = base[p];

    if (b == check[p] && n < 0) {
    result.add(-n – 1);
    }

    p = b + (int) (keyChars[i]) + 1;
    if (b == check[p])
    b = base[p];
    else
    return result;
    }

    除第一轮循环外,在判断是否匹配成功时,b 的值 为前一次循环的 base[p],而不是当前循环的。

    另外,“一举成名天下知” 的匹配是在这个 for 循环退出之后进行的,博文的最后一张图中,在 i,b,p 变量那里,再加上下面这一列,可能会更清楚一些。

    i: 循环退出
    b:30695

    p:30695
    base[p]:-4
    check[p]:30695

    滑车9年前 (2015-11-29)回复
  8. #15

    “而 base[base[20033]] = base[20032] = -1 && base[20032] == check[20032] 说明遇到一个词尾”

    如果我没理解错的话,应该是 “base[base[20033]] = base[20032] = -1 && base[20033] == check[20032] 说明遇到一个词尾”吧~

    滑车9年前 (2015-11-28)回复
  9. #14

    代码有问题,commonPrefixSearch中最后的验证条件
    p = b;
    n = base ;

    if (b == check && n < 0) {
    result.add(-n – 1);
    }
    应该判断的是当前结束状态,也就是b的值,而不是n

    pz9年前 (2015-10-10)回复
  10. #13

    Segment seg = HanLP.newSegment().enableAllNamedEntityRecognize(true);List<Term> termList = seg.seg("奥巴马挑战荒野求生");
    单步跟踪到 if (config.organizationRecognize)
    {
    // 层叠隐马模型——生成输出作为下一级隐马输入
    vertexList = viterbi(wordNetOptimum);/////单步跟踪试一下
    //for(){}
    wordNetOptimum.clear();
    wordNetOptimum.addAll(vertexList);
    preSize = wordNetOptimum.size();
    OrganizationRecognition.Recognition(vertexList, wordNetOptimum, wordNetAll);
    }
    vertexList = viterbi(wordNetOptimum);wordNetOptimum在没进函数时是
    0:[ ]
    1:[奥巴马]
    2:[]
    3:[]
    4:[挑战]
    5:[]
    6:[荒野求生]
    7:[]
    8:[]
    9:[]
    10:[ ]
    从函数出来vertexList 就变成
    [ , 奥, 巴马, 挑战, 荒野, 求生, ]
    非常不能理解,为什么把奥巴马切开了,不是已经变为一个实体了吗?

    small bird9年前 (2015-09-19)回复
    • 我并没有复现你说的问题,在com/hankcs/hanlp/seg/Viterbi/ViterbiSegment.java第94行下断点,vertexList依然是[ , 奥巴马, 挑战, 荒野, 求生, ],Viterbi最短路并不会修改任何节点,它只是找一条最短路径而已。

      hankcs9年前 (2015-09-19)回复
      • 1、博主,你在用户词典中添加“荒野求生”,再识别一下,我这边确实是将“荒野求生”分开了,而且是在执行
        vertexList = viterbi(wordNetOptimum);时分开的。“奥巴马”没切开是因为你在二元词典中添加了“未##人@挑战 100”。
        2、我不是很清楚vertexList = viterbi(wordNetOptimum);这句话有什么用?我调试的时候发现形参wordNetOptimum已经是我想要的了,已经识别了人名、地名了。

        small bird9年前 (2015-09-19)回复
        • 好的,我已经复现了问题,并且找到了原因。等我吃完饭再解决它。

          hankcs9年前 (2015-09-19)回复
        • 1、现在正常了
          CustomDictionary.add("荒野求生");
          Segment seg = HanLP.newSegment().enableAllNamedEntityRecognize(true);
          List<Term> termList = seg.seg("奥巴马挑战荒野求生");
          System.out.println(termList);
          2、这句话是层叠隐马模型必需的。
          问题出在再次Viterbi之前没有清空最后一个节点的前驱节点的引用,不过我通过其他方法解决了。

          hankcs9年前 (2015-09-19)回复
          • public static int getBiFrequency(int idA, int idB)
            {
            if (idA == -1)
            {
            return 1000;
            }
            if (idB == -1)
            {
            return 1000;
            }
            int index = binarySearch(pair, start[idA], start[idA + 1] – start[idA], idB);
            if (index < 0) return 0;
            index <<= 1;
            return pair[index + 1];
            }但是这样那二元词典的意义就不是很大啦!比如进行分词时,
            int nTwoWordsFreq = CoreBiGramTableDictionary.getBiFrequency(from.wordID, to.wordID);如果二元词典没出现的词就默认1000了?或者是我理解错了

            small bird9年前 (2015-09-19)
          • 不是二元词典没出现的词,而是非核心词典的词,也就是用户词典的词。

            hankcs9年前 (2015-09-19)
          • 那可能需要核心词典要相当全,比如“钱啟挑战”,‘啟’在核心词典没有出现,则“啟@挑战”为1000,如果“未##人@挑战”<1000,那么钱啟会识别不了。感觉可以把参数设置小一点,1000有点大

            small bird9年前 (2015-09-19)
          • 这个倒没问题,[钱啟/nr, 挑战/vn],常见单个汉字基本都收录了。其实也可以在Viterbi方法之前添加一个

            wordNetOptimum.getVertexes()[sentence.length + 1].getFirst().from = null; // 再次维特比前清空引用

            不过我嫌它性能低就没这么干。

            hankcs9年前 (2015-09-19)
          • 我理解错了O(∩_∩)O哈哈~,和1000没有关系,不过“钱啟挑战”钱啟是识别不了

            small bird9年前 (2015-09-19)
          • [钱啟/nr, 挑战/vn]

            hankcs9年前 (2015-09-19)
          • 已经发现问题了,O(∩_∩)O谢谢

            small bird9年前 (2015-09-19)
  11. #12

    Segment seg = HanLP.newSegment().enableAllNamedEntityRecognize(true);List<Term> termList = seg.seg("奥巴马挑战荒野求生");
    粗分结果[奥/b, 巴马/ns, 挑战/vn, 荒野求生/nz],可是我明明在customdictionary中添加了“奥巴马”,单步跟踪到 public V output(int state)
    {
    if (state < 0) return null;
    int n = base[state];
    if (state == check[state] && n < 0)
    {
    return v[-n – 1];
    }
    return null;
    }
    return null了!!

    small bird9年前 (2015-09-19)回复
    • 感谢反馈。这个问题跟我的双数组无关。
      首先,核心词典优先级高于用户词典,核心词典中已经有奥巴马了。(优先级可能会在后续版本改进,符合大家的习惯)
      然后,问题在刚刚得到了解决,请clone Github上的最新版。

      hankcs9年前 (2015-09-19)回复
  12. #11

    博主你好!我有2个小问题请教
    1、我在用户词典中添加“荒野求生”,为什么在机构名识别时又将“荒野求生”分开了?
    2、“奥巴马挑战荒野求生”这一句,“奥巴马”识别为“奥/b,巴马/ns”,然后我想在用户词典中添加“奥巴马”来识别奥巴马,可是CoreDictionary.Attribute output = dat.output(state);output输出为null,我困惑了好几天了

    small low bird9年前 (2015-09-19)回复
    • 反馈问题的时候最好附上版本号、触发代码,节省彼此的时间。

      hankcs9年前 (2015-09-19)回复
  13. #10

    博主,按照你的例子试了下,用DoubleArrayTrie竟然没有结果,为何 ??

    荆棘9年前 (2015-09-09)回复
    • 请单步

      hankcs9年前 (2015-09-09)回复
      • 您试过没问题吗?

        荆棘9年前 (2015-09-09)回复
        • 没有

          hankcs9年前 (2015-09-09)回复
          • 博主,我用这段代码测试的:
            final String text = "奇妙";
            List<String> words = Arrays.asList(
            "一举", "一举一动", "一举成名", "一举成名天下知",
            "万能", "万能胶", "奇怪", "奇妙");
            DoubleArrayTrie dat = new DoubleArrayTrie();
            dat.build(words);
            List<Integer> integerList = dat.commonPrefixSearch(text);
            for (int index : integerList) {
            System.out.println(words.get(index));
            }
            结果如下:
            Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -7
            at DoubleArrayTrie.commonPrefixSearch(DoubleArrayTrie.java:373)
            at DoubleArrayTrie.commonPrefixSearch(DoubleArrayTrie.java:353)
            at Test.main(Test.java:7)

            荆棘9年前 (2015-09-09)
          • final String text = "奇妙";
            List<String> words = Arrays.asList(
            "一举", "一举一动", "一举成名", "一举成名天下知",
            "万能", "万能胶", "奇怪", "奇妙");
            Collections.sort(words);
            DoubleArrayTrie dat = new DoubleArrayTrie();
            dat.build(words);
            List<Integer> integerList = dat.commonPrefixSearch(text);
            for (int index : integerList) {
            System.out.println(words.get(index));
            }

            hankcs9年前 (2015-09-09)
  14. #9

    2天了 没看懂 能请教下吗

    林墓荷9年前 (2015-06-26)回复
  15. #8

    请问下,“我实现了通用的Trie树,支持泛型、遍历、储存、载入。”这个通用的代码在hanlp中,是哪个文件,我去学习下。谢谢!

    weyn9年前 (2015-05-19)回复
  16. #7

    看了两天这个算法没看懂 [汗] ,文中
    “假定有字符串状态s,当前字符串状态为t,假定t加了一个字符c就等于状态tc,加了一个字符x等于状态tx,那么有

    base[t ] + c = base[tc]

    base[t ] + x = base[tx]

    check[tc] = check[tx]
    ”我看了英文那个文章,不应该是base[t ] + c = tc,base[t ] + x = tx 么?
    这个“多说”在我这不是很稳定啊,不知道怎么了,这个评论有时候会丢失。

    lzru--9年前 (2015-02-05)回复
    • darts-java作者参考的是一篇日本论文,不是这篇英文论文,所以代码实现和英文论文不一致。我介绍的内容是从代码里总结出来的,具体是exactMatchSearch这个方法:
      int b = base[nodePos];
      int p;

      for (int i = pos; i < len; i++)
      {
      p = b + (int) (keyChars [i] ) + 1;
      if (b == check [p] )
      b = base [p] ;
      else
      return result;
      }
      你所说的base[t ] + c = tc,是将状态等同于下标,其实对应p = b + (int) (keyChars [i] ) + 1,我所说的base[t ] + c = base[tc],是将状态等同于base的值,其实对应b = base [p] 。在darts-java中,不是下标在转移,而是b值(也就是base [i] )在转移。原理是一样的,我这种状态的描述可能不适合你,有误导作用。

      hankcs9年前 (2015-02-05)回复
  17. #6

    楼主博文中的图作图用的什么软件?

    lzru--9年前 (2015-02-03)回复
  18. #5

    怎么样给定前缀,把所有是这个前缀的词都找出来,类似于搜素引擎的自动提示?

    郭锐9年前 (2015-01-15)回复
  19. #4

    博主,因为写论文需要,所以搜索双数组trie,看了你的博客,真的很佩服,很惭愧。要好好向你学习。惭愧。会继续来向你好好学习的

    web菜鸟9年前 (2015-01-06)回复
  20. #3

    我到github上下载了darts-java,也写了一个单元测试,甚至直接用了博主的测试代码,但是发现原程序有bug,无法得到博主的结果。
    请问您用的是那份代码?请多多指教。

    爱晓10年前 (2014-12-11)回复
    • 就是这份代码,你得自己准备一个./data/small.dic,里面的单词必须是字典序

      hankcs10年前 (2014-12-11)回复
  21. #2

    怎么用这个查找敏感词呢?

    郭锐10年前 (2014-11-21)回复
  22. #1

    看代码,在插入过程中,对于第2点,对于每一群兄弟节点,应该是寻找一个begin值使得check[begin + a1…an] == 0

    邓世龙10年前 (2014-07-04)回复

我的作品

HanLP自然语言处理包《自然语言处理入门》