2015年1月13日 星期二

[String Matching] Suffix Array and Suffix Tree

[Suffix Array]


Suffix Array(a.k.a. SA) is a sorted array of all suffixes of a string.
Suffix Array(簡稱SA)是一個字串的所有suffix排序後存起來的陣列,例如:

String S = "banana$"

All of its suffixes are listed below
S的所有suffix都列在下方:

row isuffix
1 banana$
2 anana$
3 nana$
4 ana$
5 na$
6 a$
7 $

After sorting in lexicographical order will looked like this
以字母順序排序後看起來會像是這樣:

row isuffix
7 $
6 a$
4 ana$
2 anana$
1 banana$
5 na$
3 nana$

So, the SA of S is 
所以S的SA是
{ 7, 6, 4, 2, 1, 5, 3 }

[Suffix Tree]

We still use S = "banana$" as our example
我們仍然用S = "banana$"做為我們的範例

The root is empty and the smallest of its child should be on the left subtree and the larger should be on the right subtree.
Root必須是空的,且ROOT最小的child要被放在左邊,最大的要在右邊

First, we put the root and its smallest child on the graph
首先,我們把ROOT和其最小的child放在圖上


The next one we should put on is "a$", but "ana$" and "anana$" both start with "a", so we draw like this:
我們下個要放的應該是"a$",但是"ana$" 和 "anana$都是以"a"開頭,所以我們要這樣畫:
Clearly, we can see that if 2 or more suffixes have the same prefix, then the prefix will be their parent.
我們可以看出來兩個suffix如果有相同的prefix,那那串相同的prefix就是他們的parent。

Finally, out suffix tree with contents of SA on it will look like this:
最後,加上SA的內容後,我們的suffix tree會長成這樣: