Suffix array
The suffix array is a memory-efficient alternative to the suffix tree which provides a sorted list of string indices indicating the string’s suffixes.
Creation
Provided that the sentinel character $
has the smallest value the suffix array for the string barbapapa
is determined as follows:
-
determine all possible suffixes and the corresponding start indices (the index one indicates the first position).
suffix start position in string barbapapa$ 1 arbapapa$ 2 rbapapa$ 3 bapapa$ 4 apapa$ 5 papa$ 6 apa$ 7 pa$ 8 a$ 9 $ 10 -
sort all suffixes
suffix start position in string $ 10 a$ 9 apa$ 7 apapa$ 5 arbapapa$ 2 barbapapa$ 1 bapapa$ 4 pa$ 8 papa$ 6 rbapapa$ 3 -
the suffix array only records the start positions, i.e. contains the values
[10, 9, 7, 5, 2, 1, 4, 8, 6, 3]
in the example above.
Suffix arrays can be derived from suffix trees in \(\mathcal{O}(n^2 log\quad n)\) although more efficient algorithms that are able to build the array in \(\mathcal{O}(n)\) exist.
Applications:
- searching: e.g. does the suffix
papa
occur in the string? Do a binary search in the array and lookup the corresponding string value to determine whether a suffix is present. - counting: e.g. how often does the suffix
apa
occur?. Do a binary search and count the neighboring suffixes starting withapa
. - K-mer counting: compute all substrings of up to size
k
and the number of times they occur in the string.