The suffix array is a memory-efficient alternative to the suffix tree which provides a sorted list of string indices indicating the string’s suffixes.
Provided that the sentinel character
$ has the smallest value the suffix array for the string
barbapapa is determined as follows:
determine all possible suffixes and the corresponding start indices (the index one indicates the first position).
suffix start position in string barbapapa$ 1 arbapapa$ 2 rbapapa$ 3 bapapa$ 4 apapa$ 5 papa$ 6 apa$ 7 pa$ 8 a$ 9 $ 10
sort all suffixes
suffix start position in string $ 10 a$ 9 apa$ 7 apapa$ 5 arbapapa$ 2 barbapapa$ 1 bapapa$ 4 pa$ 8 papa$ 6 rbapapa$ 3
the suffix array only records the start positions, i.e. contains the values
[10, 9, 7, 5, 2, 1, 4, 8, 6, 3]in the example above.
Suffix arrays can be derived from suffix trees in although more efficient algorithms that are able to build the array in exist.
- searching: e.g. does the suffix
papaoccur in the string? Do a binary search in the array and lookup the corresponding string value to determine whether a suffix is present.
- counting: e.g. how often does the suffix
apaoccur?. Do a binary search and count the neighboring suffixes starting with
- K-mer counting: compute all substrings of up to size
kand the number of times they occur in the string.