Pattern matching is regarded as one of the core operations in computer science, especially solutions based on the renowned Burrows-Wheeler Transform (BWT). The success of BWT lies in its pattern matching algorithm known as backward search, which is not only near-optimal in the RAM model but also runs directly on a compressed representation of the input string.
Recently, backward search has been generalized to Wheeler Deterministic Finite Automata (DFAs), a subclass of standard DFAs, finding applications in bioinformatics. Researchers have shown that specific pangenome graphs of human chromosomes can be transformed into Wheeler DFAs and indexed using this strategy.
However, this BWT-based index on Wheeler DFAs inherited a significant drawback from the original backward search: a high number of I/O operations triggered during algorithm execution, which are lower-bounded by the length of the pattern in the worst case. To address this limitation, we propose the first cache-friendly algorithm specifically designed for Wheeler DFAs. Our new data structure reduces the number of I/O operations by employing a strategy analogous to the suffix array: it interleaves binary search with fast sequential scans of the automaton.
We empirically validate this new indexing strategy by running our algorithm on real-world Wheeler pangenome graphs. Our results show that while our data structure can use up to 15 times the space required by the backward search, it can also be 500 times faster and able to process a single character of the pattern in less than 3 ns.
Blogger's Review: The proposed algorithm significantly enhances the performance of Wheeler DFA by effectively reducing I/O operations, providing a new solution to the pattern matching problem in bioinformatics. The trade-off regarding space complexity is noteworthy, and future research could further optimize this algorithm for broader applications.