We study the Internal Dictionary Matching (IDM) problem, where a dictionary $\mathcal{D}$ containing $d$ substrings and a text $T$ are given, and each query concerns the occurrences of patterns in $\mathcal{D}$ within another substring of $T$. We propose a novel $O(n)$-sized data structure named Basic Substring Structure (BASS), where $n$ is the length of the text $T$. With BASS, we can handle all types of queries in the IDM problem with nearly optimal query and preprocessing time. Specifically, our results include:
- The first algorithm that answers the CountDistinct query in $\tilde{O}(1)$ time with $\tilde{O}(n+d)$ preprocessing, computing the number of distinct patterns in $T[l,r]$. Previously, the best result was $\tilde{O}(m)$ time per query after $\tilde{O}(n^2/m+d)$ or $\tilde{O}(nd/m+d)$ preprocessing, where $m$ is a chosen parameter.
- Faster algorithms for two other types of internal queries. We improved the runtime for (1) Occurrence counting (Count) queries to $O(\log n/\log\log n)$ time per query with $O(n+d\sqrt{\log n})$ preprocessing from $O(\log^2 n/\log\log n)$ time per query with $O(n\log n/\log \log n+d\log^{3/2} n)$ preprocessing. (2) Distinct pattern reporting (ReportDistinct) queries to $O(1+|\text{output}|)$ time per query from $O(\log n+|\text{output}|)$ per query. Additionally, we match the optimal runtime for the remaining two types of queries, pattern existence (Exists) and occurrence reporting (Report). We also demonstrate that BASS is more generally applicable to other internal query problems.
Blogger's Review: The introduction of the BASS data structure marks a significant breakthrough in the Internal Dictionary Matching problem, offering nearly optimal time complexity both theoretically and practically. This presents a new approach to handling large-scale text data, particularly in applications requiring frequent queries. Its efficient preprocessing and query times will greatly enhance the performance of text retrieval operations.