Core Logic and Mathematical Principles
The primary task of the KMP algorithm is to achieve pattern matching of a single pattern string within a text string in $O(N + M)$ time complexity, breaking the inherent limit of brute-force matching at $O(NM)$. The underlying mathematical foundation is the Prefix Function, often defined as the $next$ array in competitive programming.
1. Mathematical Definition of the Prefix Function
For a string $P$ of length $M$, its prefix function $next[i]$ represents the length of the longest equal proper prefix and proper suffix of the substring $P[0 \dots i]$. The set representation is as follows:
$$next[i] = \max \big\{ k \big| 0 \le k < i+1 \text{ and } P[0 \dots k-1] == P[i-k+1 \dots i] \big\}$$
If this set is empty, then $next[i] = 0$.
2. Proof of Convergence for Mismatch State Transition
When the character $S[i]$ in the text string mismatches with the character $P[j]$ in the pattern string, traditional algorithms backtrack $i$. KMP keeps $i$ stationary and transitions $j$ to $next[j-1]$.
Proof: Given that $P[0 \dots j-1] == S[i-j \dots i-1]$. If there exists a shorter matching block length $k$ such that the new prefix can connect to the current suffix, it must satisfy $P[0 \dots k-1] == P[j-k \dots j-1]$. According to the definition of the prefix function, the maximum $k$ is strictly equal to $next[j-1]$. Since $next[j-1] < j$, the transition state is monotonically decreasing, and iteration must converge to $0$ in a finite number of steps.
State Design and Algorithm Derivation
1. Recursion of the Prefix Function $next$ Array
Assuming $next[0 \dots i-1]$ is known, we now derive the value of $next[i]$. Let $j = next[i-1]$.
- Case 1: If $P[i] == P[j]$, the current match can extend one position to the right, leading to a direct state transition:
$$next[i] = j + 1$$
- Case 2: If $P[i] \neq P[j]$, it indicates that the current length of the proper prefix cannot extend. In this case, we must shorten the search range and jump to a shorter equal proper prefix-suffix position, letting $j = next[j-1]$, and repeat the check until $P[i] == P[j]$.
- Boundary: If $j$ backtracks to $0$ and still $P[i] \neq P[0]$, then $next[i] = 0$.
2. Pattern Matching State Machine
Consider the pattern string $P$ as a finite state automaton, where the state value is the number of characters successfully matched, denoted as $j$. For each character $S[i]$ in the text string:
- If $S[i] \neq P[j]$ and $j > 0$, the state backtracks continuously along the mismatch pointer:
j = next[j - 1]. - If $S[i] == P[j]$, the state advances:
j++. - If $j == M$, it indicates a complete match of the pattern string, record the answer position $i - M + 1$, and then forcefully transition the state to
j = next[j - 1]to continue matching subsequent overlapping substrings.
C++ Standard Source Code (NOIP Style)
#include <iostream>
#include <string>
#include <vector>
#include <algorithm>
using namespace std;
// Efficient input function to handle regular string input
void fast_io() {
ios::sync_with_stdio(false);
cin.tie(nullptr);
}
const int MAXM = 1000005;
int nxt[MAXM]; // Avoid global keyword conflicts with C++ standard libraries by renaming to nxt
// Strictly compute the prefix function based on 0-based indexing
void get_next(const string& p, int m) {
nxt[0] = 0; // The substring of length 1 has no proper prefix or suffix, hence must be 0
int j = 0; // j always represents the current prefix character index to compare, and also the nxt value of the previous position
for (int i = 1; i < m; ++i) {
while (j > 0 && p[i] != p[j]) {
j = nxt[j - 1]; // Critical point: on mismatch, must backtrack to nxt[j-1] instead of nxt[j]
}
if (p[i] == p[j]) {
j++;
}
nxt[i] = j;
}
}
// KMP main matching process
void kmp_search(const string& s, const string& p) {
int n = s.length();
int m = p.length();
get_next(p, m);
int j = 0; // Pointer for the pattern string
for (int i = 0; i < n; ++i) {
while (j > 0 && s[i] != p[j]) {
j = nxt[j - 1]; // On mismatch, adaptively transition state; i pointer never backtracks
}
if (s[i] == p[j]) {
j++;
}
if (j == m) {
// Successfully matched the pattern in the text string, output the starting index (1-based index common in NOIP evaluations)
cout << i - m + 2 << "\n";
j = nxt[j - 1]; // Critical point: after a successful match, to find the next overlapping target, must forcibly transition state
}
}
}
int main() {
fast_io();
string s, p;
if (cin >> s >> p) {
kmp_search(s, p);
int m = p.length();
for (int i = 0; i < m; ++i) {
cout << nxt[i] << (i == m - 1 ? "" : " ");
}
cout << "\n";
}
return 0;
}
NOIP Practical Avoidance Guide
1. Global Variable Naming Conflicts and Compilation Failures
When compiling with g++ -O2 in an offline Linux environment, if certain standard libraries are included, a global declaration int next[MAXN]; may conflict with the standard iterator operator std::next in <algorithm> or built-in namespaces. This will lead to cryptic symbol overloading errors from the compiler. In competitions, the array must be strictly named nxt or nxt_arr.
2. Mismatch Index Out of Bounds and Infinite Loops
Inside the while (j > 0 && p[i] != p[j]) loop, if the jump statement is written as j = nxt[j];, it will directly lead to logical disasters. Because when the character at $j$ mismatches, it should query the longest common prefix-suffix of the successfully matched substring $P[0 \dots j-1]$. If written as nxt[j], it not only loses mathematical correctness but may also lead to an infinite loop due to malicious data when nxt[j] == j, or trigger RE when $j = M$ due to out-of-bounds.
3. Mixing 0-based and 1-based Indexing Leading to State Gaps
The implementation of the KMP algorithm is highly dependent on index boundaries. If a 0-based index is used when reading the string, while the transition logic of the next array borrows from some textbooks' 1-based algorithms (e.g., nxt[j] = nxt[nxt[j]] and initially setting nxt[0] = -1), it will lead to physical deviations of $\pm 1$ in boundary indices, causing the state machine to completely lose its high-level interception capability, resulting in widespread WA in evaluations.
Classic NOIP/Luogu Problems
1. Luogu P3375 [Template] KMP String Matching
- Problem Description: Given two strings $S_1$ and $S_2$, find all occurrences of $S_2$ in $S_1$ and output the prefix function array of $S_2$.
- Core Essence: A pure industrial-grade implementation of the standard KMP algorithm.
- Core Problem-Solving Idea: As shown in the standard source code, first use the $O(M)$ complexity of single-string self-reduction to derive the entire
nxtstate machine, and then use this state machine to match against the text string $S_1$. The entire algorithm consists solely of linear single-directional pointer movements, with a time complexity strictly locked at $O(N + M)$ and a space complexity of only $O(M)$.
2. Luogu P2375 [NOI2014] Zoo
- Problem Description: For a string $S$, define $num[i]$ as the number of equal proper prefixes and suffixes of $S[0 \dots i]$, with lengths not exceeding half of the total length of the substring. Calculate the cumulative product of $(num[i] + 1)$ for all positions in the string modulo $10^9 + 7$.
- Core Essence: Exploring the topological depth of the prefix function tree (the $next$ tree).
- Core Problem-Solving Idea: If we directly use the
nxtarray to brute force backtrack to find positions with lengths less than or equal to half, the time complexity will degrade to $O(N^2)$, leading toTLE. The core breakthrough point is that thenxtpointer of KMP actually forms a tree-like topological structure converging to the root node (each node $i$'s parent is $nxt[i-1]$). The essence of $num[i]$ is to find the number of nodes in the ancestor chain of the current node on this tree with values less than or equal to $\lfloor \frac{i+1}{2} \rfloor$. We can maintain a double pointernum_jthat stays at the highest valid position not exceeding half while advancingi, leveraging the non-backtracking nature of double pointers to harden the time complexity to $O(N)$.