Efficient String Matching: A Comprehensive Guide to the K...

Core Logic and Mathematical Principles

The primary task of the KMP algorithm is to achieve pattern matching of a single pattern string within a text string in $O(N + M)$ time complexity, breaking the inherent limit of brute-force matching at $O(NM)$. The underlying mathematical foundation is the Prefix Function, often defined as the $next$ array in competitive programming.

1. Mathematical Definition of the Prefix Function

For a string $P$ of length $M$, its prefix function $next[i]$ represents the length of the longest equal proper prefix and proper suffix of the substring $P[0 \dots i]$. The set representation is as follows:

$$next[i] = \max \big\{ k \big| 0 \le k < i+1 \text{ and } P[0 \dots k-1] == P[i-k+1 \dots i] \big\}$$

If this set is empty, then $next[i] = 0$.

2. Proof of Convergence for Mismatch State Transition

When the character $S[i]$ in the text string mismatches with the character $P[j]$ in the pattern string, traditional algorithms backtrack $i$. KMP keeps $i$ stationary and transitions $j$ to $next[j-1]$.

Proof: Given that $P[0 \dots j-1] == S[i-j \dots i-1]$. If there exists a shorter matching block length $k$ such that the new prefix can connect to the current suffix, it must satisfy $P[0 \dots k-1] == P[j-k \dots j-1]$. According to the definition of the prefix function, the maximum $k$ is strictly equal to $next[j-1]$. Since $next[j-1] < j$, the transition state is monotonically decreasing, and iteration must converge to $0$ in a finite number of steps.

State Design and Algorithm Derivation

1. Recursion of the Prefix Function $next$ Array

Assuming $next[0 \dots i-1]$ is known, we now derive the value of $next[i]$. Let $j = next[i-1]$.

Case 1: If $P[i] == P[j]$, the current match can extend one position to the right, leading to a direct state transition:

$$next[i] = j + 1$$

Case 2: If $P[i] \neq P[j]$, it indicates that the current length of the proper prefix cannot extend. In this case, we must shorten the search range and jump to a shorter equal proper prefix-suffix position, letting $j = next[j-1]$, and repeat the check until $P[i] == P[j]$.
Boundary: If $j$ backtracks to $0$ and still $P[i] \neq P[0]$, then $next[i] = 0$.

2. Pattern Matching State Machine

Consider the pattern string $P$ as a finite state automaton, where the state value is the number of characters successfully matched, denoted as $j$. For each character $S[i]$ in the text string:

If $S[i] \neq P[j]$ and $j > 0$, the state backtracks continuously along the mismatch pointer: j = next[j - 1].
If $S[i] == P[j]$, the state advances: j++.
If $j == M$, it indicates a complete match of the pattern string, record the answer position $i - M + 1$, and then forcefully transition the state to j = next[j - 1] to continue matching subsequent overlapping substrings.

C++ Standard Source Code (NOIP Style)

#include <iostream>
#include <string>
#include <vector>
#include <algorithm>

using namespace std;

// Efficient input function to handle regular string input
void fast_io() {
    ios::sync_with_stdio(false);
    cin.tie(nullptr);
}

const int MAXM = 1000005;
int nxt[MAXM]; // Avoid global keyword conflicts with C++ standard libraries by renaming to nxt

// Strictly compute the prefix function based on 0-based indexing
void get_next(const string& p, int m) {
    nxt[0] = 0; // The substring of length 1 has no proper prefix or suffix, hence must be 0
    int j = 0;  // j always represents the current prefix character index to compare, and also the nxt value of the previous position
    for (int i = 1; i < m; ++i) {
        while (j > 0 && p[i] != p[j]) {
            j = nxt[j - 1]; // Critical point: on mismatch, must backtrack to nxt[j-1] instead of nxt[j]
        }
        if (p[i] == p[j]) {
            j++;
        }
        nxt[i] = j;
    }
}

// KMP main matching process
void kmp_search(const string& s, const string& p) {
    int n = s.length();
    int m = p.length();

    get_next(p, m);

    int j = 0; // Pointer for the pattern string
    for (int i = 0; i < n; ++i) {
        while (j > 0 && s[i] != p[j]) {
            j = nxt[j - 1]; // On mismatch, adaptively transition state; i pointer never backtracks
        }
        if (s[i] == p[j]) {
            j++;
        }
        if (j == m) {
            // Successfully matched the pattern in the text string, output the starting index (1-based index common in NOIP evaluations)
            cout << i - m + 1 << "\n"; 
            j = nxt[j - 1]; // Critical point: after a successful match, to find the next overlapping target, must forcibly transition state
        }
    }
}

int main() {
    fast_io();
    string s, p;
    if (cin >> s >> p) {
        kmp_search(s, p);
        int m = p.length();
        for (int i = 0; i < m; ++i) {
            cout << nxt[i] << (i == m - 1 ? "" : " ");
        }
        cout << "\n";
    }
    return 0;
}

NOIP Practical Avoidance Guide

1. Global Variable Naming Conflicts and Compilation Failures

When compiling with g++ -O2 in an offline Linux environment, if certain standard libraries are included, a global declaration int next[MAXN]; may conflict with the standard iterator operator std::next in <algorithm> or built-in namespaces. This will lead to cryptic symbol overloading errors from the compiler. In competitions, the array must be strictly named nxt or nxt_arr.

2. Mismatch Index Out of Bounds and Infinite Loops

Inside the while (j > 0 && p[i] != p[j]) loop, if the jump statement is written as j = nxt[j];, it will directly lead to logical disasters. Because when the character at $j$ mismatches, it should query the longest common prefix-suffix of the successfully matched substring $P[0 \dots j-1]$. If written as nxt[j], it not only loses mathematical correctness but may also lead to an infinite loop due to malicious data when nxt[j] == j, or trigger RE when $j = M$ due to out-of-bounds.

3. Mixing 0-based and 1-based Indexing Leading to State Gaps

The implementation of the KMP algorithm is highly dependent on index boundaries. If a 0-based index is used when reading the string, while the transition logic of the next array borrows from some textbooks' 1-based algorithms (e.g., nxt[j] = nxt[nxt[j]] and initially setting nxt[0] = -1), it will lead to physical deviations of $\pm 1$ in boundary indices, causing the state machine to completely lose its high-level interception capability, resulting in widespread WA in evaluations.

Classic NOIP/Luogu Problems

1. Luogu P3375 [Template] KMP String Matching

Problem Description: Given two strings $S_1$ and $S_2$, find all occurrences of $S_2$ in $S_1$ and output the prefix function array of $S_2$.
Core Essence: A pure industrial-grade implementation of the standard KMP algorithm.
Core Problem-Solving Idea: As shown in the standard source code, first use the $O(M)$ complexity of single-string self-reduction to derive the entire nxt state machine, and then use this state machine to match against the text string $S_1$. The entire algorithm consists solely of linear single-directional pointer movements, with a time complexity strictly locked at $O(N + M)$ and a space complexity of only $O(M)$.

2. Luogu P2375 [NOI2014] Zoo

Problem Description: For a string $S$, define $num[i]$ as the number of equal proper prefixes and suffixes of $S[0 \dots i]$, with lengths not exceeding half of the total length of the substring. Calculate the cumulative product of $(num[i] + 1)$ for all positions in the string modulo $10^9 + 7$.
Core Essence: Exploring the topological depth of the prefix function tree (the $next$ tree).
Core Problem-Solving Idea: If we directly use the nxt array to brute force backtrack to find positions with lengths less than or equal to half, the time complexity will degrade to $O(N^2)$, leading to TLE. The core breakthrough point is that the nxt pointer of KMP actually forms a tree-like topological structure converging to the root node (each node $i$'s parent is $nxt[i-1]$). The essence of $num[i]$ is to find the number of nodes in the ancestor chain of the current node on this tree with values less than or equal to $\lfloor \frac{i+1}{2} \rfloor$. We can maintain a double pointer num_j that stays at the highest valid position not exceeding half while advancing i, leveraging the non-backtracking nature of double pointers to harden the time complexity to $O(N)$.

Efficient String Matching: A Comprehensive Guide to the KMP Algorithm