NeFut Logo NeFut
Admin Login

Efficient Applications of String Hashing and KMP Periodicity Theorem

Published at: 2026-05-27 09:17 Last updated: 2026-06-10 09:18
#C++ #Tutorial

Core Logic and Mathematical Principles

In string tests at the NOIP level, when faced with extremely large data scales or multi-dimensional substring detections, relying solely on KMP often restricts itself due to the topological singularity of single-pointer matching. This section will break down two competitive-level breakthroughs: Collision-resistant design of multi-base large prime Hash and Periodicity theorem of KMP prefix function.

1. Collision-resistant Design of String Hash with Double Modulus

The essence of string Hash is to map a string of arbitrary length to a fixed-size integer. Let the string be $S = s_1s_2\dots s_n$, and choose a fixed base $Base$ (usually a large prime, such as $131$ or $13331$) and a modulus $M$. The single hash value recursive formula is:

$$H[i] = (H[i-1] \times Base + s_i) \pmod M$$

Thus, the hash value of any substring $S[l \dots r]$ can be extracted in $O(1)$ time using the prefix sum difference formula:

$$Hash(S[l \dots r]) = (H[r] - H[l-1] \times Base^{r-l+1}) \pmod M$$

Birthday Attack and Hash Collision Theorem: According to the birthday paradox, among $O(\sqrt{M})$ different strings, the probability of a hash collision (i.e., two different strings having the same hash value) will rapidly approach $1$. If a single unsigned long long is used and causes natural overflow (equivalent to $M = 2^{64}$), or if a single prime $10^9+7$ is used, since the competition test cases often contain deliberately constructed "natural overflow" malicious data (such as overflow-resistant hacker slider mechanisms), it is easy to cause data collisions leading to WA.

Double Hash Design: Choose two pairs of completely independent bases and large primes $(Base_1, M_1)$ and $(Base_2, M_2)$, uniquely marking each string with a tuple $(Hash_1, Hash_2)$. Effective collision-resistant moduli are typically selected from the following industrial-grade large primes:

2. Periodicity Theorem of KMP Prefix Function (Minimum Cycle Length Theorem)

The prefix function $next$ (or prefix function $\pi$) can not only be used for matching but also encapsulates strong local topological periodicity characteristics.

Periodicity Theorem: Let the length of a string be $N$, and the value at the end position of its prefix function be $next[N-1]$ (0-based corresponding to the longest equal prefix and suffix length of the entire string). Let $L = N - next[N-1]$.

Proof: It is known that the prefix and suffix of length $next[N-1]$ are completely identical. This means that after shifting the original string to the right by $L = N - next[N-1]$ positions, the overlapping area of the shifted part aligns perfectly. If $N$ is divisible by $L$, it indicates that the shifted slider seamlessly stitches the entire sequence, and the theorem holds.


State Design and Algorithm Derivation

1. Double Hash Prefix Sum and Differential State

2. Derivation of Any Substring Hash Extraction

Extract the double hash feature values of $S[l \dots r]$ (with indices starting from 1):

$$ans_1 = (H1[r] - H1[l - 1] \times P1[r-l+1]) \pmod {M_1}$$

$$ans_2 = (H2[r] - H2[l - 1] \times P2[r-l+1]) \pmod {M_2}$$

If the result is negative, a two's complement correction must be forcibly executed: $ans = (ans + M) \pmod M$.


C++ Standard Source Code (NOIP Style)

Below is a set of advanced industrial source code that integrates "double hash substring precise determination" and "KMP periodicity annihilation".

#include <iostream>
#include <string>
#include <vector>
#include <algorithm>

using namespace std;

const int MAXN = 1000005;

// Double Hash Parameter Design: Strictly select two different large primes and bases
const long long BASE1 = 131;
const long long MOD1 = 1000000007;
const long long BASE2 = 13331;
const long long MOD2 = 1000000009;

long long h1[MAXN], p1[MAXN];
long long h2[MAXN], p2[MAXN];
int nxt[MAXN];

// ================== CORE 1: Double Hash System Preprocessing and Local Extraction ==================
void init_hash(const string& s, int n) {
    p1[0] = 1; p2[0] = 1;
    h1[0] = 0; h2[0] = 0;

    for (int i = 1; i <= n; ++i) {
        p1[i] = (p1[i - 1] * BASE1) % MOD1;
        p2[i] = (p2[i - 1] * BASE2) % MOD2;

        // s[i-1] is due to C++ string 0-based indexing
        h1[i] = (h1[i - 1] * BASE1 + s[i - 1]) % MOD1;
        h2[i] = (h2[i - 1] * BASE2 + s[i - 1]) % MOD2;
    }
}

// Extract the double hash feature value of the 1-based index interval [l, r], returning as a pair
pair<long long, long long> get_sub_hash(int l, int r) {
    long long res1 = (h1[r] - h1[l - 1] * p1[r - l + 1]) % MOD1;
    if (res1 < 0) res1 += MOD1; // Fatal pitfall: C++ modulo negative numbers must be manually corrected

    long long res2 = (h2[r] - h2[l - 1] * p2[r - l + 1]) % MOD2;
    if (res2 < 0) res2 += MOD2;

    return make_pair(res1, res2);
}

// ================== CORE 2: KMP Periodicity Annihilation Operator ==================
int get_min_period(const string& s, int n) {
    nxt[0] = 0;
    int j = 0;
    for (int i = 1; i < n; ++i) {
        while (j > 0 && s[i] != s[j]) j = nxt[j - 1];
        if (s[i] == s[j]) j++;
        nxt[i] = j;
    }

    int max_len = nxt[n - 1]; // The longest common prefix and suffix length of the entire string
    int l = n - max_len;      // Potential minimum positive cycle length

    if (n % l == 0) {
        return l; // Fatal pitfall: Only when the divisibility condition holds is l a legitimate cycle
    }
    return n; // Otherwise, it indicates no complete periodic cycle, and itself is the only minimum period
}

int main() {
    ios::sync_with_stdio(false);
    cin.tie(nullptr);

    string s;
    if (cin >> s) {
        int n = s.length();

        // Test 1: KMP Minimum Period Annihilation
        int period = get_min_period(s, n);
        cout << "Min Period Length: " << period << " (Repeated " << n / period << " times)\n";

        // Test 2: Double Hash System Initialization
        init_hash(s, n);
        if (n >= 4) {
            // Extract substrings composed of the first two characters and the subsequent two characters for hash collision verification
            pair<long long, long long> sub1 = get_sub_hash(1, 2);
            pair<long long, long long> sub2 = get_sub_hash(3, 4);
            if (sub1 == sub2) {
                cout << "Substrings [1,2] and [3,4] are strictly equal.\n";
            } else {
                cout << "Substrings [1,2] and [3,4] are different.\n";
            }
        }
    }
    return 0;
}

NOIP Practical Pitfall Guide

1. Directly Returning Negative Numbers During Differential Modulo Causes WA

When executing (h1[r] - h1[l - 1] * p1[r - l + 1]) % MOD1, the subtraction operation under modulo may lead to the first half being smaller than the second half, resulting in a negative intermediate result. According to the C++ addressing and arithmetic standard, the modulo result of a negative number is still negative. If the strong correction if (res < 0) res += MOD is not performed, this negative hash value will cause irreversible misjudgment when colliding with a positive hash value elsewhere, leading to widespread WA failures in the entire code.

2. Ignoring "Divisibility" Constraints in KMP Periodicity Determination Leads to Logical Fallacies

Many contestants naively believe that $L = N - next[N-1]$ is the answer when using the $next$ array to find the minimum cycle length of a string. Warning: If the original string is abaab, with $N = 5$, the longest common prefix and suffix is ab (length 2), the calculated $L = 5 - 2 = 3$. However, abaab cannot evidently be formed by continuously repeating aba. It is essential to strictly execute the precondition check if (n % l == 0). If not divisible, the minimum cycle length can only be the total length of the original string $N$. Ignoring this step will directly expose one to specially constructed test cases in periodic counting problems.


Classic NOIP/Luogu Problems

1. Luogu P3370 [Template] String Hash

2. Luogu P4391 [BJOI2006] Radio Transmission

Original Source: local://8.4

[h] Back to Home