Core Logic and Mathematical Principles
In string tests at the NOIP level, when faced with extremely large data scales or multi-dimensional substring detections, relying solely on KMP often restricts itself due to the topological singularity of single-pointer matching. This section will break down two competitive-level breakthroughs: Collision-resistant design of multi-base large prime Hash and Periodicity theorem of KMP prefix function.
1. Collision-resistant Design of String Hash with Double Modulus
The essence of string Hash is to map a string of arbitrary length to a fixed-size integer. Let the string be $S = s_1s_2\dots s_n$, and choose a fixed base $Base$ (usually a large prime, such as $131$ or $13331$) and a modulus $M$. The single hash value recursive formula is:
$$H[i] = (H[i-1] \times Base + s_i) \pmod M$$
Thus, the hash value of any substring $S[l \dots r]$ can be extracted in $O(1)$ time using the prefix sum difference formula:
$$Hash(S[l \dots r]) = (H[r] - H[l-1] \times Base^{r-l+1}) \pmod M$$
Birthday Attack and Hash Collision Theorem: According to the birthday paradox, among $O(\sqrt{M})$ different strings, the probability of a hash collision (i.e., two different strings having the same hash value) will rapidly approach $1$. If a single unsigned long long is used and causes natural overflow (equivalent to $M = 2^{64}$), or if a single prime $10^9+7$ is used, since the competition test cases often contain deliberately constructed "natural overflow" malicious data (such as overflow-resistant hacker slider mechanisms), it is easy to cause data collisions leading to WA.
Double Hash Design: Choose two pairs of completely independent bases and large primes $(Base_1, M_1)$ and $(Base_2, M_2)$, uniquely marking each string with a tuple $(Hash_1, Hash_2)$. Effective collision-resistant moduli are typically selected from the following industrial-grade large primes:
- $M_1 = 1000000007 \quad (10^9+7)$
- $M_2 = 1000000009 \quad (10^9+9)$
- $M_3 = 998244353$
2. Periodicity Theorem of KMP Prefix Function (Minimum Cycle Length Theorem)
The prefix function $next$ (or prefix function $\pi$) can not only be used for matching but also encapsulates strong local topological periodicity characteristics.
Periodicity Theorem: Let the length of a string be $N$, and the value at the end position of its prefix function be $next[N-1]$ (0-based corresponding to the longest equal prefix and suffix length of the entire string). Let $L = N - next[N-1]$.
- If $N \pmod L == 0$, then $L$ strictly represents the minimum positive period (minimum cycle length) of the string, which is composed of the substring of length $L$ repeated $\frac{N}{L}$ times.
- If $N \pmod L \neq 0$, then the string does not possess a perfect global periodicity, but $L$ still represents its minimum potential period after excluding extra incomplete suffixes.
Proof: It is known that the prefix and suffix of length $next[N-1]$ are completely identical. This means that after shifting the original string to the right by $L = N - next[N-1]$ positions, the overlapping area of the shifted part aligns perfectly. If $N$ is divisible by $L$, it indicates that the shifted slider seamlessly stitches the entire sequence, and the theorem holds.
State Design and Algorithm Derivation
1. Double Hash Prefix Sum and Differential State
H1[i],H2[i]: Record the hash prefix sums of the first $i$ characters of the string in the double modulus system, respectively.P1[i],P2[i]: Preprocessed power tables of $Base_1^i \pmod {M_1}$ and $Base_2^i \pmod {M_2}$, used as multiplicative operators during differential alignment.
2. Derivation of Any Substring Hash Extraction
Extract the double hash feature values of $S[l \dots r]$ (with indices starting from 1):
$$ans_1 = (H1[r] - H1[l - 1] \times P1[r-l+1]) \pmod {M_1}$$
$$ans_2 = (H2[r] - H2[l - 1] \times P2[r-l+1]) \pmod {M_2}$$
If the result is negative, a two's complement correction must be forcibly executed: $ans = (ans + M) \pmod M$.
C++ Standard Source Code (NOIP Style)
Below is a set of advanced industrial source code that integrates "double hash substring precise determination" and "KMP periodicity annihilation".
#include <iostream>
#include <string>
#include <vector>
#include <algorithm>
using namespace std;
const int MAXN = 1000005;
// Double Hash Parameter Design: Strictly select two different large primes and bases
const long long BASE1 = 131;
const long long MOD1 = 1000000007;
const long long BASE2 = 13331;
const long long MOD2 = 1000000009;
long long h1[MAXN], p1[MAXN];
long long h2[MAXN], p2[MAXN];
int nxt[MAXN];
// ================== CORE 1: Double Hash System Preprocessing and Local Extraction ==================
void init_hash(const string& s, int n) {
p1[0] = 1; p2[0] = 1;
h1[0] = 0; h2[0] = 0;
for (int i = 1; i <= n; ++i) {
p1[i] = (p1[i - 1] * BASE1) % MOD1;
p2[i] = (p2[i - 1] * BASE2) % MOD2;
// s[i-1] is due to C++ string 0-based indexing
h1[i] = (h1[i - 1] * BASE1 + s[i - 1]) % MOD1;
h2[i] = (h2[i - 1] * BASE2 + s[i - 1]) % MOD2;
}
}
// Extract the double hash feature value of the 1-based index interval [l, r], returning as a pair
pair<long long, long long> get_sub_hash(int l, int r) {
long long res1 = (h1[r] - h1[l - 1] * p1[r - l + 1]) % MOD1;
if (res1 < 0) res1 += MOD1; // Fatal pitfall: C++ modulo negative numbers must be manually corrected
long long res2 = (h2[r] - h2[l - 1] * p2[r - l + 1]) % MOD2;
if (res2 < 0) res2 += MOD2;
return make_pair(res1, res2);
}
// ================== CORE 2: KMP Periodicity Annihilation Operator ==================
int get_min_period(const string& s, int n) {
nxt[0] = 0;
int j = 0;
for (int i = 1; i < n; ++i) {
while (j > 0 && s[i] != s[j]) j = nxt[j - 1];
if (s[i] == s[j]) j++;
nxt[i] = j;
}
int max_len = nxt[n - 1]; // The longest common prefix and suffix length of the entire string
int l = n - max_len; // Potential minimum positive cycle length
if (n % l == 0) {
return l; // Fatal pitfall: Only when the divisibility condition holds is l a legitimate cycle
}
return n; // Otherwise, it indicates no complete periodic cycle, and itself is the only minimum period
}
int main() {
ios::sync_with_stdio(false);
cin.tie(nullptr);
string s;
if (cin >> s) {
int n = s.length();
// Test 1: KMP Minimum Period Annihilation
int period = get_min_period(s, n);
cout << "Min Period Length: " << period << " (Repeated " << n / period << " times)\n";
// Test 2: Double Hash System Initialization
init_hash(s, n);
if (n >= 4) {
// Extract substrings composed of the first two characters and the subsequent two characters for hash collision verification
pair<long long, long long> sub1 = get_sub_hash(1, 2);
pair<long long, long long> sub2 = get_sub_hash(3, 4);
if (sub1 == sub2) {
cout << "Substrings [1,2] and [3,4] are strictly equal.\n";
} else {
cout << "Substrings [1,2] and [3,4] are different.\n";
}
}
}
return 0;
}
NOIP Practical Pitfall Guide
1. Directly Returning Negative Numbers During Differential Modulo Causes WA
When executing (h1[r] - h1[l - 1] * p1[r - l + 1]) % MOD1, the subtraction operation under modulo may lead to the first half being smaller than the second half, resulting in a negative intermediate result. According to the C++ addressing and arithmetic standard, the modulo result of a negative number is still negative. If the strong correction if (res < 0) res += MOD is not performed, this negative hash value will cause irreversible misjudgment when colliding with a positive hash value elsewhere, leading to widespread WA failures in the entire code.
2. Ignoring "Divisibility" Constraints in KMP Periodicity Determination Leads to Logical Fallacies
Many contestants naively believe that $L = N - next[N-1]$ is the answer when using the $next$ array to find the minimum cycle length of a string. Warning: If the original string is abaab, with $N = 5$, the longest common prefix and suffix is ab (length 2), the calculated $L = 5 - 2 = 3$. However, abaab cannot evidently be formed by continuously repeating aba. It is essential to strictly execute the precondition check if (n % l == 0). If not divisible, the minimum cycle length can only be the total length of the original string $N$. Ignoring this step will directly expose one to specially constructed test cases in periodic counting problems.
Classic NOIP/Luogu Problems
1. Luogu P3370 [Template] String Hash
- Problem Description: Given $N$ strings, each with a length not exceeding $M$, find how many unique strings there are among them.
- Essence of the Problem: Deduplication of a large range of strings.
- Core Solution Idea: If
std::set<string>is used directly, the time complexity involves massive string copies and internal red-black tree shifts, which can easily lead to timeouts. The hardcore approach is to use the double hash algorithm from this section to calculate the correspondingpair<long long, long long>feature code for each of the $N$ strings. Insert all feature codes into astd::vector, and finally perform a standardsortand use theuniqueoperator for deduplication. Since the time complexity for comparing numbers is $O(1)$, the overall efficiency is firmly constrained within the extremely high $O(N \log N + NM)$, easily handling any malicious collision test cases online.
2. Luogu P4391 [BJOI2006] Radio Transmission
- Problem Description: Given a string of length $N$, it is generated by continuously self-replicating and truncating a shorter string $S$. Find the minimum possible length of the source string $S$.
- Essence of the Problem: Periodic boundary segmentation using the prefix function.
- Core Solution Idea: This problem is a super variant of the KMP periodicity theorem. The problem requires the "potential" minimum period, meaning that the ending can be incomplete. From the derivation process of the periodicity theorem, it can be seen that regardless of whether the end can be divisible, shifting the original string to the right by $L = N - next[N-1]$ allows the overlapping part's common misaligned structure to align perfectly. This $L$ precisely represents the minimum sliding block length that can tile and derive the current mother string. Therefore, no
n % l == 0check is needed; simply outputting $N - next[N-1]$ can annihilate this problem in $O(N)$ complexity.