Core Logic and Mathematical Principles
Discretization is an order-preserving spatial compression technique. Its physical essence is to map an infinite or extremely large sparse value range to a finite and continuous compact value range while maintaining the original data's partial order relationship.
In most algorithmic competition problems, the space complexity or auxiliary data structures (such as Fenwick trees, segment trees, or buckets) directly depend on the size of the value range $U$. If $U \le 10^9$ but the actual number of effective points $N \le 10^5$, directly allocating space can lead to memory limit exceeded (MLE). It is observed that the core logic of the algorithm often focuses only on the relative sizes between elements (i.e., the partial order relationships of greater than, less than, and equal), rather than their absolute values.
By establishing a bijective mapping $f(x) \to k$, where $x$ is an element from the original sparse value range, and $k \in [1, M] \, (M \le N)$ is a continuous positive integer. This mapping must satisfy:
$$\forall x_i, x_j, \quad x_i < x_j \iff f(x_i) < f(x_j)$$
Using this technique, the space complexity can be forcibly reduced from $O(U)$ to $O(N)$, while the time complexity typically introduces a sorting constant of $O(N \log N)$.
Algorithm Derivation and Mapping Construction
The core of discretization lies in deduplication and binary search. The overall construction is divided into three strict topological steps:
1. Collection of Discretization Set
All absolute values that may affect the answer (including query boundaries, modification points, etc.) are pushed into a dynamic array alls. The size of the set is $N$.
2. Sorting and Deduplication (Unique)
Sort alls in ascending order to ensure data monotonicity. Then, use the two-pointer technique to eliminate duplicate elements, ensuring the uniqueness of the mapping. The sorting complexity is $O(N \log N)$, and the deduplication complexity is $O(N)$. After deduplication, the actual size of the set is $M \, (M \le N)$. At this point, the array index $i \in [0, M-1]$ is the compressed new coordinate.
3. Binary Search for Mapping
For any original large value $x$, perform a binary search (using std::lower_bound) in the constructed non-decreasing sequence alls to find the position of the first element that is greater than or equal to $x$. The mapping function is defined as:
$$f(x) = \text{lower\_bound}(\text{alls.begin}(), \text{alls.end}(), x) - \text{alls.begin}() + 1$$
The addition of 1 is to normalize the data into a 1-indexed coordinate system, which is fully compatible with subsequent boundary handling of data structures. The time complexity for a single query mapping is $O(\log N)$.
C++ Standard Source Code
#include <iostream>
#include <vector>
#include <algorithm>
using namespace std;
// Encapsulating the discretization component
struct Discretizer {
vector<int> alls;
// Add raw values to be discretized
void add(int x) {
alls.push_back(x);
}
// Execute absolute compression, establishing mapping track
void init() {
sort(alls.begin(), alls.end());
// unique returns the tail iterator after deduplication, erase thoroughly cuts off redundant memory at the tail
alls.erase(unique(alls.begin(), alls.end()), alls.end());
}
// Mapping query: maps large value ranges to [1, alls.size()]
int query(int x) {
// Key point: lower_bound relies on the monotonicity from init(), otherwise it may lead to infinite loops or return incorrect iterators
auto it = lower_bound(alls.begin(), alls.end(), x);
return (it - alls.begin()) + 1; // Convert to 1-indexed coordinates
}
// Get the upper limit of the compact value range after discretization
int size() const {
return alls.size();
}
};
int main() {
// Disable synchronization to cut off IO constants
ios_base::sync_with_stdio(false);
cin.tie(NULL);
int n, m;
if (!(cin >> n >> m)) return 0;
// Discretization not only stores the original array, but also includes the boundary coordinates of modification and query operations
Discretizer d;
vector<pair<int, int>> adds(n);
for (int i = 0; i < n; ++i) {
cin >> adds[i].first >> adds[i].second;
d.add(adds[i].first);
}
vector<pair<int, int>> queries(m);
for (int i = 0; i < m; ++i) {
cin >> queries[i].first >> queries[i].second;
d.add(queries[i].first);
d.add(queries[i].second); // Critical pitfall: the right boundary of the interval must also be discretized, otherwise it cannot accurately locate during queries
}
// Activate the compressor
d.init();
// Establish a Fenwick tree or difference array for the compact value range
vector<long long> slots(d.size() + 2, 0);
// Simulated application: point addition
for (const auto& op : adds) {
int pos = d.query(op.first);
slots[pos] += op.second;
}
// Prefix sum preprocessing
vector<long long> sum(d.size() + 2, 0);
for (int i = 1; i <= d.size(); ++i) {
sum[i] = sum[i - 1] + slots[i];
}
// Simulated application: interval query
for (const auto& q : queries) {
int l = d.query(q.first);
int r = d.query(q.second);
cout << sum[r] - sum[l - 1] << "\n";
}
return 0;
}
NOIP Practical Pitfall Guide
- Omitting query/modification boundaries leading to binary search out-of-bounds or logical breakage
The most fatal error in discretization is only discretizing the coordinates of the initial array while ignoring the subsequent online modification (Modify) point coordinates or the left boundary $L$ and right boundary $R$ of interval queries (Query). If $L$ and $R$ are not included in theallsarray during theinit()phase, then when executingquery(R),lower_boundwill return a position that does not belong to the original partial order logic, or even returnalls.end(), causing the calculated coordinates to exceed the capacity of the auxiliary data structure, directly triggering a Runtime Error (RE) or a complete logical breakdown. - Pseudo-duplication deadlock caused by unsorted data before deduplication (Unique)
The underlying implementation mechanism ofstd::uniqueis a two-pointer scan, which only removes adjacent duplicate elements. Ifstd::sortis not called beforeuniqueto ensure absolute monotonicity, identical elements in a disordered sequence will not be removed. This will leave many duplicates in theallsarray, destroying the uniqueness of the bijection. In subsequent binary searches, the compact coordinates calculated bylower_boundwill be severely misaligned, leading to extremely subtle logical errors.
Classic NOIP/Luogu Problems
1. Luogu P1496 Fire at Chibi
- Problem Description: Given $N$ closed intervals $[A_i, B_i]$, find the total length of the union of these intervals. The range of interval endpoints is $[-2^{31} \le A_i, B_i \le 2^{31}-1]$, $N \le 2 \times 10^4$.
- Core Problem Essence: Counting the total length of coverage of large value range one-dimensional intervals.
- Core Solution Idea:
The absolute coordinate value range reaches $4 \times 10^9$, making it impossible to directly use a boolean array for marking. However, since $N$ is very small, there are at most $2N = 4 \times 10^4$ interval endpoints. Collect all $A_i$ and $B_i$ for discretization to establish a compact coordinate axis. Use a difference array to mark each covered grid on the compact axis. Finally, scan the compact axis from left to right; if the current grid is covered (the prefix sum of the difference is greater than 0), add the difference of the original absolute coordinates (i.e.,alls[i] - alls[i-1]) to the answer. This successfully transforms the large value range interval problem into an $O(N \log N)$ sweep line benchmark model.
2. Luogu P1908 Inversion Pairs
- Problem Description: Given a sequence $A$ of length $N$, find the total number of pairs $(i, j)$ such that $i < j$ and $A[i] > A[j]$. $A[i] \le 10^9, N \le 5 \times 10^5$.
- Core Problem Essence: Dynamic value range prefix sum counting.
- Core Solution Idea:
The standard solution is to use a Fenwick tree to maintain the frequency of each value, traversing elements from left to right, querying the number of elements greater than the current one while inserting. However, since $A[i]$ can reach $10^9$, the Fenwick tree cannot be allocated to such a large size.
Since the inversion pairs only depend on the relative partial order relationships between elements, not their absolute values, make a copy of the entire sequence $A$ for discretization preprocessing, replacing each $A[i]$ with a positive integer between $[1, N]$. Then, directly establish a Fenwick tree of size $N$, traverse the discretized new sequence, usequery(N) - query(new_A[i])to obtain the current contribution of inversions and accumulate it, then executeupdate(new_A[i], 1). The time complexity is $O(N \log N)$, and the space complexity is $O(N)$.