Discretization Techniques: Core Algorithms and Applicatio...

Core Logic and Mathematical Principles

Discretization is an order-preserving spatial compression technique. Its physical essence is to map an infinite or extremely large sparse value range to a finite and continuous compact value range while maintaining the original data's partial order relationship.

In most algorithmic competition problems, the space complexity or auxiliary data structures (such as Fenwick trees, segment trees, or buckets) directly depend on the size of the value range $U$. If $U \le 10^9$ but the actual number of effective points $N \le 10^5$, directly allocating space can lead to memory limit exceeded (MLE). It is observed that the core logic of the algorithm often focuses only on the relative sizes between elements (i.e., the partial order relationships of greater than, less than, and equal), rather than their absolute values.

By establishing a bijective mapping $f(x) \to k$, where $x$ is an element from the original sparse value range, and $k \in [1, M] \, (M \le N)$ is a continuous positive integer. This mapping must satisfy:

$$\forall x_i, x_j, \quad x_i < x_j \iff f(x_i) < f(x_j)$$

Using this technique, the space complexity can be forcibly reduced from $O(U)$ to $O(N)$, while the time complexity typically introduces a sorting constant of $O(N \log N)$.

Algorithm Derivation and Mapping Construction

The core of discretization lies in deduplication and binary search. The overall construction is divided into three strict topological steps:

1. Collection of Discretization Set

All absolute values that may affect the answer (including query boundaries, modification points, etc.) are pushed into a dynamic array alls. The size of the set is $N$.

2. Sorting and Deduplication (Unique)

Sort alls in ascending order to ensure data monotonicity. Then, use the two-pointer technique to eliminate duplicate elements, ensuring the uniqueness of the mapping. The sorting complexity is $O(N \log N)$, and the deduplication complexity is $O(N)$. After deduplication, the actual size of the set is $M \, (M \le N)$. At this point, the array index $i \in [0, M-1]$ is the compressed new coordinate.

3. Binary Search for Mapping

For any original large value $x$, perform a binary search (using std::lower_bound) in the constructed non-decreasing sequence alls to find the position of the first element that is greater than or equal to $x$. The mapping function is defined as:

$$f(x) = \text{lower\_bound}(\text{alls.begin}(), \text{alls.end}(), x) - \text{alls.begin}() + 1$$

The addition of 1 is to normalize the data into a 1-indexed coordinate system, which is fully compatible with subsequent boundary handling of data structures. The time complexity for a single query mapping is $O(\log N)$.

C++ Standard Source Code

#include <iostream>
#include <vector>
#include <algorithm>

using namespace std;

// Encapsulating the discretization component
struct Discretizer {
    vector<int> alls;

    // Add raw values to be discretized
    void add(int x) {
        alls.push_back(x);
    }

    // Execute absolute compression, establishing mapping track
    void init() {
        sort(alls.begin(), alls.end());
        // unique returns the tail iterator after deduplication, erase thoroughly cuts off redundant memory at the tail
        alls.erase(unique(alls.begin(), alls.end()), alls.end());
    }

    // Mapping query: maps large value ranges to [1, alls.size()]
    int query(int x) {
        // Key point: lower_bound relies on the monotonicity from init(), otherwise it may lead to infinite loops or return incorrect iterators
        auto it = lower_bound(alls.begin(), alls.end(), x);
        return (it - alls.begin()) + 1; // Convert to 1-indexed coordinates
    }

    // Get the upper limit of the compact value range after discretization
    int size() const {
        return alls.size();
    }
};

int main() {
    // Disable synchronization to cut off IO constants
    ios_base::sync_with_stdio(false);
    cin.tie(NULL);

    int n, m;
    if (!(cin >> n >> m)) return 0;

    // Discretization not only stores the original array, but also includes the boundary coordinates of modification and query operations
    Discretizer d;
    vector<pair<int, int>> adds(n);
    for (int i = 0; i < n; ++i) {
        cin >> adds[i].first >> adds[i].second;
        d.add(adds[i].first);
    }

    vector<pair<int, int>> queries(m);
    for (int i = 0; i < m; ++i) {
        cin >> queries[i].first >> queries[i].second;
        d.add(queries[i].first);
        d.add(queries[i].second); // Critical pitfall: the right boundary of the interval must also be discretized, otherwise it cannot accurately locate during queries
    }

    // Activate the compressor
    d.init();

    // Establish a Fenwick tree or difference array for the compact value range
    vector<long long> slots(d.size() + 2, 0);

    // Simulated application: point addition
    for (const auto& op : adds) {
        int pos = d.query(op.first);
        slots[pos] += op.second;
    }

    // Prefix sum preprocessing
    vector<long long> sum(d.size() + 2, 0);
    for (int i = 1; i <= d.size(); ++i) {
        sum[i] = sum[i - 1] + slots[i];
    }

    // Simulated application: interval query
    for (const auto& q : queries) {
        int l = d.query(q.first);
        int r = d.query(q.second);
        cout << sum[r] - sum[l - 1] << "\n";
    }

    return 0;
}

NOIP 实战避坑指南

漏记查询/修改边界导致二分越界或逻辑断裂
离散化最致命的错误是只对初始数组的坐标进行离散化，而忽略了后续在线修改（Modify）点坐标或区间查询（Query）的左边界 $L$ 和右边界 $R$。如果在 init() 阶段没有将 $L$ 和 $R$ 塞入 alls 数组，那么在执行 query(R) 时，lower_bound 就会返回一个不属于原本偏序逻辑的递增位置，甚至返回 alls.end() 导致减出来的坐标发生超出辅助数据结构容量的越界访问，直接引发 Runtime Error（RE）或逻辑全面崩盘。
去重（Unique）前未排序引发的伪去重死循环
std::unique 的底层实现机理是双指针扫描，它只会剔除相邻的重复元素。如果在调用 unique 之前没有调用 std::sort 保证序列绝对单调，乱序序列中的相同元素将无法被剔除。这会导致 alls 数组中依旧存在大量重复项，破坏了单射的唯一性。在随后的二分查找中，lower_bound 算出来的紧凑坐标会发生严重偏移，导致算法产生极其隐蔽的逻辑错误。

经典 NOIP/洛谷真题

1. 洛谷 P1496 火烧赤壁

题意描述：给定 $N$ 个闭区间 $[A_i, B_i]$，求这些区间的并集总长度。区间端点范围 $[-2^{31} \le A_i, B_i \le 2^{31}-1]$，$N \le 2 \times 10^4$。
问题本质：大值域一维区间覆盖总长度统计。
核心解题思路：
绝对坐标值域达到 $4 \times 10^9$，无法直接开布尔数组进行标记。但 $N$ 极小，所有的区间端点数至多有 $2N = 4 \times 10^4$ 个。将所有 $A_i$ 和 $B_i$ 收集进行离散化，建立紧凑坐标轴。在紧凑轴上利用差分数组标记每个被覆盖的网格。最后从左到右扫描紧凑轴，若当前网格被覆盖（差分前缀和大于 0），则将原绝对坐标的差值（即 alls[i] - alls[i-1]）累加到答案中。成功将大值域区间并问题转化为 $O(N \log N)$ 的扫描线基准模型。

2. 洛谷 P1908 逆序对

题意描述：给定一个长度为 $N$ 的序列 $A$，求序列中满足 $i < j$ 且 $A[i] > A[j]$ 的数对 $(i, j)$ 的总量。$A[i] \le 10^9, N \le 5 \times 10^5$。
问题本质：动态值域前缀和计数。
核心解题思路：
标准解法是使用树状数组维护每个数值出现的频次，从左到右遍历元素，边查询大于当前数的个数边插入。但由于 $A[i]$ 达到 $10^9$，树状数组无法开到如此大的体量。
由于逆序对只取决于元素之间相对的偏序大小关系，与绝对值无关。将整个序列 $A$ 复制一份进行离散化预处理，将每个 $A[i]$ 单射替换为 $[1, N]$ 之间的正整数。随后，直接建立一个大小为 $N$ 的树状数组，遍历离散化后的新序列，利用 query(N) - query(new_A[i]) 获取当前逆序贡献并累加，再执行 update(new_A[i], 1)。时间复杂度 $O(N \log N)$，空间复杂度 $O(N)$。

Discretization Techniques: Core Algorithms and Applications for Efficient Sparse Data Handling