NeFut Logo NeFut
Admin Login

Discretization Techniques: Core Algorithms and Applications for Efficient Sparse Data Handling

Published at: 2026-05-27 09:17 Last updated: 2026-06-06 13:04
#algorithm #Data Structure #Discretization

Core Logic and Mathematical Principles

Discretization is an order-preserving spatial compression technique. Its physical essence is to map an infinite or extremely large sparse value range to a finite and continuous compact value range while maintaining the original data's partial order relationship.

In most algorithmic competition problems, the space complexity or auxiliary data structures (such as Fenwick trees, segment trees, or buckets) directly depend on the size of the value range $U$. If $U \le 10^9$ but the actual number of effective points $N \le 10^5$, directly allocating space can lead to memory limit exceeded (MLE). It is observed that the core logic of the algorithm often focuses only on the relative sizes between elements (i.e., the partial order relationships of greater than, less than, and equal), rather than their absolute values.

By establishing a bijective mapping $f(x) \to k$, where $x$ is an element from the original sparse value range, and $k \in [1, M] \, (M \le N)$ is a continuous positive integer. This mapping must satisfy:

$$\forall x_i, x_j, \quad x_i < x_j \iff f(x_i) < f(x_j)$$

Using this technique, the space complexity can be forcibly reduced from $O(U)$ to $O(N)$, while the time complexity typically introduces a sorting constant of $O(N \log N)$.


Algorithm Derivation and Mapping Construction

The core of discretization lies in deduplication and binary search. The overall construction is divided into three strict topological steps:

1. Collection of Discretization Set

All absolute values that may affect the answer (including query boundaries, modification points, etc.) are pushed into a dynamic array alls. The size of the set is $N$.

2. Sorting and Deduplication (Unique)

Sort alls in ascending order to ensure data monotonicity. Then, use the two-pointer technique to eliminate duplicate elements, ensuring the uniqueness of the mapping. The sorting complexity is $O(N \log N)$, and the deduplication complexity is $O(N)$. After deduplication, the actual size of the set is $M \, (M \le N)$. At this point, the array index $i \in [0, M-1]$ is the compressed new coordinate.

3. Binary Search for Mapping

For any original large value $x$, perform a binary search (using std::lower_bound) in the constructed non-decreasing sequence alls to find the position of the first element that is greater than or equal to $x$. The mapping function is defined as:

$$f(x) = \text{lower\_bound}(\text{alls.begin}(), \text{alls.end}(), x) - \text{alls.begin}() + 1$$

The addition of 1 is to normalize the data into a 1-indexed coordinate system, which is fully compatible with subsequent boundary handling of data structures. The time complexity for a single query mapping is $O(\log N)$.


C++ Standard Source Code

#include <iostream>
#include <vector>
#include <algorithm>

using namespace std;

// Encapsulating the discretization component
struct Discretizer {
    vector<int> alls;

    // Add raw values to be discretized
    void add(int x) {
        alls.push_back(x);
    }

    // Execute absolute compression, establishing mapping track
    void init() {
        sort(alls.begin(), alls.end());
        // unique returns the tail iterator after deduplication, erase thoroughly cuts off redundant memory at the tail
        alls.erase(unique(alls.begin(), alls.end()), alls.end());
    }

    // Mapping query: maps large value ranges to [1, alls.size()]
    int query(int x) {
        // Key point: lower_bound relies on the monotonicity from init(), otherwise it may lead to infinite loops or return incorrect iterators
        auto it = lower_bound(alls.begin(), alls.end(), x);
        return (it - alls.begin()) + 1; // Convert to 1-indexed coordinates
    }

    // Get the upper limit of the compact value range after discretization
    int size() const {
        return alls.size();
    }
};

int main() {
    // Disable synchronization to cut off IO constants
    ios_base::sync_with_stdio(false);
    cin.tie(NULL);

    int n, m;
    if (!(cin >> n >> m)) return 0;

    // Discretization not only stores the original array, but also includes the boundary coordinates of modification and query operations
    Discretizer d;
    vector<pair<int, int>> adds(n);
    for (int i = 0; i < n; ++i) {
        cin >> adds[i].first >> adds[i].second;
        d.add(adds[i].first);
    }

    vector<pair<int, int>> queries(m);
    for (int i = 0; i < m; ++i) {
        cin >> queries[i].first >> queries[i].second;
        d.add(queries[i].first);
        d.add(queries[i].second); // Critical pitfall: the right boundary of the interval must also be discretized, otherwise it cannot accurately locate during queries
    }

    // Activate the compressor
    d.init();

    // Establish a Fenwick tree or difference array for the compact value range
    vector<long long> slots(d.size() + 2, 0);

    // Simulated application: point addition
    for (const auto& op : adds) {
        int pos = d.query(op.first);
        slots[pos] += op.second;
    }

    // Prefix sum preprocessing
    vector<long long> sum(d.size() + 2, 0);
    for (int i = 1; i <= d.size(); ++i) {
        sum[i] = sum[i - 1] + slots[i];
    }

    // Simulated application: interval query
    for (const auto& q : queries) {
        int l = d.query(q.first);
        int r = d.query(q.second);
        cout << sum[r] - sum[l - 1] << "\n";
    }

    return 0;
}

NOIP Practical Pitfall Guide

  1. Omitting query/modification boundaries leading to binary search out-of-bounds or logical breakage
    The most fatal error in discretization is only discretizing the coordinates of the initial array while ignoring the subsequent online modification (Modify) point coordinates or the left boundary $L$ and right boundary $R$ of interval queries (Query). If $L$ and $R$ are not included in the alls array during the init() phase, then when executing query(R), lower_bound will return a position that does not belong to the original partial order logic, or even return alls.end(), causing the calculated coordinates to exceed the capacity of the auxiliary data structure, directly triggering a Runtime Error (RE) or a complete logical breakdown.
  2. Pseudo-duplication deadlock caused by unsorted data before deduplication (Unique)
    The underlying implementation mechanism of std::unique is a two-pointer scan, which only removes adjacent duplicate elements. If std::sort is not called before unique to ensure absolute monotonicity, identical elements in a disordered sequence will not be removed. This will leave many duplicates in the alls array, destroying the uniqueness of the bijection. In subsequent binary searches, the compact coordinates calculated by lower_bound will be severely misaligned, leading to extremely subtle logical errors.

Classic NOIP/Luogu Problems

1. Luogu P1496 Fire at Chibi

2. Luogu P1908 Inversion Pairs

Original Source: local://2.3

[h] Back to Home