NeFut Logo NeFut
Admin Login

Efficient Automated Differential Testing: A New Architecture and Implementation Based on Monte Carlo Methods

Published at: 2026-05-27 09:17 Last updated: 2026-06-06 13:04
#C++ #Tutorial

Core Logic and Mathematical Principles

In the NOIP examination, passing the sample tests only indicates that the program has basic correctness of $O(1)$. When facing large datasets, potential corner cases, degradation of time complexity, and constant overflows can collectively cause failures. The core logic of Automated Differential Testing is to utilize a random data generator to produce a large number of test samples, while simultaneously running the code under test (My Code) and the standard correct code (Std Code). By frequently comparing the standard outputs of both, comprehensive error detection is achieved based on probability theory.

The mathematical reliability of differential testing is established on the foundation of Monte Carlo Testing and probability convergence. Let the error probability of the program under test across the entire dataset be $p$ (i.e., $p = \frac{|S_{\text{bad}}|}{|S_{\text{all}}|}$). When the differential tester randomly generates $K$ groups of independent and identically distributed (i.i.d.) test data, the probability $P_{\text{pass}}$ that the program does not produce an error in $K$ consecutive runs is given by:

$$P_{\text{pass}} = (1 - p)^K$$

According to the properties of limits, as $K \to \infty$:

$$\lim_{K \to \infty} (1 - p)^K = 0 \quad (\text{for } 0 < p \le 1)$$

This implies that as long as the number of differential tests $K$ reaches the million level (such as in unattended differential testing), the escape probability of even a boundary vulnerability with a triggering probability of $0.001\%$ will converge to $0$.

In a Linux evaluation environment, frequently calling system-level Shell commands (like system("diff ...")) incurs significant overhead from context switching between user mode and kernel mode, limiting the number of tests per second to just a few dozen. An efficient differential tester should directly utilize C++ native file streams or implement memory-level/high-speed file-level comparisons via Linux redirection within the process, thereby increasing the testing efficiency to several thousand tests per second, truly achieving "a million tests".


Automated Differential Testing Architecture and Process Derivation

The standard architecture of the differential tester consists of four core components:

  1. gen (Generator): A random data generator that outputs standard test input data.in based on specific mathematical distributions (e.g., uniform distribution, tree structures, or graph topologies).
  2. std (Standard/Brute Force): An absolutely correct brute-force alignment code (such as $O(2^N)$ search or $O(N^3)$ naive DP). Although it has high time complexity, its logic is simple and bug-free, responsible for outputting the authoritative answer data.ans.
  3. my (Target Code): The efficient algorithm code to be tested (such as $O(N \log N)$ segment trees), responsible for outputting data.out.
  4. check (Control Engine): The scheduling engine for differential testing, responsible for cyclically calling the first three components and performing physical comparisons.

Process Lifecycle and Return Value Verification

In a Linux environment, when a process ends, it returns a status code (Exit Code) to its parent process.

A robust C++ differential testing script must not only compare the contents of output files (using diff), but also monitor the return values of the program under test in real-time. If the return value is non-zero before the content comparison, it indicates that the program has encountered a hidden memory crash (RE) when faced with that group of data, and the differential tester must immediately intercept the situation.


C++ Standard Source Code

Here is a hardcore C++ architecture for an automated differential testing machine that can be compiled and run directly in a Linux environment (NOIP examination system) using g++ -O2. This script abandons inefficient Shell command concatenation in favor of lower-level, more efficient system calls and stream control.

#include <iostream>
#include <cstdlib>
#include <fstream>
#include <string>
#include <chrono>

using std::cin;
using std::cout;
using std::string;

// Check if two files' contents are absolutely identical
bool compare_files(const string& file1, const string& file2) {
    std::ifstream f1(file1, std::ios::binary);
    std::ifstream f2(file2, std::ios::binary);

    if (!f1.is_open() || !f2.is_open()) return false;

    char ch1, ch2;
    while (f1.get(ch1)) {
        if (!f2.get(ch2)) return false; // file2 ends early
        if (ch1 != ch2) return false;    // bytes do not match
    }
    if (f2.get(ch2)) return false; // file1 ends early but file2 still has content

    return true;
}

int main() {
    // Improve I/O efficiency of the differential testing console itself
    std::ios::sync_with_stdio(false);
    cin.tie(nullptr);

    int total_tests = 1000000; // Set for one million unattended tests
    cout << "[System] Initializing Differential Testing Engine...\n";

    // Critical pitfall: When executing compile commands in Linux, ensure that the generated executable has executable permissions (chmod +x)
    // Here we assume that the three programs have been compiled in advance in the examination directory

    auto start_time = std::chrono::high_resolution_clock::now();

    for (int t = 1; t <= total_tests; ++t) {
        // 1. Run the data generator, directing random data output to data.in
        int gen_status = std::system("./gen > data.in");
        if (gen_status != 0) {
            cout << "\n[CRITICAL ERROR] Generator crashed at test #" << t << "\n";
            break;
        }

        // 2. Run the brute force/standard solution, reading data.in and outputting to data.ans
        int std_status = std::system("./std < data.in > data.ans");
        if (std_status != 0) {
            cout << "\n[CRITICAL ERROR] Standard code crashed at test #" << t << "\n";
            break;
        }

        // 3. Run the high-performance code under test, reading data.in and outputting to data.out
        // Use std::chrono to record the time constant for a single test
        auto t_start = std::chrono::high_resolution_clock::now();
        int my_status = std::system("./my < data.in > data.out");
        auto t_end = std::chrono::high_resolution_clock::now();

        // Critical pitfall: If my_status is non-zero, it indicates that the code under test has crashed (e.g., segmentation fault), and must be intercepted before diff
        if (my_status != 0) {
            cout << "\n[RUNTIME ERROR] Your code crashed (RE) at test #" << t << " with exit code " << my_status << "\n";
            cout << "[System] Please inspect 'data.in' immediately.\n";
            break;
        }

        // 4. Compare the physical file contents
        if (!compare_files("data.out", "data.ans")) {
            auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(t_end - t_start).count();
            cout << "\n[WRONG ANSWER] Mismatch detected at test #" << t << " (" << duration << " ms)\n";
            cout << "[System] Scurry to check 'data.in', 'data.out', and 'data.ans'.\n";
            return 1; // Caught in the act, exit the differential tester
        }

        // Dynamically refresh the testing progress to avoid console buffering issues
        if (t % 100 == 0) {
            cout << "\r[Running] Passed " << t << " tests successfully...";
            cout.flush(); // Force flush the buffer to ensure progress visibility
        }
    }

    auto end_time = std::chrono::high_resolution_clock::now();
    auto total_duration = std::chrono::duration_cast<std::chrono::seconds>(end_time - start_time).count();
    cout << "\n[Success] All " << total_tests << " tests passed! Total time: " << total_duration << "s.\n";

    return 0;
}

NOIP Practical Pitfall Guide

1. Unclosed File Descriptors and Disk IO Locking

2. Non-Degrading Random Seed Leading to Generator Outputting "Identical Data"


Classic NOIP/Luogu Problems

1. Luogu P1082 [NOIP2012 Improvement Group] Congruence Equation

When contestants write this problem, they are prone to errors when handling $x$ as a negative number, or when $a, b$ are large (close to $2 \times 10^9$), leading to overflow in intermediate multiplication results of long long. By constructing a differential tester, the brute force std only needs to use $O(b)$ time complexity to brute-force enumerate $x$ from $1$ to $b$ to verify whether the condition is satisfied, while the generator gen only needs to randomly generate coprime pairs of $a$ and $b$. Under a million high-frequency differential tests, any logical loopholes caused by negative modulus will be thoroughly exposed within microseconds.

2. Luogu P1004 [NOIP2000 Improvement Group] Grid Number Picking

Original Source: local://23.4

[h] Back to Home