harness — Experiment Harness

Experiment harness for the MoS verification protocol.

Runs instrumented experiments that produce the data needed for dissertation figures and analysis, with optional process-level parallelism via ProcessPoolExecutor.

Covers seven experimental directions from Caro et al. [ITCS2024]:

  1. Scaling (\(n = 4 \to 16{+}\)): Sweep \(n\) with Goldreich–Levin extraction, recording copy complexity, post-selection rate, completeness probability, and wall-clock time.

  2. Bent function worst-case: Bent functions have maximally flat Fourier spectra (\(|\hat{\tilde\phi}(s)| = 2^{-n/2}\) for all \(s\)), the hardest case for heavy coefficient extraction (Corollary 5).

  3. Verifier truncation tradeoffs: Vary the verifier’s classical sample budget \(m_V\) and accuracy parameter \(\varepsilon\) to map the completeness/soundness tradeoff surface (Theorems 8, 12). Uses a fixed \(n\) (via --n) because the sweep is already 2-D; adding an \(n\) axis would produce a prohibitively large 3-D trial grid.

  4. Noise sweep (\(n \times \eta\)): Random \(\varphi\) functions drawn from the noisy parity ensemble at varying label-flip rate \(\eta\), testing the effective coefficient regime \(\hat{\tilde\phi}_{\mathrm{eff}}(s) = (1-2\eta)\,\hat{\tilde\phi}(s)\) (Definition 5(iii), §6.2). Sweeps both \(n\) and \(\eta\), producing a 2-D grid.

  5. Soundness verification (\(n \times \text{strategy}\)): Inject dishonest provers with adversarial strategies and measure empirical rejection rates against the information-theoretic soundness guarantee (Definition 7). Sweeps \(n\) against four fixed adversarial strategies.

  6. Average-case performance (\(n \times \text{family}\)): Test the protocol on diverse function families beyond single parities: \(k\)-Fourier-sparse (Dirichlet coefficients), random Boolean (uniform truth tables), and sparse-plus-noise (dominant parity with secondary coefficients). Probes the regime of Corollary 7 (2-agnostic Fourier-sparse learning).

  7. Gate-level noise (\(n \times p\)): Apply depolarising noise channels to H, X, and CX gates in the QFS circuit, sweeping the error rate \(p\). Goes beyond the label-flip noise model of Definition 5(iii); results are inherently empirical with no theoretical prediction.

All experiments write results to Protocol Buffer binary files with per-experiment schemas (see experiments/proto/).

Usage

Run individual experiments:

python -m experiments.harness scaling --n-min 4 --n-max 12 --workers 8

Run all experiments:

python -m experiments.harness all --n-min 4 --n-max 12 --workers 4

--n-min / --n-max control the \(n\) range for all experiments. --n overrides the fixed dimension used by the truncation experiment (which sweeps other axes instead of \(n\)); it defaults to --n-min when omitted.

Programmatic use:

from experiments.harness import run_scaling_experiment
results = run_scaling_experiment(
    n_range=range(4, 13), num_trials=20, max_workers=8
)
results.save("scaling_results.pb")
[ITCS2024] (1,2,3)

M.,C. Caro, M. Hinsche, M. Ioannou, A. Nietner, and R. Sweke, “Classical Verification of Quantum Learning,” ITCS 2024, doi:10.4230/LIPIcs.ITCS.2024.24.

class experiments.harness.ExperimentResult(experiment_name: str, timestamp: str, wall_clock_s: float = 0.0, max_workers: int = 1, trials: list[TrialResult] = <factory>, parameters: dict = <factory>)[source]

Bases: object

Aggregated results from an experiment sweep.

Collects TrialResult instances from all trials in an experiment, together with the sweep parameters and timing metadata. Supports serialisation to JSON and tabular summary output.

experiment_name: str

Identifier for this experiment (e.g. "scaling").

timestamp: str

ISO 8601 timestamp of when the experiment was run.

wall_clock_s: float = 0.0

Total wall-clock time for the experiment (seconds).

max_workers: int = 1

Number of parallel worker processes used.

trials: list[TrialResult]

Individual trial results.

parameters: dict

Experiment-level configuration (sweep ranges, shot counts, etc.).

save(path: str)[source]

Serialise the experiment results to a Protocol Buffer file.

Parameters:

path (str) – Output file path (conventionally *.pb). Parent directories are created automatically.

_to_proto()[source]

Convert to the experiment-specific protobuf message.

summary_table() str[source]

Produce a human-readable summary table grouped by \(n\).

Columns: number of trials, acceptance rate, correctness rate, median \(|L|\), median total copies, median prover time, and median verifier time.

Returns:

Formatted ASCII table.

Return type:

str

class experiments.harness.TrialResult(n: int, seed: int, prover_time_s: float, qfs_shots: int, qfs_postselected: int, postselection_rate: float, list_size: int, prover_found_target: bool, verifier_time_s: float, verifier_samples: int, outcome: str, accepted: bool, accumulated_weight: float, acceptance_threshold: float, hypothesis_s: int | None, hypothesis_correct: bool, total_copies: int, total_time_s: float, epsilon: float, theta: float, delta: float, a_sq: float, b_sq: float, phi_description: str, k: int | None = None, hypothesis_coefficients: dict[int, float] | None = None, misclassification_rate: float | None = None)[source]

Bases: object

Result of a single prover–verifier trial.

Captures every observable quantity from one execution of the interactive verification protocol (§6 of [ITCS2024]): prover-side QFS, verifier-side classical estimation, the accept/reject outcome, and the output hypothesis.

n: int

Number of input bits (dimension of \(\mathcal{X}_n = \{0,1\}^n\)).

seed: int

Random seed governing this trial’s RNG chain.

prover_time_s: float

Wall-clock time for the prover’s computation (seconds).

qfs_shots: int

Number of MoS copies consumed by Quantum Fourier Sampling.

qfs_postselected: int

Number of QFS shots that survived post-selection on the label qubit \(b = 1\) (Theorem 5(i)).

postselection_rate: float

Fraction of QFS shots surviving post-selection; should concentrate around \(1/2\).

list_size: int

\(|L|\), the number of candidate heavy Fourier coefficient indices sent by the prover.

prover_found_target: bool

Whether the target parity \(s^*\) appears in \(L\).

verifier_time_s: float

Wall-clock time for the verifier’s computation (seconds).

verifier_samples: int

Number of classical random examples consumed by the verifier for independent coefficient estimation (Lemma 1).

outcome: str

"accept", "reject_list_too_large", or "reject_insufficient_weight".

Type:

Verification outcome

accepted: bool

Whether the verifier accepted the interaction.

accumulated_weight: float

\(\sum_{s \in L} \hat{\xi}(s)^2\), the Fourier weight accumulated by the verifier’s independent estimates.

acceptance_threshold: float

The threshold \(\tau\) against which accumulated_weight was compared. For parity (Theorem 12): \(\tau = a^2 - \varepsilon^2/8\).

hypothesis_s: int | None

The parity index \(s_{\mathrm{out}}\) of the output hypothesis, or None if rejected.

hypothesis_correct: bool

Whether \(s_{\mathrm{out}} = s^*\).

total_copies: int

Total MoS copies consumed (QFS + prover classical + verifier classical).

total_time_s: float

Total wall-clock time (prover + verifier).

epsilon: float

Accuracy parameter \(\varepsilon\).

theta: float

Fourier resolution threshold \(\vartheta\).

delta: float

Confidence parameter \(\delta\).

a_sq: float

Lower bound \(a^2\) on \(\mathbb{E}_{x \sim U_n}[\tilde\phi(x)^2]\) (Definition 14).

b_sq: float

Upper bound \(b^2\) on \(\mathbb{E}_{x \sim U_n}[\tilde\phi(x)^2]\) (Definition 14).

phi_description: str

Human-readable label for the distribution under test.

k: int | None = None

Fourier sparsity parameter (None for parity experiments).

hypothesis_coefficients: dict[int, float] | None = None

For k-sparse hypotheses, maps each selected parity index to its estimated Fourier coefficient. None for parity experiments.

misclassification_rate: float | None = None

Empirical misclassification rate \(\hat{P}[h(x) \neq y]\) on fresh samples. None for parity experiments.

class experiments.harness.TrialSpec(n: int, phi: list[float], noise_rate: float, target_s: int, epsilon: float, delta: float, theta: float, a_sq: float, b_sq: float, qfs_shots: int, classical_samples_prover: int, classical_samples_verifier: int, seed: int, phi_description: str, dishonest_strategy: str | None = None, gate_noise_rate: float | None = None, qfs_mode: str = 'statevector', k: int | None = None, misclassification_samples: int | None = None)[source]

Bases: object

Fully serialisable specification for one trial.

ProcessPoolExecutor requires picklable arguments. Since MoSState contains Qiskit objects that do not always pickle cleanly, we instead pass this plain dataclass to worker processes. Each worker reconstructs the MoSState from these fields, runs the protocol, and returns a TrialResult.

n: int

Number of input bits.

phi: list[float]

Label-bias function \(\varphi(x) = \Pr[y{=}1 \mid x]\), stored as a plain list (not ndarray) for pickle safety. Length \(2^n\).

noise_rate: float

Label-flip noise rate \(\eta \in [0, 0.5)\).

target_s: int

The ground-truth heavy parity index \(s^* \in \{0,\ldots,2^n{-}1\}\).

epsilon: float

Accuracy parameter \(\varepsilon\).

delta: float

Confidence parameter \(\delta\).

theta: float

Fourier resolution threshold \(\vartheta\).

a_sq: float

Lower bound \(a^2\) on \(\mathbb{E}[\tilde\phi(x)^2]\) (Definition 14).

b_sq: float

Upper bound \(b^2\) on \(\mathbb{E}[\tilde\phi(x)^2]\) (Definition 14).

qfs_shots: int

Number of MoS copies for QFS (prover).

classical_samples_prover: int

Number of classical samples for the prover’s coefficient estimation.

classical_samples_verifier: int

Number of classical samples for the verifier’s independent estimation.

seed: int

Per-trial random seed.

phi_description: str

Human-readable label for the distribution.

dishonest_strategy: str | None = None

If not None, the worker runs a dishonest prover with this strategy instead of the honest protocol. Single-element strategies (used by run_soundness_experiment()): "random_list", "wrong_parity", "partial_list", "inflated_list". Multi-element strategies (used by run_soundness_multi_experiment()): "partial_real", "diluted_list", "shifted_coefficients", "subset_plus_noise".

gate_noise_rate: float | None = None

Gate-level depolarising noise rate \(p\). When set, the worker constructs a Qiskit NoiseModel with depolarising channels on H, X (1-qubit) and CX (2-qubit) gates.

qfs_mode: str = 'statevector'

QFS simulation mode passed to MoSProver.run_protocol.

k: int | None = None

Fourier sparsity parameter \(k\). When set and > 1, the worker calls verify_fourier_sparse() instead of verify_parity().

misclassification_samples: int | None = None

Number of fresh samples for misclassification rate estimation. When None, defaults to 1000.

experiments.harness.merge_shard_files(shard_paths: list[Path], output_path: Path) None[source]

Merge sharded .pb experiment results into a single file.

Reads each shard, concatenates all trial results, aggregates metadata (summed wall-clock time, total trial count), and writes the merged protobuf to output_path.

Parameters:
  • shard_paths (list[Path]) – Paths to shard .pb files. All must be from the same experiment type.

  • output_path (Path) – Destination for the merged .pb file.

experiments.harness.shard_specs(specs: list[TrialSpec], shard_index: int, num_shards: int) list[TrialSpec][source]

Return the contiguous slice of specs assigned to this shard.

Uses divmod chunking so the first remainder shards each get one extra spec, ensuring no spec is missed or duplicated.

Parameters:
  • specs (list[TrialSpec]) – The full, deterministically generated spec list.

  • shard_index (int) – 0-based index of this shard.

  • num_shards (int) – Total number of shards.

Returns:

The slice of specs for this shard (may be empty if num_shards > len(specs)).

Return type:

list[TrialSpec]

experiments.harness.make_bent_function(n: int) list[float][source]

Construct \(\varphi\) for a Maiorana–McFarland bent function.

For even \(n\), defines \(f(x, y) = \langle x, y \rangle \bmod 2\) where \(x, y \in \{0,1\}^{n/2}\). The resulting \(g = (-1)^f\) has all Fourier coefficients equal in magnitude:

\[|\hat{g}(s)| = 2^{-n/2} \quad \forall\, s \in \{0,1\}^n\]

This is the worst case for heavy coefficient extraction because no coefficient dominates: Parseval gives \(\sum_s \hat{g}(s)^2 = 1\) spread uniformly over all \(2^n\) frequencies.

Parameters:

n (int) – Number of input bits (must be even).

Returns:

\(\varphi(x)\) for \(x = 0, \ldots, 2^n - 1\).

Return type:

list[float]

Raises:

ValueError – If n is odd.

experiments.harness.make_k_sparse(n: int, k: int, rng: numpy.random.Generator) tuple[list[float], int, float][source]

Construct \(\varphi\) for a random \(k\)-Fourier-sparse function.

Draws \(k\) distinct nonzero parity indices and random coefficients from the Dirichlet(1, …, 1) distribution on the \(k\)-simplex, so that \(\sum_i c_i = 1\). The resulting \(\tilde\phi\) satisfies \(|\tilde\phi(x)| \le 1\) by the triangle inequality, keeping \(\varphi(x) \in [0, 1]\).

This goes beyond single parities (Exp 1) by testing the protocol on functions with multiple non-zero Fourier coefficients, probing the regime of Corollary 7 (2-agnostic Fourier-sparse learning).

Parameters:
  • n (int) – Number of input bits.

  • k (int) – Fourier sparsity.

  • rng (numpy.random.Generator) – Random number generator.

Returns:

  • phi (list[float]) – Label-bias function.

  • target_s (int) – The index of the heaviest Fourier coefficient.

  • parseval_weight (float) – \(\sum_i c_i^2 = \mathbb{E}[\tilde\phi(x)^2]\), used as \(a^2 = b^2\) for the distribution class promise.

experiments.harness.make_random_boolean(n: int, rng: numpy.random.Generator) tuple[list[float], int][source]

Construct \(\varphi\) from a uniform random truth table.

Each \(\varphi(x)\) is drawn independently from \(\{0, 1\}\), producing a maximally Fourier-dense function where every coefficient is generically nonzero. Since \(\tilde\phi \in \{-1, +1\}\), we have \(\mathbb{E}[\tilde\phi(x)^2] = 1\) exactly, so \(a^2 = b^2 = 1\).

The heaviest Fourier coefficient is identified via the Walsh–Hadamard transform.

Parameters:
  • n (int) – Number of input bits.

  • rng (numpy.random.Generator) – Random number generator.

Returns:

  • phi (list[float]) – Label-bias function.

  • target_s (int) – Index of the largest \(|\hat{\tilde\phi}(s)|\).

experiments.harness.make_random_parity(n: int, rng: numpy.random.Generator) tuple[list[float], int][source]

Construct \(\varphi\) for a uniformly random nonzero parity.

Draws \(s^* \sim \mathrm{Uniform}(\{1, \ldots, 2^n - 1\})\) and returns the corresponding \(\varphi\) via make_single_parity().

Parameters:
  • n (int) – Number of input bits.

  • rng (numpy.random.Generator) – Random number generator.

Returns:

  • phi (list[float]) – Label-bias function.

  • target_s (int) – The sampled parity index.

experiments.harness.make_single_parity(n: int, target_s: int) list[float][source]

Construct \(\varphi\) for a pure parity function.

\[\varphi(x) = s^* \cdot x \bmod 2\]

so that \(\tilde\phi = \chi_{s^*}\) and the Fourier spectrum has a single nonzero coefficient \(\hat{\tilde\phi}(s^*) = 1\).

Parameters:
  • n (int) – Number of input bits.

  • target_s (int) – Parity index \(s^* \in \{0, \ldots, 2^n - 1\}\).

Returns:

\(\varphi(x)\) for \(x = 0, \ldots, 2^n - 1\).

Return type:

list[float]

experiments.harness.make_sparse_plus_noise(n: int, rng: numpy.random.Generator) tuple[list[float], int, float][source]

Construct \(\varphi\) with one dominant parity plus secondary coefficients.

The function \(\tilde\phi\) has a dominant Fourier coefficient \(c_{\mathrm{dom}} = 0.7\) at a random parity \(s^*\) and three secondary coefficients \(c_{\mathrm{sec}} = 0.1\) each, giving \(\sum |c_i| = 1\). This models the realistic case where a clear signal coexists with structured Fourier noise.

Parameters:
  • n (int) – Number of input bits.

  • rng (numpy.random.Generator) – Random number generator.

Returns:

  • phi (list[float]) – Label-bias function.

  • target_s (int) – The dominant parity index \(s^*\).

  • parseval_weight (float) – \(\sum c_i^2\), used as \(a^2 = b^2\).

experiments.harness.run_ab_regime_experiment(n_range: range = range(4, 7), gaps: list[float] | None = None, num_trials: int = 20, epsilon: float = 0.3, qfs_shots: int = 2000, classical_samples_prover: int = 1000, classical_samples_verifier: int = 3000, base_seed: int = 42, max_workers: int = 1, shard_index: int | None = None, num_shards: int | None = None) ExperimentResult[source]

Distributional regime sweep: verification with \(a^2 \neq b^2\).

For each \(n\) in n_range and each gap value, the experiment constructs a sparse-plus-noise \(\varphi\) whose true Parseval weight is \(pw \approx 0.52\), then sets

\[a^2 = pw - \tfrac{\text{gap}}{2}, \qquad b^2 = pw + \tfrac{\text{gap}}{2}\]

with \(a^2\) clamped to a minimum of 0.01. When gap = 0 this recovers the standard \(a = b\) regime used by all other experiments; positive gaps exercise Definition 14 of [ITCS2024] where the verifier is told that \(a < b\).

Note

Theorem 12 completeness precondition is satisfied only at gap = 0. Theorem 12 requires \(\varepsilon \ge 2\sqrt{b^2 - a^2}\); with \(\varepsilon = 0.3\) this means \(\mathrm{gap} \le (\varepsilon/2)^2 = 0.0225\), i.e. only the gap = 0 cell of the default sweep formally satisfies the precondition. For gap \(> 0.0225\) the experiment is intentionally running outside Theorem 12’s completeness guarantee on a benign sparse_plus_noise \(\varphi\) whose true Parseval mass \(\|\tilde\varphi\|_2^2 = 0.52\) sits at the centre of \([a^2, b^2]\), so honest interactions still produce \(\sum \widehat\xi^2 \ge a^2 - \varepsilon^2/8\).

This is not evidence that Theorem 13 (a worst-case sample-complexity lower bound for distinguishing random parities from \(U_{n+1}\)) is loose; Theorem 13 does not upper-bound the acceptance probability of any specific honest run on benign inputs. The previous figure-script interpretation was corrected per audit/ab_regime.md (M1).

The “ab regime” sweep is structurally a 1-D \(a^2\)-sweep: \(b^2 \le 0.72\) and the list-size cap \(64 b^2/\vartheta^2 \le 512\) never binds the maximum honest list of 4 (M2 in the audit). To probe both completeness and soundness of Definition 14, the centre \(pw\) would need to vary independently — see audit/FOLLOW_UPS.md.

Parameters:
  • n_range (range) – Range of \(n\) values to sweep.

  • gaps (list[float] or None) – Values of \(b^2 - a^2\) to sweep. Default: [0.0, 0.05, 0.1, 0.2, 0.3, 0.4].

  • num_trials (int) – Trials per \((n, \text{gap})\) cell.

  • epsilon (float) – Accuracy parameter \(\varepsilon\).

  • qfs_shots (int) – QFS copy budget per trial.

  • classical_samples_prover (int) – Classical samples for the prover.

  • classical_samples_verifier (int) – Classical samples for the verifier.

  • base_seed (int) – Base random seed.

  • max_workers (int) – Number of parallel worker processes.

  • shard_index (int or None) – 0-based shard index for distributed execution.

  • num_shards (int or None) – Total number of shards for distributed execution.

Return type:

ExperimentResult

experiments.harness.run_average_case_experiment(n_range: range = range(4, 11), families: list[str] | None = None, num_trials: int = 20, epsilon: float = 0.3, delta: float = 0.1, qfs_shots: int = 2000, classical_samples_prover: int = 1000, classical_samples_verifier: int = 3000, base_seed: int = 42, max_workers: int = 1, shard_index: int | None = None, num_shards: int | None = None) ExperimentResult[source]

Average-case experiment: protocol performance on diverse function families.

For each \(n\) in n_range and each function family, samples num_trials random functions from the family, runs the full honest prover \(\to\) verifier protocol, and records acceptance rate and hypothesis quality.

The experiment goes beyond the single-parity regime of Exp 1 by testing three Fourier-sparse function families:

k_sparse_2, k_sparse_4

\(k\)-Fourier-sparse functions with random Dirichlet coefficients on the \(k\)-simplex (Corollary 7). \(\vartheta\) is adapted per k so the Goldreich–Levin threshold remains below the expected heaviest coefficient. Each trial sets TrialSpec.k = k so the worker dispatches to verify_fourier_sparse() (Theorems 9/10/15) with the k-sparse acceptance threshold \(a^2 - \varepsilon^2/(128 k^2)\).

sparse_plus_noise

One dominant parity (\(c_{\mathrm{dom}} = 0.7\)) plus three secondary coefficients (\(c_{\mathrm{sec}} = 0.1\) each), giving 4 heavy coefficients in total. Sets TrialSpec.k = 4 so the verifier runs the Fourier-sparse path matching the actual sparsity of the target.

Note

History. Prior to the audit fix in audit/average_case.md (M1), all three families dispatched to verify_parity because TrialSpec.k was never set. The audit identified this as the highest-value single fix; as a result the previous on-disk results/average_case_4_16_100.pb was generated against the wrong verifier path and is invalid until re-generated. See audit/FOLLOW_UPS.md.

The fourth family random_boolean (uniform truth tables) was dropped: at \(n \ge 6\) it violates Definition 11 by construction, so the verifier correctly rejects it (M2 in the audit). Including it as an “average case” representative conflated promise-class violation with average-case behaviour.

\(\vartheta = \min(\varepsilon,\, 0.9/k)\) is a heuristic (m4 in the audit), not derived from a theorem; for Dirichlet draws a non-trivial fraction of trials will still have \(c_{\min} < \vartheta\).

Parameters:
  • n_range (range) – Range of \(n\) values to sweep.

  • families (list[str] or None) – Function family names. Default: ["k_sparse_2", "k_sparse_4", "random_boolean", "sparse_plus_noise"].

  • num_trials (int) – Trials per \((n, \text{family})\) cell.

  • epsilon (float) – Accuracy parameter \(\varepsilon\).

  • delta (float) – Confidence parameter \(\delta\).

  • qfs_shots (int) – QFS copy budget per trial.

  • classical_samples_prover (int) – Classical samples for the prover.

  • classical_samples_verifier (int) – Classical samples for the verifier.

  • base_seed (int) – Base random seed.

  • max_workers (int) – Number of parallel worker processes.

Return type:

ExperimentResult

experiments.harness.run_bent_experiment(n_range: range = range(4, 13, 2), num_trials: int = 10, epsilon: float = 0.3, theta: float = 0.3, qfs_shots: int = 3000, classical_samples_prover: int = 2000, classical_samples_verifier: int = 5000, base_seed: int = 42, max_workers: int = 1, shard_index: int | None = None, num_shards: int | None = None) ExperimentResult[source]

Bent function worst-case experiment.

Bent functions (even \(n\) only) have all Fourier coefficients at \(|\hat{g}(s)| = 2^{-n/2}\). As \(n\) grows, this magnitude shrinks below the resolution threshold \(\vartheta\), and the prover’s heavy coefficient list collapses.

The crossover occurs at \(2^{-n/2} \approx \vartheta\):

  • Below the crossover (\(n\) small), all \(2^n\) entries appear in \(L\) and the verifier accepts.

  • Above the crossover (\(n\) large), the prover finds few or no entries above threshold, and the verifier rejects.

This is the theoretically predicted worst case for the QFS-based spectrum approximation of Corollary 5: the DKW-based extraction cannot distinguish signal from the uniform floor \((1 - \mathbb{E}[\tilde\phi^2]) / 2^n\) when all coefficients are equally small.

Note

Caveat on the n=4 case (Corollary 5 uncertain band). Corollary 5 guarantees inclusion only when \(|\hat g(s)| \ge \vartheta\), while requiring returned coefficients to satisfy \(|\hat g(s)| \ge \vartheta/2\). The interval \([\vartheta/2,\, \vartheta)\) is therefore an uncertain band in which inclusion is permitted but not required. For the default \(\vartheta = 0.3\), the bent point at \(n = 4\) has every coefficient at \(2^{-n/2} = 0.25 \in [0.15, 0.3)\), so it lies inside this uncertain band rather than strictly “below the crossover” at \(2^{-n/2} = \vartheta/2\). The \(n = 4\) acceptance is therefore Corollary 5 extracting from the uncertain band where inclusion is allowed; the prover’s Parseval cap \(16/\vartheta^2\) does the heavy lifting. See audit/bent.md (m2).

Parameters:
  • n_range (range) – Range of even \(n\) values to sweep.

  • num_trials (int) – Number of independent trials per \(n\).

  • epsilon (float) – Accuracy parameter \(\varepsilon\).

  • theta (float) – Fourier resolution threshold \(\vartheta\).

  • qfs_shots (int) – QFS copy budget per trial.

  • classical_samples_prover (int) – Classical samples for the prover.

  • classical_samples_verifier (int) – Classical samples for the verifier.

  • base_seed (int) – Base random seed.

  • max_workers (int) – Number of parallel worker processes.

Return type:

ExperimentResult

experiments.harness.run_gate_noise_experiment(n_range: range = range(4, 7), gate_noise_rates: list[float] | None = None, num_trials: int = 24, epsilon: float = 0.3, qfs_shots: int = 2000, classical_samples_prover: int = 1000, classical_samples_verifier: int = 3000, base_seed: int = 42, max_workers: int = 1, shard_index: int | None = None, num_shards: int | None = None) ExperimentResult[source]

Gate-level noise sweep: verification under depolarising gate errors.

Unlike the label-flip noise experiment (Exp 2 / noise.py), which applies noise at the distribution level via \(\varphi_{\mathrm{eff}}(x) = (1 - 2\eta)\,\varphi(x) + \eta\) (Definition 5(iii)), this experiment applies depolarising noise channels directly to the quantum gates in the QFS circuit:

  • 1-qubit depolarising error (rate p) on H and X gates

  • 2-qubit depolarising error (rate p) on CX gates

Multi-controlled X gates (used in the oracle \(U_f\)) decompose into CX and single-qubit gates during transpilation, so gate noise propagates through the full circuit.

This goes beyond the noise models analysed in Caro et al. (Lemmas 4–6, §4.2) and produces inherently empirical results with no theoretical prediction to compare against.

The experiment uses qfs_mode="circuit" (required for gate-level noise) and noise_rate=0 (no label-flip noise), with \(a^2 = b^2 = 1\) (noiseless distribution class promise, since the noise is purely at the gate level rather than the label level).

Warning

Exploratory — the paper makes no predictions about gate noise. Caro et al. only analyse classical label-flip noise on the data oracle (Definition 5(iii) / Definition 12). This experiment is NOT a validation of any theorem.

Three substantive caveats from audit/gate_noise.md:

  1. The observed “threshold” at small \(n\) is dominated by truth-table oracle synthesis cost in mos.MoSState._circuit_oracle_f(), which emits up to \(2^n\) multi-controlled-X gates per shot. Each MCX decomposes into \(O(n)\) CX gates after transpilation, giving expected errors per shot \(\sim p \cdot n \cdot 2^n\) — exponential in \(n\). The paper’s Theorem 12 only requires \(O(n \log(\dots))\) single-qubit prover gates, so the experiment is hitting a synthesis blow-up that does not exist in the theory.

  2. The high-\(p\) end of the sweep (\(p \in \{0.2, 0.5\}\)) is unphysical: real superconducting/trapped-ion 2q gate error rates are \(\sim 10^{-4}\) to \(10^{-2}\). At \(p = 0.5\) the depolarising channel maps any state to within \(1/4\) of the maximally mixed state in one application.

  3. The noise model is attached only to h, x, cx. Other transpiled basis gates (sx, rz, u3, t, tdg) carry no error, so the effective per-logical-op error is smaller than nominal \(p\).

The headline finding “the verifier fails sharply under gate noise” is NOT supported without isolating circuit- implementation cost from protocol robustness. See audit/FOLLOW_UPS.md for the proposed structured-oracle fix that would actually probe the protocol.

Parameters:
  • n_range (range) – Range of \(n\) values to sweep.

  • gate_noise_rates (list[float] or None) – Depolarising error rates p to sweep. Default: [0.0, 0.001, 0.005, 0.01, 0.02, 0.05, 0.1].

  • num_trials (int) – Trials per \((n, p)\) cell.

  • epsilon (float) – Accuracy parameter \(\varepsilon\).

  • qfs_shots (int) – QFS copy budget per trial.

  • classical_samples_prover (int) – Classical samples for the prover.

  • classical_samples_verifier (int) – Classical samples for the verifier.

  • base_seed (int) – Base random seed.

  • max_workers (int) – Number of parallel worker processes.

Return type:

ExperimentResult

experiments.harness.run_k_sparse_experiment(n_range: range = range(4, 11, 2), k_values: list[int] | None = None, num_trials: int = 20, epsilon: float = 0.3, delta: float = 0.1, qfs_shots: int = 2000, classical_samples_prover: int = 1000, classical_samples_verifier: int = 3000, misclassification_samples: int = 1000, base_seed: int = 42, max_workers: int = 1, shard_index: int | None = None, num_shards: int | None = None) ExperimentResult[source]

k-Sparse verification experiment.

Exercises verify_fourier_sparse() (Theorems 9/10/14/15) by sweeping Fourier sparsity k and dimension n. For k = 1, uses verify_parity as a comparison baseline.

Measures both index-correctness (heaviest coefficient match) and empirical misclassification rate for k-sparse hypotheses.

experiments.harness.run_theta_sensitivity_experiment(n_range: range = range(4, 9, 2), theta_values: list[float] | None = None, num_trials: int = 20, epsilon: float = 0.3, delta: float = 0.1, qfs_shots: int = 2000, classical_samples_prover: int = 1000, classical_samples_verifier: int = 3000, base_seed: int = 42, max_workers: int = 1, shard_index: int | None = None, num_shards: int | None = None) ExperimentResult[source]

Theta sensitivity experiment.

Sweeps the Fourier resolution threshold \(\vartheta\) against make_sparse_plus_noise() functions (dominant coefficient 0.7, three secondary coefficients 0.1 each).

At \(\vartheta \approx 0.20\), the secondary coefficients sit at the extraction boundary \(|\hat{\tilde\phi}(s)| = 0.1\) vs threshold \(\vartheta/2 = 0.10\), probing where finite-sample noise determines inclusion in \(L\).

Records \(|L|\), acceptance outcome, and accumulated weight per trial to map the extraction frontier (Corollary 5).

Warning

Maps the acceptance boundary; does NOT validate the :math:`1/vartheta^4` sample-complexity scaling. Two MAJOR caveats from audit/theta_sensitivity.md:

  • M1. The hard-coded qfs_shots=2000, classical_samples_prover=1000 and classical_samples_verifier=3000 override the \(\vartheta\)-dependent formulas in ql.prover.MoSProver.run_protocol() (lines 316-321) and ql.verifier.MoSVerifier._verify_core() (lines 487-497). For \(\vartheta = 0.05\) the analytic prover budget is \(\sim 1.9 \times 10^8\) shots vs 2000 used (about 5 orders short). The analytic verifier budget at the same \(\vartheta\) ranges from \(\sim 2.3 \times 10^5\) (\(|L|=1\)) to \(\sim 1 \times 10^8\) (the empirical median \(|L| \approx 484\) at \(n=16\)) to \(\sim 1.5 \times 10^{14}\) (the worst-case enforced Theorem 12 list bound \(64 b^2/\vartheta^2 = 13312\)), versus 3000 used — between 2 and 11 orders of magnitude short depending on which \(|L|\) you plug in. The experiment maps where the verifier accepts/rejects; it does NOT empirically test the theoretical \(1/\vartheta^4\) scaling.

  • M2. make_sparse_plus_noise() has nonzero Fourier coefficients of magnitude exactly 0.1, so for any \(\vartheta > 0.1\) the function lies outside \(\mathfrak{D}^{\mathrm{func}}_{U_n;\ge\vartheta}\) (Definition 11). This means Theorems 8/12 do not formally apply at \(\vartheta \in \{0.12, 0.15, 0.20, 0.30, 0.50\}\) — the experiment is intentionally probing the out-of-promise regime.

To actually validate the \(1/\vartheta^4\) scaling, the sweep would need to be re-run with qfs_shots=None, classical_samples_prover=None and classical_samples_verifier=None so the analytic formulas drive the per-trial budgets. See audit/FOLLOW_UPS.md.

experiments.harness.run_noise_sweep_experiment(n_range: range = range(4, 7), noise_rates: list[float] | None = None, num_trials: int = 20, epsilon: float = 0.3, qfs_shots: int = 2000, classical_samples_prover: int = 1000, classical_samples_verifier: int = 3000, base_seed: int = 42, max_workers: int = 1, shard_index: int | None = None, num_shards: int | None = None) ExperimentResult[source]

Noise sweep: verification under increasing label-flip noise.

For each \(n\) in n_range and each noise rate \(\eta\), the MoS state is constructed from the effective label probabilities (Definition 5(iii)):

\[\varphi_{\mathrm{eff}}(x) = (1 - 2\eta)\,\varphi(x) + \eta\]

The effective Fourier coefficient becomes \(\hat{\tilde\phi}_{\mathrm{eff}}(s) = (1 - 2\eta)\, \hat{\tilde\phi}(s)\), and the distribution class promise is \(a^2 = b^2 = (1 - 2\eta)^2\).

As \(\eta \to 0.5\), the signal \((1 - 2\eta) \to 0\) and the protocol should eventually fail. The experiment measures the empirical acceptance and correctness rates as functions of \(\eta\), testing the noise-robust verification results of §6.2.

The Fourier resolution threshold \(\vartheta\) is held fixed at \(\vartheta = \varepsilon\) across the entire sweep (audit fix MAJOR-3 in audit/noise_sweep.md). Previously \(\vartheta\) was adapted per noise level (\(\vartheta = \min(\varepsilon,\, 0.9 \cdot (1 - 2\eta))\)), which silently varied two parameters along the same axis and confounded the interpretation of any non-monotonicity in the acceptance curve.

Note

Audit fixes (audit/noise_sweep.md):

  • MAJOR-1: the \(\eta\) range now extends to \(\{0.42, 0.44, 0.46, 0.48\}\) so the sweep crosses the theoretical breakdown \(\eta_{\max} \approx 0.4470\) (set by \((1 - 2\eta)^2 = \varepsilon^2/8\)). Previously the sweep stopped at \(\eta = 0.40\), never entering the failure regime.

  • MAJOR-2 (not fixed in this revision): the headline acceptance dip in \(\eta \in [0.05, 0.30]\) is dominated by squared-estimator variance against the slack \(\varepsilon^2/8 = 0.01125\), not Lemma 6 attenuation. For a sharper acceptance figure classical_samples_verifier could be bumped to \(\sim 30000\); the cleanest empirical confirmation of Lemma 6 is already fourier_weight_attenuation.png.

  • MAJOR-3: \(\vartheta\) is now held fixed across the sweep (see above).

  • The hard-coded qfs_shots=2000, classical_samples_prover=1000, classical_samples_verifier=3000 are still below the analytic Hoeffding budget; these are documented limitations tracked in audit/FOLLOW_UPS.md.

The on-disk results/noise_sweep_*.pb was generated under the old configuration and is invalid until re-run.

Parameters:
  • n_range (range) – Range of \(n\) values to sweep.

  • noise_rates (list[float] or None) – Values of \(\eta\) to sweep. Default: [0.0, 0.05, 0.1, ..., 0.4].

  • num_trials (int) – Trials per \((n, \eta)\) cell.

  • epsilon (float) – Accuracy parameter \(\varepsilon\).

  • qfs_shots (int) – QFS copy budget per trial.

  • classical_samples_prover (int) – Classical samples for the prover.

  • classical_samples_verifier (int) – Classical samples for the verifier.

  • base_seed (int) – Base random seed.

  • max_workers (int) – Number of parallel worker processes.

Return type:

ExperimentResult

experiments.harness.run_scaling_experiment(n_range: range = range(4, 13), num_trials: int = 20, epsilon: float = 0.3, delta: float = 0.1, qfs_shots: int = 2000, classical_samples_prover: int = 1000, classical_samples_verifier: int = 3000, base_seed: int = 42, max_workers: int = 1, shard_index: int | None = None, num_shards: int | None = None) ExperimentResult[source]

Scaling experiment: completeness vs \(n\) for random parities.

For each \(n\) in n_range, generates num_trials random nonzero parities \(\varphi(x) = s^* \cdot x\), runs the full honest prover \(\to\) verifier protocol, and records whether the correct parity was recovered.

This is the key experiment distinguishing the project from a reproduction study: at \(n = 4\) it matches the paper’s baseline, at \(n = 16{+}\) it demonstrates GL extraction at scale.

The prover uses \(O(\log(1/\delta) / \varepsilon^4)\) QFS copies (Corollary 5) and the verifier uses \(O(b^4 \log(1/\delta\vartheta^2) / \varepsilon^4\vartheta^4)\) classical examples (Theorem 12). The distribution class promise is \(a^2 = b^2 = 1\) (functional case, Theorem 8).

Note

Hard-coded sample budgets bypass the analytic formulas. The defaults qfs_shots=2000, classical_samples_prover=1000 and classical_samples_verifier=3000 are constants, not values derived from the Hoeffding budgets in ql.verifier.MoSVerifier._verify_core() (lines 487-497) or the Corollary 5 formula in ql.prover.MoSProver.run_protocol() (lines 316-321). Consequently total_copies = qfs_shots + classical_samples_prover + classical_samples_verifier = 6000 is constant across every \(n\) in the sweep by construction, and the experiment tests “this fixed 6000-copy budget suffices on single-parity targets up to \(n=16\)” rather than the n-independence of Theorem 12’s analytic verifier sample formula.

To actually validate the n-independence claim of Theorem 12, the sweep would need to be re-run with qfs_shots=None, classical_samples_prover=None and classical_samples_verifier=None so the worker falls through to the per-trial analytic budgets. See audit/scaling.md (M1, M2) and audit/FOLLOW_UPS.md.

Parameters:
  • n_range (range) – Range of \(n\) values to sweep.

  • num_trials (int) – Number of independent trials per \(n\).

  • epsilon (float) – Accuracy parameter \(\varepsilon\).

  • delta (float) – Confidence parameter \(\delta\).

  • qfs_shots (int) – QFS copy budget per trial.

  • classical_samples_prover (int) – Classical samples for the prover’s coefficient estimation.

  • classical_samples_verifier (int) – Classical samples for the verifier’s independent estimation.

  • base_seed (int) – Base random seed for reproducibility.

  • max_workers (int) – Number of parallel worker processes.

Return type:

ExperimentResult

experiments.harness.run_soundness_experiment(n_range: range = range(4, 7), num_trials: int = 50, epsilon: float = 0.3, classical_samples_verifier: int = 3000, base_seed: int = 42, max_workers: int = 1, shard_index: int | None = None, num_shards: int | None = None) ExperimentResult[source]

Empirical soundness against dishonest provers.

For each \(n\) in n_range, tests four adversarial prover strategies against the verifier’s information-theoretic soundness guarantee (Definition 7):

"random_list"

Prover sends 5 uniformly random frequency indices. Expected acceptance rate \(\approx 5/2^n\) (the probability of accidentally including \(s^*\)).

"wrong_parity"

Prover sends a single incorrect index \(s \neq s^*\). The verifier’s independent estimate will be \(\hat\xi(s) \approx 0\), causing rejection.

"partial_list"

Prover sends an empty list. Accumulated weight is 0, well below the threshold \(1 - \varepsilon^2/8\).

"inflated_list"

Prover sends 10 wrong indices with fabricated estimates. The verifier’s independent estimates expose all of them as having negligible true Fourier weight.

The empirical rejection rate should be \(\geq 1 - \delta\) for all strategies (excluding the combinatorial chance of "random_list" hitting \(s^*\)).

Parameters:
  • n_range (range) – Range of \(n\) values to sweep.

  • num_trials (int) – Number of trials per \((n, \text{strategy})\) cell.

  • epsilon (float) – Accuracy parameter \(\varepsilon\).

  • classical_samples_verifier (int) – Classical samples for the verifier.

  • base_seed (int) – Base random seed.

  • max_workers (int) – Number of parallel worker processes.

Return type:

ExperimentResult

experiments.harness.run_soundness_multi_experiment(n_range: range = range(4, 7), k_range: list[int] | None = None, num_trials: int = 50, epsilon: float = 0.3, classical_samples_verifier: int = 30000, base_seed: int = 42, max_workers: int = 1, shard_index: int | None = None, num_shards: int | None = None) ExperimentResult[source]

Empirical soundness against dishonest provers with k-sparse targets.

Extends the single-parity soundness experiment by testing four adversarial strategies against target functions with multiple nonzero Fourier coefficients (k-sparse, Dirichlet-drawn).

This closes Gap 3 in the experiment gap analysis: the accumulated weight check is validated against partially-correct adversarial lists where the adversary may include some genuine heavy coefficients alongside fabricated ones.

Strategies

"partial_real"

Prover includes the weaker half of real heavy coefficients plus a few fake indices. Tests whether partial real weight passes the threshold.

"diluted_list"

Prover includes the weakest \(\max(1, k/2)\) real heavy coefficients (audit fix m4 in audit/soundness_multi.md: previously max(1, k/4), which was always 1 for \(k \le 4\)) plus up to 20 random padding indices. Simulates an adversary that only knows part of the true signal and tries to dilute it with noise.

"shifted_coefficients"

Prover sends entirely wrong indices with fabricated large coefficient claims. Tests that independent estimation exposes zero true weight. Note (m2 in the audit): this strategy is structurally trivial – accumulated true weight is identically 0 – and largely overlaps with the simpler ``wrong_parity`` in :func:`run_soundness_experiment`. It is retained for back-compat with the existing test suite.

"subset_plus_noise"

Prover includes only the single heaviest real coefficient plus several near-threshold fake indices. Tests the marginal case where one real coefficient’s weight alone is insufficient.

Note

Sample budget bumped from 3000 -> 30000 (audit fix M1). The previous default classical_samples_verifier=3000 was three to four orders of magnitude below the Hoeffding-derived budget required for the k-sparse path’s tolerance \(\varepsilon^2/(256 k^2 |L|)\). At \(k = 2\) the subset_plus_noise strategy was falsely accepting at up to 18 %, exceeding the stated \(\delta = 0.1\). Bumping to 30 000 brings the verifier closer to the analytic budget; the existing results/soundness_multi_4_16_100.pb was generated with the old value and is invalid until re-run. See audit/soundness_multi.md and audit/FOLLOW_UPS.md.

\(\vartheta = \min(\varepsilon,\, 0.9/k)\) is an undocumented heuristic (m1 in the audit); the paper requires \(\vartheta \in (2^{-(n/2 - 3)},\, 1)\).

param n_range:

Range of \(n\) values to sweep.

type n_range:

range

param k_range:

Fourier sparsity values to test. Default: [2, 4].

type k_range:

list[int] or None

param num_trials:

Number of trials per \((n, k, \text{strategy})\) cell.

type num_trials:

int

param epsilon:

Accuracy parameter \(\varepsilon\).

type epsilon:

float

param classical_samples_verifier:

Classical samples for the verifier.

type classical_samples_verifier:

int

param base_seed:

Base random seed.

type base_seed:

int

param max_workers:

Number of parallel worker processes.

type max_workers:

int

rtype:

ExperimentResult

experiments.harness.run_trials_parallel(specs: list[TrialSpec], max_workers: int | None = None, label: str = '', shard_index: int | None = None, num_shards: int | None = None) list[TrialResult][source]

Dispatch a batch of trials across worker processes.

When max_workers > 1, uses ProcessPoolExecutor with as_completed() for progress reporting. Results are stored in an index-mapped list so the output preserves the original spec ordering regardless of completion order.

When max_workers == 1, falls back to sequential execution in the main process (useful for debugging — parallel tracebacks are less readable).

When shard_index and num_shards are both set, only the contiguous slice of specs assigned to this shard is executed. This enables SLURM Job Array distribution where each array task regenerates the full spec list but runs only its portion.

Parameters:
  • specs (list[TrialSpec]) – Trial specifications to execute.

  • max_workers (int or None) – Number of worker processes. None defaults to os.cpu_count().

  • label (str) – Short label printed in progress output (e.g. "scaling").

  • shard_index (int or None) – 0-based index of this shard (requires num_shards).

  • num_shards (int or None) – Total number of shards (requires shard_index).

Returns:

Results in the same order as the (possibly sharded) specs.

Return type:

list[TrialResult]

experiments.harness.run_truncation_experiment(n: int = 6, epsilon_range: list[float] | None = None, verifier_sample_range: list[int] | None = None, num_trials: int = 20, qfs_shots: int = 2000, classical_samples_prover: int = 1000, base_seed: int = 42, max_workers: int = 1, shard_index: int | None = None, num_shards: int | None = None) ExperimentResult[source]

Verifier truncation tradeoff experiment.

Fixes \(n\) and the prover’s resources, then sweeps the verifier’s classical sample budget \(m_V\) and accuracy parameter \(\varepsilon\) to map the completeness frontier.

Uses noisy parity with \(\eta = 0.15\) (effective coefficient \((1 - 2\eta) = 0.7\), so \(a^2 = b^2 = 0.49\)), which gives a tight Fourier weight margin:

\[\text{weight} \approx 0.49, \quad \tau = a^2 - \varepsilon^2 / 8\]

At small \(m_V\), the Hoeffding estimation noise in the verifier’s coefficient estimates pushes the accumulated weight below \(\tau\), causing rejection even for honest provers. At large \(\varepsilon\), \(\tau\) drops and the check becomes lenient.

Warning

Sub-Hoeffding regime — figures are NOT Theorem 12 boundary measurements. The default grid \(m_V \in \{50, \dots, 3000\}\) lies 3-4 orders of magnitude below the Theorem 12 verifier sample-complexity prescription \(m \ge (2/\mathrm{tol}^2)\,\log(4|L|/\delta)\) with \(\mathrm{tol} = \varepsilon^2/(16|L|)\) (see ql/verifier.py:493). For \(|L|=1\) and \(\delta=0.1\) the analytic minimum is roughly \(1.9 \times 10^7\) at \(\varepsilon=0.1\) and \(3.0 \times 10^4\) at \(\varepsilon=0.5\), so the grid never reaches the asymptotic regime that the threshold \(\tau = a^2 - \varepsilon^2/8\) was designed for.

Consequently the visible “knees” for \(\varepsilon \le 0.3\) are squaring-bias artefacts of the unbiased estimator \(\widehat\xi(s)^2 = \widehat{\varphi}(s)^2 + \mathrm{Var}(\widehat\xi)\), whose bias \(\approx 0.51/m_V\) is comparable to or exceeds the acceptance margin \(\varepsilon^2/8\). Some rows in truncation_summary.csv therefore show accept@50 > accept@3000 (acceptance decreasing with budget), which would be impossible in the true Hoeffding regime.

Additionally, the prover budgets qfs_shots=2000 and classical_samples_prover=1000 sit far below the Corollary 5 prescription, so the figures partly reflect prover-side QFS failure modes (the prover may fail to place the target string into \(L\) at small \(\varepsilon\) and large \(n\)) rather than pure verifier-side truncation.

The 2D grid \((\varepsilon \times m_V)\) is therefore best interpreted as a non-asymptotic / sub-Hoeffding feasibility sweep at fixed prover and verifier budgets, not as a measurement of the tradeoff surface predicted by Theorem 12. See audit/truncation.md (M1, M2) and audit/FOLLOW_UPS.md for the rerun specifications that would actually cross the Theorem 12 boundary.

Parameters:
  • n (int) – Number of input bits (fixed across the sweep).

  • epsilon_range (list[float] or None) – Values of \(\varepsilon\) to test. Default: [0.1, 0.2, 0.3, 0.4, 0.5].

  • verifier_sample_range (list[int] or None) – Verifier classical sample budgets \(m_V\) to test. Default: [50, 100, 200, 500, 1000, 3000].

  • num_trials (int) – Number of independent trials per \((\varepsilon, m_V)\) cell.

  • qfs_shots (int) – QFS copy budget per trial (prover).

  • classical_samples_prover (int) – Classical samples for the prover.

  • base_seed (int) – Base random seed.

  • max_workers (int) – Number of parallel worker processes.

Return type:

ExperimentResult