Benchmark API

lrdbenchmark provides a comprehensive benchmarking framework for evaluating and comparing all 20 long-range dependence estimators (13 classical, 3 machine-learning, 4 neural), plus optional entropy-based estimators in the classical set.

Comprehensive benchmark engine

The primary entry point for publication-style runs (runtime profiles, stratified metrics, significance tests, optional JSON export) is ComprehensiveBenchmark.

class lrdbenchmark.analysis.benchmark.ComprehensiveBenchmark(output_dir: str | None = None, runtime_profile: str = 'auto')[source]

Bases: object

Comprehensive benchmark class for testing all estimators and data models.

__init__(output_dir: str | None = None, runtime_profile: str = 'auto')[source]

Initialize the benchmark system.

Parameters:
  • output_dir (str, optional) – Directory to save benchmark results

  • runtime_profile (str, optional) – Runtime profile to control computational intensity. Options: - “auto”: determine automatically (default) - “quick”: minimise expensive diagnostics (useful for tests) - “full”: enable all diagnostics and resampling routines

_resolve_runtime_profile(runtime_profile: str | None) str[source]

Determine the runtime profile controlling benchmark intensity.

_load_protocol_config(path: Path) Dict[str, Any][source]

Load benchmark protocol configuration from YAML/JSON file.

_deep_merge_dicts(base: Dict[str, Any], updates: Dict[str, Any]) Dict[str, Any][source]

Recursively merge dictionaries without mutating inputs.

_initialize_all_estimators() Dict[str, Dict[str, Any]][source]

Initialize all available estimators organized by category.

_apply_estimator_overrides(estimators: Dict[str, Dict[str, Any]], overrides: Dict[str, Dict[str, Any]]) Dict[str, Dict[str, Any]][source]

Apply protocol-defined parameter overrides to initialized estimators.

_initialize_data_models() Dict[str, Any][source]

Initialize all available data models.

_initialize_contamination_models() Dict[str, Any][source]

Initialize all available contamination models.

get_estimators_by_type(benchmark_type: str = 'comprehensive', data_length: int = 1000) Dict[str, Any][source]

Get estimators based on the specified benchmark type.

Parameters:
  • benchmark_type (str) – Type of benchmark to run: - ‘comprehensive’: All estimators (default) - ‘classical’: Only classical statistical estimators - ‘ML’: Only machine learning estimators (non-neural) - ‘neural’: Only neural network estimators

  • data_length (int) – Length of data to be tested (used for adaptive wavelet estimators)

Returns:

Dictionary of estimators for the specified type

Return type:

dict

generate_test_data(model_name: str, data_length: int = 1000, **kwargs) Tuple[ndarray, Dict[str, Any]][source]

Generate test data using specified model.

Parameters:
  • model_name (str) – Name of the data model to use

  • data_length (int) – Length of data to generate

  • **kwargs (dict) – Additional parameters for the data model

Returns:

(data, parameters)

Return type:

tuple

apply_contamination(data: ndarray, contamination_type: str, contamination_level: float = 0.1, **kwargs) Tuple[ndarray, Dict[str, Any]][source]

Apply contamination to the data.

Parameters:
  • data (np.ndarray) – Original clean data

  • contamination_type (str) – Type of contamination to apply

  • contamination_level (float) – Level/intensity of contamination (0.0 to 1.0)

  • **kwargs (dict) – Additional parameters for specific contamination types

Returns:

(contaminated_data, contamination_info)

Return type:

tuple

run_single_estimator_test(estimator_name: str, data: ndarray, true_params: Dict[str, Any]) Dict[str, Any][source]

Run a single estimator test.

Parameters:
  • estimator_name (str) – Name of the estimator to test

  • data (np.ndarray) – Test data

  • true_params (dict) – True parameters of the data

Returns:

Test results

Return type:

dict

_calculate_monte_carlo_mse(estimator, data: ndarray, true_value: float, n_simulations: int = 50) Dict[str, Any][source]

Calculate mean signed error using Monte Carlo simulations.

Parameters:
  • estimator (BaseEstimator) – Estimator instance

  • data (np.ndarray) – Original dataset

  • true_value (float) – True parameter value

  • n_simulations (int) – Number of Monte Carlo simulations

Returns:

Mean signed error analysis results

Return type:

dict

_compute_significance_tests(results: Dict[str, Any], alpha: float = 0.05) Dict[str, Any][source]

Compute omnibus and post-hoc significance tests across estimators.

Parameters:
  • results (Dict[str, Any]) – Raw benchmark results grouped by data model.

  • alpha (float) – Significance level for hypothesis testing.

Returns:

Significance testing outcomes including Friedman statistics and Holm-adjusted pairwise Wilcoxon tests.

Return type:

Dict[str, Any]

_compute_stratified_metrics(results: Dict[str, Any], data_length: int, contamination_type: str | None, contamination_level: float) Dict[str, Any][source]

Produce stratified summaries across H bands, tail classes, data length, and contamination regime.

_categorise_hurst_band(hurst_value: float | None) str[source]

Assign H estimates to qualitative persistence bands.

_categorise_length_band(data_length: int | None) str[source]

Bucket data length into interpretable regimes.

_extract_scale_data(result: Dict[str, Any], estimator: Any) Tuple[ndarray | None, ndarray | None][source]

Extract scale and statistics data from estimator result for diagnostics.

Parameters:
  • result (dict) – Estimator result dictionary

  • estimator (BaseEstimator) – Estimator instance

Returns:

(scales, statistics) arrays or (None, None) if unavailable

Return type:

tuple

_infer_estimator_family(estimator_name: str) str[source]

Infer the family (classical, ML, neural) from estimator name.

Parameters:

estimator_name (str) – Name of the estimator

Returns:

Estimator family

Return type:

str

_infer_tail_class(model_name: str | None, data_params: Dict[str, Any] | None = None) str[source]

Infer a qualitative tail/heaviness class based on the data model.

_build_provenance_bundle(summary: Dict[str, Any]) Dict[str, Any][source]

Construct a comprehensive provenance bundle using ProvenanceTracker.

This bundle includes all settings needed to reproduce the experiment: - Data generation parameters - Estimator configuration - Preprocessing settings - Scale selection parameters - Analytics configuration - Environment information

_attach_uncertainty_calibration_summary(summary: Dict[str, Any], lookback_days: int = 90) None[source]

Augment benchmark summaries with uncertainty calibration diagnostics.

_build_result_row_provenance(result: Dict[str, Any], data_params: Dict[str, Any]) Dict[str, Any][source]

Build provenance bundle for a single result row.

This creates a lightweight provenance artifact per result that includes: - Experiment-level provenance (reference) - Row-specific parameters (data model, estimator, etc.) - Result metadata

_record_uncertainty_event(estimator_name: str, data_model: str | None, uncertainty: Any, estimate: float | None, true_value: float | None, data_length: int, estimator_family: str | None) None[source]

Persist uncertainty calibration data via the error analyzer.

run_comprehensive_benchmark(data_length: int = 1000, benchmark_type: str = 'comprehensive', contamination_type: str | None = None, contamination_level: float = 0.1, save_results: bool = True) Dict[str, Any][source]

Run comprehensive benchmark across all estimators and data models.

Parameters:
  • data_length (int) – Length of test data to generate

  • benchmark_type (str) – Type of benchmark to run: - ‘comprehensive’: All estimators (default) - ‘classical’: Only classical statistical estimators - ‘ML’: Only machine learning estimators (non-neural) - ‘neural’: Only neural network estimators

  • contamination_type (str, optional) – Type of contamination to apply to the data

  • contamination_level (float) – Level/intensity of contamination (0.0 to 1.0)

  • save_results (bool) – Whether to save results to file

Returns:

Comprehensive benchmark results

Return type:

dict

run_classical_benchmark(data_length: int = 1000, contamination_type: str | None = None, contamination_level: float = 0.1, save_results: bool = True) Dict[str, Any][source]

Run benchmark with only classical statistical estimators.

run_ml_benchmark(data_length: int = 1000, contamination_type: str | None = None, contamination_level: float = 0.1, save_results: bool = True) Dict[str, Any][source]

Run benchmark with only machine learning estimators (non-neural).

run_neural_benchmark(data_length: int = 1000, contamination_type: str | None = None, contamination_level: float = 0.1, save_results: bool = True) Dict[str, Any][source]

Run benchmark with only neural network estimators.

run_classical_estimators(data_models: list | None = None, n_samples: int = 1000, n_trials: int = 10, save_results: bool = True) Dict[str, Any][source]

Backward-compatible alias for run_classical_benchmark.

This method maintains the old API for compatibility with existing code.

run_advanced_metrics_benchmark(data_length: int = 1000, benchmark_type: str = 'comprehensive', n_monte_carlo: int = 100, convergence_threshold: float = 1e-06, save_results: bool = True) Dict[str, Any][source]

Run advanced metrics benchmark focusing on convergence and bias analysis.

Parameters:
  • data_length (int) – Length of test data to generate

  • benchmark_type (str) – Type of benchmark to run

  • n_monte_carlo (int) – Number of Monte Carlo simulations for bias analysis

  • convergence_threshold (float) – Threshold for convergence detection

  • save_results (bool) – Whether to save results to file

Returns:

Advanced metrics benchmark results

Return type:

dict

save_advanced_results(results: Dict[str, Any]) None[source]

Save advanced benchmark results to files.

print_advanced_summary(summary: Dict[str, Any]) None[source]

Print advanced benchmark summary.

save_results(results: Dict[str, Any]) None[source]

Save benchmark results to files.

print_summary(summary: Dict[str, Any]) None[source]

Print benchmark summary.

export_results(results: Dict[str, Any], output_path: str) None[source]

Export benchmark results to a file.

Parameters:
  • results (dict) – Benchmark results dictionary

  • output_path (str) – Path to save the results (JSON format)

Public package import

from lrdbenchmark import ComprehensiveBenchmark resolves to the same class documented above.

Multi-category sweep benchmark

For lighter-weight sweeps that delegate to the classical, ML, and NN benchmark runners (list-of-row results, separate from the engine’s summary dict), use:

class lrdbenchmark.benchmarks.MultiCategoryBenchmark(output_dir: str | None = None, seed: int | None = None)[source]

Bases: BaseBenchmark

Run classical, ML, and NN sweep benchmarks behind one entry point.

This coordinates ClassicalBenchmark, MLBenchmark, and NNBenchmark. For the full diagnostic engine (runtime profiles, stratified metrics, significance tests), use ComprehensiveBenchmark.

__init__(output_dir: str | None = None, seed: int | None = None)[source]
run(models: List[str] = None, lengths: List[int] = None, num_realizations: int = 10, params: Dict[str, Any] = None, run_classical: bool = True, run_ml: bool = True, run_nn: bool = True)[source]

Run selected benchmark categories and aggregate row results.

Usage examples

Basic run (returns a summary dict)

from lrdbenchmark import ComprehensiveBenchmark

benchmark = ComprehensiveBenchmark(runtime_profile="quick")
summary = benchmark.run_comprehensive_benchmark(
    data_length=256,
    benchmark_type="classical",
    save_results=False,
)

print(summary["random_state"])
print(summary.get("stratified_metrics", {}))

Classical-only and profiles

from lrdbenchmark import ComprehensiveBenchmark

# Quick profile: skips heavy diagnostics (see engine docstring)
quick = ComprehensiveBenchmark(runtime_profile="quick")
out_quick = quick.run_classical_benchmark(data_length=512, save_results=False)

# Default engine profile is "auto" (defers to environment / heuristics)
full = ComprehensiveBenchmark()
out_full = full.run_comprehensive_benchmark(
    data_length=1000,
    benchmark_type="comprehensive",
    save_results=True,
)

Inspecting per-model results

run_comprehensive_benchmark returns a dictionary. Per–data-model outcomes live under summary["results"] (keys are model names; values contain estimator_results lists with success flags, estimates, and errors).

summary = benchmark.run_comprehensive_benchmark(
    data_length=512,
    benchmark_type="classical",
    save_results=False,
)
for model_name, block in summary["results"].items():
    if block.get("error"):
        print(model_name, "failed:", block["error"])
        continue
    n_ok = sum(1 for r in block["estimator_results"] if r.get("success"))
    print(f"{model_name}: {n_ok}/{len(block['estimator_results'])} estimators OK")

Multi-category sweep (optional)

from lrdbenchmark.benchmarks import MultiCategoryBenchmark

runner = MultiCategoryBenchmark(output_dir="sweep_results", seed=42)
rows = runner.run(
    models=["fbm", "fgn"],
    lengths=[512],
    num_realizations=3,
    run_classical=True,
    run_ml=True,
    run_nn=False,
)

Best practices

  1. Use data_length ≥ 512 for stable wavelet and spectral estimates when comparing families.

  2. Use runtime_profile="quick" in CI or smoke tests; use "full" or default "auto" for exhaustive diagnostics.

  3. Set LRDBENCHMARK_AUTO_CPU=1 before import to force CPU-only JAX/CUDA visibility when you need deterministic, GPU-free environments.

  4. Handle failed estimator rows via the success flag on each result entry.

Note

Earlier documentation referred to BenchmarkResult, EstimatorResult, and BenchmarkConfig helpers; the current engine returns structured dict summaries. Prefer the keys documented on run_comprehensive_benchmark().