Benchmark API

lrdbenchmark provides a comprehensive benchmarking framework for evaluating and comparing all 20 long-range dependence estimators (13 classical, 3 machine-learning, 4 neural), plus optional entropy-based estimators in the classical set.

Comprehensive benchmark engine

The primary entry point for publication-style runs (runtime profiles, stratified metrics, significance tests, optional JSON export) is ComprehensiveBenchmark.

class lrdbenchmark.analysis.benchmark.ComprehensiveBenchmark(output_dir: str | None = None, runtime_profile: str = 'auto')[source]

Bases: object

Comprehensive benchmark class for testing all estimators and data models.

__init__(output_dir: str | None = None, runtime_profile: str = 'auto')[source]

Initialize the benchmark system.

Parameters:

output_dir (str, optional) – Directory to save benchmark results
runtime_profile (str, optional) – Runtime profile to control computational intensity. Options: - “auto”: determine automatically (default) - “quick”: minimise expensive diagnostics (useful for tests) - “full”: enable all diagnostics and resampling routines

_resolve_runtime_profile(runtime_profile: str | None) → str[source]: Determine the runtime profile controlling benchmark intensity.

_load_protocol_config(path: Path) → Dict[str, Any][source]: Load benchmark protocol configuration from YAML/JSON file.

_deep_merge_dicts(base: Dict[str, Any], updates: Dict[str, Any]) → Dict[str, Any][source]: Recursively merge dictionaries without mutating inputs.

_initialize_all_estimators() → Dict[str, Dict[str, Any]][source]: Initialize all available estimators organized by category.

_apply_estimator_overrides(estimators: Dict[str, Dict[str, Any]], overrides: Dict[str, Dict[str, Any]]) → Dict[str, Dict[str, Any]][source]: Apply protocol-defined parameter overrides to initialized estimators.

_initialize_data_models() → Dict[str, Any][source]: Initialize all available data models.

_initialize_contamination_models() → Dict[str, Any][source]: Initialize all available contamination models.

get_estimators_by_type(benchmark_type: str = 'comprehensive', data_length: int = 1000) → Dict[str, Any][source]

Get estimators based on the specified benchmark type.

Parameters:

benchmark_type (str) – Type of benchmark to run: - ‘comprehensive’: All estimators (default) - ‘classical’: Only classical statistical estimators - ‘ML’: Only machine learning estimators (non-neural) - ‘neural’: Only neural network estimators
data_length (int) – Length of data to be tested (used for adaptive wavelet estimators)

Returns:

Dictionary of estimators for the specified type

Return type:

dict

generate_test_data(model_name: str, data_length: int = 1000, **kwargs) → Tuple[ndarray, Dict[str, Any]][source]

Generate test data using specified model.

Parameters:

model_name (str) – Name of the data model to use
data_length (int) – Length of data to generate
**kwargs (dict) – Additional parameters for the data model

Returns:

(data, parameters)

Return type:

tuple

apply_contamination(data: ndarray, contamination_type: str, contamination_level: float = 0.1, **kwargs) → Tuple[ndarray, Dict[str, Any]][source]

Apply contamination to the data.

Parameters:

data (np.ndarray) – Original clean data
contamination_type (str) – Type of contamination to apply
contamination_level (float) – Level/intensity of contamination (0.0 to 1.0)
**kwargs (dict) – Additional parameters for specific contamination types

Returns:

(contaminated_data, contamination_info)

Return type:

tuple

run_single_estimator_test(estimator_name: str, data: ndarray, true_params: Dict[str, Any]) → Dict[str, Any][source]

Run a single estimator test.

Parameters:

estimator_name (str) – Name of the estimator to test
data (np.ndarray) – Test data
true_params (dict) – True parameters of the data

Returns:

Test results

Return type:

dict

_calculate_monte_carlo_mse(estimator, data: ndarray, true_value: float, n_simulations: int = 50) → Dict[str, Any][source]

Calculate mean signed error using Monte Carlo simulations.

Parameters:

estimator (BaseEstimator) – Estimator instance
data (np.ndarray) – Original dataset
true_value (float) – True parameter value
n_simulations (int) – Number of Monte Carlo simulations

Returns:

Mean signed error analysis results

Return type:

dict

_compute_significance_tests(results: Dict[str, Any], alpha: float = 0.05) → Dict[str, Any][source]

Compute omnibus and post-hoc significance tests across estimators.

Parameters:

results (Dict[str, Any]) – Raw benchmark results grouped by data model.
alpha (float) – Significance level for hypothesis testing.

Returns:

Significance testing outcomes including Friedman statistics and Holm-adjusted pairwise Wilcoxon tests.

Return type:

Dict[str, Any]

_compute_stratified_metrics(results: Dict[str, Any], data_length: int, contamination_type: str | None, contamination_level: float) → Dict[str, Any][source]: Produce stratified summaries across H bands, tail classes, data length, and contamination regime.

_categorise_hurst_band(hurst_value: float | None) → str[source]: Assign H estimates to qualitative persistence bands.

_categorise_length_band(data_length: int | None) → str[source]: Bucket data length into interpretable regimes.

_extract_scale_data(result: Dict[str, Any], estimator: Any) → Tuple[ndarray | None, ndarray | None][source]

Extract scale and statistics data from estimator result for diagnostics.

Parameters:

result (dict) – Estimator result dictionary
estimator (BaseEstimator) – Estimator instance

Returns:

(scales, statistics) arrays or (None, None) if unavailable

Return type:

tuple

_infer_estimator_family(estimator_name: str) → str[source]

Infer the family (classical, ML, neural) from estimator name.

Parameters:: estimator_name (str) – Name of the estimator
Returns:: Estimator family
Return type:: str

_infer_tail_class(model_name: str | None, data_params: Dict[str, Any] | None = None) → str[source]: Infer a qualitative tail/heaviness class based on the data model.

_build_provenance_bundle(summary: Dict[str, Any]) → Dict[str, Any][source]

Construct a comprehensive provenance bundle using ProvenanceTracker.

This bundle includes all settings needed to reproduce the experiment: - Data generation parameters - Estimator configuration - Preprocessing settings - Scale selection parameters - Analytics configuration - Environment information

_attach_uncertainty_calibration_summary(summary: Dict[str, Any], lookback_days: int = 90) → None[source]: Augment benchmark summaries with uncertainty calibration diagnostics.

_build_result_row_provenance(result: Dict[str, Any], data_params: Dict[str, Any]) → Dict[str, Any][source]

Build provenance bundle for a single result row.

This creates a lightweight provenance artifact per result that includes: - Experiment-level provenance (reference) - Row-specific parameters (data model, estimator, etc.) - Result metadata

_record_uncertainty_event(estimator_name: str, data_model: str | None, uncertainty: Any, estimate: float | None, true_value: float | None, data_length: int, estimator_family: str | None) → None[source]: Persist uncertainty calibration data via the error analyzer.

run_comprehensive_benchmark(data_length: int = 1000, benchmark_type: str = 'comprehensive', contamination_type: str | None = None, contamination_level: float = 0.1, save_results: bool = True) → Dict[str, Any][source]

Run comprehensive benchmark across all estimators and data models.

Parameters:

data_length (int) – Length of test data to generate
benchmark_type (str) – Type of benchmark to run: - ‘comprehensive’: All estimators (default) - ‘classical’: Only classical statistical estimators - ‘ML’: Only machine learning estimators (non-neural) - ‘neural’: Only neural network estimators
contamination_type (str, optional) – Type of contamination to apply to the data
contamination_level (float) – Level/intensity of contamination (0.0 to 1.0)
save_results (bool) – Whether to save results to file

Returns:

Comprehensive benchmark results

Return type:

dict

run_classical_benchmark(data_length: int = 1000, contamination_type: str | None = None, contamination_level: float = 0.1, save_results: bool = True) → Dict[str, Any][source]: Run benchmark with only classical statistical estimators.

run_ml_benchmark(data_length: int = 1000, contamination_type: str | None = None, contamination_level: float = 0.1, save_results: bool = True) → Dict[str, Any][source]: Run benchmark with only machine learning estimators (non-neural).

run_neural_benchmark(data_length: int = 1000, contamination_type: str | None = None, contamination_level: float = 0.1, save_results: bool = True) → Dict[str, Any][source]: Run benchmark with only neural network estimators.

run_classical_estimators(data_models: list | None = None, n_samples: int = 1000, n_trials: int = 10, save_results: bool = True) → Dict[str, Any][source]

Backward-compatible alias for run_classical_benchmark.

This method maintains the old API for compatibility with existing code.

run_advanced_metrics_benchmark(data_length: int = 1000, benchmark_type: str = 'comprehensive', n_monte_carlo: int = 100, convergence_threshold: float = 1e-06, save_results: bool = True) → Dict[str, Any][source]

Run advanced metrics benchmark focusing on convergence and bias analysis.

Parameters:

data_length (int) – Length of test data to generate
benchmark_type (str) – Type of benchmark to run
n_monte_carlo (int) – Number of Monte Carlo simulations for bias analysis
convergence_threshold (float) – Threshold for convergence detection
save_results (bool) – Whether to save results to file

Returns:

Advanced metrics benchmark results

Return type:

dict

save_advanced_results(results: Dict[str, Any]) → None[source]: Save advanced benchmark results to files.

print_advanced_summary(summary: Dict[str, Any]) → None[source]: Print advanced benchmark summary.

save_results(results: Dict[str, Any]) → None[source]: Save benchmark results to files.

print_summary(summary: Dict[str, Any]) → None[source]: Print benchmark summary.

export_results(results: Dict[str, Any], output_path: str) → None[source]

Export benchmark results to a file.

Parameters:

results (dict) – Benchmark results dictionary
output_path (str) – Path to save the results (JSON format)

Public package import

from lrdbenchmark import ComprehensiveBenchmark resolves to the same class documented above.

Multi-category sweep benchmark

For lighter-weight sweeps that delegate to the classical, ML, and NN benchmark runners (list-of-row results, separate from the engine’s summary dict), use:

class lrdbenchmark.benchmarks.MultiCategoryBenchmark(output_dir: str | None = None, seed: int | None = None)[source]

Bases: BaseBenchmark

Run classical, ML, and NN sweep benchmarks behind one entry point.

This coordinates ClassicalBenchmark, MLBenchmark, and NNBenchmark. For the full diagnostic engine (runtime profiles, stratified metrics, significance tests), use ComprehensiveBenchmark.

__init__(output_dir: str | None = None, seed: int | None = None)[source]

run(models: List[str] = None, lengths: List[int] = None, num_realizations: int = 10, params: Dict[str, Any] = None, run_classical: bool = True, run_ml: bool = True, run_nn: bool = True)[source]: Run selected benchmark categories and aggregate row results.

Usage examples

Basic run (returns a summary `dict`)

from lrdbenchmark import ComprehensiveBenchmark

benchmark = ComprehensiveBenchmark(runtime_profile="quick")
summary = benchmark.run_comprehensive_benchmark(
    data_length=256,
    benchmark_type="classical",
    save_results=False,
)

print(summary["random_state"])
print(summary.get("stratified_metrics", {}))

Classical-only and profiles

from lrdbenchmark import ComprehensiveBenchmark

# Quick profile: skips heavy diagnostics (see engine docstring)
quick = ComprehensiveBenchmark(runtime_profile="quick")
out_quick = quick.run_classical_benchmark(data_length=512, save_results=False)

# Default engine profile is "auto" (defers to environment / heuristics)
full = ComprehensiveBenchmark()
out_full = full.run_comprehensive_benchmark(
    data_length=1000,
    benchmark_type="comprehensive",
    save_results=True,
)

Inspecting per-model results

run_comprehensive_benchmark returns a dictionary. Per–data-model outcomes live under summary["results"] (keys are model names; values contain estimator_results lists with success flags, estimates, and errors).

summary = benchmark.run_comprehensive_benchmark(
    data_length=512,
    benchmark_type="classical",
    save_results=False,
)
for model_name, block in summary["results"].items():
    if block.get("error"):
        print(model_name, "failed:", block["error"])
        continue
    n_ok = sum(1 for r in block["estimator_results"] if r.get("success"))
    print(f"{model_name}: {n_ok}/{len(block['estimator_results'])} estimators OK")

Multi-category sweep (optional)

from lrdbenchmark.benchmarks import MultiCategoryBenchmark

runner = MultiCategoryBenchmark(output_dir="sweep_results", seed=42)
rows = runner.run(
    models=["fbm", "fgn"],
    lengths=[512],
    num_realizations=3,
    run_classical=True,
    run_ml=True,
    run_nn=False,
)

Best practices

Use data_length ≥ 512 for stable wavelet and spectral estimates when comparing families.
Use runtime_profile="quick" in CI or smoke tests; use "full" or default "auto" for exhaustive diagnostics.
Set LRDBENCHMARK_AUTO_CPU=1 before import to force CPU-only JAX/CUDA visibility when you need deterministic, GPU-free environments.
Handle failed estimator rows via the success flag on each result entry.

Note

Earlier documentation referred to BenchmarkResult, EstimatorResult, and BenchmarkConfig helpers; the current engine returns structured dict summaries. Prefer the keys documented on run_comprehensive_benchmark().

Benchmark API

Comprehensive benchmark engine

Public package import

Multi-category sweep benchmark

Usage examples

Basic run (returns a summary dict)

Classical-only and profiles

Inspecting per-model results

Multi-category sweep (optional)

Best practices

Basic run (returns a summary `dict`)