.. _leaderboard_significance: Leaderboard Significance Analysis ================================= This example explains how the leaderboard workflow now embeds non-parametric significance testing in line with Brigato et al.'s call for statistically defensible comparisons. Workflow Summary ---------------- * Run the benchmark sweeps with :class:`lrdbenchmark.analysis.benchmark.ComprehensiveBenchmark`. * Collect leaderboard tables from the generated CSV or directly inside the ``notebooks/markdown/05_leaderboard_generation.md`` notebook (or the companion tutorial :doc:`/tutorials/tutorial_05_leaderboards`). * Inspect the ``significance_analysis`` field returned with each benchmarking summary to obtain omnibus and post-hoc test statistics. * Adjust the protocol in ``config/benchmark_protocol.yaml`` to standardise preprocessing, scale-selection, and estimator overrides across runs. Code Snippet ------------ .. code-block:: python from lrdbenchmark.analysis.benchmark import ComprehensiveBenchmark import pandas as pd benchmark = ComprehensiveBenchmark() results = benchmark.run_comprehensive_benchmark( data_length=1000, benchmark_type="comprehensive", save_results=False, ) significance = results.get("significance_analysis", {}) if significance.get("status") == "ok": friedman = significance["friedman"] print( f"Friedman χ²={friedman['statistic']:.4f} " f"(p={friedman['p_value']:.4f}) across " f"{friedman['n_data_models']} data models and " f"{friedman['n_estimators']} estimators" ) mean_ranks = ( pd.DataFrame( list(significance["mean_ranks"].items()), columns=["Estimator", "Mean Rank"], ).sort_values("Mean Rank") ) print(mean_ranks.to_string(index=False)) pairwise = pd.DataFrame( [ { "Estimator A": res["pair"][0], "Estimator B": res["pair"][1], "Holm p-value": res.get("holm_p_value"), "Significant": res.get("significant"), } for res in significance["post_hoc"] if res.get("p_value") is not None ] ) if not pairwise.empty: print("\nHolm-corrected Wilcoxon comparisons:") print(pairwise.sort_values("Holm p-value").to_string(index=False)) else: print(significance.get("reason", "No significance analysis available.")) Output Interpretation --------------------- * **Friedman χ², p-value** – ombuds the null hypothesis that all estimators perform equally across the assessed data models. * **Mean Rank Table** – lower mean ranks indicate better aggregate performance. * **Holm-corrected Wilcoxon** – highlights pairwise differences that remain statistically significant after controlling the family-wise error rate. * **Coverage / CI width** – the benchmark automatically reports empirical coverage rates and mean 95% interval widths so calibration quality can be assessed alongside raw error. * **Stratified metrics** – the ``stratified_metrics`` payload provides error, coverage, and confidence-width summaries across H bands, tail classes, data lengths, and contamination regimes to prevent regime averaging. * **Robustness panels** – advanced benchmark runs embed scaling influence diagnostics and stress tests for missingness, regime shifts, seasonal drift, and burst noise so protocol choices are stress-tested alongside leaderboard scores. Additional Stratified Reporting ------------------------------- The JSON artefact saved by :class:`~lrdbenchmark.analysis.benchmark.ComprehensiveBenchmark` now includes a ``stratified_metrics`` section. To produce a publishable summary: .. code-block:: python from lrdbenchmark.analytics.dashboard import AnalyticsDashboard dashboard = AnalyticsDashboard() print(dashboard.generate_stratified_report("benchmark_results/comprehensive_benchmark_latest.json")) The dashboard can also render dedicated figures showcasing scaling slopes and robustness stress responses captured during advanced benchmarks: .. code-block:: python dashboard.create_advanced_diagnostics_visuals( "benchmark_results/advanced_benchmark_latest.json", output_dir="benchmark_results/figures", ) Best Practices -------------- * Ensure each estimator succeeds on all benchmarked data models; otherwise, the significance module drops incomplete rows to maintain valid paired tests. * Record the provenance bundle saved alongside the JSON benchmark artefact so that statistical claims remain reproducible. * Present the mean-rank and Holm-adjusted tables alongside error metrics in manuscripts to prevent "champion" narratives that rely on marginal, non-significant improvements.