Leaderboard Significance Analysis
This example explains how the leaderboard workflow now embeds non-parametric significance testing in line with Brigato et al.’s call for statistically defensible comparisons.
Workflow Summary
Run the benchmark sweeps with
lrdbenchmark.analysis.benchmark.ComprehensiveBenchmark.Collect leaderboard tables from the generated CSV or directly inside the
notebooks/markdown/05_leaderboard_generation.mdnotebook (or the companion tutorial Leaderboard Generation).Inspect the
significance_analysisfield returned with each benchmarking summary to obtain omnibus and post-hoc test statistics.Adjust the protocol in
config/benchmark_protocol.yamlto standardise preprocessing, scale-selection, and estimator overrides across runs.
Code Snippet
from lrdbenchmark.analysis.benchmark import ComprehensiveBenchmark
import pandas as pd
benchmark = ComprehensiveBenchmark()
results = benchmark.run_comprehensive_benchmark(
data_length=1000,
benchmark_type="comprehensive",
save_results=False,
)
significance = results.get("significance_analysis", {})
if significance.get("status") == "ok":
friedman = significance["friedman"]
print(
f"Friedman χ²={friedman['statistic']:.4f} "
f"(p={friedman['p_value']:.4f}) across "
f"{friedman['n_data_models']} data models and "
f"{friedman['n_estimators']} estimators"
)
mean_ranks = (
pd.DataFrame(
list(significance["mean_ranks"].items()),
columns=["Estimator", "Mean Rank"],
).sort_values("Mean Rank")
)
print(mean_ranks.to_string(index=False))
pairwise = pd.DataFrame(
[
{
"Estimator A": res["pair"][0],
"Estimator B": res["pair"][1],
"Holm p-value": res.get("holm_p_value"),
"Significant": res.get("significant"),
}
for res in significance["post_hoc"]
if res.get("p_value") is not None
]
)
if not pairwise.empty:
print("\nHolm-corrected Wilcoxon comparisons:")
print(pairwise.sort_values("Holm p-value").to_string(index=False))
else:
print(significance.get("reason", "No significance analysis available."))
Output Interpretation
Friedman χ², p-value – ombuds the null hypothesis that all estimators perform equally across the assessed data models.
Mean Rank Table – lower mean ranks indicate better aggregate performance.
Holm-corrected Wilcoxon – highlights pairwise differences that remain statistically significant after controlling the family-wise error rate.
Coverage / CI width – the benchmark automatically reports empirical coverage rates and mean 95% interval widths so calibration quality can be assessed alongside raw error.
Stratified metrics – the
stratified_metricspayload provides error, coverage, and confidence-width summaries across H bands, tail classes, data lengths, and contamination regimes to prevent regime averaging.Robustness panels – advanced benchmark runs embed scaling influence diagnostics and stress tests for missingness, regime shifts, seasonal drift, and burst noise so protocol choices are stress-tested alongside leaderboard scores.
Additional Stratified Reporting
The JSON artefact saved by ComprehensiveBenchmark
now includes a stratified_metrics section. To produce a publishable summary:
from lrdbenchmark.analytics.dashboard import AnalyticsDashboard
dashboard = AnalyticsDashboard()
print(dashboard.generate_stratified_report("benchmark_results/comprehensive_benchmark_latest.json"))
The dashboard can also render dedicated figures showcasing scaling slopes and robustness stress responses captured during advanced benchmarks:
dashboard.create_advanced_diagnostics_visuals(
"benchmark_results/advanced_benchmark_latest.json",
output_dir="benchmark_results/figures",
)
Best Practices
Ensure each estimator succeeds on all benchmarked data models; otherwise, the significance module drops incomplete rows to maintain valid paired tests.
Record the provenance bundle saved alongside the JSON benchmark artefact so that statistical claims remain reproducible.
Present the mean-rank and Holm-adjusted tables alongside error metrics in manuscripts to prevent “champion” narratives that rely on marginal, non-significant improvements.