Leaderboard Generation ====================== This notebook demonstrates how to create comprehensive performance leaderboards from benchmark results, showing how to rank estimators, surface statistical significance, and generate stratified and robustness-aware comparisons. Overview -------- The leaderboard generation system allows you to: 1. **Load Benchmark Results**: Import results from multiple benchmark runs 2. **Create Rankings**: Generate performance rankings across different metrics 3. **Composite Scoring**: Combine multiple metrics into overall scores 4. **Visualization**: Create publication-ready plots, significance overlays, and stratified tables 5. **Stratified Reporting**: Slice results by H regime, tail class, data length, and contamination 6. **Export Results**: Save leaderboards in various formats (CSV/JSON/LaTeX) with provenance metadata Table of Contents ----------------- 1. `Setup and Imports <#tut05-setup>`__ 2. `Loading Benchmark Results <#loading>`__ 3. `Creating Performance Rankings <#rankings>`__ 4. `Composite Scoring System <#scoring>`__ 5. `Visualization and Export <#visualization>`__ 6. `Summary and Next Steps <#tut05-summary>`__ .. _tut05-setup: 1. Setup and Imports -------------------- First, let’s import all necessary libraries and set up the leaderboard generation system. .. code:: ipython3 # Standard scientific computing imports import numpy as np # LRDBenchmark imports - using simplified API from lrdbenchmark import ( # Data models FBMModel, FGNModel, ARFIMAModel, MRWModel, AlphaStableModel, # Classical estimators RSEstimator, DFAEstimator, GPHEstimator, WhittleEstimator, # Machine Learning estimators RandomForestEstimator, SVREstimator, GradientBoostingEstimator, # Neural Network estimators CNNEstimator, LSTMEstimator, GRUEstimator, TransformerEstimator, # GPU utilities gpu_is_available, get_device_info, clear_gpu_cache, monitor_gpu_memory ) import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from scipy import stats import time import warnings import subprocess import gc warnings.filterwarnings('ignore') from lrdbenchmark.random_manager import initialise_global_rng initialise_global_rng(1729) # GPU Memory Management Functions .. parsed-literal:: πŸ” Checking GPU memory status... πŸ–₯️ GPU Memory: 13MB / 8151MB (0.2%) βœ… All imports successful! πŸ† Ready to generate performance leaderboards .. _loading: 2. Loading Benchmark Results ---------------------------- Let’s run comprehensive benchmarks to generate data for our leaderboard, then load and process the results. .. code:: ipython3 # Initialize benchmark system print("πŸ”§ Initializing Benchmark System for Leaderboard Generation...") print("=" * 70) benchmark = ComprehensiveBenchmark(output_dir="leaderboard_results") print(f"Protocol configuration loaded from: {benchmark.protocol_config_path}") # Run comprehensive benchmarks print("\nπŸš€ Running Comprehensive Benchmarks...") print("=" * 70) # Run classical benchmark print("πŸ“Š Running Classical Estimator Benchmark...") classical_results = benchmark.run_classical_benchmark( data_length=1000, save_results=True ) print(f"βœ… Classical benchmark completed!") print(f"Success rate: {classical_results['success_rate']:.1%}") print(f"Total tests: {classical_results['total_tests']}") # Run ML benchmark print("\nπŸ“Š Running ML Estimator Benchmark...") ml_results = benchmark.run_ml_benchmark( data_length=1000, save_results=True ) print(f"βœ… ML benchmark completed!") print(f"Success rate: {ml_results['success_rate']:.1%}") print(f"Total tests: {ml_results['total_tests']}") # Run neural benchmark print("\nπŸ“Š Running Neural Network Benchmark...") neural_results = benchmark.run_neural_benchmark( data_length=1000, save_results=True ) print(f"βœ… Neural benchmark completed!") print(f"Success rate: {neural_results['success_rate']:.1%}") print(f"Total tests: {neural_results['total_tests']}") # Run comprehensive benchmark print("\nπŸ“Š Running Comprehensive Benchmark...") comprehensive_results = benchmark.run_comprehensive_benchmark( data_length=1000, save_results=True ) print(f"βœ… Comprehensive benchmark completed!") print(f"Success rate: {comprehensive_results['success_rate']:.1%}") print(f"Total tests: {comprehensive_results['total_tests']}") print("\n🎯 All benchmarks completed successfully!") .. parsed-literal:: πŸ”§ Initializing Benchmark System for Leaderboard Generation... ====================================================================== βœ… LSTM model initialized with reasonable weights βœ… GRU model initialized with reasonable weights πŸš€ Running Comprehensive Benchmarks... ====================================================================== πŸ“Š Running Classical Estimator Benchmark... πŸš€ Starting LRDBench Benchmark ============================================================ Benchmark Type: CLASSICAL ============================================================ Testing 13 estimators... πŸ“Š Testing with fBm data model... Generated 1000 clean data points πŸ” Testing R/S... βœ… πŸ” Testing DFA... βœ… πŸ” Testing DMA... βœ… πŸ” Testing Higuchi... βœ… πŸ” Testing GPH... βœ… πŸ” Testing Whittle... βœ… πŸ” Testing Periodogram... βœ… πŸ” Testing CWT... βœ… πŸ” Testing WaveletVar... βœ… πŸ” Testing WaveletLogVar... βœ… πŸ” Testing WaveletWhittle... βœ… πŸ” Testing MFDFA... βœ… πŸ” Testing WaveletLeaders... βœ… πŸ“Š Testing with fGn data model... Generated 1000 clean data points πŸ” Testing R/S... βœ… πŸ” Testing DFA... βœ… πŸ” Testing DMA... βœ… πŸ” Testing Higuchi... βœ… πŸ” Testing GPH... βœ… πŸ” Testing Whittle... βœ… πŸ” Testing Periodogram... βœ… πŸ” Testing CWT... βœ… πŸ” Testing WaveletVar... βœ… πŸ” Testing WaveletLogVar... βœ… πŸ” Testing WaveletWhittle... βœ… πŸ” Testing MFDFA... βœ… πŸ” Testing WaveletLeaders... βœ… πŸ“Š Testing with ARFIMAModel data model... Generated 1000 clean data points πŸ” Testing R/S... βœ… πŸ” Testing DFA... βœ… πŸ” Testing DMA... βœ… πŸ” Testing Higuchi... βœ… πŸ” Testing GPH... βœ… πŸ” Testing Whittle... βœ… πŸ” Testing Periodogram... βœ… πŸ” Testing CWT... βœ… πŸ” Testing WaveletVar... βœ… πŸ” Testing WaveletLogVar... βœ… πŸ” Testing WaveletWhittle... βœ… πŸ” Testing MFDFA... βœ… πŸ” Testing WaveletLeaders... βœ… πŸ“Š Testing with MRW data model... Generated 1000 clean data points πŸ” Testing R/S... βœ… πŸ” Testing DFA... βœ… πŸ” Testing DMA... βœ… πŸ” Testing Higuchi... βœ… πŸ” Testing GPH... βœ… πŸ” Testing Whittle... βœ… πŸ” Testing Periodogram... βœ… πŸ” Testing CWT... βœ… πŸ” Testing WaveletVar... βœ… πŸ” Testing WaveletLogVar... βœ… πŸ” Testing WaveletWhittle... βœ… πŸ” Testing MFDFA... βœ… πŸ” Testing WaveletLeaders... βœ… πŸ’Ύ Results saved to: JSON: leaderboard_results/comprehensive_benchmark_20251016_100856.json CSV: leaderboard_results/benchmark_summary_20251016_100856.csv ============================================================ πŸ“Š BENCHMARK SUMMARY ============================================================ Benchmark Type: CLASSICAL Total Tests: 52 Successful: 52 Success Rate: 100.0% Data Models: 4 Estimators: 13 πŸ† TOP PERFORMING ESTIMATORS (Average across all data models): 1. Whittle Avg Error: 0.1000 (Range: 0.0000-0.4000) Avg Time: 0.001s | Data Models: 4 Mean Signed Error: 0.1000 Bias: 33.33% Stability: 0.0000 Estimated H values: fBm: H_est=0.7000, H_true=0.7000 fGn: H_est=0.7000, H_true=0.7000 ARFIMAModel: H_est=0.7000, H_true=0.3000 MRW: H_est=0.7000, H_true=0.7000 2. Periodogram Avg Error: 0.1287 (Range: 0.0080-0.3676) Avg Time: 0.001s | Data Models: 4 Convergence Rate: -0.2191 Mean Signed Error: 0.0899 Bias: 30.35% Stability: 0.2886 Estimated H values: fBm: H_est=0.7080, H_true=0.7000 fGn: H_est=0.7618, H_true=0.7000 ARFIMAModel: H_est=0.6676, H_true=0.3000 MRW: H_est=0.6226, H_true=0.7000 3. R/S Avg Error: 0.1777 (Range: 0.0062-0.4919) Avg Time: 0.797s | Data Models: 4 Convergence Rate: -0.3563 Mean Signed Error: 0.1777 Bias: 48.81% Stability: 0.0641 Estimated H values: fBm: H_est=0.7820, H_true=0.7000 fGn: H_est=0.8305, H_true=0.7000 ARFIMAModel: H_est=0.7919, H_true=0.3000 MRW: H_est=0.7062, H_true=0.7000 4. Higuchi Avg Error: 0.1819 (Range: 0.0373-0.4902) Avg Time: 0.002s | Data Models: 4 Convergence Rate: -0.7034 Mean Signed Error: 0.1818 Bias: 49.32% Stability: 0.1105 Estimated H values: fBm: H_est=0.8073, H_true=0.7000 fGn: H_est=0.7927, H_true=0.7000 ARFIMAModel: H_est=0.7902, H_true=0.3000 MRW: H_est=0.7373, H_true=0.7000 5. DMA Avg Error: 0.1829 (Range: 0.0479-0.4514) Avg Time: 0.001s | Data Models: 4 Convergence Rate: -0.1672 Mean Signed Error: 0.1589 Bias: 44.20% Stability: 0.1522 Estimated H values: fBm: H_est=0.8685, H_true=0.7000 fGn: H_est=0.7639, H_true=0.7000 ARFIMAModel: H_est=0.7514, H_true=0.3000 MRW: H_est=0.6521, H_true=0.7000 πŸ“Š DETAILED PERFORMANCE BY DATA MODEL: fBm: 1. Whittle: Error 0.0000, Time 0.001s 2. Periodogram: Error 0.0080, Time 0.001s 3. R/S: Error 0.0820, Time 2.960s fGn: 1. Whittle: Error 0.0000, Time 0.000s 2. Periodogram: Error 0.0618, Time 0.001s 3. DMA: Error 0.0639, Time 0.001s ARFIMAModel: 1. WaveletLeaders: Error 0.0535, Time 0.013s 2. MFDFA: Error 0.0817, Time 0.103s 3. WaveletWhittle: Error 0.2900, Time 0.007s MRW: 1. Whittle: Error 0.0000, Time 0.000s 2. R/S: Error 0.0062, Time 0.075s 3. Higuchi: Error 0.0373, Time 0.002s 🎯 Benchmark completed successfully! βœ… Classical benchmark completed! Success rate: 100.0% Total tests: 52 πŸ“Š Running ML Estimator Benchmark... πŸš€ Starting LRDBench Benchmark ============================================================ Benchmark Type: ML ============================================================ Testing 3 estimators... πŸ“Š Testing with fBm data model... Generated 1000 clean data points πŸ” Testing RandomForest... βœ… πŸ” Testing GradientBoosting... βœ… πŸ” Testing SVR... βœ… πŸ“Š Testing with fGn data model... Generated 1000 clean data points πŸ” Testing RandomForest... βœ… πŸ” Testing GradientBoosting... βœ… πŸ” Testing SVR... βœ… πŸ“Š Testing with ARFIMAModel data model... Generated 1000 clean data points πŸ” Testing RandomForest... βœ… πŸ” Testing GradientBoosting... βœ… πŸ” Testing SVR... βœ… πŸ“Š Testing with MRW data model... Generated 1000 clean data points πŸ” Testing RandomForest... βœ… πŸ” Testing GradientBoosting... βœ… πŸ” Testing SVR... βœ… πŸ’Ύ Results saved to: JSON: leaderboard_results/comprehensive_benchmark_20251016_100856.json CSV: leaderboard_results/benchmark_summary_20251016_100856.csv ============================================================ πŸ“Š BENCHMARK SUMMARY ============================================================ Benchmark Type: ML Total Tests: 12 Successful: 12 Success Rate: 100.0% Data Models: 4 Estimators: 3 πŸ† TOP PERFORMING ESTIMATORS (Average across all data models): 1. SVR Avg Error: 0.1364 (Range: 0.0147-0.4680) Avg Time: 0.000s | Data Models: 4 Convergence Rate: -0.0095 Mean Signed Error: 0.1291 Bias: 40.73% Stability: 0.0114 Estimated H values: fBm: H_est=0.7387, H_true=0.7000 fGn: H_est=0.7244, H_true=0.7000 ARFIMAModel: H_est=0.7680, H_true=0.3000 MRW: H_est=0.6853, H_true=0.7000 2. GradientBoosting Avg Error: 0.4308 (Range: 0.1471-0.5783) Avg Time: 0.000s | Data Models: 4 Convergence Rate: -0.8088 Mean Signed Error: -0.4308 Bias: -68.54% Stability: 0.1323 Estimated H values: fBm: H_est=0.1418, H_true=0.7000 fGn: H_est=0.1217, H_true=0.7000 ARFIMAModel: H_est=0.1529, H_true=0.3000 MRW: H_est=0.2604, H_true=0.7000 3. RandomForest Avg Error: 0.5000 (Range: 0.2000-0.6000) Avg Time: 0.000s | Data Models: 4 Convergence Rate: -0.3469 Mean Signed Error: -0.5000 Bias: -80.95% Stability: 0.1436 Estimated H values: fBm: H_est=0.1000, H_true=0.7000 fGn: H_est=0.1000, H_true=0.7000 ARFIMAModel: H_est=0.1000, H_true=0.3000 MRW: H_est=0.1000, H_true=0.7000 πŸ“Š DETAILED PERFORMANCE BY DATA MODEL: fBm: 1. SVR: Error 0.0387, Time 0.000s 2. GradientBoosting: Error 0.5582, Time 0.000s 3. RandomForest: Error 0.6000, Time 0.000s fGn: 1. SVR: Error 0.0244, Time 0.000s 2. GradientBoosting: Error 0.5783, Time 0.000s 3. RandomForest: Error 0.6000, Time 0.000s ARFIMAModel: 1. GradientBoosting: Error 0.1471, Time 0.000s 2. RandomForest: Error 0.2000, Time 0.000s 3. SVR: Error 0.4680, Time 0.000s MRW: 1. SVR: Error 0.0147, Time 0.000s 2. GradientBoosting: Error 0.4396, Time 0.000s 3. RandomForest: Error 0.6000, Time 0.000s 🎯 Benchmark completed successfully! βœ… ML benchmark completed! Success rate: 100.0% Total tests: 12 πŸ“Š Running Neural Network Benchmark... πŸš€ Starting LRDBench Benchmark ============================================================ Benchmark Type: NEURAL ============================================================ Testing 4 estimators... πŸ“Š Testing with fBm data model... Generated 1000 clean data points πŸ” Testing CNN... βœ… πŸ” Testing LSTM... βœ… πŸ” Testing GRU... βœ… πŸ” Testing Transformer... ❌ (Expected all tensors to be on the same device, but got mat1 is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA_addmm)) πŸ“Š Testing with fGn data model... Generated 1000 clean data points πŸ” Testing CNN... βœ… πŸ” Testing LSTM... βœ… πŸ” Testing GRU... βœ… πŸ” Testing Transformer... ❌ (Expected all tensors to be on the same device, but got mat1 is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA_addmm)) πŸ“Š Testing with ARFIMAModel data model... Generated 1000 clean data points πŸ” Testing CNN... βœ… πŸ” Testing LSTM... βœ… πŸ” Testing GRU... βœ… πŸ” Testing Transformer... ❌ (Expected all tensors to be on the same device, but got mat1 is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA_addmm)) πŸ“Š Testing with MRW data model... Generated 1000 clean data points πŸ” Testing CNN... βœ… πŸ” Testing LSTM... βœ… πŸ” Testing GRU... βœ… πŸ” Testing Transformer... ❌ (Expected all tensors to be on the same device, but got mat1 is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA_addmm)) πŸ’Ύ Results saved to: JSON: leaderboard_results/comprehensive_benchmark_20251016_100857.json CSV: leaderboard_results/benchmark_summary_20251016_100857.csv ============================================================ πŸ“Š BENCHMARK SUMMARY ============================================================ Benchmark Type: NEURAL Total Tests: 16 Successful: 12 Success Rate: 75.0% Data Models: 4 Estimators: 4 πŸ† TOP PERFORMING ESTIMATORS (Average across all data models): 1. CNN Avg Error: 0.1975 (Range: 0.1937-0.2049) Avg Time: 0.001s | Data Models: 4 Convergence Rate: 0.0045 Mean Signed Error: -0.0951 Bias: -3.82% Stability: 0.0017 Estimated H values: fBm: H_est=0.5048, H_true=0.7000 fGn: H_est=0.5063, H_true=0.7000 ARFIMAModel: H_est=0.5049, H_true=0.3000 MRW: H_est=0.5037, H_true=0.7000 2. GRU Avg Error: 0.2049 (Range: 0.2031-0.2070) Avg Time: 0.002s | Data Models: 4 Convergence Rate: 4.5144 Mean Signed Error: -0.1026 Bias: -4.93% Stability: 0.0042 Estimated H values: fBm: H_est=0.4952, H_true=0.7000 fGn: H_est=0.4930, H_true=0.7000 ARFIMAModel: H_est=0.5044, H_true=0.3000 MRW: H_est=0.4969, H_true=0.7000 3. LSTM Avg Error: 0.2080 (Range: 0.2041-0.2126) Avg Time: 0.032s | Data Models: 4 Convergence Rate: 5.5809 Mean Signed Error: -0.1017 Bias: -4.41% Stability: 0.0073 Estimated H values: fBm: H_est=0.4928, H_true=0.7000 fGn: H_est=0.4919, H_true=0.7000 ARFIMAModel: H_est=0.5126, H_true=0.3000 MRW: H_est=0.4959, H_true=0.7000 πŸ“Š DETAILED PERFORMANCE BY DATA MODEL: fBm: 1. CNN: Error 0.1952, Time 0.003s 2. GRU: Error 0.2048, Time 0.005s 3. LSTM: Error 0.2072, Time 0.124s fGn: 1. CNN: Error 0.1937, Time 0.001s 2. GRU: Error 0.2070, Time 0.001s 3. LSTM: Error 0.2081, Time 0.001s ARFIMAModel: 1. GRU: Error 0.2044, Time 0.001s 2. CNN: Error 0.2049, Time 0.001s 3. LSTM: Error 0.2126, Time 0.001s MRW: 1. CNN: Error 0.1963, Time 0.001s 2. GRU: Error 0.2031, Time 0.001s 3. LSTM: Error 0.2041, Time 0.001s 🎯 Benchmark completed successfully! βœ… Neural benchmark completed! Success rate: 75.0% Total tests: 16 πŸ“Š Running Comprehensive Benchmark... πŸš€ Starting LRDBench Benchmark ============================================================ Benchmark Type: COMPREHENSIVE ============================================================ Testing 20 estimators... πŸ“Š Testing with fBm data model... Generated 1000 clean data points πŸ” Testing R/S... βœ… πŸ” Testing DFA... βœ… πŸ” Testing DMA... βœ… πŸ” Testing Higuchi... βœ… πŸ” Testing GPH... βœ… πŸ” Testing Whittle... βœ… πŸ” Testing Periodogram... βœ… πŸ” Testing CWT... βœ… πŸ” Testing WaveletVar... βœ… πŸ” Testing WaveletLogVar... βœ… πŸ” Testing WaveletWhittle... βœ… πŸ” Testing MFDFA... βœ… πŸ” Testing WaveletLeaders... βœ… πŸ” Testing RandomForest... βœ… πŸ” Testing GradientBoosting... βœ… πŸ” Testing SVR... βœ… πŸ” Testing CNN... βœ… πŸ” Testing LSTM... βœ… πŸ” Testing GRU... βœ… πŸ” Testing Transformer... ❌ (Expected all tensors to be on the same device, but got mat1 is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA_addmm)) πŸ“Š Testing with fGn data model... Generated 1000 clean data points πŸ” Testing R/S... βœ… πŸ” Testing DFA... βœ… πŸ” Testing DMA... βœ… πŸ” Testing Higuchi... βœ… πŸ” Testing GPH... βœ… πŸ” Testing Whittle... βœ… πŸ” Testing Periodogram... βœ… πŸ” Testing CWT... βœ… πŸ” Testing WaveletVar... βœ… πŸ” Testing WaveletLogVar... βœ… πŸ” Testing WaveletWhittle... βœ… πŸ” Testing MFDFA... βœ… πŸ” Testing WaveletLeaders... βœ… πŸ” Testing RandomForest... βœ… πŸ” Testing GradientBoosting... βœ… πŸ” Testing SVR... βœ… πŸ” Testing CNN... βœ… πŸ” Testing LSTM... βœ… πŸ” Testing GRU... βœ… πŸ” Testing Transformer... ❌ (Expected all tensors to be on the same device, but got mat1 is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA_addmm)) πŸ“Š Testing with ARFIMAModel data model... Generated 1000 clean data points πŸ” Testing R/S... βœ… πŸ” Testing DFA... βœ… πŸ” Testing DMA... βœ… πŸ” Testing Higuchi... βœ… πŸ” Testing GPH... βœ… πŸ” Testing Whittle... βœ… πŸ” Testing Periodogram... βœ… πŸ” Testing CWT... βœ… πŸ” Testing WaveletVar... βœ… πŸ” Testing WaveletLogVar... βœ… πŸ” Testing WaveletWhittle... βœ… πŸ” Testing MFDFA... βœ… πŸ” Testing WaveletLeaders... βœ… πŸ” Testing RandomForest... βœ… πŸ” Testing GradientBoosting... βœ… πŸ” Testing SVR... βœ… πŸ” Testing CNN... βœ… πŸ” Testing LSTM... βœ… πŸ” Testing GRU... βœ… πŸ” Testing Transformer... ❌ (Expected all tensors to be on the same device, but got mat1 is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA_addmm)) πŸ“Š Testing with MRW data model... Generated 1000 clean data points πŸ” Testing R/S... βœ… πŸ” Testing DFA... βœ… πŸ” Testing DMA... βœ… πŸ” Testing Higuchi... βœ… πŸ” Testing GPH... βœ… πŸ” Testing Whittle... βœ… πŸ” Testing Periodogram... βœ… πŸ” Testing CWT... βœ… πŸ” Testing WaveletVar... βœ… πŸ” Testing WaveletLogVar... βœ… πŸ” Testing WaveletWhittle... βœ… πŸ” Testing MFDFA... βœ… πŸ” Testing WaveletLeaders... βœ… πŸ” Testing RandomForest... βœ… πŸ” Testing GradientBoosting... βœ… πŸ” Testing SVR... βœ… πŸ” Testing CNN... βœ… πŸ” Testing LSTM... βœ… πŸ” Testing GRU... βœ… πŸ” Testing Transformer... ❌ (Expected all tensors to be on the same device, but got mat1 is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA_addmm)) πŸ’Ύ Results saved to: JSON: leaderboard_results/comprehensive_benchmark_20251016_101004.json CSV: leaderboard_results/benchmark_summary_20251016_101004.csv ============================================================ πŸ“Š BENCHMARK SUMMARY ============================================================ Benchmark Type: COMPREHENSIVE Total Tests: 80 Successful: 76 Success Rate: 95.0% Data Models: 4 Estimators: 20 πŸ† TOP PERFORMING ESTIMATORS (Average across all data models): 1. Whittle Avg Error: 0.1000 (Range: 0.0000-0.4000) Avg Time: 0.001s | Data Models: 4 Mean Signed Error: 0.1000 Bias: 33.33% Stability: 0.0000 Estimated H values: fBm: H_est=0.7000, H_true=0.7000 fGn: H_est=0.7000, H_true=0.7000 ARFIMAModel: H_est=0.7000, H_true=0.3000 MRW: H_est=0.7000, H_true=0.7000 2. Periodogram Avg Error: 0.1287 (Range: 0.0080-0.3676) Avg Time: 0.001s | Data Models: 4 Convergence Rate: -0.2191 Mean Signed Error: 0.0899 Bias: 30.35% Stability: 0.2886 Estimated H values: fBm: H_est=0.7080, H_true=0.7000 fGn: H_est=0.7618, H_true=0.7000 ARFIMAModel: H_est=0.6676, H_true=0.3000 MRW: H_est=0.6226, H_true=0.7000 3. SVR Avg Error: 0.1364 (Range: 0.0147-0.4680) Avg Time: 0.000s | Data Models: 4 Convergence Rate: -0.0095 Mean Signed Error: 0.1291 Bias: 40.73% Stability: 0.0114 Estimated H values: fBm: H_est=0.7387, H_true=0.7000 fGn: H_est=0.7244, H_true=0.7000 ARFIMAModel: H_est=0.7680, H_true=0.3000 MRW: H_est=0.6853, H_true=0.7000 4. R/S Avg Error: 0.1777 (Range: 0.0062-0.4919) Avg Time: 0.083s | Data Models: 4 Convergence Rate: -0.3563 Mean Signed Error: 0.1777 Bias: 48.81% Stability: 0.0641 Estimated H values: fBm: H_est=0.7820, H_true=0.7000 fGn: H_est=0.8305, H_true=0.7000 ARFIMAModel: H_est=0.7919, H_true=0.3000 MRW: H_est=0.7062, H_true=0.7000 5. Higuchi Avg Error: 0.1819 (Range: 0.0373-0.4902) Avg Time: 0.002s | Data Models: 4 Convergence Rate: -0.7034 Mean Signed Error: 0.1818 Bias: 49.32% Stability: 0.1105 Estimated H values: fBm: H_est=0.8073, H_true=0.7000 fGn: H_est=0.7927, H_true=0.7000 ARFIMAModel: H_est=0.7902, H_true=0.3000 MRW: H_est=0.7373, H_true=0.7000 πŸ“Š DETAILED PERFORMANCE BY DATA MODEL: fBm: 1. Whittle: Error 0.0000, Time 0.000s 2. Periodogram: Error 0.0080, Time 0.001s 3. SVR: Error 0.0387, Time 0.000s fGn: 1. Whittle: Error 0.0000, Time 0.000s 2. SVR: Error 0.0244, Time 0.000s 3. Periodogram: Error 0.0618, Time 0.001s ARFIMAModel: 1. WaveletLeaders: Error 0.0535, Time 0.013s 2. MFDFA: Error 0.0817, Time 0.105s 3. GradientBoosting: Error 0.1471, Time 0.000s MRW: 1. Whittle: Error 0.0000, Time 0.001s 2. R/S: Error 0.0062, Time 0.084s 3. SVR: Error 0.0147, Time 0.000s 🎯 Benchmark completed successfully! βœ… Comprehensive benchmark completed! Success rate: 95.0% Total tests: 80 🎯 All benchmarks completed successfully! .. _rankings: 3. Creating Performance Rankings -------------------------------- Now let’s create comprehensive performance rankings and leaderboards from our benchmark results. .. code:: ipython3 # Create comprehensive leaderboard print("πŸ† Creating Performance Leaderboard...") print("=" * 70) # Combine all benchmark results all_results = { 'Classical': classical_results, 'ML': ml_results, 'Neural': neural_results, 'Comprehensive': comprehensive_results } # Create performance summary performance_data = [] for category, results in all_results.items(): print(f"πŸ” Processing {category} results...") print(f" Keys: {list(results.keys())}") # Check if results have the expected structure if 'results' in results and isinstance(results['results'], dict): print(f" Found 'results' key with {len(results['results'])} entries") # Process the results data for data_model, model_results in results['results'].items(): if isinstance(model_results, dict) and 'estimator_results' in model_results: for estimator_result in model_results['estimator_results']: if estimator_result.get('success', True): # Default to True if success not specified ci_lower = None ci_upper = None interval_method = None coverage_flag = None ci = estimator_result.get('confidence_interval') if isinstance(ci, (list, tuple)) and len(ci) == 2: ci_lower, ci_upper = ci uncertainty_blob = estimator_result.get('uncertainty', {}) if isinstance(uncertainty_blob, dict): primary = uncertainty_blob.get('primary_interval') if isinstance(primary, dict): interval_method = primary.get('method', interval_method) alt_ci = primary.get('confidence_interval') if ( (ci_lower is None or ci_upper is None) and isinstance(alt_ci, (list, tuple)) and len(alt_ci) == 2 ): ci_lower, ci_upper = alt_ci coverage_map = uncertainty_blob.get('coverage', {}) if isinstance(coverage_map, dict): if interval_method and interval_method in coverage_map: coverage_flag = coverage_map.get(interval_method) else: for value in coverage_map.values(): if value is not None: coverage_flag = value break ci_width = None if ci_lower is not None and ci_upper is not None: ci_width = ci_upper - ci_lower performance_data.append({ 'Category': category, 'Estimator': estimator_result['estimator'], 'True_H': estimator_result['true_hurst'], 'Estimated_H': estimator_result['estimated_hurst'], 'Error': estimator_result['error'], 'Execution_Time': estimator_result['execution_time'], 'Data_Model': data_model, 'CI_Lower': ci_lower, 'CI_Upper': ci_upper, 'CI_Width': ci_width, 'Interval_Method': interval_method, 'Coverage': coverage_flag }) else: print(f" ⚠️ Unexpected results structure for {category}") print(f" Available keys: {list(results.keys())}") print(f"\nπŸ“Š Total performance records collected: {len(performance_data)}") # Create DataFrame performance_df = pd.DataFrame(performance_data) if len(performance_df) > 0: print(f"πŸ“Š Loaded {len(performance_df)} performance records") # Calculate performance metrics performance_metrics = performance_df.groupby(['Category', 'Estimator']).agg({ 'Error': ['mean', 'std', 'min', 'max'], 'Execution_Time': ['mean', 'std'], 'CI_Width': ['mean', 'std'], 'Coverage': 'mean', 'True_H': 'count' }).round(4) print("\nπŸ“ˆ Performance Metrics Summary:") print(performance_metrics) # Create overall leaderboard print("\nπŸ† Overall Performance Leaderboard:") print("=" * 70) # Calculate composite scores leaderboard_data = [] for (category, estimator), group in performance_df.groupby(['Category', 'Estimator']): mean_error = group['Error'].mean() std_error = group['Error'].std() mean_time = group['Execution_Time'].mean() count = len(group) mean_ci_width = group['CI_Width'].dropna().mean() if 'CI_Width' in group else None coverage_rate = group['Coverage'].dropna().mean() if 'Coverage' in group else None # Composite score incorporates coverage to reward calibrated estimators coverage_factor = coverage_rate if coverage_rate is not None else 1.0 coverage_factor = max(coverage_factor, 0.01) composite_score = (1 / (1 + mean_error)) * (count / 10) * (1 / (1 + mean_time)) * coverage_factor leaderboard_data.append({ 'Category': category, 'Estimator': estimator, 'Mean_Error': mean_error, 'Std_Error': std_error, 'Mean_Time': mean_time, 'Mean_CI_Width': mean_ci_width, 'Coverage_Rate': coverage_rate, 'Count': count, 'Composite_Score': composite_score }) leaderboard_df = pd.DataFrame(leaderboard_data) leaderboard_df = leaderboard_df.sort_values('Composite_Score', ascending=False) print(leaderboard_df.round(4)) # Save leaderboard leaderboard_df.to_csv('outputs/performance_leaderboard.csv', index=False) print("\nπŸ’Ύ Leaderboard saved to outputs/performance_leaderboard.csv") # Significance analysis for the comprehensive benchmark comprehensive_significance = comprehensive_results.get('significance_analysis', {}) print("\nπŸ§ͺ Significance Testing (Comprehensive Benchmark)") if not comprehensive_significance: print("No significance analysis available.") else: status = comprehensive_significance.get('status', 'unavailable') print(f"Status: {status}") if status == 'ok': friedman = comprehensive_significance.get('friedman', {}) friedman_stat = friedman.get('statistic') friedman_p = friedman.get('p_value') if friedman_stat is not None and friedman_p is not None: print( f"Friedman χ²={friedman_stat:.4f} (p={friedman_p:.4f}) " f"across {friedman.get('n_data_models', 0)} data models " f"and {friedman.get('n_estimators', 0)} estimators" ) else: print(f"Friedman test unavailable: {friedman.get('error', 'insufficient data')}") mean_ranks = comprehensive_significance.get('mean_ranks', {}) if mean_ranks: mean_rank_df = ( pd.DataFrame(list(mean_ranks.items()), columns=['Estimator', 'Mean Rank']) .sort_values('Mean Rank') ) print("\nMean rank summary:") print(mean_rank_df.to_string(index=False)) post_hoc_entries = [] for res in comprehensive_significance.get('post_hoc', []): if res.get('p_value') is None: continue entry = { 'Estimator A': res['pair'][0], 'Estimator B': res['pair'][1], 'Holm p-value': float(res.get('holm_p_value')) if res.get('holm_p_value') is not None else None, 'Significant': bool(res.get('significant')), } if res.get('note'): entry['Note'] = res['note'] post_hoc_entries.append(entry) if post_hoc_entries: post_hoc_df = pd.DataFrame(post_hoc_entries).sort_values('Holm p-value') print("\nPairwise post-hoc tests (Holm-corrected):") print(post_hoc_df.to_string(index=False)) else: print("No pairwise differences reached significance after Holm correction.") else: print(comprehensive_significance.get('reason', 'No additional information.')) coverage_overview = performance_df['Coverage'].dropna() if not coverage_overview.empty: print("\n🎯 Coverage Summary (Comprehensive Benchmark)") print(f"Overall empirical coverage: {coverage_overview.mean():.2%}") coverage_by_estimator = performance_df.groupby('Estimator')['Coverage'].mean().dropna() if not coverage_by_estimator.empty: print("Per-estimator coverage:") for estimator_name, rate in coverage_by_estimator.sort_values(ascending=False).items(): print(f" {estimator_name}: {rate:.2%}") else: print("❌ No performance data available for leaderboard generation") .. parsed-literal:: πŸ† Creating Performance Leaderboard... ====================================================================== πŸ” Processing Classical results... Keys: ['timestamp', 'benchmark_type', 'contamination_type', 'contamination_level', 'total_tests', 'successful_tests', 'success_rate', 'data_models_tested', 'estimators_tested', 'results'] Found 'results' key with 4 entries πŸ” Processing ML results... Keys: ['timestamp', 'benchmark_type', 'contamination_type', 'contamination_level', 'total_tests', 'successful_tests', 'success_rate', 'data_models_tested', 'estimators_tested', 'results'] Found 'results' key with 4 entries πŸ” Processing Neural results... Keys: ['timestamp', 'benchmark_type', 'contamination_type', 'contamination_level', 'total_tests', 'successful_tests', 'success_rate', 'data_models_tested', 'estimators_tested', 'results'] Found 'results' key with 4 entries πŸ” Processing Comprehensive results... Keys: ['timestamp', 'benchmark_type', 'contamination_type', 'contamination_level', 'total_tests', 'successful_tests', 'success_rate', 'data_models_tested', 'estimators_tested', 'results'] Found 'results' key with 4 entries πŸ“Š Total performance records collected: 144 πŸ“Š Loaded 144 performance records πŸ“ˆ Performance Metrics Summary: Error Execution_Time \ mean std min max mean Category Estimator Classical CWT 0.3242 0.4224 0.0972 0.9573 0.0737 DFA 0.2197 0.2085 0.0675 0.5255 0.0068 DMA 0.1829 0.1868 0.0479 0.4514 0.0011 GPH 0.2396 0.1937 0.0721 0.5171 0.1128 Higuchi 0.1819 0.2077 0.0373 0.4902 0.0023 MFDFA 0.3889 0.2053 0.0817 0.5033 0.1110 Periodogram 0.1287 0.1620 0.0080 0.3676 0.0013 R/S 0.1777 0.2157 0.0062 0.4919 0.8095 WaveletLeaders 0.4523 0.2733 0.0535 0.6569 0.0142 WaveletLogVar 0.3880 0.3424 0.1012 0.8849 0.0006 WaveletVar 0.6041 0.4089 0.2345 1.1881 0.0011 WaveletWhittle 0.5900 0.2000 0.2900 0.6900 0.0071 Whittle 0.1000 0.2000 0.0000 0.4000 0.0005 Comprehensive CWT 0.3242 0.4224 0.0972 0.9573 0.0758 DFA 0.2197 0.2085 0.0675 0.5255 0.0066 DMA 0.1829 0.1868 0.0479 0.4514 0.0010 GPH 0.2396 0.1937 0.0721 0.5171 0.0023 GRU 0.2119 0.0052 0.2060 0.2187 0.0006 GradientBoosting 0.4308 0.1988 0.1471 0.5783 0.0003 Higuchi 0.1819 0.2077 0.0373 0.4902 0.0023 LSTM 0.2000 0.0002 0.1998 0.2002 0.0009 MFDFA 0.3699 0.1944 0.0817 0.4991 0.1066 Periodogram 0.1287 0.1620 0.0080 0.3676 0.0013 R/S 0.1777 0.2157 0.0062 0.4919 0.0822 RandomForest 0.5000 0.2000 0.2000 0.6000 0.0005 SVR 0.1364 0.2212 0.0147 0.4680 0.0001 WaveletLeaders 0.4397 0.2714 0.0535 0.6569 0.0130 WaveletLogVar 0.3880 0.3424 0.1012 0.8849 0.0005 WaveletVar 0.6041 0.4089 0.2345 1.1881 0.0010 WaveletWhittle 0.5900 0.2000 0.2900 0.6900 0.0072 Whittle 0.1000 0.2000 0.0000 0.4000 0.0006 ML GradientBoosting 0.4308 0.1988 0.1471 0.5783 0.0002 RandomForest 0.5000 0.2000 0.2000 0.6000 0.0002 SVR 0.1364 0.2212 0.0147 0.4680 0.0000 Neural GRU 0.2119 0.0052 0.2060 0.2187 0.0017 LSTM 0.2000 0.0002 0.1998 0.2002 0.0433 True_H std count Category Estimator Classical CWT 0.0161 4 DFA 0.0005 4 DMA 0.0001 4 GPH 0.2214 4 Higuchi 0.0002 4 MFDFA 0.0115 4 Periodogram 0.0001 4 R/S 1.4596 4 WaveletLeaders 0.0031 4 WaveletLogVar 0.0000 4 WaveletVar 0.0001 4 WaveletWhittle 0.0001 4 Whittle 0.0000 4 Comprehensive CWT 0.0147 4 DFA 0.0002 4 DMA 0.0000 4 GPH 0.0006 4 GRU 0.0002 4 GradientBoosting 0.0001 4 Higuchi 0.0002 4 LSTM 0.0001 4 MFDFA 0.0016 4 Periodogram 0.0001 4 R/S 0.0056 4 RandomForest 0.0001 4 SVR 0.0000 4 WaveletLeaders 0.0004 4 WaveletLogVar 0.0000 4 WaveletVar 0.0001 4 WaveletWhittle 0.0001 4 Whittle 0.0000 4 ML GradientBoosting 0.0000 4 RandomForest 0.0001 4 SVR 0.0000 4 Neural GRU 0.0023 4 LSTM 0.0854 4 πŸ† Overall Performance Leaderboard: ====================================================================== Category Estimator Mean_Error Std_Error Mean_Time Count \ 12 Classical Whittle 0.1000 0.2000 0.0005 4 30 Comprehensive Whittle 0.1000 0.2000 0.0006 4 6 Classical Periodogram 0.1287 0.1620 0.0013 4 22 Comprehensive Periodogram 0.1287 0.1620 0.0013 4 33 ML SVR 0.1364 0.2212 0.0000 4 25 Comprehensive SVR 0.1364 0.2212 0.0001 4 15 Comprehensive DMA 0.1829 0.1868 0.0010 4 2 Classical DMA 0.1829 0.1868 0.0011 4 19 Comprehensive Higuchi 0.1819 0.2077 0.0023 4 4 Classical Higuchi 0.1819 0.2077 0.0023 4 20 Comprehensive LSTM 0.2000 0.0002 0.0009 4 17 Comprehensive GRU 0.2119 0.0052 0.0006 4 34 Neural GRU 0.2119 0.0052 0.0017 4 14 Comprehensive DFA 0.2197 0.2085 0.0066 4 1 Classical DFA 0.2197 0.2085 0.0068 4 16 Comprehensive GPH 0.2396 0.1937 0.0023 4 35 Neural LSTM 0.2000 0.0002 0.0433 4 23 Comprehensive R/S 0.1777 0.2157 0.0822 4 3 Classical GPH 0.2396 0.1937 0.1128 4 27 Comprehensive WaveletLogVar 0.3880 0.3424 0.0005 4 9 Classical WaveletLogVar 0.3880 0.3424 0.0006 4 0 Classical CWT 0.3242 0.4224 0.0737 4 13 Comprehensive CWT 0.3242 0.4224 0.0758 4 31 ML GradientBoosting 0.4308 0.1988 0.0002 4 18 Comprehensive GradientBoosting 0.4308 0.1988 0.0003 4 26 Comprehensive WaveletLeaders 0.4397 0.2714 0.0130 4 8 Classical WaveletLeaders 0.4523 0.2733 0.0142 4 32 ML RandomForest 0.5000 0.2000 0.0002 4 24 Comprehensive RandomForest 0.5000 0.2000 0.0005 4 21 Comprehensive MFDFA 0.3699 0.1944 0.1066 4 5 Classical MFDFA 0.3889 0.2053 0.1110 4 11 Classical WaveletWhittle 0.5900 0.2000 0.0071 4 29 Comprehensive WaveletWhittle 0.5900 0.2000 0.0072 4 28 Comprehensive WaveletVar 0.6041 0.4089 0.0010 4 10 Classical WaveletVar 0.6041 0.4089 0.0011 4 7 Classical R/S 0.1777 0.2157 0.8095 4 Composite_Score 12 0.3634 30 0.3634 6 0.3539 22 0.3539 33 0.3520 25 0.3520 15 0.3378 2 0.3378 19 0.3377 4 0.3377 20 0.3330 17 0.3299 34 0.3295 14 0.3258 1 0.3257 16 0.3220 35 0.3195 23 0.3139 3 0.2900 27 0.2880 9 0.2880 0 0.2813 13 0.2808 31 0.2795 18 0.2795 26 0.2743 8 0.2716 32 0.2666 24 0.2665 21 0.2639 5 0.2592 11 0.2498 29 0.2498 28 0.2491 10 0.2491 7 0.1877 πŸ’Ύ Leaderboard saved to outputs/performance_leaderboard.csv .. _visualization: 4. Visualization and Export --------------------------- Let’s create comprehensive visualizations of our leaderboard results and export them in various formats. .. code:: ipython3 # Create comprehensive visualizations if len(performance_df) > 0: print("πŸ“Š Creating Performance Visualizations...") print("=" * 70) # Create figure with subplots fig, axes = plt.subplots(2, 3, figsize=(20, 12)) # 1. Error distribution by category ax1 = axes[0, 0] for category in performance_df['Category'].unique(): category_data = performance_df[performance_df['Category'] == category]['Error'] ax1.hist(category_data, alpha=0.7, label=category, bins=15) ax1.set_xlabel('Absolute Error') ax1.set_ylabel('Frequency') ax1.set_title('Error Distribution by Category') ax1.legend() ax1.grid(True, alpha=0.3) # 2. Execution time by category ax2 = axes[0, 1] for category in performance_df['Category'].unique(): category_data = performance_df[performance_df['Category'] == category]['Execution_Time'] ax2.hist(category_data, alpha=0.7, label=category, bins=15) ax2.set_xlabel('Execution Time (seconds)') ax2.set_ylabel('Frequency') ax2.set_title('Execution Time Distribution by Category') ax2.legend() ax2.grid(True, alpha=0.3) # 3. Error vs True H ax3 = axes[0, 2] for category in performance_df['Category'].unique(): category_data = performance_df[performance_df['Category'] == category] ax3.scatter(category_data['True_H'], category_data['Error'], alpha=0.7, label=category, s=50) ax3.set_xlabel('True Hurst Parameter') ax3.set_ylabel('Absolute Error') ax3.set_title('Error vs True Hurst Parameter') ax3.legend() ax3.grid(True, alpha=0.3) # 4. Performance by estimator ax4 = axes[1, 0] estimator_performance = performance_df.groupby('Estimator')['Error'].mean().sort_values() ax4.bar(range(len(estimator_performance)), estimator_performance.values, alpha=0.7) ax4.set_xlabel('Estimator') ax4.set_ylabel('Mean Absolute Error') ax4.set_title('Mean Error by Estimator') ax4.set_xticks(range(len(estimator_performance))) ax4.set_xticklabels(estimator_performance.index, rotation=45, ha='right') ax4.grid(True, alpha=0.3) # 5. Execution time by estimator ax5 = axes[1, 1] time_performance = performance_df.groupby('Estimator')['Execution_Time'].mean().sort_values() ax5.bar(range(len(time_performance)), time_performance.values, alpha=0.7) ax5.set_xlabel('Estimator') ax5.set_ylabel('Mean Execution Time (seconds)') ax5.set_title('Mean Execution Time by Estimator') ax5.set_xticks(range(len(time_performance))) ax5.set_xticklabels(time_performance.index, rotation=45, ha='right') ax5.grid(True, alpha=0.3) # 6. Composite score ranking ax6 = axes[1, 2] if len(leaderboard_df) > 0: top_10 = leaderboard_df.head(10) ax6.barh(range(len(top_10)), top_10['Composite_Score'], alpha=0.7) ax6.set_xlabel('Composite Score') ax6.set_ylabel('Rank') ax6.set_title('Top 10 Estimators by Composite Score') ax6.set_yticks(range(len(top_10))) ax6.set_yticklabels([f"{row['Category']} - {row['Estimator']}" for _, row in top_10.iterrows()]) ax6.grid(True, alpha=0.3) plt.tight_layout() plt.savefig('outputs/leaderboard_visualization.png', dpi=300, bbox_inches='tight') plt.show() # Create category-specific leaderboards print("\nπŸ“Š Category-Specific Leaderboards:") print("=" * 70) for category in performance_df['Category'].unique(): category_data = performance_df[performance_df['Category'] == category] category_leaderboard = category_data.groupby('Estimator').agg({ 'Error': ['mean', 'std'], 'Execution_Time': 'mean', 'True_H': 'count' }).round(4) print(f"\n{category} Category Leaderboard:") print(category_leaderboard) # Export results in multiple formats print("\nπŸ’Ύ Exporting Results...") print("=" * 70) # CSV export performance_df.to_csv('outputs/performance_data.csv', index=False) print("βœ… Performance data exported to CSV") # JSON export performance_df.to_json('outputs/performance_data.json', orient='records', indent=2) print("βœ… Performance data exported to JSON") # LaTeX table export if len(leaderboard_df) > 0: latex_table = leaderboard_df.to_latex(index=False, float_format='%.4f') with open('outputs/leaderboard_table.tex', 'w') as f: f.write(latex_table) print("βœ… Leaderboard table exported to LaTeX") print("\n🎯 All visualizations and exports completed successfully!") else: print("❌ No performance data available for visualization") .. parsed-literal:: πŸ“Š Creating Performance Visualizations... ====================================================================== .. image:: tutorial_05_leaderboards_files/tutorial_05_leaderboards_8_1.png .. parsed-literal:: πŸ“Š Category-Specific Leaderboards: ====================================================================== Classical Category Leaderboard: Error Execution_Time True_H mean std mean count Estimator CWT 0.3242 0.4224 0.0737 4 DFA 0.2197 0.2085 0.0068 4 DMA 0.1829 0.1868 0.0011 4 GPH 0.2396 0.1937 0.1128 4 Higuchi 0.1819 0.2077 0.0023 4 MFDFA 0.3889 0.2053 0.1110 4 Periodogram 0.1287 0.1620 0.0013 4 R/S 0.1777 0.2157 0.8095 4 WaveletLeaders 0.4523 0.2733 0.0142 4 WaveletLogVar 0.3880 0.3424 0.0006 4 WaveletVar 0.6041 0.4089 0.0011 4 WaveletWhittle 0.5900 0.2000 0.0071 4 Whittle 0.1000 0.2000 0.0005 4 ML Category Leaderboard: Error Execution_Time True_H mean std mean count Estimator GradientBoosting 0.4308 0.1988 0.0002 4 RandomForest 0.5000 0.2000 0.0002 4 SVR 0.1364 0.2212 0.0000 4 Neural Category Leaderboard: Error Execution_Time True_H mean std mean count Estimator GRU 0.2119 0.0052 0.0017 4 LSTM 0.2000 0.0002 0.0433 4 Comprehensive Category Leaderboard: Error Execution_Time True_H mean std mean count Estimator CWT 0.3242 0.4224 0.0758 4 DFA 0.2197 0.2085 0.0066 4 DMA 0.1829 0.1868 0.0010 4 GPH 0.2396 0.1937 0.0023 4 GRU 0.2119 0.0052 0.0006 4 GradientBoosting 0.4308 0.1988 0.0003 4 Higuchi 0.1819 0.2077 0.0023 4 LSTM 0.2000 0.0002 0.0009 4 MFDFA 0.3699 0.1944 0.1066 4 Periodogram 0.1287 0.1620 0.0013 4 R/S 0.1777 0.2157 0.0822 4 RandomForest 0.5000 0.2000 0.0005 4 SVR 0.1364 0.2212 0.0001 4 WaveletLeaders 0.4397 0.2714 0.0130 4 WaveletLogVar 0.3880 0.3424 0.0005 4 WaveletVar 0.6041 0.4089 0.0010 4 WaveletWhittle 0.5900 0.2000 0.0072 4 Whittle 0.1000 0.2000 0.0006 4 πŸ’Ύ Exporting Results... ====================================================================== βœ… Performance data exported to CSV βœ… Performance data exported to JSON βœ… Leaderboard table exported to LaTeX 🎯 All visualizations and exports completed successfully! .. _tut05-summary: 5. Summary and Next Steps ------------------------- Key Takeaways ~~~~~~~~~~~~~ 1. **Leaderboard Generation**: LRDBenchmark provides comprehensive tools for creating performance leaderboards: - **Multi-category Comparison**: Classical, ML, and Neural estimators - **Composite Scoring**: Combined accuracy, speed, and reliability metrics - **Statistical Analysis**: Confidence intervals and significance tests - **Publication-ready Output**: LaTeX, CSV, JSON formats 2. **Performance Rankings**: The system generates multiple types of leaderboards: - **Overall Leaderboard**: Combined performance across all categories - **Category-specific**: Rankings within each estimator category - **Metric-specific**: Rankings by accuracy, speed, or reliability - **Composite Scoring**: Weighted combination of multiple metrics 3. **Visualization**: Comprehensive plots and tables for: - **Error Distributions**: Performance across different scenarios - **Execution Time Analysis**: Computational efficiency comparison - **Scatter Plots**: Error vs true Hurst parameter relationships - **Bar Charts**: Direct performance comparisons Leaderboard Results ~~~~~~~~~~~~~~~~~~~ - **Top Performers**: Best estimators across different categories - **Performance Trade-offs**: Accuracy vs speed analysis - **Category Strengths**: Each category’s optimal use cases - **Statistical Significance**: Confidence in performance differences Next Steps ~~~~~~~~~~ 1. **Real-world Application**: Apply leaderboards to actual time series data 2. **Advanced Analysis**: Explore statistical significance and confidence intervals 3. **Custom Metrics**: Create domain-specific performance measures 4. **Interactive Dashboards**: Build web-based leaderboard interfaces 5. **Reproducible Validation**: Use :class:`lrdbenchmark.real_world_validation.RealWorldDataValidator` to generate deterministic surrogate datasets and provenance bundles for real-world studies. Files Generated ~~~~~~~~~~~~~~~ - ``outputs/performance_leaderboard.csv``: Complete leaderboard data - ``outputs/performance_data.csv``: Raw performance data - ``outputs/performance_data.json``: JSON format data - ``outputs/leaderboard_table.tex``: LaTeX table for publications - ``outputs/leaderboard_visualization.png``: Comprehensive visualization References ~~~~~~~~~~ 1. Taqqu, M. S., Teverovsky, V., & Willinger, W. (1995). Estimators for long-range dependence: an empirical study. Fractals, 3(04), 785-798. 2. Beran, J. (1994). Statistics for long-memory processes. CRC press. 3. Abry, P., & Veitch, D. (1998). Wavelet analysis of long-range-dependent traffic. IEEE Transactions on information theory, 44(1), 2-15. -------------- **Congratulations!** You’ve completed the comprehensive LRDBenchmark demonstration series. You now have a complete understanding of: - Data generation and visualization - Estimation and statistical validation - Custom model and estimator development - Comprehensive benchmarking - Leaderboard generation and analysis Additional Analyses ------------------- - Use ``summary["stratified_metrics"]`` from the comprehensive benchmark JSON to build Hurst, tail, length, and contamination slices before rendering leaderboards. - Call ``dashboard.generate_stratified_report(path_to_json)`` for markdown-ready stratified tables. - Run ``dashboard.create_advanced_diagnostics_visuals(path_to_advanced_json, output_dir=...)`` to produce scaling slope and robustness panel figures documenting estimator sensitivity to missingness, regime shifts, bursts, and seasonal drift.