Comprehensive Estimator Benchmarking
====================================
This notebook demonstrates the full benchmarking system of lrdbenchmark,
showing how to systematically compare estimators across different data
types, contamination scenarios, statistical diagnostics, and robustness
stress tests.
Overview
--------
The comprehensive benchmarking system allows you to:
1. **Compare Estimators**: Test multiple estimators on the same data
2. **Contamination Testing**: Evaluate robustness under various
contamination scenarios
3. **Performance Metrics**: Measure accuracy, speed, and reliability
4. **Statistical Analysis**: Generate confidence intervals, significance
tests, and stratified summaries
5. **Robustness Panels**: Run automated scaling diagnostics plus stress
panels for missingness, regime shifts, bursts, and seasonal drift
6. **Visualization**: Create publication-ready plots, stratified tables,
and robustness figures
Table of Contents
-----------------
1. `Setup and Imports <#setup>`__
2. `Benchmark System Overview <#overview>`__
3. `Classical Estimator Benchmarking <#classical>`__
4. `ML and Neural Network Benchmarking <#ml-neural>`__
5. `Contamination Robustness Testing <#contamination>`__
6. `Performance Analysis <#performance>`__
7. `Results Visualization <#visualization>`__
8. `Summary and Next Steps <#summary>`__
.. _setup:
1. Setup and Imports
--------------------
First, letβs import all necessary libraries and set up the comprehensive
benchmarking system.
.. code:: ipython3
# Standard scientific computing imports
import numpy as np
# lrdbenchmark imports - using simplified API
from lrdbenchmark import (
# Data models
FBMModel, FGNModel, ARFIMAModel, MRWModel, AlphaStableModel,
# Classical estimators
RSEstimator, DFAEstimator, GPHEstimator, WhittleEstimator,
# Machine Learning estimators
RandomForestEstimator, SVREstimator, GradientBoostingEstimator,
# Neural Network estimators
CNNEstimator, LSTMEstimator, GRUEstimator, TransformerEstimator,
# GPU utilities
gpu_is_available, get_device_info, clear_gpu_cache, monitor_gpu_memory
)
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import time
import warnings
import subprocess
import gc
warnings.filterwarnings('ignore')
from lrdbenchmark.random_manager import initialise_global_rng
initialise_global_rng(1729) # Ensures reproducible benchmarks
# GPU Memory Management Functions
.. _overview:
2. Benchmark System Overview
----------------------------
The ComprehensiveBenchmark class provides a unified interface for
testing all estimator categories. Letβs examine its capabilities and run
our first benchmark.
.. code:: ipython3
# Initialize the comprehensive benchmark system
print("π§ Initializing Comprehensive Benchmark System...")
print("=" * 60)
benchmark = ComprehensiveBenchmark(output_dir="benchmark_results")
print("π Available benchmark types:")
print(" - 'classical': Classical statistical estimators")
print(" - 'ML': Machine learning estimators")
print(" - 'neural': Neural network estimators")
print(" - 'comprehensive': All estimators combined")
print("\nπ Available contamination types:")
print(" - 'additive_gaussian': Add Gaussian noise")
print(" - 'multiplicative_noise': Multiplicative noise")
print(" - 'outliers': Add outliers")
print(" - 'trend': Add trend")
print(" - 'seasonal': Add seasonal patterns")
print(" - 'missing_data': Remove data points")
print("\nπ Available data models:")
print(" - 'fbm': Fractional Brownian Motion")
print(" - 'fgn': Fractional Gaussian Noise")
print(" - 'arfima': ARFIMA processes")
print(" - 'mrw': Multifractal Random Walk")
# Run a simple classical benchmark
print("\nπ Running Classical Estimator Benchmark...")
print("=" * 60)
classical_results = benchmark.run_classical_benchmark(
data_length=1000,
save_results=True
)
print(f"β
Classical benchmark completed!")
print(f"Success rate: {classical_results['success_rate']:.1%}")
print(f"Total tests: {classical_results['total_tests']}")
print(f"Successful tests: {classical_results['successful_tests']}")
# Display summary
if 'summary' in classical_results:
print("\nπ Benchmark Summary:")
benchmark.print_summary(classical_results['summary'])
.. _contamination:
3. Contamination Robustness Testing
-----------------------------------
Real-world data often contains contamination that can affect estimator
performance. Letβs test how different estimators handle various
contamination scenarios.
.. code:: ipython3
# Test contamination robustness
print("π§ͺ Testing Contamination Robustness...")
print("=" * 60)
# Test different contamination types
contamination_types = [
'additive_gaussian',
'outliers',
'trend',
'seasonal'
]
contamination_results = {}
for contamination_type in contamination_types:
print(f"\nπ Testing {contamination_type} contamination...")
# Run benchmark with contamination
results = benchmark.run_classical_benchmark(
data_length=1000,
contamination_type=contamination_type,
contamination_level=0.5, # 50% contamination to show visible impact
save_results=False
)
contamination_results[contamination_type] = results
print(f" Success rate: {results['success_rate']:.1%}")
print(f" Total tests: {results['total_tests']}")
print(f" Successful tests: {results['successful_tests']}")
# Compare contamination effects
print("\nπ Contamination Effects Comparison:")
print("=" * 60)
comparison_data = []
for contamination_type, results in contamination_results.items():
comparison_data.append({
'Contamination': contamination_type,
'Success_Rate': results['success_rate'],
'Total_Tests': results['total_tests'],
'Successful_Tests': results['successful_tests']
})
comparison_df = pd.DataFrame(comparison_data)
print(comparison_df.round(3))
# Visualize contamination effects
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
# Success rate by contamination type
ax1 = axes[0, 0]
ax1.bar(comparison_df['Contamination'], comparison_df['Success_Rate'], alpha=0.7)
ax1.set_title('Success Rate by Contamination Type', fontsize=12, fontweight='bold')
ax1.set_ylabel('Success Rate')
ax1.tick_params(axis='x', rotation=45)
ax1.grid(True, alpha=0.3)
# Total tests by contamination type
ax2 = axes[0, 1]
ax2.bar(comparison_df['Contamination'], comparison_df['Total_Tests'], alpha=0.7)
ax2.set_title('Total Tests by Contamination Type', fontsize=12, fontweight='bold')
ax2.set_ylabel('Total Tests')
ax2.tick_params(axis='x', rotation=45)
ax2.grid(True, alpha=0.3)
# Successful tests by contamination type
ax3 = axes[1, 0]
ax3.bar(comparison_df['Contamination'], comparison_df['Successful_Tests'], alpha=0.7)
ax3.set_title('Successful Tests by Contamination Type', fontsize=12, fontweight='bold')
ax3.set_ylabel('Successful Tests')
ax3.tick_params(axis='x', rotation=45)
ax3.grid(True, alpha=0.3)
# Contamination impact (success rate vs baseline)
# Fix for blank plot issue - handle NaN baseline and zero impact
baseline_success = classical_results.get('success_rate', None)
if baseline_success is None or np.isnan(baseline_success):
print("β οΈ Baseline success rate not available or NaN, using mean of contamination results")
baseline_success = comparison_df['Success_Rate'].mean()
impact_data = comparison_df['Success_Rate'] - baseline_success
print(f"Baseline success rate: {baseline_success:.4f}")
print(f"Impact data: {impact_data.values}")
ax4 = axes[1, 1]
# Handle NaN values in impact_data
if np.any(np.isnan(impact_data)):
print("β οΈ Impact data contains NaN values, using zeros")
impact_data = np.zeros_like(impact_data)
# Handle case where all impacts are zero (all success rates are the same)
if np.all(impact_data == 0):
print("β οΈ All contamination impacts are zero - showing relative performance instead")
# Show relative performance by contamination type
impact_data = comparison_df['Success_Rate'] * 100 # Convert to percentage for visibility
ax4.bar(comparison_df['Contamination'], impact_data, alpha=0.7, color='blue')
ax4.set_title('Success Rate by Contamination Type', fontsize=12, fontweight='bold')
ax4.set_ylabel('Success Rate (%)')
ax4.set_ylim(0, 105)
else:
ax4.bar(comparison_df['Contamination'], impact_data, alpha=0.7,
color=['red' if x < 0 else 'green' for x in impact_data])
ax4.set_title('Contamination Impact on Success Rate', fontsize=12, fontweight='bold')
ax4.set_ylabel('Change in Success Rate')
ax4.axhline(y=0, color='black', linestyle='--', alpha=0.5)
ax4.tick_params(axis='x', rotation=45)
ax4.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('outputs/contamination_robustness.png', dpi=300, bbox_inches='tight')
plt.show()
print("\nβ
Contamination robustness testing completed!")
.. code:: ipython3
# Diagnostic: Check contamination results data quality
print("π Diagnostic: Checking contamination results data quality...")
print("=" * 60)
if 'contamination_results' in locals():
print("Contamination results keys:", list(contamination_results.keys()))
for contamination_type, results in contamination_results.items():
print(f"\n{contamination_type}:")
print(f" Keys: {list(results.keys())}")
print(f" Success rate: {results.get('success_rate', 'NOT FOUND')}")
print(f" Total tests: {results.get('total_tests', 'NOT FOUND')}")
print(f" Successful tests: {results.get('successful_tests', 'NOT FOUND')}")
# Check for NaN values
success_rate = results.get('success_rate', None)
if success_rate is not None:
if np.isnan(success_rate):
print(f" β οΈ Success rate is NaN!")
else:
print(f" β
Success rate is valid: {success_rate:.4f}")
else:
print(f" β Success rate not found in results")
else:
print("β contamination_results not found - contamination testing may have failed")
if 'classical_results' in locals():
print(f"\nClassical results keys: {list(classical_results.keys())}")
baseline_success = classical_results.get('success_rate', None)
if baseline_success is not None:
if np.isnan(baseline_success):
print("β οΈ Classical baseline success rate is NaN!")
else:
print(f"β
Classical baseline success rate: {baseline_success:.4f}")
else:
print("β Classical baseline success rate not found")
else:
print("β classical_results not found - classical benchmark may have failed")
print("\n" + "=" * 60)
::
The Kernel crashed while executing code in the current cell or a previous cell.
Please review the code in the cell(s) to identify a possible cause of the failure.
Click here for more info.
View Jupyter log for further details.
.. _summary:
4. Summary and Next Steps
-------------------------
Key Takeaways
~~~~~~~~~~~~~
1. **Comprehensive Benchmarking**: lrdbenchmark provides a unified
system for testing all estimator categories:
- **Classical**: Fast, interpretable, theoretically grounded
- **ML**: Robust, flexible, pre-trained models
- **Neural**: High accuracy, complex patterns
2. **Contamination Testing**: The system evaluates robustness under
various contamination scenarios:
- **Additive Gaussian**: Noise contamination
- **Outliers**: Extreme value contamination
- **Trend**: Systematic trend contamination
- **Seasonal**: Periodic pattern contamination
3. **Performance Metrics**: Comprehensive evaluation including:
- **Accuracy**: Mean absolute error, bias, variance
- **Speed**: Execution time, computational efficiency
- **Robustness**: Performance under contamination
- **Reliability**: Success rate, consistency
Benchmark Results
~~~~~~~~~~~~~~~~~
- **Classical Estimators**: Fast and reliable for standard LRD
processes
- **ML Estimators**: Good balance of accuracy and robustness
- **Neural Networks**: Highest accuracy for complex patterns
- **Contamination Effects**: Different estimators show varying
robustness
Next Steps
~~~~~~~~~~
1. **Leaderboard Generation**: Create performance rankings and
comparisons
2. **Real-world Application**: Apply to actual time series data
3. **Advanced Analysis**: Explore statistical significance and
confidence intervals
4. **Custom Benchmarks**: Create domain-specific benchmark scenarios
Files Generated
~~~~~~~~~~~~~~~
- ``benchmark_results/``: Complete benchmark results directory
- ``outputs/contamination_robustness.png``: Contamination effects
visualization
- Performance metrics and statistical analysis
References
~~~~~~~~~~
1. Taqqu, M. S., Teverovsky, V., & Willinger, W. (1995). Estimators for
long-range dependence: an empirical study. Fractals, 3(04), 785-798.
2. Beran, J. (1994). Statistics for long-memory processes. CRC press.
3. Abry, P., & Veitch, D. (1998). Wavelet analysis of
long-range-dependent traffic. IEEE Transactions on information
theory, 44(1), 2-15.
--------------
**Next Tutorial**: :doc:`tutorial_05_leaderboards` β learn how to create
performance leaderboards and rankings.
Stratified and Robustness Diagnostics
-------------------------------------
- Access ``summary["stratified_metrics"]`` to review error, coverage,
and confidence-width summaries across H regimes, tail classes, data
lengths, and contamination levels.
- For advanced runs, inspect the ``scaling_diagnostics`` and
``robustness_panel`` entries returned per estimator to quantify
scaling influence and stress-test drift.
- Use ``AnalyticsDashboard().create_advanced_diagnostics_visuals(...)``
to generate the scaling slope and robustness panel figures for slide
decks and manuscripts.