Validation Techniques and Statistical Tests
===========================================

This document provides comprehensive coverage of the validation techniques, statistical tests, and quality assurance methods used in LRDBench.

Monte Carlo Validation
=====================

Overview
---------

Monte Carlo validation is the primary method for assessing estimator performance on synthetic data with known parameters. This approach allows for controlled evaluation of estimator bias, variance, and overall accuracy.

Methodology
----------

**Data Generation Process**:

1. **Parameter Space Definition**: Define ranges for Hurst parameters (typically H ∈ [0.3, 0.9])
2. **Model Selection**: Choose data models (FBM, FGN, ARFIMA, MRW)
3. **Sample Size Selection**: Use multiple sample sizes (e.g., 500, 1000, 2000, 5000)
4. **Realization Count**: Generate N realizations for each parameter combination

**Mathematical Framework**:

For estimator :math:`\hat{H}` and true value :math:`H_0`:

.. math::

   \text{Bias} = \mathbb{E}[\hat{H}] - H_0 = \frac{1}{N} \sum_{i=1}^N \hat{H}_i - H_0

.. math::

   \text{Variance} = \mathbb{E}[(\hat{H} - \mathbb{E}[\hat{H}])^2] = \frac{1}{N-1} \sum_{i=1}^N (\hat{H}_i - \bar{H})^2

.. math::

   \text{MSE} = \mathbb{E}[(\hat{H} - H_0)^2] = \text{Bias}^2 + \text{Variance}

**Implementation**:

.. code-block:: python

   import numpy as np
   from lrdbenchmark import FBMModel, ComprehensiveBenchmark
   
   def monte_carlo_validation(H_values, sample_sizes, n_realizations=100):
       """Perform Monte Carlo validation."""
       results = {}
       
       for H in H_values:
           for n in sample_sizes:
               estimates = []
               
               for i in range(n_realizations):
                   # Generate synthetic data
                   model = FBMModel(H=H, sigma=1.0)
                   data = model.generate(n, seed=i)
                   
                   # Apply estimator
                   benchmark = ComprehensiveBenchmark()
                   result = benchmark.run_classical_benchmark(
                       data_length=n,
                       estimators=['dfa', 'gph', 'rs']
                   )
                   
                   # Collect estimates
                   for estimator_name, estimator_result in result.estimators.items():
                       if estimator_name not in results:
                           results[estimator_name] = {}
                       if (H, n) not in results[estimator_name]:
                           results[estimator_name][(H, n)] = []
                       
                       results[estimator_name][(H, n)].append(
                           estimator_result.mean_estimate
                       )
       
       return results

Bootstrap Methods
================

Overview
---------

Bootstrap methods provide non-parametric approaches to estimate confidence intervals, standard errors, and bias correction for estimators.

Types of Bootstrap
------------------

**1. Non-Parametric Bootstrap**:

Resample with replacement from the original data:

.. math::

   X^*_1, X^*_2, \ldots, X^*_n \sim \text{iid from } \{X_1, X_2, \ldots, X_n\}

**2. Parametric Bootstrap**:

Generate bootstrap samples from a fitted parametric model:

.. math::

   X^*_i \sim F_{\hat{\theta}}(x)

**3. Moving Block Bootstrap**:

Preserve temporal dependence by resampling blocks:

.. math::

   B_i = (X_i, X_{i+1}, \ldots, X_{i+l-1})

**Implementation**:

.. code-block:: python

   import numpy as np
   from scipy import stats
   
   def bootstrap_confidence_interval(data, estimator_func, n_bootstrap=1000, 
                                   confidence_level=0.95):
       """Calculate bootstrap confidence interval."""
       bootstrap_estimates = []
       
       for _ in range(n_bootstrap):
           # Resample with replacement
           bootstrap_sample = np.random.choice(data, size=len(data), replace=True)
           
           # Apply estimator
           estimate = estimator_func(bootstrap_sample)
           bootstrap_estimates.append(estimate)
       
       # Calculate confidence interval
       alpha = 1 - confidence_level
       lower_percentile = (alpha / 2) * 100
       upper_percentile = (1 - alpha / 2) * 100
       
       ci_lower = np.percentile(bootstrap_estimates, lower_percentile)
       ci_upper = np.percentile(bootstrap_estimates, upper_percentile)
       
       return ci_lower, ci_upper, bootstrap_estimates

Cross-Validation
================

Overview
--------

Cross-validation is essential for machine learning estimators to prevent overfitting and assess generalization performance.

K-Fold Cross-Validation
-----------------------

**Algorithm**:

1. **Data Partitioning**: Divide data into K folds
2. **Training/Validation**: For each fold k:
   - Train on K-1 folds
   - Validate on fold k
3. **Performance Aggregation**: Average performance across all folds

**Mathematical Formulation**:

.. math::

   \text{CV} = \frac{1}{K} \sum_{k=1}^K L(y_k, f^{-k}(x_k))

where :math:`f^{-k}` is the estimator trained on all folds except fold k.

**Implementation**:

.. code-block:: python

   from sklearn.model_selection import KFold
   from sklearn.metrics import mean_squared_error
   
   def cross_validate_estimator(data, labels, estimator_class, k_folds=5):
       """Perform k-fold cross-validation."""
       kf = KFold(n_splits=k_folds, shuffle=True, random_state=42)
       cv_scores = []
       
       for train_idx, val_idx in kf.split(data):
           # Split data
           X_train, X_val = data[train_idx], data[val_idx]
           y_train, y_val = labels[train_idx], labels[val_idx]
           
           # Train estimator
           estimator = estimator_class()
           estimator.fit(X_train, y_train)
           
           # Predict and evaluate
           y_pred = estimator.estimate(X_val)
           mse = mean_squared_error(y_val, y_pred)
           cv_scores.append(mse)
       
       return np.mean(cv_scores), np.std(cv_scores)

Time Series Cross-Validation
----------------------------

For time series data, standard k-fold CV can lead to data leakage. Time series CV uses expanding windows:

.. code-block:: python

   def time_series_cv(data, labels, estimator_class, min_train_size=100):
       """Time series cross-validation with expanding windows."""
       cv_scores = []
       
       for i in range(min_train_size, len(data)):
           # Training set: all data up to index i
           X_train = data[:i]
           y_train = labels[:i]
           
           # Validation set: next observation
           X_val = data[i:i+1]
           y_val = labels[i:i+1]
           
           # Train and predict
           estimator = estimator_class()
           estimator.fit(X_train, y_train)
           y_pred = estimator.estimate(X_val)
           
           # Calculate error
           mse = mean_squared_error(y_val, y_pred)
           cv_scores.append(mse)
       
       return np.mean(cv_scores), np.std(cv_scores)

Robustness Analysis
===================

Overview
--------

Robustness analysis assesses estimator performance under various data conditions, including contamination, noise, and model misspecification.

Contamination Models
--------------------

**1. Additive Noise**:

.. math::

   X_t = X_t^{(0)} + \epsilon_t, \quad \epsilon_t \sim N(0, \sigma^2)

**2. Outlier Contamination**:

.. math::

   X_t = \begin{cases}
   X_t^{(0)} & \text{with probability } 1-\epsilon \\
   X_t^{(0)} + \delta & \text{with probability } \epsilon
   \end{cases}

**3. Trend Contamination**:

.. math::

   X_t = X_t^{(0)} + \alpha t + \beta t^2

**Implementation**:

.. code-block:: python

   def robustness_analysis(data, estimator_func, contamination_levels):
       """Analyze estimator robustness to contamination."""
       results = {}
       
       for epsilon in contamination_levels:
           contaminated_data = data.copy()
           
           # Add contamination
           n_contaminated = int(epsilon * len(data))
           contamination_indices = np.random.choice(
               len(data), size=n_contaminated, replace=False
           )
           
           # Add outliers
           contaminated_data[contamination_indices] += np.random.normal(
               0, 5, size=n_contaminated
           )
           
           # Apply estimator
           estimate = estimator_func(contaminated_data)
           results[epsilon] = estimate
       
       return results

Breakdown Point Analysis
------------------------

The breakdown point is the smallest fraction of contaminated data that can cause the estimator to produce arbitrarily bad results.

**Implementation**:

.. code-block:: python

   def breakdown_point_analysis(data, estimator_func, max_contamination=0.5):
       """Estimate breakdown point of an estimator."""
       original_estimate = estimator_func(data)
       breakdown_point = None
       
       for epsilon in np.arange(0, max_contamination, 0.01):
           # Create contaminated data
           n_contaminated = int(epsilon * len(data))
           contaminated_data = data.copy()
           
           # Add extreme outliers
           contamination_indices = np.random.choice(
               len(data), size=n_contaminated, replace=False
           )
           contaminated_data[contamination_indices] = 1e6
           
           # Check if estimator breaks down
           try:
               contaminated_estimate = estimator_func(contaminated_data)
               
               # Check if estimate is reasonable
               if abs(contaminated_estimate - original_estimate) > 0.5:
                   breakdown_point = epsilon
                   break
           except:
               breakdown_point = epsilon
               break
       
       return breakdown_point

Validation Statistical Tests
============================

Hypothesis Testing
------------------

**1. Test for Long-Range Dependence**:

Null hypothesis: :math:`H_0: H = 0.5` (no LRD)
Alternative hypothesis: :math:`H_1: H \neq 0.5` (LRD present)

Test statistic:
.. math::

   T = \frac{\hat{H} - 0.5}{\text{SE}(\hat{H})}

**2. Test for Parameter Equality**:

Null hypothesis: :math:`H_0: H_1 = H_2`
Alternative hypothesis: :math:`H_1: H_1 \neq H_2`

Test statistic:
.. math::

   T = \frac{\hat{H}_1 - \hat{H}_2}{\sqrt{\text{SE}(\hat{H}_1)^2 + \text{SE}(\hat{H}_2)^2}}

**Implementation**:

.. code-block:: python

   from scipy import stats
   
   def test_lrd_presence(data, estimator_func, alpha=0.05):
       """Test for presence of long-range dependence."""
       # Estimate H
       H_estimate = estimator_func(data)
       
       # Bootstrap standard error
       bootstrap_estimates = []
       for _ in range(1000):
           bootstrap_sample = np.random.choice(data, size=len(data), replace=True)
           bootstrap_estimates.append(estimator_func(bootstrap_sample))
       
       se_H = np.std(bootstrap_estimates)
       
       # Test statistic
       test_statistic = (H_estimate - 0.5) / se_H
       
       # Critical value
       critical_value = stats.norm.ppf(1 - alpha/2)
       
       # Decision
       reject_null = abs(test_statistic) > critical_value
       p_value = 2 * (1 - stats.norm.cdf(abs(test_statistic)))
       
       return {
           'test_statistic': test_statistic,
           'p_value': p_value,
           'reject_null': reject_null,
           'H_estimate': H_estimate,
           'standard_error': se_H
       }

Goodness-of-Fit Tests
---------------------

**1. Kolmogorov-Smirnov Test**:

Tests whether empirical distribution matches theoretical distribution.

.. math::

   D_n = \sup_x |F_n(x) - F(x)|

**2. Anderson-Darling Test**:

Weighted version of KS test, more sensitive to tails.

.. math::

   A^2 = n \int_{-\infty}^{\infty} \frac{(F_n(x) - F(x))^2}{F(x)(1-F(x))} dF(x)

**3. Chi-Square Test**:

Tests fit of observed frequencies to expected frequencies.

.. math::

   \chi^2 = \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i}

**Implementation**:

.. code-block:: python

   def goodness_of_fit_tests(data, theoretical_distribution):
       """Perform goodness-of-fit tests."""
       results = {}
       
       # Kolmogorov-Smirnov test
       ks_statistic, ks_pvalue = stats.kstest(data, theoretical_distribution)
       results['ks_test'] = {
           'statistic': ks_statistic,
           'p_value': ks_pvalue
       }
       
       # Anderson-Darling test
       ad_statistic, ad_critical_values, ad_significance_levels = stats.anderson(
           data, dist=theoretical_distribution
       )
       results['anderson_darling'] = {
           'statistic': ad_statistic,
           'critical_values': ad_critical_values,
           'significance_levels': ad_significance_levels
       }
       
       # Chi-square test
       observed, bins = np.histogram(data, bins='auto')
       expected = theoretical_distribution.pdf(bins[:-1]) * len(data) * np.diff(bins)
       
       chi2_statistic, chi2_pvalue = stats.chisquare(observed, expected)
       results['chi_square'] = {
           'statistic': chi2_statistic,
           'p_value': chi2_pvalue
       }
       
       return results

Model Selection
===============

Information Criteria
--------------------

**1. Akaike Information Criterion (AIC)**:

.. math::

   \text{AIC} = 2k - 2\ln(L)

where k is the number of parameters and L is the likelihood.

**2. Bayesian Information Criterion (BIC)**:

.. math::

   \text{BIC} = \ln(n)k - 2\ln(L)

where n is the sample size.

**3. Corrected AIC (AICc)**:

.. math::

   \text{AICc} = \text{AIC} + \frac{2k(k+1)}{n-k-1}

**Implementation**:

.. code-block:: python

   def model_selection_criteria(models, data):
       """Calculate model selection criteria."""
       results = {}
       
       for model_name, model in models.items():
           # Fit model and get likelihood
           model.fit(data)
           log_likelihood = model.log_likelihood(data)
           n_params = model.n_parameters
           n_samples = len(data)
           
           # Calculate criteria
           aic = 2 * n_params - 2 * log_likelihood
           bic = np.log(n_samples) * n_params - 2 * log_likelihood
           aicc = aic + (2 * n_params * (n_params + 1)) / (n_samples - n_params - 1)
           
           results[model_name] = {
               'AIC': aic,
               'BIC': bic,
               'AICc': aicc,
               'log_likelihood': log_likelihood,
               'n_parameters': n_params
           }
       
       return results

Validation Performance Metrics
==============================

Accuracy Metrics
----------------

**1. Mean Absolute Error (MAE)**:

.. math::

   MAE = \frac{1}{n} \sum_{i=1}^n |\hat{H}_i - H_i|

**2. Root Mean Square Error (RMSE)**:

.. math::

   RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^n (\hat{H}_i - H_i)^2}

**3. Mean Absolute Percentage Error (MAPE)**:

.. math::

   MAPE = \frac{100\%}{n} \sum_{i=1}^n \left|\frac{\hat{H}_i - H_i}{H_i}\right|

**4. Symmetric Mean Absolute Percentage Error (SMAPE)**:

.. math::

   SMAPE = \frac{100\%}{n} \sum_{i=1}^n \frac{2|\hat{H}_i - H_i|}{|\hat{H}_i| + |H_i|}

Precision Metrics
-----------------

**1. Standard Error**:

.. math::

   SE = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (\hat{H}_i - \bar{H})^2}

**2. Coefficient of Variation**:

.. math::

   CV = \frac{SE}{\bar{H}} \times 100\%

**3. Confidence Interval Width**:

.. math::

   CI_{width} = \hat{H}_{1-\alpha/2} - \hat{H}_{\alpha/2}

**Implementation**:

.. code-block:: python

   def calculate_performance_metrics(true_values, estimated_values):
       """Calculate comprehensive performance metrics."""
       metrics = {}
       
       # Convert to numpy arrays
       true_vals = np.array(true_values)
       est_vals = np.array(estimated_values)
       
       # Accuracy metrics
       errors = est_vals - true_vals
       abs_errors = np.abs(errors)
       
       metrics['MAE'] = np.mean(abs_errors)
       metrics['RMSE'] = np.sqrt(np.mean(errors**2))
       metrics['MAPE'] = 100 * np.mean(np.abs(errors / true_vals))
       metrics['SMAPE'] = 100 * np.mean(2 * abs_errors / (np.abs(est_vals) + np.abs(true_vals)))
       
       # Precision metrics
       metrics['Standard_Error'] = np.std(est_vals)
       metrics['Coefficient_of_Variation'] = (np.std(est_vals) / np.mean(est_vals)) * 100
       
       # Bias
       metrics['Bias'] = np.mean(errors)
       
       # Correlation
       metrics['Correlation'] = np.corrcoef(true_vals, est_vals)[0, 1]
       
       return metrics

Efficiency Metrics
-----------------

**1. Computational Complexity**:
Big-O notation for time and space complexity of estimators.

**2. Convergence Rate**:
Rate at which estimator approaches true value with increasing sample size.

**3. Asymptotic Efficiency**:
Ratio of estimator variance to Cramér-Rao lower bound.

**Implementation**:

.. code-block:: python

   import time
   import numpy as np
   from lrdbenchmark import FBMModel, ComprehensiveBenchmark
   import matplotlib.pyplot as plt

   def efficiency_analysis_example():
       """Demonstrate efficiency analysis of estimators."""
       
       # Test different sample sizes
       sample_sizes = [500, 1000, 2000, 4000, 8000]
       n_runs = 10
       
       # Initialize results
       results = {
           'dfa': {'times': [], 'estimates': []},
           'gph': {'times': [], 'estimates': []},
           'rs': {'times': [], 'estimates': []}
       }
       
       print("Running efficiency analysis...")
       
       for n in sample_sizes:
           print(f"Testing sample size: {n}")
           
           for estimator_name in results.keys():
               times = []
               estimates = []
               
               for i in range(n_runs):
                   # Generate data
                   model = FBMModel(H=0.7, sigma=1.0)
                   data = model.generate(n, seed=i)
                   
                   # Time estimation
                   start_time = time.time()
                   
                   benchmark = ComprehensiveBenchmark()
                   result = benchmark.run_classical_benchmark(
                       data_length=n,
                       estimators=[estimator_name]
                   )
                   
                   end_time = time.time()
                   
                   if estimator_name in result.estimators:
                       times.append(end_time - start_time)
                       estimates.append(
                           result.estimators[estimator_name].mean_estimate
                       )
               
               if times:
                   results[estimator_name]['times'].append(np.mean(times))
                   results[estimator_name]['estimates'].append(np.mean(estimates))
       
       # Plot computational complexity
       plt.figure(figsize=(12, 5))
       
       plt.subplot(1, 2, 1)
       for estimator_name in results.keys():
           if results[estimator_name]['times']:
               plt.loglog(sample_sizes[:len(results[estimator_name]['times'])], 
                         results[estimator_name]['times'], 
                         label=estimator_name.upper(), marker='o')
       
       plt.xlabel('Sample Size')
       plt.ylabel('Execution Time (seconds)')
       plt.title('Computational Complexity')
       plt.legend()
       plt.grid(True)
       
       # Plot convergence
       plt.subplot(1, 2, 2)
       true_H = 0.7
       for estimator_name in results.keys():
           if results[estimator_name]['estimates']:
               errors = np.abs(np.array(results[estimator_name]['estimates']) - true_H)
               plt.loglog(sample_sizes[:len(errors)], errors, 
                         label=estimator_name.upper(), marker='o')
       
       plt.xlabel('Sample Size')
       plt.ylabel('Absolute Error')
       plt.title('Convergence Rate')
       plt.legend()
       plt.grid(True)
       
       plt.tight_layout()
       plt.show()
       
       return results

   # Run the example
   if __name__ == "__main__":
       results = efficiency_analysis_example()
       print("Efficiency analysis completed!")

Quality Assurance
=================

Data Quality Checks
------------------

**1. Stationarity Tests**:

.. code-block:: python

   from statsmodels.tsa.stattools import adfuller, kpss
   from scipy import stats
   import numpy as np
   from lrdbenchmark import FBMModel, FGNModel

   def data_quality_checks_example():
       """Demonstrate data quality checks for time series."""
       
       # Generate different types of data
       models = {
           'FBM (H=0.7)': FBMModel(H=0.7, sigma=1.0),
           'FGN (H=0.8)': FGNModel(H=0.8, sigma=1.0),
           'Non-stationary': lambda: np.cumsum(np.random.normal(0, 1, 1000))
       }
       
       print("=== DATA QUALITY CHECKS ===")
       
       for model_name, model in models.items():
           print(f"\n--- {model_name} ---")
           
           # Generate data
           if callable(model):
               data = model()
           else:
               data = model.generate(1000, seed=42)
           
           # ADF Test (Augmented Dickey-Fuller)
           adf_stat, adf_pvalue, adf_critical = adfuller(data)[:3]
           print(f"ADF Test:")
           print(f"  Statistic: {adf_stat:.4f}")
           print(f"  p-value: {adf_pvalue:.4f}")
           print(f"  Stationary: {'Yes' if adf_pvalue < 0.05 else 'No'}")
           
           # KPSS Test
           kpss_stat, kpss_pvalue, kpss_critical = kpss(data)[:3]
           print(f"KPSS Test:")
           print(f"  Statistic: {kpss_stat:.4f}")
           print(f"  p-value: {kpss_pvalue:.4f}")
           print(f"  Stationary: {'Yes' if kpss_pvalue > 0.05 else 'No'}")
           
           # Normality Test (Shapiro-Wilk)
           shapiro_stat, shapiro_pvalue = stats.shapiro(data)
           print(f"Shapiro-Wilk Test:")
           print(f"  Statistic: {shapiro_stat:.4f}")
           print(f"  p-value: {shapiro_pvalue:.4f}")
           print(f"  Normal: {'Yes' if shapiro_pvalue > 0.05 else 'No'}")
           
           # Basic statistics
           print(f"Basic Statistics:")
           print(f"  Mean: {np.mean(data):.4f}")
           print(f"  Std: {np.std(data):.4f}")
           print(f"  Skewness: {stats.skew(data):.4f}")
           print(f"  Kurtosis: {stats.kurtosis(data):.4f}")

   # Run the example
   if __name__ == "__main__":
       data_quality_checks_example()

Estimator Validation
-------------------

**1. Consistency Checks**:

.. code-block:: python

   import numpy as np
   from lrdbenchmark import FBMModel, ComprehensiveBenchmark
   import matplotlib.pyplot as plt

   def estimator_validation_example():
       """Demonstrate estimator validation procedures."""
       
       # Test consistency with increasing sample size
       sample_sizes = [500, 1000, 2000, 4000, 8000]
       true_H = 0.7
       n_runs = 20
       
       results = {
           'dfa': {'estimates': [], 'std': []},
           'gph': {'estimates': [], 'std': []},
           'rs': {'estimates': [], 'std': []}
       }
       
       print("Running estimator validation...")
       
       for n in sample_sizes:
           print(f"Sample size: {n}")
           
           for estimator_name in results.keys():
               estimates = []
               
               for i in range(n_runs):
                   # Generate data
                   model = FBMModel(H=true_H, sigma=1.0)
                   data = model.generate(n, seed=i)
                   
                   # Apply estimator
                   benchmark = ComprehensiveBenchmark()
                   result = benchmark.run_classical_benchmark(
                       data_length=n,
                       estimators=[estimator_name]
                   )
                   
                   if estimator_name in result.estimators:
                       estimates.append(
                           result.estimators[estimator_name].mean_estimate
                       )
               
               if estimates:
                   results[estimator_name]['estimates'].append(np.mean(estimates))
                   results[estimator_name]['std'].append(np.std(estimates))
       
       # Plot consistency
       plt.figure(figsize=(12, 5))
       
       plt.subplot(1, 2, 1)
       for estimator_name in results.keys():
           if results[estimator_name]['estimates']:
               plt.semilogx(sample_sizes[:len(results[estimator_name]['estimates'])], 
                           results[estimator_name]['estimates'], 
                           label=estimator_name.upper(), marker='o')
       
       plt.axhline(y=true_H, color='red', linestyle='--', label='True H')
       plt.xlabel('Sample Size')
       plt.ylabel('Estimated H')
       plt.title('Consistency Check')
       plt.legend()
       plt.grid(True)
       
       # Plot standard deviation
       plt.subplot(1, 2, 2)
       for estimator_name in results.keys():
           if results[estimator_name]['std']:
               plt.loglog(sample_sizes[:len(results[estimator_name]['std'])], 
                         results[estimator_name]['std'], 
                         label=estimator_name.upper(), marker='o')
       
       plt.xlabel('Sample Size')
       plt.ylabel('Standard Deviation')
       plt.title('Precision vs Sample Size')
       plt.legend()
       plt.grid(True)
       
       plt.tight_layout()
       plt.show()
       
       return results

   # Run the example
   if __name__ == "__main__":
       results = estimator_validation_example()
       print("Estimator validation completed!")

Comprehensive Validation Workflow
--------------------------------

.. code-block:: python

   from lrdbenchmark import FBMModel, FGNModel, ComprehensiveBenchmark
   import numpy as np
   import pandas as pd
   import matplotlib.pyplot as plt

   def comprehensive_validation_workflow():
       """Complete validation workflow for LRDBench estimators."""
       
       print("=== COMPREHENSIVE VALIDATION WORKFLOW ===")
       
       # 1. Define validation parameters
       H_values = np.linspace(0.3, 0.9, 13)
       sample_sizes = [500, 1000, 2000]
       n_realizations = 50
       
       # 2. Initialize results storage
       validation_results = []
       
       # 3. Run comprehensive validation
       for H in H_values:
           print(f"Testing H = {H:.2f}")
           
           for n in sample_sizes:
               for i in range(n_realizations):
                   # Generate data
                   model = FBMModel(H=H, sigma=1.0)
                   data = model.generate(n, seed=int(H*1000 + i))
                   
                   # Run benchmark
                   benchmark = ComprehensiveBenchmark()
                   result = benchmark.run_classical_benchmark(
                       data_length=n,
                       estimators=['dfa', 'gph', 'rs', 'higuchi']
                   )
                   
                   # Store results
                   for estimator_name, estimator_result in result.estimators.items():
                       validation_results.append({
                           'true_H': H,
                           'sample_size': n,
                           'realization': i,
                           'estimator': estimator_name,
                           'estimated_H': estimator_result.mean_estimate,
                           'error': abs(estimator_result.mean_estimate - H)
                       })
       
       # 4. Analyze results
       df = pd.DataFrame(validation_results)
       
       print(f"\n=== VALIDATION SUMMARY ===")
       print(f"Total tests: {len(df)}")
       print(f"H range: {df['true_H'].min():.2f} to {df['true_H'].max():.2f}")
       print(f"Sample sizes: {sorted(df['sample_size'].unique())}")
       print(f"Estimators: {sorted(df['estimator'].unique())}")
       
       # 5. Calculate performance metrics
       performance = df.groupby('estimator').agg({
           'error': ['mean', 'std', 'min', 'max'],
           'estimated_H': ['mean', 'std']
       }).round(4)
       
       print(f"\n=== PERFORMANCE METRICS ===")
       print(performance)
       
       # 6. Create visualization
       plt.figure(figsize=(15, 10))
       
       # Error distribution
       plt.subplot(2, 3, 1)
       for estimator in df['estimator'].unique():
           subset = df[df['estimator'] == estimator]
           plt.hist(subset['error'], alpha=0.7, label=estimator.upper(), bins=20)
       plt.xlabel('Absolute Error')
       plt.ylabel('Frequency')
       plt.title('Error Distribution')
       plt.legend()
       
       # Error vs True H
       plt.subplot(2, 3, 2)
       for estimator in df['estimator'].unique():
           subset = df[df['estimator'] == estimator]
           plt.scatter(subset['true_H'], subset['error'], alpha=0.6, label=estimator.upper())
       plt.xlabel('True H')
       plt.ylabel('Absolute Error')
       plt.title('Error vs True H')
       plt.legend()
       
       # Error vs Sample Size
       plt.subplot(2, 3, 3)
       for estimator in df['estimator'].unique():
           subset = df[df['estimator'] == estimator]
           plt.scatter(subset['sample_size'], subset['error'], alpha=0.6, label=estimator.upper())
       plt.xlabel('Sample Size')
       plt.ylabel('Absolute Error')
       plt.title('Error vs Sample Size')
       plt.legend()
       
       # Estimated vs True H
       plt.subplot(2, 3, 4)
       for estimator in df['estimator'].unique():
           subset = df[df['estimator'] == estimator]
           plt.scatter(subset['true_H'], subset['estimated_H'], alpha=0.6, label=estimator.upper())
       plt.plot([0.3, 0.9], [0.3, 0.9], 'r--', label='Perfect')
       plt.xlabel('True H')
       plt.ylabel('Estimated H')
       plt.title('Estimated vs True H')
       plt.legend()
       
       # Box plot by estimator
       plt.subplot(2, 3, 5)
       df.boxplot(column='error', by='estimator', ax=plt.gca())
       plt.title('Error Distribution by Estimator')
       plt.suptitle('')
       
       # Box plot by sample size
       plt.subplot(2, 3, 6)
       df.boxplot(column='error', by='sample_size', ax=plt.gca())
       plt.title('Error Distribution by Sample Size')
       plt.suptitle('')
       
       plt.tight_layout()
       plt.show()
       
       return df, performance

   # Run the comprehensive workflow
   if __name__ == "__main__":
       df, performance = comprehensive_validation_workflow()
       print("Comprehensive validation workflow completed!")

**1. Computational Complexity**: Big-O notation for time and space complexity
**2. Convergence Rate**: Rate at which estimator approaches true value
**3. Asymptotic Efficiency**: Ratio of estimator variance to Cramér-Rao lower bound

**Implementation**:

.. code-block:: python

   import time
   import psutil
   
   def efficiency_analysis(estimator_func, data_sizes):
       """Analyze computational efficiency."""
       results = {}
       
       for n in data_sizes:
           # Generate test data
           test_data = np.random.randn(n)
           
           # Measure execution time
           start_time = time.time()
           start_memory = psutil.Process().memory_info().rss
           
           estimator_func(test_data)
           
           end_time = time.time()
           end_memory = psutil.Process().memory_info().rss
           
           execution_time = end_time - start_time
           memory_usage = end_memory - start_memory
           
           results[n] = {
               'execution_time': execution_time,
               'memory_usage': memory_usage,
               'time_per_sample': execution_time / n,
               'memory_per_sample': memory_usage / n
           }
       
       return results

Validation References
=====================

1. Beran, J. (1994). Statistics for Long-Memory Processes. Chapman & Hall.
2. Efron, B., & Tibshirani, R. J. (1994). An Introduction to the Bootstrap. CRC Press.
3. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer.
4. Hyndman, R. J., & Athanasopoulos, G. (2018). Forecasting: Principles and Practice. OTexts.
5. Montgomery, D. C., Peck, E. A., & Vining, G. G. (2012). Introduction to Linear Regression Analysis. Wiley.
6. Shumway, R. H., & Stoffer, D. S. (2017). Time Series Analysis and Its Applications. Springer.


Nonstationarity and Time-Varying H
==================================

Overview
--------

Classical LRD estimators assume stationarity—specifically that the Hurst parameter H
is constant throughout the time series. When this assumption is violated, estimators
can produce biased or misleading results.

**Key stationarity assumptions violated by time-varying H:**

1. **Constant autocovariance structure**: Classical estimators assume :math:`\gamma(k)` is time-invariant
2. **Scale invariance**: Power-law decay of correlations requires constant scaling properties
3. **Ergodicity**: Time averages should equal ensemble averages

Why Classical Estimators Fail
-----------------------------

**Regime Switching**:

When H switches between regimes (e.g., H=0.3 → H=0.8 at midpoint), classical estimators
produce weighted averages of the regime-specific H values, with weights depending on:

- Segment lengths
- Estimator type (spectral vs. time domain)
- Sample size

**Continuous Drift**:

Linear or smooth H(t) trajectories cause:

- Systematic bias toward the time-averaged H̄ = ∫H(t)dt/T
- Inflated variance estimates
- Poor confidence interval coverage

**Structural Breaks**:

Level shifts and variance changes create spurious long-range correlations:

.. math::

   \hat{H}_{\text{spurious}} = \frac{1}{2} + \frac{\log(1 + \text{break\_severity})}{\log(n)}

Generating Time-Varying H Signals
---------------------------------

LRDBench provides nonstationary generators in the ``generation`` module:

.. code-block:: python

   from lrdbenchmark.generation import (
       RegimeSwitchingProcess,
       ContinuousDriftProcess,
       StructuralBreakProcess,
       EnsembleTimeAverageProcess
   )
   
   # Regime switching: H=0.3 → H=0.8 at midpoint
   gen = RegimeSwitchingProcess(h_regimes=[0.3, 0.8], change_points=[0.5])
   result = gen.generate(1000)
   
   # Linear drift: H increases from 0.3 to 0.8
   gen = ContinuousDriftProcess(h_start=0.3, h_end=0.8, drift_type='linear')
   result = gen.generate(1000)
   
   # Returns dict with 'signal', 'h_trajectory', 'metadata'


Structural Break Detection
==========================

Overview
--------

The ``StructuralBreakDetector`` class provides multiple tests for detecting
change points that would invalidate stationarity assumptions.

Available Tests
---------------

**1. CUSUM Test**:

Cumulative sum test for mean shifts. The test statistic is:

.. math::

   S_k = \sum_{i=1}^k (X_i - \bar{X})

The maximum absolute deviation is compared against Brownian bridge critical values.

**2. Recursive CUSUM**:

Sequential/online detection suitable for real-time monitoring. Uses control limits
to detect when cumulative deviations exceed threshold.

**3. Chow Test**:

Tests whether regression coefficients differ before and after a hypothesized break:

.. math::

   F = \frac{(RSS_R - RSS_U)/k}{RSS_U/(n-2k)}

where :math:`RSS_R` and :math:`RSS_U` are restricted and unrestricted residual sums.

**4. ICSS Algorithm**:

Iterative Cumulative Sum of Squares (Inclán & Tiao, 1994) for detecting variance
change points.

Usage
-----

.. code-block:: python

   from lrdbenchmark.analysis.diagnostics import StructuralBreakDetector
   
   detector = StructuralBreakDetector(significance_level=0.05)
   
   # Run all tests
   result = detector.detect_all(data)
   
   if result['any_break_detected']:
       print("⚠️ Stationarity violated!")
       print(result['warnings'])
   
   # Individual tests
   cusum_result = detector.cusum_test(data)
   chow_result = detector.chow_test(data, break_index=500)


Nonequilibrium Physics Considerations
=====================================

Ergodicity Breaking
-------------------

In nonequilibrium systems, ensemble averages may differ from time averages:

.. math::

   \langle X \rangle_{\text{ensemble}} \neq \langle X \rangle_{\text{time}}

This occurs in aging systems (e.g., glassy dynamics) and subdiffusive processes.
Classical estimators assume ergodicity, leading to systematic errors when violated.

**Testing for Ergodicity**:

.. code-block:: python

   from lrdbenchmark.generation import EnsembleTimeAverageProcess
   
   gen = EnsembleTimeAverageProcess(H=0.7, aging_exponent=0.5)
   result = gen.generate_ensemble(n_realizations=100, length=1000)
   
   # Compare ensemble vs time averages
   ensemble_mean = result['ensemble_mean']  # Mean across realizations
   time_mean = result['time_mean']          # Mean across time for each realization

Aging Effects
-------------

Aging manifests as:

- Time-dependent diffusion coefficients
- Non-stationary waiting time distributions
- Power-law decay of relaxation functions

LRDBench's ``EnsembleTimeAverageProcess`` models aging via:

- Power-law aging: :math:`H(t) \propto t^{-\alpha}`
- Logarithmic aging: :math:`H(t) \propto \log(t)`
- Exponential aging: :math:`H(t) \to H_{\text{boundary}}`


Failure Mode Interpretation
===========================

Catalog of Classical Estimator Failures
---------------------------------------

+---------------------+---------------------+-------------------------+-------------------+
| Failure Mode        | Affected Estimators | Physics Regime          | Detection         |
+=====================+=====================+=========================+===================+
| Bias explosion      | R/S, DFA            | H → 0 or H → 1          | \|Bias\| > 0.15   |
+---------------------+---------------------+-------------------------+-------------------+
| Scale sensitivity   | Spectral methods    | Nonstationarity         | Sensitivity > 0.1 |
+---------------------+---------------------+-------------------------+-------------------+
| False positives     | GPH                 | Short records (n < 512) | Type I > 10%      |
+---------------------+---------------------+-------------------------+-------------------+
| Heavy-tail breakdown| All                 | α < 2 stable            | Variance overflow |
+---------------------+---------------------+-------------------------+-------------------+
| Ergodicity breaking | All                 | Aging/nonequilibrium    | Ensemble ≠ time   |
+---------------------+---------------------+-------------------------+-------------------+

Interpreting Benchmark Results
------------------------------

When interpreting failure benchmark results:

1. **Compare stationary vs. nonstationary MAE**: Relative degradation indicates sensitivity
2. **Check structural break detection**: If breaks detected, results are unreliable
3. **Examine H-dependent bias**: Estimators often fail more at H boundaries
4. **Consider sample size effects**: Short series exacerbate all failure modes

**Reporting Guidelines**:

- Report effect sizes (Cohen's d) for bias significance
- Apply Bonferroni/FDR correction for multiple comparisons
- Include 95% CI from bootstrap (≥500 resamples)
- Visualize with heatmaps: Estimator × H × Scenario