Validation Techniques and Statistical Tests

This document provides comprehensive coverage of the validation techniques, statistical tests, and quality assurance methods used in LRDBench.

Monte Carlo Validation

Overview

Monte Carlo validation is the primary method for assessing estimator performance on synthetic data with known parameters. This approach allows for controlled evaluation of estimator bias, variance, and overall accuracy.

Methodology

Data Generation Process:

Parameter Space Definition: Define ranges for Hurst parameters (typically H ∈ [0.3, 0.9])
Model Selection: Choose data models (FBM, FGN, ARFIMA, MRW)
Sample Size Selection: Use multiple sample sizes (e.g., 500, 1000, 2000, 5000)
Realization Count: Generate N realizations for each parameter combination

Mathematical Framework:

For estimator \(\hat{H}\) and true value \(H_0\):

\[\text{Bias} = \mathbb{E}[\hat{H}] - H_0 = \frac{1}{N} \sum_{i=1}^N \hat{H}_i - H_0\]

\[\text{Variance} = \mathbb{E}[(\hat{H} - \mathbb{E}[\hat{H}])^2] = \frac{1}{N-1} \sum_{i=1}^N (\hat{H}_i - \bar{H})^2\]

\[\text{MSE} = \mathbb{E}[(\hat{H} - H_0)^2] = \text{Bias}^2 + \text{Variance}\]

Implementation:

import numpy as np
from lrdbenchmark import FBMModel, ComprehensiveBenchmark

def monte_carlo_validation(H_values, sample_sizes, n_realizations=100):
    """Perform Monte Carlo validation."""
    results = {}

    for H in H_values:
        for n in sample_sizes:
            estimates = []

            for i in range(n_realizations):
                # Generate synthetic data
                model = FBMModel(H=H, sigma=1.0)
                data = model.generate(n, seed=i)

                # Apply estimator
                benchmark = ComprehensiveBenchmark()
                result = benchmark.run_classical_benchmark(
                    data_length=n,
                    estimators=['dfa', 'gph', 'rs']
                )

                # Collect estimates
                for estimator_name, estimator_result in result.estimators.items():
                    if estimator_name not in results:
                        results[estimator_name] = {}
                    if (H, n) not in results[estimator_name]:
                        results[estimator_name][(H, n)] = []

                    results[estimator_name][(H, n)].append(
                        estimator_result.mean_estimate
                    )

    return results

Bootstrap Methods

Overview

Bootstrap methods provide non-parametric approaches to estimate confidence intervals, standard errors, and bias correction for estimators.

Types of Bootstrap

1. Non-Parametric Bootstrap:

Resample with replacement from the original data:

\[X^*_1, X^*_2, \ldots, X^*_n \sim \text{iid from } \{X_1, X_2, \ldots, X_n\}\]

2. Parametric Bootstrap:

Generate bootstrap samples from a fitted parametric model:

\[X^*_i \sim F_{\hat{\theta}}(x)\]

3. Moving Block Bootstrap:

Preserve temporal dependence by resampling blocks:

\[B_i = (X_i, X_{i+1}, \ldots, X_{i+l-1})\]

Implementation:

import numpy as np
from scipy import stats

def bootstrap_confidence_interval(data, estimator_func, n_bootstrap=1000,
                                confidence_level=0.95):
    """Calculate bootstrap confidence interval."""
    bootstrap_estimates = []

    for _ in range(n_bootstrap):
        # Resample with replacement
        bootstrap_sample = np.random.choice(data, size=len(data), replace=True)

        # Apply estimator
        estimate = estimator_func(bootstrap_sample)
        bootstrap_estimates.append(estimate)

    # Calculate confidence interval
    alpha = 1 - confidence_level
    lower_percentile = (alpha / 2) * 100
    upper_percentile = (1 - alpha / 2) * 100

    ci_lower = np.percentile(bootstrap_estimates, lower_percentile)
    ci_upper = np.percentile(bootstrap_estimates, upper_percentile)

    return ci_lower, ci_upper, bootstrap_estimates

Cross-Validation

Overview

Cross-validation is essential for machine learning estimators to prevent overfitting and assess generalization performance.

K-Fold Cross-Validation

Algorithm:

Data Partitioning: Divide data into K folds
Training/Validation: For each fold k: - Train on K-1 folds - Validate on fold k
Performance Aggregation: Average performance across all folds

Mathematical Formulation:

\[\text{CV} = \frac{1}{K} \sum_{k=1}^K L(y_k, f^{-k}(x_k))\]

where \(f^{-k}\) is the estimator trained on all folds except fold k.

Implementation:

from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

def cross_validate_estimator(data, labels, estimator_class, k_folds=5):
    """Perform k-fold cross-validation."""
    kf = KFold(n_splits=k_folds, shuffle=True, random_state=42)
    cv_scores = []

    for train_idx, val_idx in kf.split(data):
        # Split data
        X_train, X_val = data[train_idx], data[val_idx]
        y_train, y_val = labels[train_idx], labels[val_idx]

        # Train estimator
        estimator = estimator_class()
        estimator.fit(X_train, y_train)

        # Predict and evaluate
        y_pred = estimator.estimate(X_val)
        mse = mean_squared_error(y_val, y_pred)
        cv_scores.append(mse)

    return np.mean(cv_scores), np.std(cv_scores)

Time Series Cross-Validation

For time series data, standard k-fold CV can lead to data leakage. Time series CV uses expanding windows:

def time_series_cv(data, labels, estimator_class, min_train_size=100):
    """Time series cross-validation with expanding windows."""
    cv_scores = []

    for i in range(min_train_size, len(data)):
        # Training set: all data up to index i
        X_train = data[:i]
        y_train = labels[:i]

        # Validation set: next observation
        X_val = data[i:i+1]
        y_val = labels[i:i+1]

        # Train and predict
        estimator = estimator_class()
        estimator.fit(X_train, y_train)
        y_pred = estimator.estimate(X_val)

        # Calculate error
        mse = mean_squared_error(y_val, y_pred)
        cv_scores.append(mse)

    return np.mean(cv_scores), np.std(cv_scores)

Robustness Analysis

Overview

Robustness analysis assesses estimator performance under various data conditions, including contamination, noise, and model misspecification.

Contamination Models

1. Additive Noise:

\[X_t = X_t^{(0)} + \epsilon_t, \quad \epsilon_t \sim N(0, \sigma^2)\]

2. Outlier Contamination:

\[\begin{split}X_t = \begin{cases} X_t^{(0)} & \text{with probability } 1-\epsilon \\ X_t^{(0)} + \delta & \text{with probability } \epsilon \end{cases}\end{split}\]

3. Trend Contamination:

\[X_t = X_t^{(0)} + \alpha t + \beta t^2\]

Implementation:

def robustness_analysis(data, estimator_func, contamination_levels):
    """Analyze estimator robustness to contamination."""
    results = {}

    for epsilon in contamination_levels:
        contaminated_data = data.copy()

        # Add contamination
        n_contaminated = int(epsilon * len(data))
        contamination_indices = np.random.choice(
            len(data), size=n_contaminated, replace=False
        )

        # Add outliers
        contaminated_data[contamination_indices] += np.random.normal(
            0, 5, size=n_contaminated
        )

        # Apply estimator
        estimate = estimator_func(contaminated_data)
        results[epsilon] = estimate

    return results

Breakdown Point Analysis

The breakdown point is the smallest fraction of contaminated data that can cause the estimator to produce arbitrarily bad results.

Implementation:

def breakdown_point_analysis(data, estimator_func, max_contamination=0.5):
    """Estimate breakdown point of an estimator."""
    original_estimate = estimator_func(data)
    breakdown_point = None

    for epsilon in np.arange(0, max_contamination, 0.01):
        # Create contaminated data
        n_contaminated = int(epsilon * len(data))
        contaminated_data = data.copy()

        # Add extreme outliers
        contamination_indices = np.random.choice(
            len(data), size=n_contaminated, replace=False
        )
        contaminated_data[contamination_indices] = 1e6

        # Check if estimator breaks down
        try:
            contaminated_estimate = estimator_func(contaminated_data)

            # Check if estimate is reasonable
            if abs(contaminated_estimate - original_estimate) > 0.5:
                breakdown_point = epsilon
                break
        except:
            breakdown_point = epsilon
            break

    return breakdown_point

Validation Statistical Tests

Hypothesis Testing

1. Test for Long-Range Dependence:

Null hypothesis: \(H_0: H = 0.5\) (no LRD) Alternative hypothesis: \(H_1: H \neq 0.5\) (LRD present)

Test statistic: .. math:

T = \frac{\hat{H} - 0.5}{\text{SE}(\hat{H})}

2. Test for Parameter Equality:

Null hypothesis: \(H_0: H_1 = H_2\) Alternative hypothesis: \(H_1: H_1 \neq H_2\)

Test statistic: .. math:

T = \frac{\hat{H}_1 - \hat{H}_2}{\sqrt{\text{SE}(\hat{H}_1)^2 + \text{SE}(\hat{H}_2)^2}}

Implementation:

from scipy import stats

def test_lrd_presence(data, estimator_func, alpha=0.05):
    """Test for presence of long-range dependence."""
    # Estimate H
    H_estimate = estimator_func(data)

    # Bootstrap standard error
    bootstrap_estimates = []
    for _ in range(1000):
        bootstrap_sample = np.random.choice(data, size=len(data), replace=True)
        bootstrap_estimates.append(estimator_func(bootstrap_sample))

    se_H = np.std(bootstrap_estimates)

    # Test statistic
    test_statistic = (H_estimate - 0.5) / se_H

    # Critical value
    critical_value = stats.norm.ppf(1 - alpha/2)

    # Decision
    reject_null = abs(test_statistic) > critical_value
    p_value = 2 * (1 - stats.norm.cdf(abs(test_statistic)))

    return {
        'test_statistic': test_statistic,
        'p_value': p_value,
        'reject_null': reject_null,
        'H_estimate': H_estimate,
        'standard_error': se_H
    }

Goodness-of-Fit Tests

1. Kolmogorov-Smirnov Test:

Tests whether empirical distribution matches theoretical distribution.

\[D_n = \sup_x |F_n(x) - F(x)|\]

2. Anderson-Darling Test:

Weighted version of KS test, more sensitive to tails.

\[A^2 = n \int_{-\infty}^{\infty} \frac{(F_n(x) - F(x))^2}{F(x)(1-F(x))} dF(x)\]

3. Chi-Square Test:

Tests fit of observed frequencies to expected frequencies.

\[\chi^2 = \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i}\]

Implementation:

def goodness_of_fit_tests(data, theoretical_distribution):
    """Perform goodness-of-fit tests."""
    results = {}

    # Kolmogorov-Smirnov test
    ks_statistic, ks_pvalue = stats.kstest(data, theoretical_distribution)
    results['ks_test'] = {
        'statistic': ks_statistic,
        'p_value': ks_pvalue
    }

    # Anderson-Darling test
    ad_statistic, ad_critical_values, ad_significance_levels = stats.anderson(
        data, dist=theoretical_distribution
    )
    results['anderson_darling'] = {
        'statistic': ad_statistic,
        'critical_values': ad_critical_values,
        'significance_levels': ad_significance_levels
    }

    # Chi-square test
    observed, bins = np.histogram(data, bins='auto')
    expected = theoretical_distribution.pdf(bins[:-1]) * len(data) * np.diff(bins)

    chi2_statistic, chi2_pvalue = stats.chisquare(observed, expected)
    results['chi_square'] = {
        'statistic': chi2_statistic,
        'p_value': chi2_pvalue
    }

    return results

Model Selection

Information Criteria

1. Akaike Information Criterion (AIC):

\[\text{AIC} = 2k - 2\ln(L)\]

where k is the number of parameters and L is the likelihood.

2. Bayesian Information Criterion (BIC):

\[\text{BIC} = \ln(n)k - 2\ln(L)\]

where n is the sample size.

3. Corrected AIC (AICc):

\[\text{AICc} = \text{AIC} + \frac{2k(k+1)}{n-k-1}\]

Implementation:

def model_selection_criteria(models, data):
    """Calculate model selection criteria."""
    results = {}

    for model_name, model in models.items():
        # Fit model and get likelihood
        model.fit(data)
        log_likelihood = model.log_likelihood(data)
        n_params = model.n_parameters
        n_samples = len(data)

        # Calculate criteria
        aic = 2 * n_params - 2 * log_likelihood
        bic = np.log(n_samples) * n_params - 2 * log_likelihood
        aicc = aic + (2 * n_params * (n_params + 1)) / (n_samples - n_params - 1)

        results[model_name] = {
            'AIC': aic,
            'BIC': bic,
            'AICc': aicc,
            'log_likelihood': log_likelihood,
            'n_parameters': n_params
        }

    return results

Validation Performance Metrics

Accuracy Metrics

1. Mean Absolute Error (MAE):

\[MAE = \frac{1}{n} \sum_{i=1}^n |\hat{H}_i - H_i|\]

2. Root Mean Square Error (RMSE):

\[RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^n (\hat{H}_i - H_i)^2}\]

3. Mean Absolute Percentage Error (MAPE):

\[MAPE = \frac{100\%}{n} \sum_{i=1}^n \left|\frac{\hat{H}_i - H_i}{H_i}\right|\]

4. Symmetric Mean Absolute Percentage Error (SMAPE):

\[SMAPE = \frac{100\%}{n} \sum_{i=1}^n \frac{2|\hat{H}_i - H_i|}{|\hat{H}_i| + |H_i|}\]

Precision Metrics

1. Standard Error:

\[SE = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (\hat{H}_i - \bar{H})^2}\]

2. Coefficient of Variation:

\[CV = \frac{SE}{\bar{H}} \times 100\%\]

3. Confidence Interval Width:

\[CI_{width} = \hat{H}_{1-\alpha/2} - \hat{H}_{\alpha/2}\]

Implementation:

def calculate_performance_metrics(true_values, estimated_values):
    """Calculate comprehensive performance metrics."""
    metrics = {}

    # Convert to numpy arrays
    true_vals = np.array(true_values)
    est_vals = np.array(estimated_values)

    # Accuracy metrics
    errors = est_vals - true_vals
    abs_errors = np.abs(errors)

    metrics['MAE'] = np.mean(abs_errors)
    metrics['RMSE'] = np.sqrt(np.mean(errors**2))
    metrics['MAPE'] = 100 * np.mean(np.abs(errors / true_vals))
    metrics['SMAPE'] = 100 * np.mean(2 * abs_errors / (np.abs(est_vals) + np.abs(true_vals)))

    # Precision metrics
    metrics['Standard_Error'] = np.std(est_vals)
    metrics['Coefficient_of_Variation'] = (np.std(est_vals) / np.mean(est_vals)) * 100

    # Bias
    metrics['Bias'] = np.mean(errors)

    # Correlation
    metrics['Correlation'] = np.corrcoef(true_vals, est_vals)[0, 1]

    return metrics

Efficiency Metrics

1. Computational Complexity: Big-O notation for time and space complexity of estimators.

2. Convergence Rate: Rate at which estimator approaches true value with increasing sample size.

3. Asymptotic Efficiency: Ratio of estimator variance to Cramér-Rao lower bound.

Implementation:

import time
import numpy as np
from lrdbenchmark import FBMModel, ComprehensiveBenchmark
import matplotlib.pyplot as plt

def efficiency_analysis_example():
    """Demonstrate efficiency analysis of estimators."""

    # Test different sample sizes
    sample_sizes = [500, 1000, 2000, 4000, 8000]
    n_runs = 10

    # Initialize results
    results = {
        'dfa': {'times': [], 'estimates': []},
        'gph': {'times': [], 'estimates': []},
        'rs': {'times': [], 'estimates': []}
    }

    print("Running efficiency analysis...")

    for n in sample_sizes:
        print(f"Testing sample size: {n}")

        for estimator_name in results.keys():
            times = []
            estimates = []

            for i in range(n_runs):
                # Generate data
                model = FBMModel(H=0.7, sigma=1.0)
                data = model.generate(n, seed=i)

                # Time estimation
                start_time = time.time()

                benchmark = ComprehensiveBenchmark()
                result = benchmark.run_classical_benchmark(
                    data_length=n,
                    estimators=[estimator_name]
                )

                end_time = time.time()

                if estimator_name in result.estimators:
                    times.append(end_time - start_time)
                    estimates.append(
                        result.estimators[estimator_name].mean_estimate
                    )

            if times:
                results[estimator_name]['times'].append(np.mean(times))
                results[estimator_name]['estimates'].append(np.mean(estimates))

    # Plot computational complexity
    plt.figure(figsize=(12, 5))

    plt.subplot(1, 2, 1)
    for estimator_name in results.keys():
        if results[estimator_name]['times']:
            plt.loglog(sample_sizes[:len(results[estimator_name]['times'])],
                      results[estimator_name]['times'],
                      label=estimator_name.upper(), marker='o')

    plt.xlabel('Sample Size')
    plt.ylabel('Execution Time (seconds)')
    plt.title('Computational Complexity')
    plt.legend()
    plt.grid(True)

    # Plot convergence
    plt.subplot(1, 2, 2)
    true_H = 0.7
    for estimator_name in results.keys():
        if results[estimator_name]['estimates']:
            errors = np.abs(np.array(results[estimator_name]['estimates']) - true_H)
            plt.loglog(sample_sizes[:len(errors)], errors,
                      label=estimator_name.upper(), marker='o')

    plt.xlabel('Sample Size')
    plt.ylabel('Absolute Error')
    plt.title('Convergence Rate')
    plt.legend()
    plt.grid(True)

    plt.tight_layout()
    plt.show()

    return results

# Run the example
if __name__ == "__main__":
    results = efficiency_analysis_example()
    print("Efficiency analysis completed!")

Quality Assurance

Data Quality Checks

1. Stationarity Tests:

from statsmodels.tsa.stattools import adfuller, kpss
from scipy import stats
import numpy as np
from lrdbenchmark import FBMModel, FGNModel

def data_quality_checks_example():
    """Demonstrate data quality checks for time series."""

    # Generate different types of data
    models = {
        'FBM (H=0.7)': FBMModel(H=0.7, sigma=1.0),
        'FGN (H=0.8)': FGNModel(H=0.8, sigma=1.0),
        'Non-stationary': lambda: np.cumsum(np.random.normal(0, 1, 1000))
    }

    print("=== DATA QUALITY CHECKS ===")

    for model_name, model in models.items():
        print(f"\n--- {model_name} ---")

        # Generate data
        if callable(model):
            data = model()
        else:
            data = model.generate(1000, seed=42)

        # ADF Test (Augmented Dickey-Fuller)
        adf_stat, adf_pvalue, adf_critical = adfuller(data)[:3]
        print(f"ADF Test:")
        print(f"  Statistic: {adf_stat:.4f}")
        print(f"  p-value: {adf_pvalue:.4f}")
        print(f"  Stationary: {'Yes' if adf_pvalue < 0.05 else 'No'}")

        # KPSS Test
        kpss_stat, kpss_pvalue, kpss_critical = kpss(data)[:3]
        print(f"KPSS Test:")
        print(f"  Statistic: {kpss_stat:.4f}")
        print(f"  p-value: {kpss_pvalue:.4f}")
        print(f"  Stationary: {'Yes' if kpss_pvalue > 0.05 else 'No'}")

        # Normality Test (Shapiro-Wilk)
        shapiro_stat, shapiro_pvalue = stats.shapiro(data)
        print(f"Shapiro-Wilk Test:")
        print(f"  Statistic: {shapiro_stat:.4f}")
        print(f"  p-value: {shapiro_pvalue:.4f}")
        print(f"  Normal: {'Yes' if shapiro_pvalue > 0.05 else 'No'}")

        # Basic statistics
        print(f"Basic Statistics:")
        print(f"  Mean: {np.mean(data):.4f}")
        print(f"  Std: {np.std(data):.4f}")
        print(f"  Skewness: {stats.skew(data):.4f}")
        print(f"  Kurtosis: {stats.kurtosis(data):.4f}")

# Run the example
if __name__ == "__main__":
    data_quality_checks_example()

Estimator Validation

1. Consistency Checks:

import numpy as np
from lrdbenchmark import FBMModel, ComprehensiveBenchmark
import matplotlib.pyplot as plt

def estimator_validation_example():
    """Demonstrate estimator validation procedures."""

    # Test consistency with increasing sample size
    sample_sizes = [500, 1000, 2000, 4000, 8000]
    true_H = 0.7
    n_runs = 20

    results = {
        'dfa': {'estimates': [], 'std': []},
        'gph': {'estimates': [], 'std': []},
        'rs': {'estimates': [], 'std': []}
    }

    print("Running estimator validation...")

    for n in sample_sizes:
        print(f"Sample size: {n}")

        for estimator_name in results.keys():
            estimates = []

            for i in range(n_runs):
                # Generate data
                model = FBMModel(H=true_H, sigma=1.0)
                data = model.generate(n, seed=i)

                # Apply estimator
                benchmark = ComprehensiveBenchmark()
                result = benchmark.run_classical_benchmark(
                    data_length=n,
                    estimators=[estimator_name]
                )

                if estimator_name in result.estimators:
                    estimates.append(
                        result.estimators[estimator_name].mean_estimate
                    )

            if estimates:
                results[estimator_name]['estimates'].append(np.mean(estimates))
                results[estimator_name]['std'].append(np.std(estimates))

    # Plot consistency
    plt.figure(figsize=(12, 5))

    plt.subplot(1, 2, 1)
    for estimator_name in results.keys():
        if results[estimator_name]['estimates']:
            plt.semilogx(sample_sizes[:len(results[estimator_name]['estimates'])],
                        results[estimator_name]['estimates'],
                        label=estimator_name.upper(), marker='o')

    plt.axhline(y=true_H, color='red', linestyle='--', label='True H')
    plt.xlabel('Sample Size')
    plt.ylabel('Estimated H')
    plt.title('Consistency Check')
    plt.legend()
    plt.grid(True)

    # Plot standard deviation
    plt.subplot(1, 2, 2)
    for estimator_name in results.keys():
        if results[estimator_name]['std']:
            plt.loglog(sample_sizes[:len(results[estimator_name]['std'])],
                      results[estimator_name]['std'],
                      label=estimator_name.upper(), marker='o')

    plt.xlabel('Sample Size')
    plt.ylabel('Standard Deviation')
    plt.title('Precision vs Sample Size')
    plt.legend()
    plt.grid(True)

    plt.tight_layout()
    plt.show()

    return results

# Run the example
if __name__ == "__main__":
    results = estimator_validation_example()
    print("Estimator validation completed!")

Comprehensive Validation Workflow

from lrdbenchmark import FBMModel, FGNModel, ComprehensiveBenchmark
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

def comprehensive_validation_workflow():
    """Complete validation workflow for LRDBench estimators."""

    print("=== COMPREHENSIVE VALIDATION WORKFLOW ===")

    # 1. Define validation parameters
    H_values = np.linspace(0.3, 0.9, 13)
    sample_sizes = [500, 1000, 2000]
    n_realizations = 50

    # 2. Initialize results storage
    validation_results = []

    # 3. Run comprehensive validation
    for H in H_values:
        print(f"Testing H = {H:.2f}")

        for n in sample_sizes:
            for i in range(n_realizations):
                # Generate data
                model = FBMModel(H=H, sigma=1.0)
                data = model.generate(n, seed=int(H*1000 + i))

                # Run benchmark
                benchmark = ComprehensiveBenchmark()
                result = benchmark.run_classical_benchmark(
                    data_length=n,
                    estimators=['dfa', 'gph', 'rs', 'higuchi']
                )

                # Store results
                for estimator_name, estimator_result in result.estimators.items():
                    validation_results.append({
                        'true_H': H,
                        'sample_size': n,
                        'realization': i,
                        'estimator': estimator_name,
                        'estimated_H': estimator_result.mean_estimate,
                        'error': abs(estimator_result.mean_estimate - H)
                    })

    # 4. Analyze results
    df = pd.DataFrame(validation_results)

    print(f"\n=== VALIDATION SUMMARY ===")
    print(f"Total tests: {len(df)}")
    print(f"H range: {df['true_H'].min():.2f} to {df['true_H'].max():.2f}")
    print(f"Sample sizes: {sorted(df['sample_size'].unique())}")
    print(f"Estimators: {sorted(df['estimator'].unique())}")

    # 5. Calculate performance metrics
    performance = df.groupby('estimator').agg({
        'error': ['mean', 'std', 'min', 'max'],
        'estimated_H': ['mean', 'std']
    }).round(4)

    print(f"\n=== PERFORMANCE METRICS ===")
    print(performance)

    # 6. Create visualization
    plt.figure(figsize=(15, 10))

    # Error distribution
    plt.subplot(2, 3, 1)
    for estimator in df['estimator'].unique():
        subset = df[df['estimator'] == estimator]
        plt.hist(subset['error'], alpha=0.7, label=estimator.upper(), bins=20)
    plt.xlabel('Absolute Error')
    plt.ylabel('Frequency')
    plt.title('Error Distribution')
    plt.legend()

    # Error vs True H
    plt.subplot(2, 3, 2)
    for estimator in df['estimator'].unique():
        subset = df[df['estimator'] == estimator]
        plt.scatter(subset['true_H'], subset['error'], alpha=0.6, label=estimator.upper())
    plt.xlabel('True H')
    plt.ylabel('Absolute Error')
    plt.title('Error vs True H')
    plt.legend()

    # Error vs Sample Size
    plt.subplot(2, 3, 3)
    for estimator in df['estimator'].unique():
        subset = df[df['estimator'] == estimator]
        plt.scatter(subset['sample_size'], subset['error'], alpha=0.6, label=estimator.upper())
    plt.xlabel('Sample Size')
    plt.ylabel('Absolute Error')
    plt.title('Error vs Sample Size')
    plt.legend()

    # Estimated vs True H
    plt.subplot(2, 3, 4)
    for estimator in df['estimator'].unique():
        subset = df[df['estimator'] == estimator]
        plt.scatter(subset['true_H'], subset['estimated_H'], alpha=0.6, label=estimator.upper())
    plt.plot([0.3, 0.9], [0.3, 0.9], 'r--', label='Perfect')
    plt.xlabel('True H')
    plt.ylabel('Estimated H')
    plt.title('Estimated vs True H')
    plt.legend()

    # Box plot by estimator
    plt.subplot(2, 3, 5)
    df.boxplot(column='error', by='estimator', ax=plt.gca())
    plt.title('Error Distribution by Estimator')
    plt.suptitle('')

    # Box plot by sample size
    plt.subplot(2, 3, 6)
    df.boxplot(column='error', by='sample_size', ax=plt.gca())
    plt.title('Error Distribution by Sample Size')
    plt.suptitle('')

    plt.tight_layout()
    plt.show()

    return df, performance

# Run the comprehensive workflow
if __name__ == "__main__":
    df, performance = comprehensive_validation_workflow()
    print("Comprehensive validation workflow completed!")

1. Computational Complexity: Big-O notation for time and space complexity 2. Convergence Rate: Rate at which estimator approaches true value 3. Asymptotic Efficiency: Ratio of estimator variance to Cramér-Rao lower bound

Implementation:

import time
import psutil

def efficiency_analysis(estimator_func, data_sizes):
    """Analyze computational efficiency."""
    results = {}

    for n in data_sizes:
        # Generate test data
        test_data = np.random.randn(n)

        # Measure execution time
        start_time = time.time()
        start_memory = psutil.Process().memory_info().rss

        estimator_func(test_data)

        end_time = time.time()
        end_memory = psutil.Process().memory_info().rss

        execution_time = end_time - start_time
        memory_usage = end_memory - start_memory

        results[n] = {
            'execution_time': execution_time,
            'memory_usage': memory_usage,
            'time_per_sample': execution_time / n,
            'memory_per_sample': memory_usage / n
        }

    return results

Validation References

Beran, J. (1994). Statistics for Long-Memory Processes. Chapman & Hall.
Efron, B., & Tibshirani, R. J. (1994). An Introduction to the Bootstrap. CRC Press.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer.
Hyndman, R. J., & Athanasopoulos, G. (2018). Forecasting: Principles and Practice. OTexts.
Montgomery, D. C., Peck, E. A., & Vining, G. G. (2012). Introduction to Linear Regression Analysis. Wiley.
Shumway, R. H., & Stoffer, D. S. (2017). Time Series Analysis and Its Applications. Springer.

Nonstationarity and Time-Varying H

Overview

Classical LRD estimators assume stationarity—specifically that the Hurst parameter H is constant throughout the time series. When this assumption is violated, estimators can produce biased or misleading results.

Key stationarity assumptions violated by time-varying H:

Constant autocovariance structure: Classical estimators assume \(\gamma(k)\) is time-invariant
Scale invariance: Power-law decay of correlations requires constant scaling properties
Ergodicity: Time averages should equal ensemble averages

Why Classical Estimators Fail

Regime Switching:

When H switches between regimes (e.g., H=0.3 → H=0.8 at midpoint), classical estimators produce weighted averages of the regime-specific H values, with weights depending on:

Segment lengths
Estimator type (spectral vs. time domain)
Sample size

Continuous Drift:

Linear or smooth H(t) trajectories cause:

Systematic bias toward the time-averaged H̄ = ∫H(t)dt/T
Inflated variance estimates
Poor confidence interval coverage

Structural Breaks:

Level shifts and variance changes create spurious long-range correlations:

\[\hat{H}_{\text{spurious}} = \frac{1}{2} + \frac{\log(1 + \text{break\_severity})}{\log(n)}\]

Generating Time-Varying H Signals

LRDBench provides nonstationary generators in the generation module:

from lrdbenchmark.generation import (
    RegimeSwitchingProcess,
    ContinuousDriftProcess,
    StructuralBreakProcess,
    EnsembleTimeAverageProcess
)

# Regime switching: H=0.3 → H=0.8 at midpoint
gen = RegimeSwitchingProcess(h_regimes=[0.3, 0.8], change_points=[0.5])
result = gen.generate(1000)

# Linear drift: H increases from 0.3 to 0.8
gen = ContinuousDriftProcess(h_start=0.3, h_end=0.8, drift_type='linear')
result = gen.generate(1000)

# Returns dict with 'signal', 'h_trajectory', 'metadata'

Structural Break Detection

Overview

The StructuralBreakDetector class provides multiple tests for detecting change points that would invalidate stationarity assumptions.

Available Tests

1. CUSUM Test:

Cumulative sum test for mean shifts. The test statistic is:

\[S_k = \sum_{i=1}^k (X_i - \bar{X})\]

The maximum absolute deviation is compared against Brownian bridge critical values.

2. Recursive CUSUM:

Sequential/online detection suitable for real-time monitoring. Uses control limits to detect when cumulative deviations exceed threshold.

3. Chow Test:

Tests whether regression coefficients differ before and after a hypothesized break:

\[F = \frac{(RSS_R - RSS_U)/k}{RSS_U/(n-2k)}\]

where \(RSS_R\) and \(RSS_U\) are restricted and unrestricted residual sums.

4. ICSS Algorithm:

Iterative Cumulative Sum of Squares (Inclán & Tiao, 1994) for detecting variance change points.

Usage

from lrdbenchmark.analysis.diagnostics import StructuralBreakDetector

detector = StructuralBreakDetector(significance_level=0.05)

# Run all tests
result = detector.detect_all(data)

if result['any_break_detected']:
    print("⚠️ Stationarity violated!")
    print(result['warnings'])

# Individual tests
cusum_result = detector.cusum_test(data)
chow_result = detector.chow_test(data, break_index=500)

Nonequilibrium Physics Considerations

Ergodicity Breaking

In nonequilibrium systems, ensemble averages may differ from time averages:

\[\langle X \rangle_{\text{ensemble}} \neq \langle X \rangle_{\text{time}}\]

This occurs in aging systems (e.g., glassy dynamics) and subdiffusive processes. Classical estimators assume ergodicity, leading to systematic errors when violated.

Testing for Ergodicity:

from lrdbenchmark.generation import EnsembleTimeAverageProcess

gen = EnsembleTimeAverageProcess(H=0.7, aging_exponent=0.5)
result = gen.generate_ensemble(n_realizations=100, length=1000)

# Compare ensemble vs time averages
ensemble_mean = result['ensemble_mean']  # Mean across realizations
time_mean = result['time_mean']          # Mean across time for each realization

Aging Effects

Aging manifests as:

Time-dependent diffusion coefficients
Non-stationary waiting time distributions
Power-law decay of relaxation functions

LRDBench’s EnsembleTimeAverageProcess models aging via:

Power-law aging: \(H(t) \propto t^{-\alpha}\)
Logarithmic aging: \(H(t) \propto \log(t)\)
Exponential aging: \(H(t) \to H_{\text{boundary}}\)

Failure Mode Interpretation

Catalog of Classical Estimator Failures

Failure Mode	Affected Estimators	Physics Regime	Detection
Bias explosion	R/S, DFA	H → 0 or H → 1	\|Bias\| > 0.15
Scale sensitivity	Spectral methods	Nonstationarity	Sensitivity > 0.1
False positives	GPH	Short records (n < 512)	Type I > 10%
Heavy-tail breakdown	All	α < 2 stable	Variance overflow
Ergodicity breaking	All	Aging/nonequilibrium	Ensemble ≠ time

Interpreting Benchmark Results

When interpreting failure benchmark results:

Compare stationary vs. nonstationary MAE: Relative degradation indicates sensitivity
Check structural break detection: If breaks detected, results are unreliable
Examine H-dependent bias: Estimators often fail more at H boundaries
Consider sample size effects: Short series exacerbate all failure modes

Reporting Guidelines:

Report effect sizes (Cohen’s d) for bias significance
Apply Bonferroni/FDR correction for multiple comparisons
Include 95% CI from bootstrap (≥500 resamples)
Visualize with heatmaps: Estimator × H × Scenario