Custom Models and Estimators ============================ This notebook demonstrates how to extend the lrdbenchmark library with custom data models and estimators, showing the extensibility and flexibility of the framework. Overview -------- lrdbenchmark is designed to be highly extensible, allowing users to add their own data models and estimators. This notebook covers: 1. **Understanding the Base Classes**: How the framework is structured 2. **Custom Data Models**: Creating new stochastic processes 3. **Custom Estimators**: Implementing new estimation methods 4. **Integration**: Making custom components work with the benchmark system 5. **Best Practices**: Guidelines for extensibility Table of Contents ----------------- 1. `Setup and Imports <#setup>`__ 2. `Understanding Base Classes <#base-classes>`__ 3. `Custom Data Model Example <#custom-data-model>`__ 4. `Custom Classical Estimator <#custom-classical>`__ 5. `Custom ML Estimator <#custom-ml>`__ 6. `Integration with Benchmark System <#integration>`__ 7. `Testing and Validation <#testing>`__ 8. `Best Practices <#best-practices>`__ 9. `Summary and Next Steps <#summary>`__ .. _setup: 1. Setup and Imports -------------------- First, let’s import all necessary libraries and examine the base classes that we’ll be extending. .. code:: ipython3 # Standard scientific computing imports import numpy as np # lrdbenchmark imports - using simplified API from lrdbenchmark import ( # Data models FBMModel, FGNModel, ARFIMAModel, MRWModel, AlphaStableModel, # Classical estimators RSEstimator, DFAEstimator, GPHEstimator, WhittleEstimator, # Machine Learning estimators RandomForestEstimator, SVREstimator, GradientBoostingEstimator, # Neural Network estimators CNNEstimator, LSTMEstimator, GRUEstimator, TransformerEstimator, # GPU utilities gpu_is_available, get_device_info, clear_gpu_cache, monitor_gpu_memory ) import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from scipy import stats from scipy.optimize import minimize import time import warnings import subprocess import gc warnings.filterwarnings('ignore') # GPU Memory Management Functions .. _base-classes: 2. Understanding Base Classes ----------------------------- Let’s examine the base classes that we’ll be extending to understand the framework structure. .. code:: ipython3 # Examine the BaseModel and BaseEstimator class structures print("🔍 Examining BaseModel class structure:") print("=" * 50) # Get the BaseModel class methods base_model_methods = [method for method in dir(BaseModel) if not method.startswith('_') and callable(getattr(BaseModel, method))] print("BaseModel methods:") for method in base_model_methods: print(f" - {method}") print("\n📋 Required methods for custom data models:") print(" - __init__(self, **kwargs): Initialize with parameters") print(" - generate(self, length, seed=None): Generate n samples") print(" - _validate_parameters(self): Optional method to validate input parameters") # Examine the BaseEstimator class structure print("\n🔍 Examining BaseEstimator class structure:") print("=" * 50) base_estimator_methods = [method for method in dir(BaseEstimator) if not method.startswith('_') and callable(getattr(BaseEstimator, method))] print("BaseEstimator methods:") for method in base_estimator_methods: print(f" - {method}") print("\n📋 Required methods for custom estimators:") print(" - __init__(self, **kwargs): Initialize with parameters") print(" - estimate(self, data): Estimate the Hurst parameter from data") print(" - _validate_parameters(self): Optional method to validate input parameters") .. _custom-data-model: 3. Custom Data Model Example ---------------------------- Let’s create a custom data model: **Fractional Ornstein-Uhlenbeck Process**. This is a mean-reverting process with long-range dependence, useful for modeling financial time series. .. code:: ipython3 class FractionalOrnsteinUhlenbeck(BaseModel): """ Custom Fractional Ornstein-Uhlenbeck Process. This process combines mean reversion with long-range dependence: dX_t = -θ(X_t - μ)dt + σ dB_H(t) where: - θ: mean reversion speed - μ: long-term mean - σ: volatility - B_H(t): fractional Brownian motion with Hurst parameter H """ def __init__(self, theta=0.1, mu=0.0, sigma=1.0, H=0.7, **kwargs): """ Initialize the Fractional Ornstein-Uhlenbeck process. Parameters ---------- theta : float Mean reversion speed (must be > 0) mu : float Long-term mean sigma : float Volatility (must be > 0) H : float Hurst parameter (0 < H < 1) """ self.theta = theta self.mu = mu self.sigma = sigma self.H = H # Store parameters for base class super().__init__(theta=theta, mu=mu, sigma=sigma, H=H, **kwargs) def _validate_parameters(self): """Validate the model parameters.""" if self.theta <= 0: raise ValueError("Mean reversion speed theta must be positive") if self.sigma <= 0: raise ValueError("Volatility sigma must be positive") if not (0 < self.H < 1): raise ValueError("Hurst parameter H must be in (0, 1)") def generate(self, n, seed=None): """ Generate n samples from the Fractional Ornstein-Uhlenbeck process. Parameters ---------- n : int Number of samples to generate seed : int, optional Random seed for reproducibility Returns ------- np.ndarray Generated time series """ if seed is not None: np.random.seed(seed) # Generate fractional Brownian motion fbm = FBMModel(H=self.H, sigma=1.0) fbm_path = fbm.generate(n, seed=seed) # Apply Ornstein-Uhlenbeck transformation dt = 1.0 / n # Time step x = np.zeros(n) x[0] = self.mu # Start at long-term mean for i in range(1, n): # Euler-Maruyama discretization dx = -self.theta * (x[i-1] - self.mu) * dt + self.sigma * (fbm_path[i] - fbm_path[i-1]) x[i] = x[i-1] + dx return x def get_theoretical_properties(self): """ Get theoretical properties of the process. Returns ------- dict Dictionary containing theoretical properties """ return { 'mean': self.mu, 'variance': self.sigma**2 / (2 * self.theta), 'autocorrelation_time': 1 / self.theta, 'hurst_parameter': self.H, 'mean_reversion_speed': self.theta, 'long_term_mean': self.mu, 'volatility': self.sigma } def __str__(self): return f"FractionalOrnsteinUhlenbeck(theta={self.theta}, mu={self.mu}, sigma={self.sigma}, H={self.H})" # Test the custom model print("🧪 Testing Custom Fractional Ornstein-Uhlenbeck Model:") print("=" * 60) # Create model instance fou = FractionalOrnsteinUhlenbeck(theta=0.1, mu=0.0, sigma=1.0, H=0.7) print(f"Model: {fou}") # Generate sample data n_samples = 1000 data = fou.generate(n_samples, seed=42) # Print basic statistics print(f"\n📊 Generated data statistics:") print(f" Length: {len(data)}") print(f" Mean: {data.mean():.4f}") print(f" Std: {data.std():.4f}") print(f" Min: {data.min():.4f}") print(f" Max: {data.max():.4f}") # Get theoretical properties theoretical = fou.get_theoretical_properties() print(f"\n🧮 Theoretical properties:") for key, value in theoretical.items(): print(f" {key}: {value:.4f}") # Visualize the data fig, axes = plt.subplots(2, 2, figsize=(15, 10)) # Time series axes[0, 0].plot(data[:500], linewidth=1.5, alpha=0.8) axes[0, 0].set_title('Fractional Ornstein-Uhlenbeck Process', fontsize=12, fontweight='bold') axes[0, 0].set_xlabel('Time') axes[0, 0].set_ylabel('Value') axes[0, 0].grid(True, alpha=0.3) # Distribution axes[0, 1].hist(data, bins=50, density=True, alpha=0.7) axes[0, 1].set_title('Distribution', fontsize=12, fontweight='bold') axes[0, 1].set_xlabel('Value') axes[0, 1].set_ylabel('Density') axes[0, 1].grid(True, alpha=0.3) # Autocorrelation from statsmodels.tsa.stattools import acf acf_values = acf(data, nlags=50, fft=True) axes[1, 0].plot(acf_values, linewidth=2, alpha=0.8) axes[1, 0].set_title('Autocorrelation Function', fontsize=12, fontweight='bold') axes[1, 0].set_xlabel('Lag') axes[1, 0].set_ylabel('ACF') axes[1, 0].grid(True, alpha=0.3) # Power spectral density freqs = np.fft.fftfreq(n_samples)[:n_samples//2] psd = np.abs(np.fft.fft(data))**2 psd = psd[:n_samples//2] axes[1, 1].loglog(freqs[1:], psd[1:], linewidth=2, alpha=0.8) axes[1, 1].set_title('Power Spectral Density', fontsize=12, fontweight='bold') axes[1, 1].set_xlabel('Frequency') axes[1, 1].set_ylabel('PSD') axes[1, 1].grid(True, alpha=0.3) plt.tight_layout() plt.savefig('outputs/custom_fou_model.png', dpi=300, bbox_inches='tight') plt.show() print("\n✅ Custom data model created and tested successfully!") .. _custom-classical: 4. Custom Classical Estimator ----------------------------- Let’s create a custom classical estimator: **Variance-Based Hurst Estimator**. This is a simple but effective method based on the scaling of variance with time. .. code:: ipython3 class VarianceBasedHurstEstimator(BaseEstimator): """ Custom Variance-Based Hurst Estimator. This estimator uses the scaling relationship: Var(X_t) ∝ t^(2H) where H is the Hurst parameter. """ def __init__(self, min_scale=10, max_scale=None, num_scales=10, **kwargs): """ Initialize the variance-based Hurst estimator. Parameters ---------- min_scale : int Minimum time scale for analysis max_scale : int, optional Maximum time scale (default: data_length // 4) num_scales : int Number of scales to use """ super().__init__(**kwargs) self.parameters = { 'min_scale': min_scale, 'max_scale': max_scale, 'num_scales': num_scales } def estimate(self, data): """ Estimate Hurst parameter using variance scaling. Parameters ---------- data : np.ndarray Time series data Returns ------- dict Dictionary containing estimation results """ n = len(data) # Get parameters min_scale = self.parameters['min_scale'] max_scale = self.parameters['max_scale'] num_scales = self.parameters['num_scales'] # Set maximum scale if not provided if max_scale is None: max_scale = n // 4 else: max_scale = min(max_scale, n // 4) # Generate scales scales = np.logspace(np.log10(min_scale), np.log10(max_scale), num_scales).astype(int) scales = np.unique(scales) # Remove duplicates # Calculate variances for each scale variances = [] valid_scales = [] for scale in scales: if scale >= n: continue # Calculate variance for this scale var_scale = self._calculate_variance_at_scale(data, scale) if var_scale > 0: variances.append(var_scale) valid_scales.append(scale) if len(variances) < 3: return {'hurst_parameter': None, 'error': 'Insufficient data for estimation'} # Fit power law: log(Var) = 2H * log(t) + C log_scales = np.log(valid_scales) log_variances = np.log(variances) # Linear regression coeffs = np.polyfit(log_scales, log_variances, 1) H_estimate = coeffs[0] / 2.0 # Calculate R-squared y_pred = coeffs[0] * log_scales + coeffs[1] ss_res = np.sum((log_variances - y_pred) ** 2) ss_tot = np.sum((log_variances - np.mean(log_variances)) ** 2) r_squared = 1 - (ss_res / ss_tot) if ss_tot > 0 else 0 return { 'hurst_parameter': H_estimate, 'r_squared': r_squared, 'scales_used': len(valid_scales), 'method': 'Variance-Based' } def _calculate_variance_at_scale(self, data, scale): """Calculate variance at a specific time scale.""" n = len(data) if scale >= n: return 0 # Calculate variance for non-overlapping windows num_windows = n // scale if num_windows < 2: return 0 variances = [] for i in range(num_windows): start_idx = i * scale end_idx = start_idx + scale window_data = data[start_idx:end_idx] if len(window_data) > 0: variances.append(np.var(window_data)) return np.mean(variances) if variances else 0 # Test the custom estimator print("🧪 Testing Custom Variance-Based Hurst Estimator:") print("=" * 60) # Create estimator variance_estimator = VarianceBasedHurstEstimator(min_scale=10, num_scales=15) # Test on FBM data with known H H_test = 0.7 fbm = FBMModel(H=H_test, sigma=1.0) test_data = fbm.generate(1000, seed=42) # Estimate Hurst parameter result = variance_estimator.estimate(test_data) print(f"Test data: FBM with H = {H_test}") print(f"Estimated H: {result['hurst_parameter']:.4f}") print(f"Error: {abs(result['hurst_parameter'] - H_test):.4f}") print(f"R-squared: {result['r_squared']:.4f}") print(f"Scales used: {result['scales_used']}") # Compare with R/S estimator rs_estimator = RSEstimator() rs_result = rs_estimator.estimate(test_data) rs_H = rs_result.get('hurst_parameter', rs_result.get('H', None)) print(f"\nComparison with R/S estimator:") print(f"R/S H: {rs_H:.4f}") print(f"R/S Error: {abs(rs_H - H_test):.4f}") print("\n✅ Custom classical estimator created and tested successfully!") 5. Summary and Next Steps ------------------------- Key Takeaways ~~~~~~~~~~~~~ 1. **Extensibility**: lrdbenchmark is designed to be highly extensible with clear interfaces for custom components. 2. **Custom Data Models**: Inherit from BaseModel and implement required methods. 3. **Custom Estimators**: Follow standard interface and return consistent results. 4. **Integration**: Custom components work with existing benchmark system. Best Practices ~~~~~~~~~~~~~~ - Parameter validation - Error handling - Documentation - Testing - Performance consideration - Compatibility Next Steps ~~~~~~~~~~ 1. Test custom components in benchmark system 2. Apply to real-world data 3. Explore advanced features 4. Share with community -------------- **Next Tutorial**: :doc:`tutorial_04_benchmarking` — learn how to run the comprehensive benchmarking system.