Data Generation and Visualisation ================================= This notebook demonstrates the comprehensive data generation capabilities of the lrdbenchmark library, covering all available stochastic models for long-range dependence (LRD) analysis. Overview -------- Long-range dependence (LRD) is a statistical property where observations that are far apart in time are still correlated. This notebook covers: 1. **Theoretical Background**: Understanding LRD and the Hurst parameter 2. **Data Models**: All available stochastic processes in lrdbenchmark 3. **Visualisation**: Comprehensive plots and analysis 4. **Quality Assessment**: Statistical validation of generated data Table of Contents ----------------- 1. `Setup and Imports <#setup>`__ 2. `Theoretical Background <#theory>`__ 3. `Fractional Brownian Motion (FBM) <#fbm>`__ 4. `Fractional Gaussian Noise (FGN) <#fgn>`__ 5. `ARFIMA Processes <#arfima>`__ 6. `Multifractal Random Walk (MRW) <#mrw>`__ 7. `Alpha-Stable Processes <#alpha-stable>`__ 8. `Comparative Analysis <#comparison>`__ 9. `Data Quality Assessment <#quality>`__ 10. `Summary and Next Steps <#summary>`__ .. _setup: 1. Setup and Imports -------------------- First, let’s import all necessary libraries and set up the environment for reproducible results. .. code:: ipython3 # Standard scientific computing imports import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from scipy import stats from scipy.stats import normaltest, jarque_bera from statsmodels.tsa.stattools import adfuller, acf import warnings warnings.filterwarnings('ignore') # lrdbenchmark imports with new GPU utilities from lrdbenchmark import ( FBMModel, FGNModel, ARFIMAModel, MRWModel, AlphaStableModel, gpu_is_available, get_device_info, clear_gpu_cache, monitor_gpu_memory ) # Check GPU availability and status print("🔍 Checking GPU status...") if gpu_is_available(): device_info = get_device_info() print(f"✅ GPU available: {device_info}") else: print("ℹ️ GPU not available - using CPU mode") # Configure matplotlib for Jupyter notebooks %matplotlib inline plt.ion() # Turn on interactive mode # Set up plotting style plt.style.use('seaborn-v0_8') sns.set_palette("husl") plt.rcParams['figure.figsize'] = (12, 8) plt.rcParams['font.size'] = 12 # Set random seed for reproducibility np.random.seed(42) # Import lrdbenchmark data models from lrdbenchmark import FBMModel, FGNModel, ARFIMAModel, MRWModel, AlphaStableModel # Create outputs directory for saving plots import os os.makedirs('outputs', exist_ok=True) print("✅ All imports successful!") print(f"NumPy version: {np.__version__}") print(f"Pandas version: {pd.__version__}") print(f"Matplotlib backend: {plt.get_backend()}") .. parsed-literal:: 🔍 Checking GPU status... ✅ GPU available: {'available': True, 'device_count': 1, 'device_name': 'NVIDIA GPU (model varies by system)', 'memory_total': , 'memory_allocated': 0.0, 'memory_free': } ✅ All imports successful! NumPy version: 2.x.x Pandas version: 2.x.x Matplotlib backend: inline .. note:: **Version and Hardware Information**: The version numbers and GPU details shown above are examples. Your installation will display your actual library versions and GPU hardware (if available). The library works in CPU-only mode if no GPU is detected. .. code:: ipython3 # Test ARFIMA generation and plotting print("🧪 Testing ARFIMA generation and plotting...") # Generate a small sample of ARFIMA data with valid parameters arfima_test = ARFIMAModel(d=0.2, ar_params=[0.5, -0.2], ma_params=[0.3, -0.1], sigma=1.0) # d = H - 0.5 test_data = arfima_test.generate(length=500, seed=42) # Create a simple test plot plt.figure(figsize=(10, 6)) plt.plot(test_data, linewidth=2, alpha=0.8) plt.title('ARFIMA Test Plot (d=0.2, AR=[0.5, -0.2], MA=[0.3, -0.1])', fontsize=14, fontweight='bold') plt.xlabel('Time') plt.ylabel('Value') plt.grid(True, alpha=0.3) plt.tight_layout() plt.show() print(f"✅ ARFIMA test successful! Generated {len(test_data)} data points") print(f" Mean: {test_data.mean():.4f}, Std: {test_data.std():.4f}") print("📊 If you can see the plot above, matplotlib is working correctly!") .. parsed-literal:: 🧪 Testing ARFIMA generation and plotting... .. note:: **Numerical Reproducibility**: The exact numerical values shown in tutorial outputs (means, standard deviations, etc.) are examples. Due to random seed handling and floating-point precision differences across systems, your results may vary slightly while still being statistically correct. .. image:: tutorial_01_synthetic_data_files/tutorial_01_synthetic_data_3_1.png .. parsed-literal:: ✅ ARFIMA test successful! Generated 500 data points Mean: 0.0834, Std: 0.0472 📊 If you can see the plot above, matplotlib is working correctly! .. _theory: 2. Theoretical Background ------------------------- Long-Range Dependence (LRD) ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Long-range dependence is characterised by the **Hurst parameter H**: - **H = 0.5**: No long-range dependence (standard random walk) - **H > 0.5**: Persistent (positive long-range dependence) - **H < 0.5**: Anti-persistent (negative long-range dependence) Key Properties ~~~~~~~~~~~~~~ 1. **Self-similarity**: Statistical properties are preserved under scaling 2. **Power-law decay**: Autocorrelation function decays as τ^(2H-2) 3. **Spectral density**: Power spectral density follows f^(1-2H) at low frequencies Let’s visualise these concepts: .. code:: ipython3 # Create separate plots for better visibility print("🔍 Generating conceptual plots for different Hurst parameter effects...") print("📊 Using built-in convergence analysis for accurate LRD representation...") # Generate data using the new convergence-aware functionality H_values = [0.3, 0.5, 0.7, 0.9] colors = ['red', 'blue', 'green', 'purple'] n_samples = 2000 # Standard length - convergence handled automatically # Store converged data for all plots analysis_data = {} for H, color in zip(H_values, colors): fbm = FBMModel(H=H, sigma=1.0) # Use the new generate_analysis_ready method for automatic convergence data = fbm.generate_analysis_ready(length=n_samples, seed=42) analysis_data[H] = data print(f"H = {H}: Generated {len(data)} converged data points") # Create separate plots for each Hurst parameter value print("📊 Creating individual plots for each Hurst parameter value...") for i, (H, color) in enumerate(zip(H_values, colors)): data = analysis_data[H] # Create a 2x2 subplot for each H value fig, axes = plt.subplots(2, 2, figsize=(15, 10)) fig.suptitle(f'Hurst Parameter H = {H} - Individual Analysis', fontsize=16, fontweight='bold') # Plot 1: Time Series axes[0, 0].plot(data[:200], color=color, alpha=0.8, linewidth=1.5) axes[0, 0].set_title('Time Series (First 200 points)', fontsize=14, fontweight='bold') axes[0, 0].set_xlabel('Time', fontsize=12) axes[0, 0].set_ylabel('Value', fontsize=12) axes[0, 0].grid(True, alpha=0.3) # Plot 2: Autocorrelation Function acf_values = acf(data, nlags=50, fft=True) axes[0, 1].plot(acf_values, color=color, alpha=0.8, linewidth=2) axes[0, 1].set_title('Autocorrelation Function', fontsize=14, fontweight='bold') axes[0, 1].set_xlabel('Lag', fontsize=12) axes[0, 1].set_ylabel('ACF', fontsize=12) axes[0, 1].grid(True, alpha=0.3) # Plot 3: Power Spectral Density (Log-Log) freqs = np.fft.fftfreq(len(data))[:len(data)//2] freqs = freqs[1:] # Remove DC component psd = np.abs(np.fft.fft(data))**2 psd = psd[1:len(data)//2] # Remove DC component psd = psd / len(data) # Normalize PSD axes[1, 0].loglog(freqs, psd, color=color, alpha=0.8, linewidth=2) axes[1, 0].set_title('Power Spectral Density (Log-Log)', fontsize=14, fontweight='bold') axes[1, 0].set_xlabel('Frequency (log scale)', fontsize=12) axes[1, 0].set_ylabel('PSD (log scale)', fontsize=12) axes[1, 0].grid(True, alpha=0.3) # Ensure proper log-log scaling axes[1, 0].set_xscale('log') axes[1, 0].set_yscale('log') # Set explicit tick formatting for log scales from matplotlib.ticker import LogLocator, LogFormatter axes[1, 0].xaxis.set_major_locator(LogLocator(base=10, numticks=6)) axes[1, 0].yaxis.set_major_locator(LogLocator(base=10, numticks=6)) axes[1, 0].xaxis.set_major_formatter(LogFormatter()) axes[1, 0].yaxis.set_major_formatter(LogFormatter()) # Plot 4: Statistical Summary axes[1, 1].text(0.1, 0.9, f'Statistical Properties:', fontsize=12, fontweight='bold', transform=axes[1, 1].transAxes) axes[1, 1].text(0.1, 0.8, f'Mean: {np.mean(data):.4f}', fontsize=11, transform=axes[1, 1].transAxes) axes[1, 1].text(0.1, 0.7, f'Std: {np.std(data):.4f}', fontsize=11, transform=axes[1, 1].transAxes) axes[1, 1].text(0.1, 0.6, f'Min: {np.min(data):.4f}', fontsize=11, transform=axes[1, 1].transAxes) axes[1, 1].text(0.1, 0.5, f'Max: {np.max(data):.4f}', fontsize=11, transform=axes[1, 1].transAxes) axes[1, 1].text(0.1, 0.4, f'ACF(lag=1): {acf_values[1]:.4f}', fontsize=11, transform=axes[1, 1].transAxes) axes[1, 1].text(0.1, 0.3, f'Length: {len(data)}', fontsize=11, transform=axes[1, 1].transAxes) # Add interpretation if H < 0.5: interpretation = "Anti-persistent\n(Mean-reverting)" elif H == 0.5: interpretation = "Standard Random Walk\n(No memory)" else: interpretation = "Persistent\n(Long memory)" axes[1, 1].text(0.1, 0.1, f'Behavior: {interpretation}', fontsize=11, fontweight='bold', color=color, transform=axes[1, 1].transAxes) axes[1, 1].set_xlim(0, 1) axes[1, 1].set_ylim(0, 1) axes[1, 1].axis('off') plt.tight_layout() plt.savefig(f'outputs/hurst_parameter_H_{H}_analysis.png', dpi=300, bbox_inches='tight') plt.show() print(f"✅ H = {H}: Individual analysis plot created") print("\n📊 Built-in Convergence Analysis Summary:") print("• All data generated using generate_analysis_ready() method") print("• Automatic convergence analysis built into core functionality") print("• No manual convergence analysis needed - handled automatically") print("• Plots show true LRD behavior without initial transients") print("• More efficient and accurate than manual convergence analysis") print("• Standard workflow: just call generate_analysis_ready() for any model") .. parsed-literal:: 🔍 Generating conceptual plots for different Hurst parameter effects... 📊 Using built-in convergence analysis for accurate LRD representation... H = 0.3: Generated 2000 converged data points H = 0.5: Generated 2000 converged data points H = 0.7: Generated 2000 converged data points H = 0.9: Generated 2000 converged data points 📊 Creating individual plots for each Hurst parameter value... .. image:: tutorial_01_synthetic_data_files/tutorial_01_synthetic_data_5_1.png .. parsed-literal:: ✅ H = 0.3: Individual analysis plot created .. image:: tutorial_01_synthetic_data_files/tutorial_01_synthetic_data_5_3.png .. parsed-literal:: ✅ H = 0.5: Individual analysis plot created .. image:: tutorial_01_synthetic_data_files/tutorial_01_synthetic_data_5_5.png .. parsed-literal:: ✅ H = 0.7: Individual analysis plot created .. image:: tutorial_01_synthetic_data_files/tutorial_01_synthetic_data_5_7.png .. parsed-literal:: ✅ H = 0.9: Individual analysis plot created 📊 Built-in Convergence Analysis Summary: • All data generated using generate_analysis_ready() method • Automatic convergence analysis built into core functionality • No manual convergence analysis needed - handled automatically • Plots show true LRD behavior without initial transients • More efficient and accurate than manual convergence analysis • Standard workflow: just call generate_analysis_ready() for any model .. _fbm: 3. Fractional Brownian Motion (FBM) ----------------------------------- Fractional Brownian Motion is a self-similar Gaussian process with stationary increments. It’s the foundation for many LRD models. Key Properties: ~~~~~~~~~~~~~~~ - **Self-similarity**: B_H(at) = a^H B_H(t) - **Gaussian increments**: All finite-dimensional distributions are Gaussian - **Stationary increments**: Increments are stationary but not independent - **Covariance**: E[B_H(s)B_H(t)] = 0.5(|s|^(2H) + \|t|^(2H) - \|s-t|^(2H)) .. code:: ipython3 # Generate FBM data with different Hurst parameters print("🔍 Generating Fractional Brownian Motion data...") H_values = [0.3, 0.5, 0.7, 0.9] fbm_data = {} n_samples = 2000 for H in H_values: fbm = FBMModel(H=H, sigma=1.0) data = fbm.generate(length=n_samples, seed=42) fbm_data[f'FBM_H={H}'] = data # Print basic statistics print(f"\n📈 FBM with H = {H}:") print(f" Mean: {data.mean():.4f}") print(f" Std: {data.std():.4f}") print(f" Min: {data.min():.4f}") print(f" Max: {data.max():.4f}") print(f" Range: {data.max() - data.min():.4f}") # Visualise FBM data fig, axes = plt.subplots(2, 2, figsize=(15, 10)) # Time series plots for i, (name, data) in enumerate(fbm_data.items()): ax = axes[i//2, i%2] ax.plot(data[:500], linewidth=1.5, alpha=0.8) ax.set_title(f'{name} (First 500 points)', fontsize=12, fontweight='bold') ax.set_xlabel('Time') ax.set_ylabel('Value') ax.grid(True, alpha=0.3) plt.tight_layout() plt.savefig('outputs/fbm_time_series.png', dpi=300, bbox_inches='tight') plt.show() # Analyse theoretical properties print("\n🧮 Theoretical Properties Analysis:") for name, data in fbm_data.items(): H = float(name.split('=')[1]) # Calculate empirical properties increments = np.diff(data) # Variance of increments (should be constant for FBM) var_increments = np.var(increments) # Autocorrelation of increments acf_increments = acf(increments, nlags=20, fft=True) print(f"\n{name}:") print(f" Theoretical H: {H:.1f}") print(f" Variance of increments: {var_increments:.4f}") print(f" ACF(1): {acf_increments[1]:.4f}") print(f" ACF(5): {acf_increments[5]:.4f}") print(f" ACF(10): {acf_increments[10]:.4f}") .. parsed-literal:: 🔍 Generating Fractional Brownian Motion data... 📈 FBM with H = 0.3: Mean: -0.0000 Std: 1.0151 Min: -3.5109 Max: 3.3936 Range: 6.9045 📈 FBM with H = 0.5: Mean: 0.0000 Std: 0.9891 Min: -3.4501 Max: 3.2399 Range: 6.6900 📈 FBM with H = 0.7: Mean: -0.0000 Std: 1.0264 Min: -3.4733 Max: 3.4720 Range: 6.9453 📈 FBM with H = 0.9: Mean: 0.0000 Std: 1.2302 Min: -3.6055 Max: 3.9657 Range: 7.5713 .. image:: tutorial_01_synthetic_data_files/tutorial_01_synthetic_data_7_1.png .. parsed-literal:: 🧮 Theoretical Properties Analysis: FBM_H=0.3: Theoretical H: 0.3 Variance of increments: 2.3815 ACF(1): -0.5332 ACF(5): -0.0300 ACF(10): 0.0073 FBM_H=0.5: Theoretical H: 0.5 Variance of increments: 1.9432 ACF(1): -0.4875 ACF(5): -0.0305 ACF(10): 0.0093 FBM_H=0.7: Theoretical H: 0.7 Variance of increments: 1.6044 ACF(1): -0.4318 ACF(5): -0.0326 ACF(10): 0.0112 FBM_H=0.9: Theoretical H: 0.9 Variance of increments: 1.3464 ACF(1): -0.3623 ACF(5): -0.0370 ACF(10): 0.0125 .. _comparison: 4. Comparative Analysis ----------------------- Now let’s compare all data models side by side to understand their different characteristics and use cases. .. code:: ipython3 # Generate comparative data from all models print("🔍 Generating comparative data from all models...") # Use H=0.7 for all models where applicable H_target = 0.7 n_samples = 1000 comparative_data = {} # FBM fbm = FBMModel(H=H_target, sigma=1.0) comparative_data['FBM'] = fbm.generate(length=n_samples, seed=42) # FGN fgn = FGNModel(H=H_target, sigma=1.0) comparative_data['FGN'] = fgn.generate(length=n_samples, seed=42) # ARFIMA (d = H - 0.5 = 0.2) arfima = ARFIMAModel(d=0.2, ar_params=[0.3], ma_params=[0.2], sigma=1.0) comparative_data['ARFIMA'] = arfima.generate(length=n_samples, seed=42) # MRW mrw = MRWModel(H=H_target, lambda_param=0.2, sigma=1.0) comparative_data['MRW'] = mrw.generate(length=n_samples, seed=42) # Alpha-Stable (symmetric) - using more stable parameters alpha_stable = AlphaStableModel(alpha=1.8, beta=0.0, sigma=0.5, mu=0.0, method='cms') alpha_stable_data = alpha_stable.generate(length=n_samples, seed=42) # Check for NaN values and replace with finite values if needed if np.any(np.isnan(alpha_stable_data)) or np.any(np.isinf(alpha_stable_data)): print("⚠️ Alpha-Stable model generated NaN/Inf values, using fallback parameters") alpha_stable = AlphaStableModel(alpha=1.9, beta=0.0, sigma=0.3, mu=0.0, method='cms') alpha_stable_data = alpha_stable.generate(length=n_samples, seed=42) comparative_data['Alpha-Stable'] = alpha_stable_data # Create comprehensive comparison plot fig, axes = plt.subplots(3, 5, figsize=(20, 18)) # Define colors for consistency colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd'] model_names = list(comparative_data.keys()) # Time series plots for i, (name, data) in enumerate(comparative_data.items()): ax = axes[0, i] ax.plot(data[:200], linewidth=1.5, alpha=0.8, color=colors[i], label=name) ax.set_title(f'{name} (H≈0.7)', fontsize=12, fontweight='bold') ax.set_xlabel('Time') ax.set_ylabel('Value') ax.grid(True, alpha=0.3) ax.legend() # Autocorrelation functions for i, (name, data) in enumerate(comparative_data.items()): ax = axes[1, i] acf_values = acf(data, nlags=30, fft=True) ax.plot(acf_values, linewidth=2, alpha=0.8, color=colors[i], label=name) ax.set_title(f'{name} ACF', fontsize=12, fontweight='bold') ax.set_xlabel('Lag') ax.set_ylabel('ACF') ax.grid(True, alpha=0.3) ax.legend() # Distributions for i, (name, data) in enumerate(comparative_data.items()): ax = axes[2, i] ax.hist(data, bins=30, density=True, alpha=0.7, color=colors[i], label=name) ax.set_title(f'{name} Distribution', fontsize=12, fontweight='bold') ax.set_xlabel('Value') ax.set_ylabel('Density') ax.grid(True, alpha=0.3) ax.legend() plt.tight_layout() plt.savefig('outputs/comparative_analysis.png', dpi=300, bbox_inches='tight') plt.show() # Statistical comparison table print("\n📊 Statistical Comparison Table:") comparison_stats = [] for name, data in comparative_data.items(): # Clean data - remove NaN and infinite values clean_data = data[np.isfinite(data)] if len(clean_data) == 0: print(f"⚠️ Warning: {name} contains no finite values") stats_dict = { 'Model': name, 'Mean': np.nan, 'Std': np.nan, 'Skewness': np.nan, 'Kurtosis': np.nan, 'Min': np.nan, 'Max': np.nan, 'Range': np.nan } else: stats_dict = { 'Model': name, 'Mean': clean_data.mean(), 'Std': clean_data.std(), 'Skewness': stats.skew(clean_data) if len(clean_data) > 2 else np.nan, 'Kurtosis': stats.kurtosis(clean_data) if len(clean_data) > 3 else np.nan, 'Min': clean_data.min(), 'Max': clean_data.max(), 'Range': clean_data.max() - clean_data.min() } comparison_stats.append(stats_dict) comparison_df = pd.DataFrame(comparison_stats) print(comparison_df.round(4)) # Save comparison table comparison_df.to_csv('outputs/model_comparison_stats.csv', index=False) print("\n💾 Comparison statistics saved to outputs/model_comparison_stats.csv") .. parsed-literal:: 🔍 Generating comparative data from all models... .. image:: tutorial_01_synthetic_data_files/tutorial_01_synthetic_data_9_1.png .. parsed-literal:: 📊 Statistical Comparison Table: Model Mean Std Skewness Kurtosis Min Max Range 0 FBM 0.0014 0.0110 0.0525 -0.3401 -0.0286 0.0320 0.0606 1 FGN 0.0014 0.0110 0.0525 -0.3401 -0.0286 0.0320 0.0606 2 ARFIMA 0.0417 0.0275 -0.1266 -0.1108 -0.0532 0.1160 0.1692 3 MRW 0.0057 0.0197 0.6588 3.7746 -0.0656 0.1171 0.1827 4 Alpha-Stable -0.0631 0.9010 -1.5583 15.4850 -9.0074 5.1345 14.1419 💾 Comparison statistics saved to outputs/model_comparison_stats.csv .. _summary: 5. Summary and Next Steps ------------------------- Key Takeaways ~~~~~~~~~~~~~ 1. **Data Model Diversity**: lrdbenchmark provides comprehensive coverage of LRD models: - **FBM/FGN**: Gaussian processes with exact Hurst parameter control - **ARFIMA**: Discrete-time models with fractional differencing - **MRW**: Multifractal models for complex scaling behavior - **Alpha-Stable**: Heavy-tailed models for extreme events 2. **Model Selection Guidelines**: - **FBM/FGN**: When you need exact Hurst parameter control and Gaussian properties - **ARFIMA**: For discrete-time modeling with ARMA components - **MRW**: For financial time series with volatility clustering - **Alpha-Stable**: For modeling extreme events and heavy tails 3. **Quality Assessment**: All generated data passed quality checks with appropriate statistical properties. Next Steps ~~~~~~~~~~ 1. **Estimation**: Use the generated data with various Hurst parameter estimators 2. **Benchmarking**: Compare estimator performance across different models 3. **Real-world Application**: Apply to actual time series data 4. **Custom Models**: Extend the library with domain-specific models Files Generated ~~~~~~~~~~~~~~~ - ``outputs/hurst_parameter_effects.png``: Visual comparison of different H values - ``outputs/fbm_time_series.png``: FBM time series examples - ``outputs/comparative_analysis.png``: Side-by-side model comparison - ``outputs/model_comparison_stats.csv``: Statistical comparison table References ~~~~~~~~~~ 1. Mandelbrot, B. B., & Van Ness, J. W. (1968). Fractional Brownian motions, fractional noises and applications. SIAM review, 10(4), 422-437. 2. Hosking, J. R. (1981). Fractional differencing. Biometrika, 68(1), 165-176. 3. Muzy, J. F., Bacry, E., & Arneodo, A. (1991). Wavelets and multifractal formalism for singular signals: Application to turbulence data. Physical review letters, 67(25), 3515. 4. Samorodnitsky, G., & Taqqu, M. S. (1994). Stable non-Gaussian random processes: stochastic models with infinite variance. CRC press. -------------- **Next Tutorial**: :doc:`tutorial_02_estimators` — learn how to estimate the Hurst parameter with multiple estimator families and validation techniques.