Datasets API

Dataset generation utilities for benchmarking and testing.

Overview

Synthetic data generation for change point detection.

This module provides functions to generate synthetic time series data with various types of changes, including:

  • Multiple model types (mean, variance, regression, GLM, LASSO, ARMA, GARCH)

  • Rich metadata (SNR, difficulty scores, true parameters)

  • Realistic parameter generation

  • Multiple change patterns (jump, drift, coefficient changes)

  • Reproducible with seed control

All functions return dictionaries with ‘data’, ‘changepoints’, and ‘metadata’ for comprehensive analysis and benchmarking.

fastcpd.datasets.make_mean_change(n_samples: int = 500, n_changepoints: int = 3, n_dim: int = 1, mean_deltas: List[float] | None = None, noise_std: float = 1.0, change_type: str = 'jump', seed: int | None = None) Dict[source]

Generate data with mean changes.

Parameters:
  • n_samples – Total number of samples

  • n_changepoints – Number of change points

  • n_dim – Data dimensionality

  • mean_deltas – List of mean shifts for each segment (auto-generated if None)

  • noise_std – Noise standard deviation

  • change_type – ‘jump’ (step change) or ‘drift’ (gradual)

  • seed – Random seed

Returns:

  • data: array of shape (n_samples, n_dim)

  • changepoints: array of CP indices

  • true_means: list of segment means

  • metadata: dict with SNR, deltas, difficulty, etc.

Return type:

Dictionary with

Examples

>>> data_dict = make_mean_change(n_samples=500, n_changepoints=3)
>>> data = data_dict['data']
>>> cps = data_dict['changepoints']
>>> print(f"SNR: {data_dict['metadata']['snr']:.2f}")
fastcpd.datasets.make_variance_change(n_samples: int = 500, n_changepoints: int = 3, n_dim: int = 1, variance_ratios: List[float] | None = None, base_var: float = 1.0, change_type: str = 'multiplicative', seed: int | None = None) Dict[source]

Generate data with variance changes.

Parameters:
  • n_samples – Total number of samples

  • n_changepoints – Number of change points

  • n_dim – Data dimensionality

  • variance_ratios – List of variance multipliers (auto-generated if None)

  • base_var – Baseline variance

  • change_type – ‘multiplicative’ or ‘additive’

  • seed – Random seed

Returns:

  • data: array of shape (n_samples, n_dim)

  • changepoints: array of CP indices

  • true_variances: list of segment variances

  • metadata: dict with variance_ratios, kurtosis, etc.

Return type:

Dictionary with

Examples

>>> data_dict = make_variance_change(n_samples=500, n_changepoints=2)
>>> print(data_dict['metadata']['variance_ratios'])
fastcpd.datasets.make_regression_change(n_samples: int = 500, n_changepoints: int = 3, n_features: int = 3, coef_changes: str | List[ndarray] = 'random', noise_std: float = 0.5, correlation: float = 0.0, seed: int | None = None) Dict[source]

Generate linear regression data with coefficient changes.

Parameters:
  • n_samples – Total number of samples

  • n_changepoints – Number of change points

  • n_features – Number of covariates

  • coef_changes – ‘random’, ‘sign_flip’, ‘magnitude’, or list of coefficient arrays

  • noise_std – Error term std deviation

  • correlation – Covariate correlation (0 to 0.9)

  • seed – Random seed

Returns:

  • data: array (n_samples, n_features+1) where [:, 0] is y and [:, 1:] is X

  • changepoints: array of CP indices

  • true_coefs: array (n_segments, n_features) of coefficients per segment

  • X: covariate matrix (n_samples, n_features)

  • y: response vector (n_samples,)

  • metadata: dict with R², condition number, effect size

Return type:

Dictionary with

Examples

>>> data_dict = make_regression_change(n_samples=300, n_changepoints=2, n_features=3)
>>> X = data_dict['X']
>>> y = data_dict['y']
>>> print(data_dict['metadata']['r_squared_per_segment'])
fastcpd.datasets.make_arma_change(n_samples: int = 500, n_changepoints: int = 3, orders: List[Tuple[int, int]] | None = None, sigma_change: bool = False, innovation: str = 'normal', seed: int | None = None) Dict[source]

Generate ARMA time series with parameter changes.

Parameters:
  • n_samples – Total number of samples

  • n_changepoints – Number of change points

  • orders – List of (p,q) tuples for each segment (auto-generated if None)

  • sigma_change – If True, innovation variance also changes

  • innovation – ‘normal’, ‘t’, or ‘skew_normal’

  • seed – Random seed

Returns:

  • data: array (n_samples,)

  • changepoints: array of CP indices

  • true_params: list of dicts with ‘ar’, ‘ma’, ‘sigma’ for each segment

  • metadata: dict with stationarity checks, ACF, PACF

Return type:

Dictionary with

Examples

>>> data_dict = make_arma_change(n_samples=500, orders=[(1,1), (2,0)])
>>> print(data_dict['metadata']['is_stationary'])
fastcpd.datasets.make_glm_change(n_samples: int = 500, n_changepoints: int = 3, n_features: int = 3, family: str = 'binomial', coef_changes: str | List[ndarray] = 'random', trials: int | None = None, correlation: float = 0.0, seed: int | None = None) Dict[source]

Generate GLM data with coefficient changes.

Parameters:
  • n_samples – Total number of samples

  • n_changepoints – Number of change points

  • n_features – Number of covariates

  • family – ‘binomial’ or ‘poisson’

  • coef_changes – ‘random’, ‘sign_flip’, or list of coefficient arrays

  • trials – Number of trials for binomial (default: 1 for logistic regression)

  • correlation – Covariate correlation (0 to 0.9)

  • seed – Random seed

Returns:

  • data: array (n_samples, n_features+1) where [:, 0] is y and [:, 1:] is X

  • changepoints: array of CP indices

  • true_coefs: array (n_segments, n_features) of coefficients per segment

  • X: covariate matrix (n_samples, n_features)

  • y: response vector (n_samples,)

  • metadata: dict with AUC (binomial), overdispersion (poisson), etc.

Return type:

Dictionary with

Examples

>>> data_dict = make_glm_change(n_samples=400, family='binomial', n_features=3)
>>> X = data_dict['X']
>>> y = data_dict['y']
>>> print(data_dict['metadata']['separation_per_segment'])
fastcpd.datasets.make_garch_change(n_samples: int = 500, n_changepoints: int = 3, orders: List[Tuple[int, int]] | None = None, volatility_regimes: List[str] | None = None, seed: int | None = None) Dict[source]

Generate GARCH time series with volatility regime changes.

Parameters:
  • n_samples – Total number of samples

  • n_changepoints – Number of change points

  • orders – List of (p,q) tuples for each segment (auto-generated if None)

  • volatility_regimes – List of ‘low’, ‘medium’, ‘high’ for each segment

  • seed – Random seed

Returns:

  • data: array (n_samples,) of returns

  • changepoints: array of CP indices

  • true_params: list of dicts with ‘omega’, ‘alpha’, ‘beta’ for each segment

  • volatility: array (n_samples,) of conditional volatility

  • metadata: dict with volatility ratios, kurtosis

Return type:

Dictionary with

Examples

>>> data_dict = make_garch_change(n_samples=600, n_changepoints=2)
>>> returns = data_dict['data']
>>> vol = data_dict['volatility']
>>> print(data_dict['metadata']['avg_volatility_per_segment'])
fastcpd.datasets.add_annotation_noise(true_changepoints: List | ndarray, n_annotators: int = 5, noise_std: float = 5.0, agreement_rate: float = 0.8, seed: int | None = None) List[List[int]][source]

Simulate multiple human annotators with varying agreement.

Useful for testing covering metric and multi-annotator scenarios.

Parameters:
  • true_changepoints – Ground truth change points

  • n_annotators – Number of annotators to simulate

  • noise_std – Std of Gaussian noise added to CP locations

  • agreement_rate – Probability each annotator includes each CP

  • seed – Random seed

Returns:

List of lists, each sublist is one annotator’s change points

Examples

>>> true_cps = [100, 200, 300]
>>> annotators = add_annotation_noise(true_cps, n_annotators=5)
>>> print(f"Annotator 1: {annotators[0]}")
>>> print(f"Annotator 2: {annotators[1]}")

Dataset Generators

Parametric Datasets

fastcpd.datasets.make_mean_change(n_samples: int = 500, n_changepoints: int = 3, n_dim: int = 1, mean_deltas: List[float] | None = None, noise_std: float = 1.0, change_type: str = 'jump', seed: int | None = None) Dict[source]

Generate data with mean changes.

Parameters:
  • n_samples – Total number of samples

  • n_changepoints – Number of change points

  • n_dim – Data dimensionality

  • mean_deltas – List of mean shifts for each segment (auto-generated if None)

  • noise_std – Noise standard deviation

  • change_type – ‘jump’ (step change) or ‘drift’ (gradual)

  • seed – Random seed

Returns:

  • data: array of shape (n_samples, n_dim)

  • changepoints: array of CP indices

  • true_means: list of segment means

  • metadata: dict with SNR, deltas, difficulty, etc.

Return type:

Dictionary with

Examples

>>> data_dict = make_mean_change(n_samples=500, n_changepoints=3)
>>> data = data_dict['data']
>>> cps = data_dict['changepoints']
>>> print(f"SNR: {data_dict['metadata']['snr']:.2f}")
fastcpd.datasets.make_variance_change(n_samples: int = 500, n_changepoints: int = 3, n_dim: int = 1, variance_ratios: List[float] | None = None, base_var: float = 1.0, change_type: str = 'multiplicative', seed: int | None = None) Dict[source]

Generate data with variance changes.

Parameters:
  • n_samples – Total number of samples

  • n_changepoints – Number of change points

  • n_dim – Data dimensionality

  • variance_ratios – List of variance multipliers (auto-generated if None)

  • base_var – Baseline variance

  • change_type – ‘multiplicative’ or ‘additive’

  • seed – Random seed

Returns:

  • data: array of shape (n_samples, n_dim)

  • changepoints: array of CP indices

  • true_variances: list of segment variances

  • metadata: dict with variance_ratios, kurtosis, etc.

Return type:

Dictionary with

Examples

>>> data_dict = make_variance_change(n_samples=500, n_changepoints=2)
>>> print(data_dict['metadata']['variance_ratios'])

Regression Datasets

fastcpd.datasets.make_regression_change(n_samples: int = 500, n_changepoints: int = 3, n_features: int = 3, coef_changes: str | List[ndarray] = 'random', noise_std: float = 0.5, correlation: float = 0.0, seed: int | None = None) Dict[source]

Generate linear regression data with coefficient changes.

Parameters:
  • n_samples – Total number of samples

  • n_changepoints – Number of change points

  • n_features – Number of covariates

  • coef_changes – ‘random’, ‘sign_flip’, ‘magnitude’, or list of coefficient arrays

  • noise_std – Error term std deviation

  • correlation – Covariate correlation (0 to 0.9)

  • seed – Random seed

Returns:

  • data: array (n_samples, n_features+1) where [:, 0] is y and [:, 1:] is X

  • changepoints: array of CP indices

  • true_coefs: array (n_segments, n_features) of coefficients per segment

  • X: covariate matrix (n_samples, n_features)

  • y: response vector (n_samples,)

  • metadata: dict with R², condition number, effect size

Return type:

Dictionary with

Examples

>>> data_dict = make_regression_change(n_samples=300, n_changepoints=2, n_features=3)
>>> X = data_dict['X']
>>> y = data_dict['y']
>>> print(data_dict['metadata']['r_squared_per_segment'])
fastcpd.datasets.make_glm_change(n_samples: int = 500, n_changepoints: int = 3, n_features: int = 3, family: str = 'binomial', coef_changes: str | List[ndarray] = 'random', trials: int | None = None, correlation: float = 0.0, seed: int | None = None) Dict[source]

Generate GLM data with coefficient changes.

Parameters:
  • n_samples – Total number of samples

  • n_changepoints – Number of change points

  • n_features – Number of covariates

  • family – ‘binomial’ or ‘poisson’

  • coef_changes – ‘random’, ‘sign_flip’, or list of coefficient arrays

  • trials – Number of trials for binomial (default: 1 for logistic regression)

  • correlation – Covariate correlation (0 to 0.9)

  • seed – Random seed

Returns:

  • data: array (n_samples, n_features+1) where [:, 0] is y and [:, 1:] is X

  • changepoints: array of CP indices

  • true_coefs: array (n_segments, n_features) of coefficients per segment

  • X: covariate matrix (n_samples, n_features)

  • y: response vector (n_samples,)

  • metadata: dict with AUC (binomial), overdispersion (poisson), etc.

Return type:

Dictionary with

Examples

>>> data_dict = make_glm_change(n_samples=400, family='binomial', n_features=3)
>>> X = data_dict['X']
>>> y = data_dict['y']
>>> print(data_dict['metadata']['separation_per_segment'])

Time Series Datasets

fastcpd.datasets.make_arma_change(n_samples: int = 500, n_changepoints: int = 3, orders: List[Tuple[int, int]] | None = None, sigma_change: bool = False, innovation: str = 'normal', seed: int | None = None) Dict[source]

Generate ARMA time series with parameter changes.

Parameters:
  • n_samples – Total number of samples

  • n_changepoints – Number of change points

  • orders – List of (p,q) tuples for each segment (auto-generated if None)

  • sigma_change – If True, innovation variance also changes

  • innovation – ‘normal’, ‘t’, or ‘skew_normal’

  • seed – Random seed

Returns:

  • data: array (n_samples,)

  • changepoints: array of CP indices

  • true_params: list of dicts with ‘ar’, ‘ma’, ‘sigma’ for each segment

  • metadata: dict with stationarity checks, ACF, PACF

Return type:

Dictionary with

Examples

>>> data_dict = make_arma_change(n_samples=500, orders=[(1,1), (2,0)])
>>> print(data_dict['metadata']['is_stationary'])
fastcpd.datasets.make_garch_change(n_samples: int = 500, n_changepoints: int = 3, orders: List[Tuple[int, int]] | None = None, volatility_regimes: List[str] | None = None, seed: int | None = None) Dict[source]

Generate GARCH time series with volatility regime changes.

Parameters:
  • n_samples – Total number of samples

  • n_changepoints – Number of change points

  • orders – List of (p,q) tuples for each segment (auto-generated if None)

  • volatility_regimes – List of ‘low’, ‘medium’, ‘high’ for each segment

  • seed – Random seed

Returns:

  • data: array (n_samples,) of returns

  • changepoints: array of CP indices

  • true_params: list of dicts with ‘omega’, ‘alpha’, ‘beta’ for each segment

  • volatility: array (n_samples,) of conditional volatility

  • metadata: dict with volatility ratios, kurtosis

Return type:

Dictionary with

Examples

>>> data_dict = make_garch_change(n_samples=600, n_changepoints=2)
>>> returns = data_dict['data']
>>> vol = data_dict['volatility']
>>> print(data_dict['metadata']['avg_volatility_per_segment'])

Example Usage

Basic Dataset Generation

from fastcpd.datasets import make_mean_change
import numpy as np

# Generate dataset with 3 change points
np.random.seed(42)
data_dict = make_mean_change(
    n_samples=600,
    n_changepoints=3,
    seed=42
)

# Access components
print(f"Data shape: {data_dict['data'].shape}")
print(f"Change points: {data_dict['changepoints']}")

GLM Dataset

from fastcpd.datasets import make_glm_change

# Logistic regression dataset
data_dict = make_glm_change(
    n_samples=800,
    n_predictors=5,
    n_changepoints=2,
    family='binomial',
    seed=42
)

# Extract response and predictors
y = data_dict['data'][:, 0]
X = data_dict['data'][:, 1:]