Evaluation Metrics¶

fastcpd-python provides 6 comprehensive evaluation metrics for assessing change point detection performance.

Overview¶

Metric	Description	Range	Best Value
Precision	Fraction of detected CPs that are correct	[0, 1]	1.0
Recall	Fraction of true CPs that are detected	[0, 1]	1.0
F1-Score	Harmonic mean of precision and recall	[0, 1]	1.0
Hausdorff	Maximum distance between true and detected sets	[0, ∞)	0.0
Covering	Multi-annotator agreement metric	[0, 1]	1.0
Annotation Error	Average distance of detections from true CPs	[0, ∞)	0.0

Quick Start¶

Basic Evaluation¶

from fastcpd.metrics import evaluate_all
from fastcpd.segmentation import mean
from fastcpd.datasets import make_mean_change
import numpy as np

# Generate data with known change points
data_dict = make_mean_change(n_samples=500, n_changepoints=3, seed=42)

# Detect change points
result = mean(data_dict['data'], beta="MBIC")

# Evaluate all metrics
metrics = evaluate_all(
    true_cps=data_dict['changepoints'],
    pred_cps=result.cp_set.tolist(),
    n_samples=500,
    margin=10  # Tolerance window
)

# Display results
print(f"Precision: {metrics['point_metrics']['precision']:.3f}")
print(f"Recall:    {metrics['point_metrics']['recall']:.3f}")
print(f"F1-Score:  {metrics['point_metrics']['f1_score']:.3f}")

Example output:

Precision: 1.000
Recall:    0.667
F1-Score:  0.800

Individual Metrics¶

Precision and Recall¶

Precision: What fraction of detected change points are correct?

\[\text{Precision} = \frac{TP}{TP + FP}\]

Recall (Sensitivity): What fraction of true change points were detected?

\[\text{Recall} = \frac{TP}{TP + FN}\]

from fastcpd.metrics import precision_recall

true_cps = [100, 200, 300]
detected_cps = [98, 205, 350]

pr = precision_recall(true_cps, detected_cps, n_samples=500, margin=10)

print(f"Precision: {pr['precision']:.3f}")
print(f"Recall:    {pr['recall']:.3f}")

Parameters:

margin: Tolerance window (default: 10)
- A detected CP within margin of a true CP is considered correct
- Common values: 5-20 depending on application

Interpretation:

High Precision, Low Recall: Conservative detection (few false positives, many misses)
Low Precision, High Recall: Liberal detection (many false positives, few misses)
High Both: Excellent detection
Low Both: Poor detection

F1-Score¶

Harmonic mean of precision and recall, balancing both metrics.

\[F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}\]

from fastcpd.metrics import precision_recall

pr = precision_recall(true_cps, detected_cps, n_samples=500, margin=10)
print(f"F1-Score: {pr['f1_score']:.3f}")

Advantages:

Single metric summarizing performance
Balances precision and recall
Standard in machine learning

Disadvantages:

Doesn’t capture the magnitude of errors
Sensitive to margin parameter

Hausdorff Distance¶

Maximum distance between true and detected change point sets (in both directions).

\[d_H(A, B) = \max\left(\max_{a \in A} \min_{b \in B} |a - b|, \max_{b \in B} \min_{a \in A} |a - b|\right)\]

from fastcpd.metrics import hausdorff_distance

hd = hausdorff_distance(true_cps, detected_cps)
print(f"Hausdorff Distance: {hd['hausdorff']}")

Interpretation:

0: Perfect match
Larger values: Worse match
Captures worst-case error
Useful for applications requiring guarantees

Use Cases:

Medical applications (critical to not miss)
Safety-critical systems
Quality control

Annotation Error¶

Mean absolute error between optimally matched true and detected change points.

\[\text{AE} = \frac{1}{|\text{Matched Pairs}|} \sum_{(t,d) \in \text{Matched Pairs}} |t - d|\]

where matched pairs are determined by greedy closest-pair matching between true and detected sets.

from fastcpd.metrics import annotation_error

ae = annotation_error(true_cps, detected_cps)
print(f"Annotation Error: {ae['error']:.2f}")

Interpretation:

0: All matched pairs have perfect localization
Lower is better
Measures average localization accuracy for matched pairs
Unmatched change points (when |true| ≠ |detected|) do not contribute to the error

Covering Metric¶

Multi-annotator agreement metric. Measures how well detected change points “cover” multiple sets of annotations.

\[\text{Covering} = \frac{1}{K} \sum_{k=1}^K \frac{|D \cap T_k^{\text{margin}}|}{|T_k|}\]

where \(T_k\) is the k-th annotator’s change points, and \(T_k^{\text{margin}}\) is expanded by margin.

from fastcpd.metrics import covering_metric

# Multiple annotators
true_cps_multi = [
    [100, 200, 300],     # Annotator 1
    [105, 195, 305],     # Annotator 2
    [98, 203, 298]       # Annotator 3
]

detected_cps = [102, 201, 299]

covering = covering_metric(true_cps_multi, detected_cps, margin=10)
print(f"Covering: {covering:.3f}")

Use Cases:

Datasets with multiple expert annotations
Ambiguous change point locations
Robustness assessment

Advantages:

Accounts for annotation uncertainty
More realistic than single ground truth
Commonly used in research papers

Evaluate All Metrics at Once¶

from fastcpd.metrics import evaluate_all

metrics = evaluate_all(
    true_cps=[100, 200, 300],
    pred_cps=[98, 205, 310],
    n_samples=500,
    margin=10
)

# Returns dictionary with all metrics
print("All Metrics:")
for metric_name, value in metrics.items():
    print(f"  {metric_name:20s}: {value:.3f}")

Example output:

All Metrics:
  precision           : 1.000
  recall              : 1.000
  f1_score            : 1.000
  hausdorff           : 10.000
  annotation_error    : 6.333
  one_to_one          : 1.000

Metric Return Values¶

Rich Dictionary Returns¶

Most metrics return a dictionary with detailed breakdown:

from fastcpd.metrics import precision_recall

result = precision_recall(true_cps, detected_cps, n_samples=500, margin=10)

# Dictionary with detailed fields
print(result)
# {
#     'precision': 0.667,
#     'recall': 0.667,
#     'f1_score': 0.667,
#     'true_positives': 2,
#     'false_positives': 1,
#     'false_negatives': 1,
#     'n_true': 3,
#     'n_detected': 3,
#     'margin': 10
# }

Advantages:

Detailed debugging information
Understand exactly what happened
Report multiple aspects

Choosing Metrics¶

By Application¶

Application	Recommended Metrics
General research	Precision, Recall, F1-Score
Medical/safety-critical	Recall (don’t miss!), Hausdorff
Quality control	Precision (avoid false alarms), F1
Multi-annotator data	Covering metric
Exact localization needed	Annotation Error, Hausdorff
Algorithm comparison	F1-Score, Covering

By Constraint¶

Minimize False Positives (e.g., avoid unnecessary interventions):

Focus on Precision
Use conservative beta values

Minimize False Negatives (e.g., don’t miss critical events):

Focus on Recall
Use liberal beta values

Balance Both:

Use F1-Score
Tune beta to maximize F1

Best Practices¶

Choosing the Margin¶

The margin parameter is critical:

# Too small: penalizes small localization errors
metrics_strict = evaluate_all(true_cps, detected_cps, n_samples=500, margin=2)

# Reasonable: 10-20 typical
metrics_moderate = evaluate_all(true_cps, detected_cps, n_samples=500, margin=10)

# Too large: accepts poor localization
metrics_loose = evaluate_all(true_cps, detected_cps, n_samples=500, margin=50)

Guidelines:

Small data (n<100): margin=2-5
Medium data (n=100-1000): margin=5-15
Large data (n>1000): margin=10-30
Application-specific: Use domain knowledge

Report Multiple Metrics¶

Don’t rely on a single metric:

# Report at least these three
print(f"Precision: {metrics['point_metrics']['precision']:.3f}")
print(f"Recall:    {metrics['point_metrics']['recall']:.3f}")
print(f"F1-Score:  {metrics['point_metrics']['f1_score']:.3f}")

# Plus application-specific
print(f"Hausdorff: {metrics['hausdorff']:.1f}")

Cross-Validation for Beta Selection¶

from fastcpd.segmentation import mean
from fastcpd.metrics import precision_recall
from fastcpd.datasets import make_mean_change

# Generate data
data_dict = make_mean_change(n_samples=500, n_changepoints=3, seed=42)

# Try different beta values
beta_values = [5.0, 10.0, 15.0, 20.0, 30.0, "BIC", "MBIC", "MDL"]
results = []

for beta_val in beta_values:
    result = mean(data_dict['data'], beta=beta_val)
    pr = precision_recall(
        data_dict['changepoints'],
        result.cp_set.tolist(),
        n_samples=500,
        margin=10
    )
    f1 = pr['f1_score']
    results.append((beta_val, f1, len(result.cp_set)))
    print(f"Beta={str(beta_val):8s}: F1={f1:.3f}, n_cp={len(result.cp_set)}")

# Select best beta
best_beta, best_f1, best_ncp = max(results, key=lambda x: x[1])
print(f"\nBest: Beta={best_beta}, F1={best_f1:.3f}")

Integration with Datasets¶

All dataset generators return metadata including ground truth change points:

from fastcpd.datasets import make_mean_change, make_glm_change

# Mean change dataset
data_dict = make_mean_change(n_samples=500, n_changepoints=3, seed=42)
# Returns: data, changepoints, means, noise_std, SNR, etc.

# GLM dataset
data_dict = make_glm_change(
    n_samples=500,
    n_predictors=5,
    n_changepoints=2,
    family='binomial'
)
# Returns: data, changepoints, coefficients, AUC, etc.

# Detect and evaluate
result = mean(data_dict['data'], beta="MBIC")
metrics = evaluate_all(
    data_dict['changepoints'],
    result.cp_set.tolist(),
    n_samples=500,
    margin=10
)

Next Steps¶

Visualization - Visualize metrics and detection results
Metrics API - Complete API reference
Datasets API - Dataset generation API