STaRFormer: Semi-Supervised Task-Informed Representation Learning via Dynamic Attention-Based Regional Masking for Sequential Data

NeurIPS 2025

1University of Glasgow, 2BMW Group
a) Semi-Supervised Contrastive Learning
b) Dynamic Attention-Based Regional Masking

STaRFormer Mechanics: the two main operations that drive STaRFormer.

Abstract

Understanding user intent is essential for situational and context-aware decision-making. Motivated by a real-world scenario, this work addresses intent predictions of smart device users in the vicinity of vehicles by modeling sequential spatiotemporal data. However, in real-world scenarios, environmental factors and sensor limitations can result in non-stationary and irregularly sampled data, posing significant challenges. To address these issues, we propose STaRFormer, a Transformer-based approach that can serve as a universal framework for sequential modeling. STaRFormer utilizes a new dynamic attention-based regional masking scheme combined with a novel semi-supervised contrastive learning paradigm to enhance task-specific latent representations. Comprehensive experiments on 56 datasets varying in types (including non-stationary and irregularly sampled), domains, sequence lengths, training samples, and applications demonstrate the efficacy of STaRFormer. We achieve notable improvements over state-of-the-art approaches.

Motivation

This work aims to address the challenges posed by real-world time series data, which often exhibit non-stationarity and irregular sampling characteristics due to factors such as sensor technology, external conditions, and device malfunctions. Conventional machine learning algorithms, such as LSTM and Transformer, typically assume the data is fully observed, stationary, and sampled at regular intervals. We developed a versatile framework, STaRFormer, that can effectively model time series with these characteristics while maintaining applicability to regular time series data as well.

A demonstration of the use case is available here.

Technical TL;DR

The proposed STaRFormer framework introduces dynamic attention-based regional masking and a novel semi-supervised contrastive learning scheme to create robust task-informed latent embeddings, enhancing the model's robustness to irregularities in time series. This approach can additional serves as an effective augmentation method to improve performance for various time series types (including non-stationary and irregularly sampled), domains and downstream tasks.

STaRFormer Architecture
STaRFormer Architecture
Semi-Supervised Contrastive Learning Components

I. Formulation for sequence-level prediction tasks

Composition of batch-wise (bw) and class-wise (cw) contrastive components.

STaRFormer Architecture

II. Formulation for elementwise-level prediction tasks

Composition of batch-wise (bw), intra-class-wise (cw-intra) and inter-class-wise (cw-inter) contrastive components.

STaRFormer Architecture

Results

Classification

Non-stationary and spatiotemporal time series

Table 1: Results for spatiotemporal, non-stationary time series.
DKT GL
Accuracy F0.5-Score Accuracy
RNN 0.754 ± 0.010 0.754 ± 0.010 0.643
TrajFormer - - 0.855
SVM - - 0.861
LSTM 0.844 ± 0.003 0.843 ± 0.002 0.884
GRU 0.840 ± 0.003 0.840 ± 0.003 0.898
ST-GRU - - 0.913
Transformer 0.849 ± 0.002 0.849 ± 0.002 0.881
TARNet 0.781 ± 0.011 0.782 ± 0.012 0.880
TimesURL 0.724 ± 0.003 - 0.751
STaRFormer 0.852± 0.003 0.852± 0.003 0.932

Irregularly sampled time series.

Table 2: Results for irregular sampled time series (in %).
P19 P12 PAM
AUROC AUPRC AUROC AUPRC Accuracy Precision Recall F1-Score
Transformer 80.7 ± 3.8 42.7 ± 7.7 83.3 ± 0.7 47.9 ± 3.6 83.5 ± 1.5 84.8 ± 1.5 86.0 ± 1.2 85.0 ± 1.3
Trans-mean 83.7 ± 1.8 45.8 ± 3.2 82.6 ± 2.0 46.3 ± 4.0 83.7 ± 2.3 84.9 ± 2.6 86.4 ± 2.1 85.1 ± 2.4
GRU-D 83.9 ± 1.7 46.9 ± 2.1 81.9 ± 2.1 46.1 ± 4.7 83.3 ± 1.6 84.6 ± 1.2 85.2 ± 1.6 84.8 ± 1.2
SeFT 81.2 ± 2.3 41.9 ± 3.1 73.9 ± 2.5 31.1 ± 4.1 67.1 ± 2.2 70.0 ± 2.4 68.2 ± 1.5 68.5 ± 1.8
mTAND 84.4 ± 1.3 50.6 ± 2.0 84.2 ± 0.8 48.2 ± 3.4 74.6 ± 4.3 74.3 ± 4.0 79.5 ± 2.8 76.8 ± 3.4
IP-Net 84.6 ± 1.3 38.1 ± 3.7 82.6 ± 1.4 47.6 ± 3.1 74.3 ± 3.8 75.6 ± 2.1 77.9 ± 2.2 76.6 ± 2.8
DGM$^2$-O 86.7 ± 3.4 44.7 ± 11.7 84.4 ± 1.6 47.3 ± 3.6$ 82.4 ± 2.3 85.2 ± 1.2 83.9 ± 2.3 84.3 ± 1.8
MTGNN 81.9 ± 6.2 39.9 ± 8.9 74.4 ± 6.7 35.5 ± 6.0 83.4 ± 1.9 85.2 ± 1.7 86.1 ± 1.9 85.9 ± 2.4
Raindrop 87.0 ± 2.3 51.8 ± 5.5 82.8 ± 1.7 44.0 ± 3.0 88.5 ± 1.5 89.9 ± 1.5 89.9 ± 0.6 89.8 ± 1.0
ViTST 89.2 ± 2.0 53.1 ± 3.4 85.1 ± 0.8 51.1 ± 4.1 95.8 ± 1.3 96.2 ± 1.3 96.5 ± 1.2 96.1 ± 1.1
STaRFormer 89.4± 1.3 61.3± 3.4 85.3± 1.2 52.0± 1.7 97.6± 0.9 97.3± 0.4 97.6± 0.3 97.4± 0.3

Regular time series.

Table 3: Classification results on the multivariate time series UEA Benchmark (30 datasets).
ViTST DTWD Weasel-Muse TST (TimesURL) T-Loss TS-TCC TNC TS2Vec InfoTS Rocket Mini-Rocket TST (TARNet) InfoTSs TimesURL TARNet STaRFormer
Avg. Accuracy 0.790 0.608 0.691 0.617 0.658 0.668 0.670 0.704 0.714 0.715 0.719 0.729 0.730 0.752 0.755 0.795
Rank - - - 13 12 11 10 9 8 7 6 5 4 3 2 1
Avg. Rank - - - 10.6 8.6 9.2 9.9 7.4 6.8 5.5 5.7 6.5 5.3 3.9 4.9 2.8
Top scores 1 0 5 1 1 1 0 1 1 5 4 6 3 4 7 9
1-v-1 8 28 20 29 27 27 29 25 27 19 22 23 23 19 21 -
DS Count 10 29 28 30 30 30 30 30 30 30 30 30 30 30 30 30
Avg. Accuracy 28 - 0.604 0.691 0.631 0.675 0.680 0.677 0.713 0.722 0.730 0.733 0.724 0.738 0.760 0.770 0.793
Rank 28 - 15 10 14 13 11 12 9 8 6 5 7 4 3 2 1
Avg. Rank 28 - 11.2 7.8 11.7 9.1 10.3 11.0 8.1 7.5 5.8 6.0 7.5 5.8 4.1 5.2 3.1
Avg. Accuracy 9 0.776 0.702 0.737 0.674 0.717 0.708 0.715 0.734 0.727 0.756 0.751 0.771 0.736 0.770 0.717 0.793
Rank 9 2 15 7 16 12 14 13 9 10 5 6 3 8 4 11 1
Avg. Rank 9 6.4 11.8 9.0 12.3 11.1 11.3 10.8 9.0 10.0 6.7 7.4 3.9 8.4 5.3 6.3 3.3

Anomaly Detection

Table 4: Anomaly detection results (Univariate).
Yahoo KPI
F1-Score Precision Recall F1-Score Precision Recall
SPOT 0.338 0.269 0.454 0.217 0.786 0.126
DSPOT 0.316 0.241 0.458 0.521 0.623 0.447
SPOT 0.338 0.269 0.454 0.217 0.786 0.126
DONUT 0.026 0.013 0.825 0.347 0.371 0.326
SR 0.563 0.451 0.747 0.622 0.647 0.598
TS2Vec 0.745 0.729 0.762 0.677 0.929 0.533
TimesURL 0.749 0.748 0.750 0.688 0.925 0.546
STaRFormer 0.789 0.772 0.807 0.830 0.852 0.811

Time Series Extrinsic Regression

Table 5: Regression results (RMSE) on the TSR Benchmark (19 datasets).
FPCR FPCR-Bspline SVR SVR Optimised Random Forest XGBoost 1-NN-ED 5-NN-ED 1-NN-DTWD 5-NN-DTWD Rocket FCN ResNet Inception TARNet STaRFormer
Avg. Rel. Mean Difference ↓ 0.028 0.029 0.387 0.208 -0.121 -0.132 0.288 0.051 0.125 -0.034 -0.245 -0.160 -0.119 -0.220 0.170 -0.254
Avg. Rel. Mean Rank ↓ 9 10 16 14 6 5 15 11 12 8 2 4 7 3 13 1
Top Scores ↑ 0 0 0 0 0 3 0 0 0 0 6 1 1 3 0 9

Ablation Studies

t-SNE visualizations of STaRFormer's latent space Z. For datasets where our contrastive learning approach is highly effective (e.g., PS), more distinct class clusters are clearly visible.

STaRFormer Architecture

BibTeX

@misc{2504.10097,
    Author = {Maximilian Forstenhäusler and Daniel Külzer and Christos Anagnostopoulos and Shameem Puthiya Parambath and Natascha Weber},
    Title = {STaRFormer: Semi-Supervised Task-Informed Representation Learning via Dynamic Attention-Based Regional Masking for Sequential Data},
    Year = {2025},
    Eprint = {arXiv:2504.10097},
    }