STaRFormer: Semi-Supervised Task-Informed Representation Learning via Dynamic Attention-Based Regional Masking for Sequential Data

a) Semi-Supervised Contrastive Learning

b) Dynamic Attention-Based Regional Masking

STaRFormer Mechanics: the two main operations that drive STaRFormer.

Abstract

Understanding user intent is essential for situational and context-aware decision-making. Motivated by a real-world scenario, this work addresses intent predictions of smart device users in the vicinity of vehicles by modeling sequential spatiotemporal data. However, in real-world scenarios, environmental factors and sensor limitations can result in non-stationary and irregularly sampled data, posing significant challenges. To address these issues, we propose STaRFormer, a Transformer-based approach that can serve as a universal framework for sequential modeling. STaRFormer utilizes a new dynamic attention-based regional masking scheme combined with a novel semi-supervised contrastive learning paradigm to enhance task-specific latent representations. Comprehensive experiments on 56 datasets varying in types (including non-stationary and irregularly sampled), domains, sequence lengths, training samples, and applications demonstrate the efficacy of STaRFormer. We achieve notable improvements over state-of-the-art approaches.

Motivation

This work aims to address the challenges posed by real-world time series data, which often exhibit non-stationarity and irregular sampling characteristics due to factors such as sensor technology, external conditions, and device malfunctions. Conventional machine learning algorithms, such as LSTM and Transformer, typically assume the data is fully observed, stationary, and sampled at regular intervals. We developed a versatile framework, STaRFormer, that can effectively model time series with these characteristics while maintaining applicability to regular time series data as well.

A demonstration of the use case is available here.

Technical TL;DR

The proposed STaRFormer framework introduces dynamic attention-based regional masking and a novel semi-supervised contrastive learning scheme to create robust task-informed latent embeddings, enhancing the model's robustness to irregularities in time series. This approach can additional serves as an effective augmentation method to improve performance for various time series types (including non-stationary and irregularly sampled), domains and downstream tasks.

STaRFormer Architecture

Semi-Supervised Contrastive Learning Components

I. Formulation for sequence-level prediction tasks

Composition of batch-wise (bw) and class-wise (cw) contrastive components.

II. Formulation for elementwise-level prediction tasks

Composition of batch-wise (bw), intra-class-wise (cw-intra) and inter-class-wise (cw-inter) contrastive components.

Results

Classification

Non-stationary and spatiotemporal time series

Table 1: Results for spatiotemporal, non-stationary time series.

	DKT		GL
	Accuracy↑	F_0.5-Score↑	Accuracy↑
RNN	0.754 ± 0.010	0.754 ± 0.010	0.643
TrajFormer	-	-	0.855
SVM	-	-	0.861
LSTM	0.844 ± 0.003	0.843 ± 0.002	0.884
GRU	0.840 ± 0.003	0.840 ± 0.003	0.898
ST-GRU	-	-	0.913
Transformer	0.849 ± 0.002	0.849 ± 0.002	0.881
TARNet	0.781 ± 0.011	0.782 ± 0.012	0.880
TimesURL	0.724 ± 0.003	-	0.751
STaRFormer	0.852± 0.003	0.852± 0.003	0.932

Irregularly sampled time series.

Table 2: Results for irregular sampled time series (in %).

	P19		P12		PAM
	AUROC↑	AUPRC↑	AUROC↑	AUPRC↑	Accuracy↑	Precision↑	Recall↑	F₁-Score↑
Transformer	80.7 ± 3.8	42.7 ± 7.7	83.3 ± 0.7	47.9 ± 3.6	83.5 ± 1.5	84.8 ± 1.5	86.0 ± 1.2	85.0 ± 1.3
Trans-mean	83.7 ± 1.8	45.8 ± 3.2	82.6 ± 2.0	46.3 ± 4.0	83.7 ± 2.3	84.9 ± 2.6	86.4 ± 2.1	85.1 ± 2.4
GRU-D	83.9 ± 1.7	46.9 ± 2.1	81.9 ± 2.1	46.1 ± 4.7	83.3 ± 1.6	84.6 ± 1.2	85.2 ± 1.6	84.8 ± 1.2
SeFT	81.2 ± 2.3	41.9 ± 3.1	73.9 ± 2.5	31.1 ± 4.1	67.1 ± 2.2	70.0 ± 2.4	68.2 ± 1.5	68.5 ± 1.8
mTAND	84.4 ± 1.3	50.6 ± 2.0	84.2 ± 0.8	48.2 ± 3.4	74.6 ± 4.3	74.3 ± 4.0	79.5 ± 2.8	76.8 ± 3.4
IP-Net	84.6 ± 1.3	38.1 ± 3.7	82.6 ± 1.4	47.6 ± 3.1	74.3 ± 3.8	75.6 ± 2.1	77.9 ± 2.2	76.6 ± 2.8
DGM$^2$-O	86.7 ± 3.4	44.7 ± 11.7	84.4 ± 1.6	47.3 ± 3.6$	82.4 ± 2.3	85.2 ± 1.2	83.9 ± 2.3	84.3 ± 1.8
MTGNN	81.9 ± 6.2	39.9 ± 8.9	74.4 ± 6.7	35.5 ± 6.0	83.4 ± 1.9	85.2 ± 1.7	86.1 ± 1.9	85.9 ± 2.4
Raindrop	87.0 ± 2.3	51.8 ± 5.5	82.8 ± 1.7	44.0 ± 3.0	88.5 ± 1.5	89.9 ± 1.5	89.9 ± 0.6	89.8 ± 1.0
ViTST	89.2 ± 2.0	53.1 ± 3.4	85.1 ± 0.8	51.1 ± 4.1	95.8 ± 1.3	96.2 ± 1.3	96.5 ± 1.2	96.1 ± 1.1
STaRFormer	89.4± 1.3	61.3± 3.4	85.3± 1.2	52.0± 1.7	97.6± 0.9	97.3± 0.4	97.6± 0.3	97.4± 0.3

Regular time series.

Table 3: Classification results on the multivariate time series UEA Benchmark (30 datasets).

	ViTST	DTWD	Weasel-Muse	TST (TimesURL)	T-Loss	TS-TCC	TNC	TS2Vec	InfoTS	Rocket	Mini-Rocket	TST (TARNet)	InfoTS_s	TimesURL	TARNet	STaRFormer
Avg. Accuracy↑	0.790	0.608	0.691	0.617	0.658	0.668	0.670	0.704	0.714	0.715	0.719	0.729	0.730	0.752	0.755	0.795
Rank↓	-	-	-	13	12	11	10	9	8	7	6	5	4	3	2	1
Avg. Rank↓	-	-	-	10.6	8.6	9.2	9.9	7.4	6.8	5.5	5.7	6.5	5.3	3.9	4.9	2.8
Top scores↑	1	0	5	1	1	1	0	1	1	5	4	6	3	4	7	9
1-v-1↑	8	28	20	29	27	27	29	25	27	19	22	23	23	19	21	-
DS Count↑	10	29	28	30	30	30	30	30	30	30	30	30	30	30	30	30

Avg. Accuracy 28↑	-	0.604	0.691	0.631	0.675	0.680	0.677	0.713	0.722	0.730	0.733	0.724	0.738	0.760	0.770	0.793
Rank 28↓	-	15	10	14	13	11	12	9	8	6	5	7	4	3	2	1
Avg. Rank 28↓	-	11.2	7.8	11.7	9.1	10.3	11.0	8.1	7.5	5.8	6.0	7.5	5.8	4.1	5.2	3.1

Avg. Accuracy 9↑	0.776	0.702	0.737	0.674	0.717	0.708	0.715	0.734	0.727	0.756	0.751	0.771	0.736	0.770	0.717	0.793
Rank 9↓	2	15	7	16	12	14	13	9	10	5	6	3	8	4	11	1
Avg. Rank 9↓	6.4	11.8	9.0	12.3	11.1	11.3	10.8	9.0	10.0	6.7	7.4	3.9	8.4	5.3	6.3	3.3

Anomaly Detection

Table 4: Anomaly detection results (Univariate).

	Yahoo			KPI
	F₁-Score↑	Precision↑	Recall↑	F₁-Score↑	Precision↑	Recall↑
SPOT	0.338	0.269	0.454	0.217	0.786	0.126
DSPOT	0.316	0.241	0.458	0.521	0.623	0.447
SPOT	0.338	0.269	0.454	0.217	0.786	0.126
DONUT	0.026	0.013	0.825	0.347	0.371	0.326
SR	0.563	0.451	0.747	0.622	0.647	0.598
TS2Vec	0.745	0.729	0.762	0.677	0.929	0.533
TimesURL	0.749	0.748	0.750	0.688	0.925	0.546
STaRFormer	0.789	0.772	0.807	0.830	0.852	0.811

Time Series Extrinsic Regression

Table 5: Regression results (RMSE) on the TSR Benchmark (19 datasets).

	FPCR	FPCR-Bspline	SVR	SVR Optimised	Random Forest	XGBoost	1-NN-ED	5-NN-ED	1-NN-DTWD	5-NN-DTWD	Rocket	FCN	ResNet	Inception	TARNet	STaRFormer
Avg. Rel. Mean Difference ↓	0.028	0.029	0.387	0.208	-0.121	-0.132	0.288	0.051	0.125	-0.034	-0.245	-0.160	-0.119	-0.220	0.170	-0.254
Avg. Rel. Mean Rank ↓	9	10	16	14	6	5	15	11	12	8	2	4	7	3	13	1
Top Scores ↑	0	0	0	0	0	3	0	0	0	0	6	1	1	3	0	9

Ablation Studies

Impact of Semi-Supervised Contrastive Learning

Fusing batch-wise and class-wise similarities is beneficial

self- / supervised CL ↓; performance

Emphasis on CL loss, i.e., ↑ λ_CL ↓ performance

At λ_CL≈0.1, equal contribution of 𝔏_TaskCE and 𝔏_STaR-CL

Impact of Regional Masking

No top performance for any metric where γ=0, i.e., masking of individual elements.
↑ masking regions, tends to ↑ performance

Latent Space

t-SNE visualizations of STaRFormer's latent space Z. For datasets where our contrastive learning approach is highly effective (e.g., PS), more distinct class clusters are clearly visible. STaRFormer-Base is a vanilla Transformer-encoder.

BibTeX

@misc{2504.10097,
    Author = {Maximilian Forstenhäusler and Daniel Külzer and Christos Anagnostopoulos and Shameem Puthiya Parambath and Natascha Weber},
    Title = {STaRFormer: Semi-Supervised Task-Informed Representation Learning via Dynamic Attention-Based Regional Masking for Sequential Data},
    Year = {2025},
    Eprint = {arXiv:2504.10097},
    }