Spatiotemporal Graph Neural Networks for Action Segmentation in Multimodal Data

Keystep recognition is a fine-grained video understanding task that aims to classify small, heterogeneous steps within long-form videos of human activities. Current approaches have poor performance, reaching 35-42% accuracy on the Ego-Exo4D benchmark dataset.

In collaboration with Intel Labs, we developed a flexible graph-learning framework for fine-grained keystep recognition that achieves state of the art performance. Our approach, termed GLEVR, consists of constructing a graph where each video clip of the egocentric video corresponds to a node. We further leverage alignment between egocentric and exocentric videos during training for improved inference on egocentric videos, as well as adding automatic captioning as an additional modality. This simple, graph-based approach is able to effectively learn long-term dependencies in egocentric videos. Furthermore, the graphs are sparse and computationally efficient, substantially outperforming larger models.

We perform extensive experiments on the Ego-Exo4D dataset and show that our proposed flexible graph-based framework notably outperforms existing methods.

GLEVR won 1st place in the 2025 Ego-Exo4D keystep recognition challenge, out of more than 20 team submissions. Check out the extended abstract on arxiv.

Check out the code repository.

Ego-exo4D sample
Example sample from the Ego-Exo4D benchmark, with multi-modal ego- and exo-centric videos, narrations, and commentaries.
localization example