Multi-human behavior prediction using vision language models

Thumbnail Image

Date

2025

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

The ability to accurately predict multiple human motions and behaviors is crucial for mobile robots operating in human-populated environments. It is essential to incorporate the context of the scene and the states of objects within the environment because human behaviors are inherently influenced by their surroundings. Although prior research focuses primarily on predicting actions in single-human scenarios from an egocentric view, robotic applications require understanding multiple human behaviors from a third-person perspective. In contrast, there are fewer pre-existing datasets that captures multi-human behavior, especially from a third person perspective. The gap in data availability further complicates the development of accurate and efficient prediction methods for real world applications. This thesis addresses the problem of forecasting actions of multiple humans within a scene from a third person’s point of view. By leveraging Vision Language Models (VLMs) and Scene graphs, this thesis proposes a framework that is capable to predict multiple-human behavior in an indoor environment. Due to a lack of suitable dataset for multiple human behavior prediction, this thesis also fine-tunes open source VLMs with synthetic human behavior data and evaluates the resulting models on both synthetic sequences and real-world video recordings to assess their generalization capabilities. Additionally, this thesis also outlines the process of generating synthetic data generated by using a photo-realistic simulator. This thesis presents VISTA, which stands for Vision And Scene Aware Temporal Action Anticipation, a fine tuned VLM which is capable to predict human behavior up-to 5 seconds in a single-shot manner. This work also details a fine-tuning pipeline for VISTA utilizing methods such as Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) resulting in an 13% improvement over existing methods. In the end, this thesis also presents several ablation studies to examine different components of the framework and to understand the factors influencing the behavior prediction.

Description

Keywords

Citation

Endorsement

Review

Supplemented By

Referenced By