Multi-human behavior prediction using vision language models

dc.contributor.authorPanchal, Utsav
dc.date.accessioned2025-06-17T14:52:31Z
dc.date.issued2025
dc.description.abstractThe ability to accurately predict multiple human motions and behaviors is crucial for mobile robots operating in human-populated environments. It is essential to incorporate the context of the scene and the states of objects within the environment because human behaviors are inherently influenced by their surroundings. Although prior research focuses primarily on predicting actions in single-human scenarios from an egocentric view, robotic applications require understanding multiple human behaviors from a third-person perspective. In contrast, there are fewer pre-existing datasets that captures multi-human behavior, especially from a third person perspective. The gap in data availability further complicates the development of accurate and efficient prediction methods for real world applications. This thesis addresses the problem of forecasting actions of multiple humans within a scene from a third person’s point of view. By leveraging Vision Language Models (VLMs) and Scene graphs, this thesis proposes a framework that is capable to predict multiple-human behavior in an indoor environment. Due to a lack of suitable dataset for multiple human behavior prediction, this thesis also fine-tunes open source VLMs with synthetic human behavior data and evaluates the resulting models on both synthetic sequences and real-world video recordings to assess their generalization capabilities. Additionally, this thesis also outlines the process of generating synthetic data generated by using a photo-realistic simulator. This thesis presents VISTA, which stands for Vision And Scene Aware Temporal Action Anticipation, a fine tuned VLM which is capable to predict human behavior up-to 5 seconds in a single-shot manner. This work also details a fine-tuning pipeline for VISTA utilizing methods such as Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) resulting in an 13% improvement over existing methods. In the end, this thesis also presents several ablation studies to examine different components of the framework and to understand the factors influencing the behavior prediction.en
dc.identifier.other1929255721
dc.identifier.urihttp://nbn-resolving.de/urn:nbn:de:bsz:93-opus-ds-166150de
dc.identifier.urihttps://elib.uni-stuttgart.de/handle/11682/16615
dc.identifier.urihttps://doi.org/10.18419/opus-16596
dc.language.isoen
dc.rightsinfo:eu-repo/semantics/openAccess
dc.subject.ddc004
dc.titleMulti-human behavior prediction using vision language modelsen
dc.typemasterThesis
ubs.fakultaetInformatik, Elektrotechnik und Informationstechnik
ubs.institutInstitut für Architektur von Anwendungssystemen
ubs.publikation.seiten81
ubs.publikation.typAbschlussarbeit (Master)
ubs.unilizenzOK

Files

Original bundle

Now showing 1 - 1 of 1
Thumbnail Image
Name:
Utsav_Master_Thesis_IAAS.pdf
Size:
12.19 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
3.3 KB
Format:
Item-specific license agreed upon to submission
Description: