LLAVIDAL

Abstract

Current Large Language Vision Models (LLVMs) trained on web videos perform well in general video understanding but struggle with fine-grained details, complex human-object interactions (HOI), and view-invariant representation learning essential for Activities of Daily Living (ADL). This limitation stems from a lack of specialized ADL video instruction-tuning datasets and insufficient modality integration to capture discriminative action representations. To address this, we propose a semi-automated framework for curating ADL datasets, creating ADL-X, a multiview, multimodal RGBS instruction-tuning dataset. Additionally, we introduce LLAVIDAL, an LLVM integrating videos, 3D skeletons, and HOIs to model ADL's complex spatiotemporal relationships. For training LLAVIDAL a simple joint alignment of all modalities yields suboptimal results; thus, we propose a Multimodal Progressive (MMPro) training strategy, incorporating modalities in stages following a curriculum. We also establish ADL MCQ and video description benchmarks to assess LLVM performance in ADL tasks. Trained on ADL-X, LLAVIDAL achieves state-of-the-art performance across ADL benchmarks

Quantitative Results

Impact of ADL-X Training

ADLMCQ - Action Recognition

ADLMCQ - Temporal Completion

Effect of Introduction of Skeleton and Object Cues on ADLMCQ Action Recognition

Effect of Introduction of Skeleton and Object Cues on ADLMCQ Temporal Completion

Effect of Introduction of Skeleton and Object Cues on TSU Descriptions

Usage License

The dataset is protected under the CC-BY license of Creative Commons, which allows users to distribute, remix, adapt, and build upon the material in any medium or format, as long as the creator is attributed. The license allows ADL-X for commercial use. As the authors of this manuscript and collectors of this dataset, we reserve the right to distribute the data.

LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living

Comparison of LLVM vs LLAVIDAL : In real world scenarios, web-video trained models struggle to understand Activities of Daily Living due to the subtle nuances in the video, whereas our ADL-X trained LLAVIDAL model triumphs in understanding complex human-object interactions.

Abstract

ADL-X Data Curation Process

Pipeline for extraction of Object and Skeleton modalities

LLAVIDAL leverages MMPro(MultiModal Progressive) training strategy showing progressive multi-modal integration across three stages.

Quantitative Results

Impact of ADL-X Training

ADLMCQ - Action Recognition

ADLMCQ - Temporal Completion

Effect of Introduction of Skeleton and Object Cues on ADLMCQ Action Recognition

Effect of Introduction of Skeleton and Object Cues on ADLMCQ Temporal Completion

Effect of Introduction of Skeleton and Object Cues on TSU Descriptions

Qualitative Results

Video Description

Action Recognition

Temporal Action Completion

Usage License