⚠️ This website is for reviewers only ⚠️

LLAVIDAL Logo LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living

Comparison of LLVM vs LLAVIDAL : In real world scenarios, web-video trained models struggle to understand Activities of Daily Living due to the subtle nuances in the video, whereas our ADL-X trained LLAVIDAL model triumphs in understanding complex human-object interactions.

Abstract

Current Large Language Vision Models (LLVMs) trained on web videos perform well in general video understanding but struggle with fine-grained details, complex human-object interactions (HOI), and view-invariant representation learning essential for Activities of Daily Living (ADL). This limitation stems from a lack of specialized ADL video instruction-tuning datasets and insufficient modality integration to capture discriminative action representations. To address this, we propose a semi-automated framework for curating ADL datasets, creating ADL-X, a multiview, multimodal RGBS instruction-tuning dataset. Additionally, we introduce LLAVIDAL, an LLVM integrating videos, 3D skeletons, and HOIs to model ADL's complex spatiotemporal relationships. For training LLAVIDAL a simple joint alignment of all modalities yields suboptimal results; thus, we propose a Multimodal Progressive (MMPro) training strategy, incorporating modalities in stages following a curriculum. We also establish ADL MCQ and video description benchmarks to assess LLVM performance in ADL tasks. Trained on ADL-X, LLAVIDAL achieves state-of-the-art performance across ADL benchmarks

MMPro Training Strategy

LLAVIDAL leverages MMPro(MultiModal Progressive) training strategy showing progressive multi-modal integration across three stages.

Quantitative Results

Quantitative Result 7

Impact of ADL-X Training

Quantitative Result 2

ADLMCQ - Action Recognition

Quantitative Result 3

ADLMCQ - Temporal Completion

Quantitative Result 4

Effect of Introduction of Skeleton and Object Cues on ADLMCQ Action Recognition

Quantitative Result 5

Effect of Introduction of Skeleton and Object Cues on ADLMCQ Temporal Completion

Quantitative Result 6

Effect of Introduction of Skeleton and Object Cues on TSU Descriptions

Qualitative Results

Usage License

The dataset is protected under the CC-BY license of Creative Commons, which allows users to distribute, remix, adapt, and build upon the material in any medium or format, as long as the creator is attributed. The license allows ADL-X for commercial use. As the authors of this manuscript and collectors of this dataset, we reserve the right to distribute the data.