Weakly Supervised Human and Object Detection via Spatiotemporal Interactions

Shuang Li1    Yilun Du1    Antonio Torralba1    Josef Sivic2    Bryan Russell3

1 MIT CSAIL    2 INRIA    3 Adobe

Paper | PyTorch code


Abstract

We introduce the task of weakly supervised learning for detecting a human and object in videos given a query action and object. Our task poses unique challenges as a system does not know whether the human-object interaction is present in the untrimmed video or the actual spatiotemporal location of the human and object. To address these challenges, we introduce a weakly supervised spatiotemporal training loss that aims to jointly associate spatiotemporal regions in a video with an action and object vocabulary and encourage temporal continuity of the visual appearance of moving objects. To train our model, we introduce a dataset comprising over 7k videos with human-object interaction annotations that have been semi-automatically curated from sentence captions associated with the videos. We demonstrate improved performance over weakly supervised baselines adapted to our task on our video dataset.

paper thumbnail

Paper: arxiv, 2021

Dataset: V-HICO

Code: PyTorch

Citation

Shuang Li, Yilun Du, Antonio Torralba, Josef Sivic, Bryan Russell.
Weakly Supervised Human and Object Detection via Spatiotemporal Interactions. Bibtex




Human and Object Detection in Videos

We seek to detect human-object interactions in videos. In this example, our system is able to find ``human grooming horse'' in a dataset of videos. Our approach learns to detect such interactions in a weakly supervised fashion.

Weakly supervised spatiotemporal loss

Our loss jointly aligns features for spatiotemporal regions in a video to (a) a language-embedding feature for an input query and (b) other spatiotemporal regions likely to contain the object.

Model of Spatiotemporal Interactions

Model architecture of the proposed method. (a) is the overview of the training process. We take video frames as input and send them to a temporal soft attention module and a feature aggregation module. The output features incorporate the temporal and spatial information and are used to compute the spatiotemporal loss. (b) is the illustration of the feature aggregation module. In each frame, we use the Faster-RCNN/Densepose to extract the object and human region proposal locations and an ROI pooling layer to extract their features. The object/human score functions take the feature of a region proposal from the ROI pooling layer, the language phrase, and the frame feature $x'_t$ (do not show here) as input and output a score for the region. We aggregate the object/human region proposal features based on their scores from the score function to generate the aggregated object/human features.

Experiments

Ablation Studies on V-HICO

Evaluation of each component of the proposed model. Phrase (Phr) detection refers to correct localization (0.5 IoU) of the union of human and object bounding boxes while relationship (Rel) refers to correct localization of both human and object bounding boxes.


Comparison with Baselines

Evaluation of performance on HICO-DET compared to methods in [1], [2], and [3] and different random baselines. Phrase detection refers to correct localization (0.5 IoU) of the union of human and object bounding boxes while relationship refers to correct localization of both human and object bounding boxes.


Comparison with Baselines on Unseen Classes

Evaluation of various models on the unseen test set on V-HICO. The unseen test set consists of 53 classes of objects unseen during training. Evaluation at IoU threshold 0.5. (ko) and (def) are the known object setting and default setting.


Qualitative Results

Qualitative prediction of our model on the V-HICO test set with predicted human bounding box (red), predicted object bounding box (green), and input action-object label (bottom).


Related works

  1. "Weakly-supervised learning of visual relations", in ICCV 2017 (Oral).
  2. Luowei Zhou, Nathan Louis, Jason J. Corso "Weakly-supervised video object grounding from text by loss weighting and object interaction", in BMVC 2018.
  3. "Visual Relation Grounding in Videos", in ECCV 2020.




Accessibility