Weakly Supervised Human and Object Detection via Spatiotemporal Interactions
Shuang Li1 Yilun Du1 Antonio Torralba1 Josef Sivic2 Bryan Russell3
1 MIT CSAIL 2 INRIA 3 Adobe
Paper | PyTorch code
Abstract
We introduce the task of weakly supervised learning for detecting a human and object in videos given a query action and object. Our task poses unique challenges as a system does not know whether the human-object interaction is present in the untrimmed video or the actual spatiotemporal location of the human and object. To address these challenges, we introduce a weakly supervised spatiotemporal training loss that aims to jointly associate spatiotemporal regions in a video with an action and object vocabulary and encourage temporal continuity of the visual appearance of moving objects. To train our model, we introduce a dataset comprising over 7k videos with human-object interaction annotations that have been semi-automatically curated from sentence captions associated with the videos. We demonstrate improved performance over weakly supervised baselines adapted to our task on our video dataset.

Paper: arxiv, 2021
Dataset: V-HICO
Code: PyTorch
Citation
Shuang Li, Yilun Du, Antonio Torralba, Josef Sivic, Bryan Russell.
Weakly Supervised Human and Object Detection via Spatiotemporal Interactions. Bibtex
Human and Object Detection in Videos
We seek to detect human-object interactions in videos. In this example, our system is able to find ``human grooming horse'' in a dataset of videos. Our approach learns to detect such interactions in a weakly supervised fashion.
Weakly supervised spatiotemporal loss
Our loss jointly aligns features for spatiotemporal regions in a video to (a) a language-embedding feature for an input query and (b) other spatiotemporal regions likely to contain the object.
Model of Spatiotemporal Interactions
Model architecture of the proposed method. (a) is the overview of the training process. We take video frames as input and send them to a temporal soft attention module and a feature aggregation module. The output features incorporate the temporal and spatial information and are used to compute the spatiotemporal loss. (b) is the illustration of the feature aggregation module. In each frame, we use the Faster-RCNN/Densepose to extract the object and human region proposal locations and an ROI pooling layer to extract their features. The object/human score functions take the feature of a region proposal from the ROI pooling layer, the language phrase, and the frame feature $x'_t$ (do not show here) as input and output a score for the region. We aggregate the object/human region proposal features based on their scores from the score function to generate the aggregated object/human features.
Experiments
Ablation Studies on V-HICO
Evaluation of each component of the proposed model. Phrase (Phr) detection refers to correct localization (0.5 IoU) of the union of human and object bounding boxes while relationship (Rel) refers to correct localization of both human and object bounding boxes.Comparison with Baselines
Evaluation of performance on HICO-DET compared to methods in [1], [2], and [3] and different random baselines. Phrase detection refers to correct localization (0.5 IoU) of the union of human and object bounding boxes while relationship refers to correct localization of both human and object bounding boxes.Comparison with Baselines on Unseen Classes
Evaluation of various models on the unseen test set on V-HICO. The unseen test set consists of 53 classes of objects unseen during training. Evaluation at IoU threshold 0.5. (ko) and (def) are the known object setting and default setting.Qualitative Results
Qualitative prediction of our model on the V-HICO test set with predicted human bounding box (red), predicted object bounding box (green), and input action-object label (bottom).