V-HICO Dataset: Videos of Humans Interacting with Common Objects

V-HICO is a dataset for human-object interaction in videos. There are 6,594 videos, including 5,297 training videos, 635 validation videos, 608 test videos, and 54 unseen test videos, of human-object interaction. To test the performance of models on common human-object interaction classes and generalization to new human-object interaction classes, we provide two test splits, the first one has the same human-object interaction classes in the training split while the second one consists of unseen novel classes.

V-HICO consists of 244 object classes and 99 action classes. There are 756 action-object pairwise classes in total. The unseen test dataset contains 51 object classes and 32 action classes with 52 action-object pairwise classes. All videos are labeled with text annotations of the human action and the associated object. The test and unseen dataset contain the annotations of both human and object bounding boxes.


Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions

Shuang Li, Yilun Du, Antonio Torralba, Josef Sivic, Bryan Russell
Paper    Project    Dataset    PyTorch Code
International Conference on Computer Vision (ICCV), 2021.


Shuang Li


Yilun Du


Antonio Torralba


Josef Sivic


Bryan Russell


Reach out to lishuang@mit.edu for questions, suggestions, and feedback.