Putting People in Their Place: Affordance-Aware Human Insertion into Scenes

Photo-realistic affordance-aware human insertion into scenes. We train a 1B-parameter diffusion model that achieves photo-realistic human insertion into scenes in an affordance-aware manner. We propose a novel self-supervised training scheme that learns by inpainting humans from two different frames of a video. When trained on a dataset of 2.4 million clips, our model is capable of inserting diverse humans into diverse scenes. Additionally, at inference time, the model can be prompted to perform person hallucination, scene hallucination, partial body completion and cloth swapping.

We study the problem of inferring scene affordances by presenting a method for realistically inserting people into scenes. Given a scene image with a marked region and an image of a person, we insert the person into the scene while respecting the scene affordances. Our model can infer the set of realistic poses given the scene context, re-pose the reference person, and harmonize the composition. We set up the task in a self-supervised fashion by learning to re-pose humans in video clips. We train a large-scale diffusion model on a dataset of 2.4M video clips that produces diverse plausible poses while respecting the scene context. Given the learned human-scene composition, our model can also hallucinate realistic people and scenes when prompted without conditioning and also enables interactive editing. A quantitative evaluation shows that our method synthesizes more realistic human appearance and more natural human-scene interactions than prior work.

We source two random frames from a video clip. We mask out the person in the first frame and use the person from the second frame as conditioning to inpaint the image. We concatenate the latent features of the background image and rescaled mask along with the noisy image to the denoising UNet. Reference person embeddings (CLIP ViT-L/14) are passed via cross-attention.

In addition to bounding boxes, we support generic masks like scribbles and segmentation masks.

At inference time, the model can be prompted to perform person hallucination by passing an empty scene conditioning.

Shown here are baseline comparisons from DALL-E 2, Stable-Diffusion v1.5 and ours (from left to right).

At inference time, the model can be prompted to perform scene hallucination by passing an empty person conditioning.

Shown here are baseline comparisons from DALL-E 2, Stable-Diffusion v1.5 and ours (from left to right).

Tim Brooks and Alexei A. Efros. Hallucinating Pose-Compatible Scenes. In European Conference on Computer Vision (ECCV), 2022.

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High-resolution Image Synthesis with Latent Diffusion Models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.

Xiaolong Wang, Rohit Girdhar, and Abhinav Gupta. Binge Watching: Scaling Affordance Learning from Sitcoms. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

Xueting Li, SIfei Liu, Kihwan Kim, Xiaolong Wang, Ming-Hsuan Yang, and Jan Kautz. Putting Humans in a Scene: Learning Affordance in 3D Indoor Environments. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

Vincent Delaitre, David F Fouhey, Ivan Laptev, Josef Sivic, Abhinav Gupta, and Alexei A. Efros. Scene Semantics from Long-term Observation of People. In European Conference on Computer Vision (ECCV), 2012.

Abhinav Gupta, Scott Satkin, Alexei A. Efros, and Martial Hebert. From 3D Scene Geometry to Human Workspace. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011.

We are grateful to Fabian Caba Heilbron, Tobias Hinz, Kushal Kafle, Nupur Kumari, Yijun Li and Markus Woodson for insightful discussions regarding data and training pipeline. This work was partly done by Sumith Kulal during an internship at Adobe Research. Additional funding provided by ONR MURI and NSF GRFP.

BibTeX

@inproceedings{kulal2023affordance,
  author    = {Kulal, Sumith and Brooks, Tim and Aiken, Alex and Wu, Jiajun and Yang, Jimei and Lu, Jingwan and Efros, Alexei A. and Singh, Krishna Kumar},
  title     = {Putting People in Their Place: Affordance-Aware Human Insertion into Scenes},
  booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2023},
}

Slides

Feel free to use these slides to help explain our research:

Putting People in Their Place:
Affordance-Aware Human Insertion into Scenes

Abstract

Video

Architecture Overview

Model Samples

Same person in different scenes.

Different people in same scene.

Masks beyond bounding boxes.

Person hallucination.

Person hallucination baselines.

Scene hallucination.

Scene hallucination baselines.

Partial-body completion.

Cloth swapping.

In the press

Related works

Acknowledgements

BibTeX

Slides