Photo-realistic affordance-aware human insertion into scenes. We train a 1B-parameter diffusion model that achieves photo-realistic human insertion into scenes in an affordance-aware manner. We propose a novel self-supervised training scheme that learns by inpainting humans from two different frames of a video. When trained on a dataset of 2.4 million clips, our model is capable of inserting diverse humans into diverse scenes. Additionally, at inference time, the model can be prompted to perform person hallucination, scene hallucination, partial body completion and cloth swapping.
We study the problem of inferring scene affordances by presenting a method for realistically inserting people into scenes. Given a scene image with a marked region and an image of a person, we insert the person into the scene while respecting the scene affordances. Our model can infer the set of realistic poses given the scene context, re-pose the reference person, and harmonize the composition. We set up the task in a self-supervised fashion by learning to re-pose humans in video clips. We train a large-scale diffusion model on a dataset of 2.4M video clips that produces diverse plausible poses while respecting the scene context. Given the learned human-scene composition, our model can also hallucinate realistic people and scenes when prompted without conditioning and also enables interactive editing. A quantitative evaluation shows that our method synthesizes more realistic human appearance and more natural human-scene interactions than prior work.
We source two random frames from a video clip. We mask out the person in the first frame and use the person from the second frame as conditioning to inpaint the image. We concatenate the latent features of the background image and rescaled mask along with the noisy image to the denoising UNet. Reference person embeddings (CLIP ViT-L/14) are passed via cross-attention.
In addition to bounding boxes, we support generic masks like scribbles and segmentation masks.
At inference time, the model can be prompted to perform person hallucination by passing an empty scene conditioning.
Shown here are baseline comparisons from DALL-E 2, Stable-Diffusion v1.5 and ours (from left to right).
At inference time, the model can be prompted to perform scene hallucination by passing an empty person conditioning.
Shown here are baseline comparisons from DALL-E 2, Stable-Diffusion v1.5 and ours (from left to right).
We are grateful to Fabian Caba Heilbron, Tobias Hinz, Kushal Kafle, Nupur Kumari, Yijun Li and Markus Woodson for insightful discussions regarding data and training pipeline. This work was partly done by Sumith Kulal during an internship at Adobe Research. Additional funding provided by ONR MURI and NSF GRFP.
BibTeX
@inproceedings{kulal2023affordance,
author = {Kulal, Sumith and Brooks, Tim and Aiken, Alex and Wu, Jiajun and Yang, Jimei and Lu, Jingwan and Efros, Alexei A. and Singh, Krishna Kumar},
title = {Putting People in Their Place: Affordance-Aware Human Insertion into Scenes},
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2023},
}
Slides
Feel free to use these slides to help explain our research: