We introduce a large scale synthetic dataset generated with diffusion models for human head detection and 3D mesh estimation. Our dataset comprises over 1 million high-resolution images, each annotated with detailed 3D head meshes, facial landmarks, and bounding boxes. Using this dataset we introduce a new model architecture capable of simultaneous heads detection and head meshes reconstruction from a single image in a single step.
The dataset generation process consists of the following stages:
We generate images with a latent diffusion model conditioned on a large real-world dataset
A small subset of data is manually labeled with head bounding boxes to train a binary detector on synthetic data,
For each detected head in generated images we predict the 3D head model parameters
The final dataset is automatically filtered to remove extra noise and privacy sensitive samples
Data contains a wide variety of scenes, numbers of people, provides rich annotations for every human head and is scalable to an arbitrary number of samples. The final dataset consists of 1,022,944 images with 2,219,146 heads annotated with bounding boxes and 3D model parameters
VGGHeads extends YOLO-NAS architecture to predict the 3D Morphable Model parameters along with the head bounding boxes from the multi-scale feature maps. Rich ground truth annotations allow us to train a model that estimates a compact 3D Head representation of multiple people from an RGB image in a single forward pass. For each head, we predict a vector of 3DMM parameters disentangled into shape, expression, and pose. As every vertex can be mapped to a face part, both head, face and other head parts bounding box can be recovered from the reprojected head mesh. Compared to other methods, this setup encodes a more general representation that serves as a base for other downstream head modeling tasks. The final predictions include:
Head and Face Bounding Box
3DMM Parameters
Head and Face 3D Vertices and 2D Landmarks
3D Head Pose
The full 3D head mesh provides a strong condition for the image generation process. We demonstrate the ability to generate images with a controlled 3D head shape and pose by training a ControlNet and T2I Adapter conditioned on the meshes.
from head_detector import HeadDetector
detector = HeadDetector()
image_path = "your_image.jpg"
predictions = detector(image_path)
# predictions.heads contain a list of heads with .bbox, .vertices_3d, .head_pose params predictions.draw() # draw heads on the image
detector = HeadDetector()
image_path = "your_image.jpg"
predictions = detector(image_path)
# predictions.heads contain a list of heads with .bbox, .vertices_3d, .head_pose params predictions.draw() # draw heads on the image
@article{vggheads,
title={VGGHeads: A Large-Scale Synthetic Dataset for 3D Human Heads},
author={Orest Kupyn and Eugene Khvedchenia and Christian Rupprecht},
year={2024},
eprint={2407.18245},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2407.18245},
}
title={VGGHeads: A Large-Scale Synthetic Dataset for 3D Human Heads},
author={Orest Kupyn and Eugene Khvedchenia and Christian Rupprecht},
year={2024},
eprint={2407.18245},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2407.18245},
}
We would like to thank Tetiana Martyniuk and Iro Laina for paper proofreading and valuable feedback. We also thank the Armed Forces of Ukraine for providing security to complete this work.
© Piñata Farms AI 2024