Research in document image understanding is impeded by the scarcity of high-quality data. As a step towards removing this blocker for impactful research, we introduce ADOPD, a large-scale dataset for document page decomposition. A novel data-driven document taxonomy discovery method for data collection distinguishes ADOPD from other datasets. Our approach combines large-scale pretrained models with a human-in-the-loop process to ensure diversity and balance in the resulting data collection. Leveraging our data-driven document taxonomy, we collected and densely annotated labels for document images, covering four document image understanding tasks: Doc2Mask, Doc2Box, Doc2Tag, and Doc2Seq. Specifically, for each image, the annotations include human-labeled entity masks, text bounding boxes, as well as automatically generated tags and captions. We provide detailed experimental analyses to validate our data-driven document taxonomy method and experimentally analyze the four tasks based on different models. We believe that ADOPD has the potential to become a cornerstone dataset to support future research on document image understanding.
@inproceedings{
gu2024adopd,
title={{AD}o{PD}: A Large-Scale Document Page Decomposition Dataset},
author={Jiuxiang Gu and Xiangxi Shi and Jason Kuen and Lu Qi and Ruiyi Zhang and Anqi Liu and Ani Nenkova and Tong Sun},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=x1ptaXpOYa}
}