ADOPD: A Large-Scale Document Page Decomposition Dataset

Research in document image understanding is impeded by the scarcity of high-quality data. As a step towards removing this blocker for impactful research, we introduce ADOPD, a large-scale dataset for document page decomposition. A novel data-driven document taxonomy discovery method for data collection distinguishes ADOPD from other datasets. Our approach combines large-scale pretrained models with a human-in-the-loop process to ensure diversity and balance in the resulting data collection. Leveraging our data-driven document taxonomy, we collected and densely annotated labels for document images, covering four document image understanding tasks: Doc2Mask, Doc2Box, Doc2Tag, and Doc2Seq. Specifically, for each image, the annotations include human-labeled entity masks, text bounding boxes, as well as automatically generated tags and captions. We provide detailed experimental analyses to validate our data-driven document taxonomy method and experimentally analyze the four tasks based on different models. We believe that ADOPD has the potential to become a cornerstone dataset to support future research on document image understanding.

ADOPD: A Large-Scale Document Page Decomposition Dataset

Abstract

BibTeX