LF AI & Data Foundation Launches DocLang Specification Working Group to Advance an Open Standard for AI-Native Documents

LF AI & Data Foundation Launches DocLang Specification Working Group to Advance an Open Standard for AI-Native Documents

PR Newswire

New specification, supported by leading LF AI & Data member organizations IBM and Red Hat, as well as other organizations including ABBYY, complements the Docling open source project

SAN FRANCISCO, June 9, 2026 /PRNewswire/ — LF AI & Data Foundation, the premier organization supporting open source innovation in artificial intelligence and data under the Linux Foundation, today announced the formation of the DocLang Specification Working Group. This working group supports a new collaborative standards development initiative to develop DocLang, an open, universal, AI-native document format designed to improve how enterprises prepare, exchange, and govern document data for AI systems.

LF AI & Data Foundation

Founded by LF AI & Data premier members IBM, NVIDIA, and Red Hat, as well as contributors ABBYY and HumanSignal, the DocLang Working Group will operate under Joint Development Foundation’s vendor-neutral, open governance model to develop and maintain a specification that supports more reliable, interoperable document processing across AI and agentic workflows.

“Documents remain one of the most important sources of enterprise knowledge, but most were never designed for AI-driven workflows,” said Mark Collier, general manager of AI & Infrastructure at the Linux Foundation and executive director of LF AI & Data. “With the launch of the DocLang Working Group, we are bringing the open source community together to develop a vendor-neutral, interoperable standard that helps organizations prepare document data for AI more reliably, transparently, and at scale. Combined with projects like Docling, this effort can help create a more open foundation for document understanding across the AI ecosystem.”

“DocLang is the culmination of years of research into how documents can be represented more efficiently and more faithfully for AI systems,” said Peter Staar, Principal Research Scientist and Manager at IBM Software. “Our work began with innovations such as OTSL for compact table representation and DocTags for preserving document structure and semantics in a machine-readable form. Together with our industry partners, we have distilled these lessons into DocLang, a new AI-native format for unstructured content that is designed to represent arbitrarily complex documents in a way that aligns naturally with modern LLM tokenization and reasoning. Our vision is for DocLang to become a broadly adopted international standard for AI-ready documents, providing a consistent representation for both humans and machines, much as PDF became the universal standard for document exchange in the human-centric era.”

“NVIDIA looks forward to working with the Linux Foundation and the broader DocLang ecosystem to accelerate the adoption of this AI-native document format across industries,” said Kari Briski, Vice President, Generative AI, NVIDIA.

Enterprises today work across a fragmented landscape of document formats, including PDFs, JPEGs, and other file types built primarily for human consumption rather than AI interpretation. As organizations increasingly rely on generative AI and agentic systems, this disconnect can introduce complexity, raise costs, and reduce reliability when extracting meaning from business documents.

“DocLang is designed to solve one of the foundational problems in enterprise AI: documents were built for humans, not machines,” said Maxime Vermeir, Vice President, AI Strategy at ABBYY. “By introducing a minimal, standardized, and AI-native representation of document structure, layout, meaning and governance, DocLang creates a far more deterministic foundation for modern AI systems. This results in an AI native context layer at scale.”

DocLang is designed to support:

  • Preservation of both semantic meaning and geometric layout in a single AI-native format
  • Representation of structural elements such as headings, paragraphs, and tables alongside their position on the page
  • Embedded governance controls to help downstream systems enforce policies related to privacy, extraction scope, and model training permissions
  • Optimization for modern AI tokenization and modeling approaches to support more efficient and reliable document understanding

DocLang and Docling

The new working group builds on the momentum of Docling, the open source document processing toolkit hosted by LF AI & Data. Originally developed by the AI for Knowledge team at IBM Research Zurich, and released as open source in 2024, Docling has become a widely adopted project for converting documents into structured, AI-ready representations.

Docling serves as the processing and conversion layer, ingesting a range of document formats (including .pdf, .docx, .pptx, .xlsx, HTML, and images) and transforming them into structured outputs using advanced models for layout analysis and table understanding. Its internal representation, DoclingDocument, captures text, tables, figures, reading order, and layout in a richly structured format.

DocLang complements that foundation by defining an open, interoperable standard for expressing and exchanging that structured output across systems. Together, Docling and DocLang create a more complete open source document AI stack under LF AI & Data, spanning document ingestion, parsing, standardized representation, and downstream consumption by language models and agentic AI systems.

Learn more and get involved

Organizations and individual contributors interested in helping shape the future of AI document processing are invited to participate in the DocLang Working Group. Membership is open to organizations committed to building open, interoperable AI infrastructure.

To learn more, adopt the standard, or contribute to the DocLang specification, visit https://doclang.ai/.

About the Linux Foundation
The Linux Foundation is the world’s leading home for collaboration on open source software, hardware, standards, and data. Linux Foundation projects, including Linux, Kubernetes, Model Context Protocol (MCP), OpenChain, OpenSearch, OpenSSF, OpenStack, PyTorch, Ray, RISC-V, SPDX and Zephyr, provide the foundation for global infrastructure. The Linux Foundation is focused on leveraging best practices and addressing the needs of contributors, users, and solution providers to create sustainable models for open collaboration. For more information, please visit us at linuxfoundation.org.

For a list of trademarks of The Linux Foundation, please see its trademark usage page: linuxfoundation.org/trademark-usage. Linux is a registered trademark of Linus Torvalds.

Media Contact
The Linux Foundation
pr@linuxfoundation.org

Cision View original content to download multimedia:https://www.prnewswire.com/news-releases/lf-ai–data-foundation-launches-doclang-specification-working-group-to-advance-an-open-standard-for-ai-native-documents-302794922.html

SOURCE LF AI & Data Foundation