Over the past several decades, the health tech and drug discovery industries have accumulated an enormous volume of manually created documents. These include patient health records (PHR), clinical trial protocols, lab reports, investigator brochures, case report forms, adverse event reports, and regulatory submissions, among others. Most of these documents were created without standardized formats and were typically bound to the internal systems or documentation styles of the institutions or companies that generated them.
These documents were authored by humans and are often stored as unstructured PDFs or, at best, in complex, proprietary XML schemas. For example, clinical trial data submitted to repositories like ClinicalTrials.gov typically includes all the essential information—such as study objectives and endpoints, schedules of activities, drug candidate identifiers, and inclusion/exclusion criteria—but this information is primarily intended for human interpretation.
While ClinicalTrials.gov does offer data via an XML-based API, the XML format used is non-standardized and difficult to parse, making it challenging for automated systems to extract structured information efficiently. Moreover, many clinical trials, particularly those that failed or were discontinued, never made it to public registries like ClinicalTrials.gov. Instead, their data remains locked away in PDF files stored in corporate archives, inaccessible to modern computational tools.
Historically, this lack of structure was not a major concern because the primary consumers of these documents were human researchers, regulators, and medical professionals. However, this paradigm is rapidly changing.
With the advent of Artificial Intelligence, Machine Learning, and Natural Language Processing (NLP), there is an urgent and growing need to convert legacy health tech documents into machine-readable formats. AI tools cannot meaningfully analyze or learn from unstructured text unless the underlying information is organized in a structured and semantically annotated form.
This transformation involves not just digitization, but also the semantic annotation of content—tagging entities like drug names, biomarkers, adverse events, patient demographics, and trial outcomes with standardized vocabularies such as MeSH, SNOMED CT, or CDISC standards. Structured data formats like JSON, XML, or YAML allow computers to "understand" and work with the content.
Furthermore, structured and annotated data is critical for training domain-specific language models, which in turn can be used to accelerate literature review, automate regulatory submissions, and even assist in clinical decision-making.
Recently, several standard digital formats have emerged that aim to bring structure and interoperability to health and clinical research data. Notable examples include USDM (Unified Study Definition Model) and FHIR (Fast Healthcare Interoperability Resources).
The USDM, maintained by CDISC, provides a standardized, machine-readable framework for defining clinical trials. It ensures that key trial attributes—such as objectives, endpoints, schedules, arms, and interventions—are captured in a consistent format across different systems and stakeholders. You can find more detailed information about USDM in another article we’ve published.
FHIR, on the other hand, is a widely adopted standard for the electronic exchange of healthcare information. It is maintained by HL7 and focuses on representing and sharing healthcare data such as patient records, medications, lab results, and encounter histories. We also have a dedicated article about FHIR, which you can read here.
To convert unstructured documents like PDFs into standardized formats such as USDM or FHIR, a process known as document annotation is required. This involves identifying relevant sections of a document, extracting specific pieces of information (e.g., the primary objective of a clinical trial), and mapping them to corresponding fields in a structured format like JSON.
There are several approaches to document annotation:
We may outline two extreme approaches in the industry:
This method relies on a large team of relatively low-skilled workers using CRM-like tools to manually locate and extract information, which is then entered into structured formats like JSON or XML. While this method can be controlled and auditable, it is slow, costly, and prone to human error—especially when outsourced.
In this scenario, the entire document is loaded into a language model with a sophisticated prompt, and the model is expected to extract and organize all relevant information autonomously. While attractive in theory, this approach presents several challenges:
In our experience, the most practical and scalable solution is a hybrid approach. This involves:
This strategy offers a balance between automation and control. By reducing the document size before sending it to the LLM, we stay within context limits and reduce computational costs. At the same time, by combining human intuition or predefined heuristics with model intelligence, we improve accuracy and maintain flexibility.
This semi-automated pipeline represents a feasible path forward—scalable, cost-effective, and adaptable to evolving document types and annotation standards.
In this article, we highlighted the critical need to annotate and structure legacy documents in the clinical and health tech domains. As AI and data-driven tools become central to research and care, extracting structured data from unstructured PDFs and non-standard formats is essential.
We reviewed emerging standards like USDM, and explored various annotation methods—from fully manual to fully automated LLM-based approaches. While each has trade-offs, a hybrid method combining heuristics with language models offers a practical balance of scalability, accuracy, and privacy.
Modernizing document workflows is a key step toward unlocking the full potential of AI in healthcare—enabling better insights, faster innovation, and smarter clinical decisions.