Health Tech, Clinical, and Drug Discovery Document Annotation

Over the past several decades, the health tech and drug discovery industries have accumulated an enormous volume of manually created documents. These include patient health records (PHR), clinical trial protocols, lab reports, investigator brochures, case report forms, adverse event reports, and regulatory submissions, among others. Most of these documents were created without standardized formats and were typically bound to the internal systems or documentation styles of the institutions or companies that generated them.

These documents were authored by humans and are often stored as unstructured PDFs or, at best, in complex, proprietary XML schemas. For example, clinical trial data submitted to repositories like ClinicalTrials.gov typically includes all the essential information—such as study objectives and endpoints, schedules of activities, drug candidate identifiers, and inclusion/exclusion criteria—but this information is primarily intended for human interpretation.

While ClinicalTrials.gov does offer data via an XML-based API, the XML format used is non-standardized and difficult to parse, making it challenging for automated systems to extract structured information efficiently. Moreover, many clinical trials, particularly those that failed or were discontinued, never made it to public registries like ClinicalTrials.gov. Instead, their data remains locked away in PDF files stored in corporate archives, inaccessible to modern computational tools.

Historically, this lack of structure was not a major concern because the primary consumers of these documents were human researchers, regulators, and medical professionals. However, this paradigm is rapidly changing.

The Growing Need for Structured, Annotated Documents

With the advent of Artificial Intelligence, Machine Learning, and Natural Language Processing (NLP), there is an urgent and growing need to convert legacy health tech documents into machine-readable formats. AI tools cannot meaningfully analyze or learn from unstructured text unless the underlying information is organized in a structured and semantically annotated form.

This transformation involves not just digitization, but also the semantic annotation of content—tagging entities like drug names, biomarkers, adverse events, patient demographics, and trial outcomes with standardized vocabularies such as MeSH, SNOMED CT, or CDISC standards. Structured data formats like JSON, XML, or YAML allow computers to "understand" and work with the content.

Furthermore, structured and annotated data is critical for training domain-specific language models, which in turn can be used to accelerate literature review, automate regulatory submissions, and even assist in clinical decision-making.

Emerging Standards and Practical Approaches to Document Annotation

Recently, several standard digital formats have emerged that aim to bring structure and interoperability to health and clinical research data. Notable examples include USDM (Unified Study Definition Model) and FHIR (Fast Healthcare Interoperability Resources).

The USDM, maintained by CDISC, provides a standardized, machine-readable framework for defining clinical trials. It ensures that key trial attributes—such as objectives, endpoints, schedules, arms, and interventions—are captured in a consistent format across different systems and stakeholders. You can find more detailed information about USDM in another article we’ve published.

FHIR, on the other hand, is a widely adopted standard for the electronic exchange of healthcare information. It is maintained by HL7 and focuses on representing and sharing healthcare data such as patient records, medications, lab results, and encounter histories. We also have a dedicated article about FHIR, which you can read here.

The Challenge of Converting Legacy PDFs

To convert unstructured documents like PDFs into standardized formats such as USDM or FHIR, a process known as document annotation is required. This involves identifying relevant sections of a document, extracting specific pieces of information (e.g., the primary objective of a clinical trial), and mapping them to corresponding fields in a structured format like JSON.

There are several approaches to document annotation:

Manual parsing – Human annotators read documents, locate relevant sections, and manually extract data into structured formats.
Semi-automated, Heuristic-based parsing – Rule-based systems or regular expressions are used to identify document sections; then this section might be forwarded to LLM or classical NLP algorithms.
Large Language Model (LLM)-based annotation – Entire documents are loaded into a language model’s context, which is then prompted to extract and structure the relevant information.

Two Extremes: Manual vs. Full LLM Annotation

We may outline two extreme approaches in the industry:

Fully Manual Annotation

This method relies on a large team of relatively low-skilled workers using CRM-like tools to manually locate and extract information, which is then entered into structured formats like JSON or XML. While this method can be controlled and auditable, it is slow, costly, and prone to human error—especially when outsourced.

Fully Automated LLM Annotation

In this scenario, the entire document is loaded into a language model with a sophisticated prompt, and the model is expected to extract and organize all relevant information autonomously. While attractive in theory, this approach presents several challenges:

Context limitations: Not all LLMs can handle large documents due to restricted context windows.
Prompt complexity: Accurate extraction often requires multi-turn agent-like interactions rather than a single static prompt.
Privacy concerns: Clinical documents may contain sensitive or proprietary information. Using commercial cloud-based models like OpenAI’s may not be viable due to data privacy policies and risks of indirect data leakage through embedding storage or fine-tuning pipelines. This often forces teams to rely on self-hosted models, which require expensive GPU infrastructure and considerable DevOps expertise to deploy and maintain.

A Practical Hybrid Approach

In our experience, the most practical and scalable solution is a hybrid approach. This involves:

Manual or heuristic-based preprocessing to identify relevant sections of the document (e.g., locating structured tables or headings).
Extracting those sections and preparing them as clean input for the LLM.
Using a language model to perform fine-grained annotation and populate structured formats like USDM JSON.
Use standardized vocabularies such as MeSH, SNOMED etc to link extracted entities to standard terms.

This strategy offers a balance between automation and control. By reducing the document size before sending it to the LLM, we stay within context limits and reduce computational costs. At the same time, by combining human intuition or predefined heuristics with model intelligence, we improve accuracy and maintain flexibility.

This semi-automated pipeline represents a feasible path forward—scalable, cost-effective, and adaptable to evolving document types and annotation standards.

Conclusion

In this article, we highlighted the critical need to annotate and structure legacy documents in the clinical and health tech domains. As AI and data-driven tools become central to research and care, extracting structured data from unstructured PDFs and non-standard formats is essential.

We reviewed emerging standards like USDM, and explored various annotation methods—from fully manual to fully automated LLM-based approaches. While each has trade-offs, a hybrid method combining heuristics with language models offers a practical balance of scalability, accuracy, and privacy.

Modernizing document workflows is a key step toward unlocking the full potential of AI in healthcare—enabling better insights, faster innovation, and smarter clinical decisions.