May 11, 2023 | AI and NLP, Clinical Data Strategy, Digital Quality Transformation, Year Round Prospective HEDIS®

Unstructured Data and Health Care Transformation

Last year ChatGPT exploded on to the scene, kicking off a flurry of technology development with impact across many industries. The technology underlying ChatGPT (known as large language models or LLMs) has actually been around and evolving for several years. And these models are just the latest episode in a much longer story of language technology development that has been accelerating for a decade or more. Still, there is no doubt that we are in a period of rapid technology change which could radically transform the way we work, including the way we work in healthcare.

Rebecca Jacobson, MD, MS, FACMI

Co-Founder, CEO, and President

Recent Blogs

Digital Quality Measurement is Heating Up – Will you be ready?

April 7, 2025 | CQL and FHIR, Digital Quality Measures (DQM), Digital Quality Transformation, Uncategorized

HEDIS NLP in the era of LLMs – what if my organization wants to build our own?

March 11, 2025 | Star Ratings, Uncategorized, Year Round Prospective HEDIS®

Succeeding with your MY25 Prospective HEDIS Programs

February 6, 2025 | Star Ratings, Uncategorized, Year Round Prospective HEDIS®

I am certain that many new innovations for payers will result from this large step forward in language models, including better analytics of clinical data to help focus limited resources on interventions that produce better and more equitable healthcare. But in order to really use these language technologies at scale, health plans will need to figure out how to aggregate clinical data, specifically unstructured clinical data. That’s why this month’s blog takes a deep dive into unstructured clinical data. What is it? Why do you need it? And how to manage it at scale?

What is Unstructured Clinical Data?

Unstructured data includes things like text, images, and video. It’s data that cannot be codified or separated into specific database fields without special methods (such as NLP or image processing). For a health plan quality team, the type of unstructured data that you know best are member charts, something you manage routinely as part of your HEDIS® hybrid review season. Simply put, member charts come to you as unstructured data because they contain all types of data formatted as a series of clinical notes, often stored as one or more PDF documents.

Your Unstructured Data Needs are Increasing

As the NCQA hybrid methodology is retired, health plans are increasingly turning to prospective HEDIS across key populations during the measurement year to preserve their HEDIS® rates and incentive programs. Health plans also need to support ongoing risk adjustment programs that rely on clinical review of charts. Data integration partners can now provide member charts in high volumes and health plans can also integrate directly with practices to aggregate this type of data. As NLP methods continue to grow in sophistication, there will be new opportunities for using unstructured data for example in care management, social needs identification, or for detection of patterns of fraud, waste, and abuse. In fact, almost any health plan operation could be a potential use case for unstructured data.

You Can Only Go So Far with What You’ve Been Doing

And yet…many health plans still rely on file directories for storing and managing member charts. Unfortunately, file directories don’t scale to the volume that you’ll need as you transition to prospective HEDIS®, and they won’t provide the flexibility and speed you’ll want to easily leverage technologies such as NLP and LLMs. File directory style management also tends to create silos, because each business unit creates its own separate file store, resulting in duplication of data and multiplication of data retrieval costs. It’s not unusual to see health plans paying for the same chart to be retrieved three times for three different purposes in the same year. And then paying for an identical and unchanged chart the next year too!

Other health plans are buying off-the-shelf document repositories or using their data warehouses to create their own centralized infrastructures in collaboration with their internal IT partners. That’s definitely a step up from file directories. But off-the-shelf technology and most clinical data repositories just don’t have the capabilities needed to support healthcare data and analytics using unstructured data. That can hold you back as you build your analytic muscles to leverage NLP or even just try to move your team to year-round prospective review.

Wish List for Your New Infrastructure

What should you be looking for in a centralized unstructured data repository? We’ve made a list of the 11 things we think are most critical.

1. Scalable to Vast Stores of Unstructured Data. Be prepared to manage up to hundreds of millions or billions of documents, stored in their original formats and NLP-friendly transformations for efficient use in downstream applications

2. Historical and Real-Time. Use historical data to meet look-back periods. And support near-real-time use cases for surveillance. HEDIS® success requires both!

3. Indexed and Searchable Using User-Friendly Tools as well as APIs. Your staff needs to be able to quickly find members who meet specific criteria, enabling rapid development of quality improvement campaigns. But you should also be able to leverage the same search capabilities through APIs to support your data analytics efforts.

4. Format Conversion. Unstructured data can come in pure text formats such as RTF or TXT, multi-medium formats such as PDF and XML, or image formats such as TIFF. Your repository must be able to ingest these formats and convert them into TXT for NLP or display-friendly formats such as PDF so that downstream applications can consume this data in the format best suited for the use case.

5. Run Independently of EMRs. Integrate data from EMRs through a variety of mechanisms and formats. There is no right way to get data. Your repository should support them all.

6. Patient Identity Reconciliation. Match individual members to their unstructured data to create a unified record across providers.

7. Rich APIs. Due to the volumes involved, your in-house analytics and software solutions teams will need rich APIs to search and retrieve this data at scale. Near-to-real-time use cases cannot be supported with file-based transfers.

8. Cohort Management for Pipeline Orchestration. Your analytics workflows may vary by population or denominator, so your repository should support orchestration – the ability to define and save cohorts and move their clinical data through specific NLP workflows.

9. Strong Data Integrity and Version Control. Clinical documents go through a lifecycle from initial creation to final sign-off and can also be amended later. Your repository must have the capability to support multiple versions of the same document and be able to store these in a traceable manner for auditing support.

10. Data Validation. Support multiple methods for validating data. Empower your data team to support best practices in data quality.

11. Secure, HIPAA-Compliant, and HITRUST-Certified. Keep data privacy and security at the forefront.

Hey – That’s Us!

If you’re looking for a good example of an infrastructure with these eleven characteristics, take a peek at Constellation – Astrata’s newest standalone offering for the payer market. Constellation is a sophisticated healthcare document repository that will help you centralize your data and use it efficiently across all your use cases. Your analytics teams can use constellation APIs to hook up to their own analytics workflows or offer our integration points to your 3rd party vendors, speeding up your integrations while lowering costs. Astrata can provide the managed services along with software to run your centralized asset or license the software and training for your own IT team to manage. Learn more here:

Download the Constellation product sheet.

Read more about how your organization can develop and implement a clinical data strategy.