X
Request a Demo

March 11, 2025 | , ,

HEDIS NLP in the era of LLMs – what if my organization wants to build our own?

As hybrid measurement is retired, your team is likely already switching to prospective HEDIS medical record review to help you maintain your HEDIS rates, Star and Medicaid P4P ratings. But that means abstracting records on a much larger population compared to your smaller HEDIS sample. Most payers are already using Natural Language Processing (NLP) systems now to triage cases so that they can process these much larger volumes. If you’re in the market to ramp up your Prospective HEDIS program with NLP, you may be facing a build versus buy decision with your IT and Analytics colleagues. What should internal teams consider going into that build versus buy decision?

Rebecca Jacobson, MD, MS, FACMI

Co-Founder, CEO, and President

One thing that is worth pointing out upfront is that we rarely found health plan teams prepared to build their own population-scaled NLP systems until the last 2 years. With the increased availability of Large Language Models (LLMs), health plan IT teams feel more confident that they can create an information extraction solution for HEDIS without deep NLP expertise. But a “build-it-ourselves” decision can also produce significant challenges and slowdowns. This blog will provide an overall analysis of the build-it-yourself option to help quality leaders avoid the pitfalls and steer towards collaborative success. 

What are we building exactly? 

It’s important to start with a solid lay-person’s understanding of what technology needs to be in place to create a successful prospective HEDIS MRR program. If your organization is building its own, you need to account for ALL of these pieces. The figure below is a good starting point for formulating a complete picture. 

Any complete solution starts with being able to ingest and normalize a wide variety of different data formats because unstructured data comes in all shapes and sizes. PDF, HL7, CCD, CCDA, HTML, faxes and images are all common. Metadata (such as facility, provider, and encounter type) needs to be ingested and normalized too. Next, models must be built, typically one for each HEDIS measure. IT teams using LLMs will most likely choose an open-source LLM (such as Llama) or a proprietary LLM (such as OpenAI) and then engage in something called “prompt engineering”. In this phase, they will formulate a query of the large language model that will bring back the information you want – such as the compliance status and specific evidence for closure of a COL gap across an entire medical chart. Your team may engage in other model building activities such as fine tuning or even model pre-training, designed to make your results more accurate. Throughout the process, they will need to iteratively evaluate the model to assure sufficient accuracy before they can be deployed. Following deployment, your team will need to monitor for drift and update and redeploy models on a yearly basis on sync with the new HEDIS specification. 

Why do teams want to build their own? 

One reason for the increasing interest in building internally is that LLMs are a very engaging and exciting new technology, one that seems particularly well-suited to teams with limited NLP experience. LLMs can be exhilarating to work with, especially initially. A good programmer working on the “model building” tasks above can make progress quickly, even with limited previous NLP experience. The speed by which they can get to 60-70% accuracy suggests that a near-perfect solution is just around the corner. However, attaining acceptable performance from the model and deploying it with all of the other needed technology and processes is actually a much bigger project. Internal teams should consider not just the language model, but also evaluation, monitoring, data ingestion and normalization pipelines, a user interface and process for expert review, and yearly rebuilds. 

Is it financially advantageous to build our own? 

The perceived financial benefit is another reason why teams may decide to build their own rather than use an established AI vendor. Your team may have already constructed an initial comparison of the costs of a vendor-based AI solution versus an internal build. But there are five specific areas where you want to make sure you aren’t overlooking hidden costs of an internal build: 

  • Need for domain expertise. HEDIS domain expertise is the most important hidden cost to manage. Your build team needs access to substantial expertise to complete the project. This may start as a request to help your build team better understand the HEDIS specification or value sets, but can end up with near-daily requests for examples of language for HEDIS evidence, help evaluating results, and ongoing feedback during the review process. Even document annotation and gold standard development by your abstractors may be needed. It takes both domain expertise and AI expertise to build the complete solution. So be sure to account for both. 
  • Achieving sufficient accuracy. LLMs can generate good accuracy out of the box because they build on knowledge embedded in very large datasets (such as the entire internet). But getting to a level of accuracy sufficient to support a prospective HEDIS program across multiple measures and lines of business is much harder. In general, it’s easier to get high precision (few false positives) than to achieve high recall (few false negatives). Your goal should be to close all gaps for which there is evidence in the documentation. Without sufficient recall, you’ll see a substantial decrease in the value of your prospective program.  
  • Achieving sufficient coverage. A complete solution means producing models for all the relevant HEDIS measures and includes both inclusions and exclusions. This usually means many more prompts to manage, and more effort from your team to help build them. Some measures (like COL and CCS) are usually very easy to build, but others can be much more challenging. Ensure that the financial projections are not estimated based on the easiest measures to produce. 
  • Maintaining models over time. It’s also wise to account for the changes in individual measures as well as new measures that result from the yearly HEDIS updates, and changes to Star Ratings and P4P programs. 
  • Include all of the necessary functionality not just the LLM. Be sure that you’ve accounted for everything listed in the figure above, including development and maintenance of the user interface to review and accept gap closures when allowable HEDIS evidence is present. 

Once all of these hidden costs are included, organizations often find that the internal build can be the more expensive option, particularly in terms of up front costs.  

What are the risks of building our own? 

Delayed speed to market and maintaining your competitive edge represent the most important risks of a build-it-yourself strategy. As the HEDIS measures continue to be removed from the hybrid methodology, health plans must leverage every tool available to them to preserve their rates. A “build-it-ourselves” strategy make take years to produce a usable platform for prospective HEDIS. Most health plans do not have the time to spare. That’s especially true for those competing against health plans that are already using established commercial systems. 

Questions to ask of your IT or analytics colleagues 

To help you make the best decision possible, we’ve curated a list of questions that can assist you in developing a solid build plan. 

  • Have you accounted for all of the technologies and processes that need to be developed and managed (see figure)? 
  • What data formats will be included and excluded? 
  • How much time from the quality team will be needed? What parts of the development and maintenance process will we be needed for? 
  • How will you measure accuracy? How will we determine an acceptable threshold for accuracy? Will accuracy be measured over time to ensure that it does not degrade?  
  • How will the results of the solution be reviewed by quality experts for accuracy? How can we assure that the total amount of expert review time does not end up being the same or greater for our abstractors? 
  • What approach will you take to ensure that the solution generalizes to measures other than the easy ones (e.g. COL, CCS)? 
  • Will our internally built solution be able to identify exclusions too, or only inclusions? 
  • How will the system be monitored for drift over time? What will happen if we detect that there has been a drop in accuracy? 
  • How will the system be maintained as measures can change on a yearly basis. What will be needed to make sure that the system stays up to date? 

For those who know that they want to buy instead of build, we are always delighted to share Astrata’s approach to scaling your prospective HEDIS season.