The well-publicized problems with healthcare.gov are disturbing, especially when we remember they might result in many continuing without health insurance. But it seemed a step in the right direction when recent a news report differentiated between "front end" and "back end" problems. The back end problems were data issues, like a married applicant with two kids being sent to an insurer's systems as a man with three wives.

Coincidently, I recently responded to a questionnaire about health care data. I've paraphrased the questions and my responses below. Perhaps the views of someone who's spent a lot of time in the health care engine room might provide some useful perspective.

In reading through the questions and my responses, picture a care management system that pulls patient, provider, and visit data together from multiple insurance companies and sends letters with helpful advice to patients with certain conditions.

1) What kinds of problems exist in collecting patient-centric data? What data quality issues are present?

In context, this question asked how well data from the patient-provider encounter (health care jargon for doctor visit) was recorded, including patient enrollment, data from the encounter, the diagnosis, and even matching patient encounter records to the right providers.

Being a "back end" developer, most often in an insurance company, my job is to integrate data from several different sources into a single database. Based on my experience, I'd characterize 5 to 10% of the records as incomplete or inconsistent in some way. A smaller percentage of the records feature errors significant to the mission of whatever system I'm involved in building.

Quality problems might include:

  • Missing data for a given patient
  • Data entered incorrectly
  • Duplicate patient data due to key data entered differently at separate care locations. Redundant data could also be entered multiple times at the same location, if data validation controls are not in place

The result of the latter might be multiple records for the same person or data for different people being "matched" together into the same record incorrectly.

These data quality issues cause errors when systems use patient data to drive transaction processing. Using letters related to a specific condition – say, diabetes – as an example:

  • Missing data might cause a patient with diabetes to be omitted.
  • Invalid duplication of a patient record might result in duplicate letters for the same person.
  • Merging patient records incorrectly might cause someone without the given condition to receive a letter.

2) What steps are taken to improve upon the data or the collection methods?

From the perspective of a data integration professional without any influence on the point of data entry, the options are these:

  1. Detect and exclude invalid data,
  2. Include all data regardless of validity, and
  3. Interpolate or estimate correct values.

The first two options are the most common. In most business applications the third option is frowned upon because it changes source data without any way of knowing the actual value.

3) What are the negative impacts of dealing with dirty-data?

A business process cannot be 100% accurate if its source data isn't 100% accurate. Each data quality measure introduced to the data integration process increases data integration cost and increases data transmission lag to the target system.

4) What efforts are in place to reduce the negative impacts? What kinds of processing or architecture helped make the difference?

Business process design and system improvements that introduce measures to correct data at the point of entry are the best way to insure quality data.

I believe the health care industry has a lot of potential improvement here. Why do I have to fill out a 10-page form for error-prone manual entry at the doctor's office when the insurance company they share data with has all of my detailed information available for download? Of course those kinds of solutions quickly raise Big Brother type questions, but for better or worse any insured person's health care details are replicated across many provider and insurer systems. Our records are already at risk without addition of this one very useful interface.

After entry, the data passes through the labyrinth of back end systems of large providers and insurance companies. Different systems, or even different locations, over time evolve different business rules and data definitions. It is critical that those developing integration processes understand the business aspects of all different sources and conform them to a common standard for proper interpretation for the particular target business process. Projects that skimp on business analysis and source data research pay the price as many of their implicit assumptions turn out to be wrong. Maybe that's what's happened at some of the back-end systems served by healthcare.gov.

It's sometimes possible to correct data after the fact. For example, data analysts on one CapTech project for a large health insurer tracked data errors on reports back to the incorrect data in source systems, and recommended the corrections needed. The 12-month effort resulted in significant data quality improvement after diligent work by the 10 or so skilled data analysts on the team.