I've worked with health care data for the past few years, and in a recent conversation I realized it might be valuable to detail some of the complexities of health care data for those who might enter this growing field. Of course these considerations aren't unique to health care, but they are typical of the challenges that the new health care application developer or analyst might face.

Policyholders and People

At first glance in a health care data model, the entities Policyholder and Individual, or person, might seem like the same thing, but the difference is very important. One person can be a policyholder under more than one plan. For example, someone may be covered under their employer's plan and also covered as a dependent under their spouse's plan. (Both the covered person and their dependent are considered "policyholders" in the applications I've seen). If a person is covered under more than one plan, and the different plans are managed in different databases, then sometimes it is difficult for systems to match their separate records and know the different policyholder records represent the same individual.

This difficulty in "member matching" can complicate innovative new approaches like care management. Care management involves analyzing health records and intervening with individuals to reduce their health risk. For example, frequent emergency room visits for asthma may indicate Chronic Obstructive Pulmonary Disorder (COPD), and if the patient is a smoker then a care management provider may begin intervention to promote smoking cessation.

Claims data is the most accessible source of health information because it unifies data from the many hospitals, practices, networks, clinics, etc, which may be involved in an individual's care. If care management analysis is based on policyholder claims, then there's always the risk that the system may underperform because it fails to connect health records for different policyholders who happen to be the same person.

Sheer Data Volume

Health care data becomes very large very quickly. Every doctor's visit results in a clinical record and a claim, or more likely a number of clinical records and claims for the doctor visit, labs, and prescriptions. So each visit might result in many rows added to tables representing treatment encounters, claims, and payments, each with relationships to tables representing providers, facilities, policies and plans, employer groups, etc. Of course a hospital visit would result in storage of many more records than a doctor visit. Insurance companies store all of this claims and treatment history along with member demographics, records of member privacy choices, and policy coverage detail and history of changes.

As the number of covered policyholders grows into the tens of millions, typical problems of scale emerge, and highly qualified database administrators must optimize servers and databases in situation-specific ways to ensure adequate transaction throughput, reporting performance, security, and integrity.

Even if database management concerns are overcome, these two additional issues generally emerge on projects involving health care data:

  • Hidden errors in membership data:In addition to the Policyholder/Person ambiguity noted above, very large member databases tend to harbor many duplicate records for the same individual due to data entry errors or differences. For example a record for a "Tom Smith" and another for the same person under "Thomas Smith", one with a typo in the address, can be typical. Improving membership data quality is typically beyond the scope of a project with other business objectives, and as a result such ambiguities tend to be addressed at the project level with patches and workarounds.
  • Use of live data in test:In application development, sometimes it is prohibitively difficult to simulate real-world complexity and size in a test environment, so many organizations developing applications to run against very large databases test on live data. In the health care world this means allowing development and test staff access to personal health information (PHI), and therefore putting in place the same background checks and non-disclosure agreements for developers and testers as with health care staff and claims administrators. Some organizations instead put extensive resources into masking PHI or developing large and complex simulation databases. These solutions can be useful even though they displace the problem rather than solve it: those who develop the simulations are application developers who require access to PHI, and the simulation database must be tested extensively to ensure it is an accurate semblance of the original.

Integrating Many Inconsistent Sources

Claims and care management both pull data from many different networks, hospitals, customer companies, and other data sources, each with different data quality standards, different business processes, and different lead time to data extraction. A data element critical to a care management organization, for example, may be of poor quality because it might not be relevant to the operational needs of a behavioral health provider. Additionally, a medical imaging provider with impeccable paper records may send them to a less than stellar data entry firm for transcription, degrading the quality of outbound data with random errors. Profiling, quality assessment, and perhaps adjustment of project goals to match available data quality are critical early steps when integrating data from various health care sources.

Structured versus Unstructured Data

Much valuable health care data is currently locked up in doctor's notes, call center recordings, charts, and other non-digital forms. The industry is beginning to scratch the surface of unlocking unstructured data, for example, in imaging systems, but in general has yet to use emerging approaches like text mining to extract information from these non-structured sources.

Of course there are many more complexities the developer faces in working with health care data, and there are approaches beyond those detailed here to the ones I've covered. Look for more at this site on health care considerations in coming months.