Data Cleansing
We described how to combine data from several sources and manage it, but what about the data itself? In the example portrayed at the beginning of this article, one of the phone numbers was apparently wrong; combining the two numbers would not solve the problems. Generally speaking, if the business processes are fed with outdated, incomplete, or simply incorrect data, how can one expect up-to-date, complete, and correct output?
Here Data Cleansing processes come into play. These processes try to identify incorrect, corrupt, or incomplete data and then attempt to correct them wherever possible. The identification could be based on sets of rules. For instance, a Social Security Number should always be 9 digits long and conform to the format described here. Thus, any entry in the SSN data field that does not conform to those rules could be flagged and routed to a remediation process.
Sometimes it is difficult to determine if the data is correct and complete; if this is the case, data can be flagged and fed into a manual review process where human experts review the data and perform corrective actions.
Of course, one way of dealing with incomplete or outdated data is to simply ignore it and exclude it from any further processing. Sometimes that might be the best solution available. For instance, why should a computer hardware store care about 10-year-old customer data? Their business and customer base are changing so quickly that such old data could be irrelevant to today’s business decisions. When old data is likely not needed anymore to support the business processes one could simply archive that data “as-is” and move on.
Assuring that the data is of high quality and the steps to maintain that quality falls under the responsibilities of a Data Steward. This role requires a strong understanding of both the technical aspects and business side of the organization, acting as the liaison between the IT department and the business.
Implementation
The implementation of MDM starts with People.
Based on their knowledge of business requirements, these experts will
specify which areas of data (often also called “data entities”) should
be put under MDM control. They also need to define the ownership of the
data; that is, who is responsible for which data. Further, they set up
the rules that determine the flow of information within the organization
– and outside of the organization if applicable. An important task is
to define rules and algorithms that handle situations when multiple
sources could be used to obtain the same piece of data. Above we
outlined the situation that a customer is known both as “John Smith” and
“John E. Smith,” and that the business rule specifies that the name
with the most characters should be taken. During the implementation
these are the kind of business rules that must be established, tested,
and implemented.
Once those decision have been made, we need to define Processes.
For instance, the exact processes to route the data automatically from
their sources to the destination must be worked out. But what happens if
the automatic workflow fails, e.g., because data records violate
business rules? Coming back to our previous example, what if there is a
third source listing the same customer as “John ED Smith”? Here the
“take the longest name”-rule would yield an ambiguous result. In cases
like this, manual workflows might have to be created that allow for
review and decision. Also, it must be determined who or what can author
(create, modify, delete) data. Finally, validation processes that assess
the completeness and correctness of the data have to be established,
e.g., address information could be validated using an USPS address
checker.
Now that we know what data we want to put under MDM and how to process the data, it is time to look at the Technology
that enables us to implement all of this. The IT department will need
to furnish the appropriate infrastructure (on-premise or in the cloud),
such as servers, databases, and network connections, while the software
developers will implement algorithms and – if required – program
interfaces that provide for sharing the data.
Maintaining High Data Quality
Depending on the MDM strategy, there will be an initial migration of data from the source to the target system(s), during which the data is run through all of the steps described above. Then, as new data arrives, that information must be combined with the already existing (“historic”) data.
Continuously detecting data issues and quickly remediating them is vital because the data quality will deteriorate just as the telephone number scenario illustrates: If you had deleted the outdated number as soon as you received your friend’s new contact information you would have avoided a lot of trouble.
The Costs of (not) doing MDM
Master Data Management is not easy (or cheap) to obtain. However, Harvard Business Review estimated that in the US alone, bad data carries a cost of over $3 Trillion per year. Given this very large figure, every organization should seriously consider taking steps toward MDM; correct and complete data that is easy to access is a valuable asset. Through accurate reporting based on good data, an important understanding of what went right and wrong in the past can be gained. Those insights lead to data-driven decisions that can shape the future. Like in life, the best advice is to start small: You could start with one data entity where a clean-up and MDM effort would yield immediate improvements. Over time, you can scale this process, bringing your full (pertinent) data under MDM control.