Historically, businesses have been able to establish effective data governance programs because data governance was built into the design of the data store. In today's big data environment, this changes dramatically, and both data governance and data science may suffer as a result.

I recently wrote a white paper, "A strategy for establishing data governance in the big data world," that focuses on these issues and describes a strategy that CapTech has developed for addressing them. The paper also discusses the evolution of data storage since the late 1980s, when I started working in the technology industry, and how data governance has been affected by this evolution. This blog provides a brief overview of the white paper.

Data governance consists of the policies and practices that an organization establishes to manage its data assets. Ineffective data governance can create problems in such critical areas as traceability and data lineage, metadata, data quality, and data security. The implications are serious not only for data scientists, but also for the business as a whole. Without effective data governance:

  • Analysts and other data consumers can't find the data they are looking for.
  • The quality of data in storage is poor.
  • It's unclear where data came from and who has handled it.
  • Controls on data access aren't in place.

Built-in data governance

In the late 1980s, when I was doing COBOL programming on mainframe systems, we stored data in flat file structures with metadata that indicated folder names. This provided insights into the type of data stored in any particular folder and supported a rudimentary level of data governance.

In the 1990s, relational database management systems (RDBMSs) became widespread, increasing operational efficiency by reducing the amount of data required for transactions (e.g., credit card purchases) as well as the need for costly storage. Data governance was enhanced through the establishment of table names, field names, and definitions of each.

Soon, online analytical processing and the data warehouse were introduced, helping organizations leverage data for use in analysis and decision-making. Data governance arguably became more important in this model, as metadata was extended into the data warehouse.

The big data environment

Since 2004, a variety of big data technologies have surfaced, enabling businesses to collect unprecedented volumes and varieties of data. Because these technologies leverage commodity hardware, storage costs have dropped significantly, leading to an ever-increasing quantity of data. But, because these technologies don't assign metadata to files, analysts, and other data consumers are having trouble determining what data they have in storage. That makes it difficult to run analyses that convert data into business insights. It also makes it difficult to implement data governance initiatives.

CapTech has developed a strategy that addresses these problems by associating metadata with files as they are brought into the storage area, often referred to as a data lake. In addition to providing high-level metadata, the process of ingestion registration provides a level of lineage or traceability and can support both data security and data quality checks.

I liken the strategy to the use of a fish finder in a large body of water. Just as the fish finder helps anglers determine where fish are likely to be found, metadata helps data scientists identify the best places to look for business insights.

In upcoming blogs, I'll discuss the evolution of data storage over the past 25-plus years and how the CapTech strategy addresses data governance issues in the big data environment.

Click here to download the full white paper: A Strategy for Establishing Data Governance in the Big Data World