I have worked at multiple companies where the implementation of Cassandra resulted in an inability to query the operational data. In each of these cases, they had to build a parallel Extract, Transformation and Load (ETL) process that "replicated" the data to an RDBMS to support ad-hoc queries. This is one of the biggest pitfalls of implementing NoSQL tools and it can be avoided with a little pre-implementation planning.

A traditional RDBMS database is designed by modeling the data before implementing the physical storage structures (tables). The tables, relationships and indexes are optimized to support the primary data access paths; however, the relational technology supports many other data access paths. It is relatively simple to improve the performance of these additional data access paths through the implementation of indexes and other tools in the persistence layer.

For example, a system is built with a Recoding table that stores the recording of programs on a device. The primary key is Device IP Address, Program and Date (e.g., NOVA, 20140226190000). The data access path for the application code is Device IP Address. The Recording table will efficiently support applicaton operations on a single device. Other users will likely want to aggregate data across all devices for analysis. For example, how many devices recorded a program on a given date? The RDBMS can support this query without any modifications to the persistence layer.

It is much different in the NoSQL world. The data storage structures are created based on all anticipated data access paths. Each of the data structures support a single data access path. Continuing from the previous example, a Recording column family is created in Cassandra with a composite row key of Device IP Address, Program, and Date (e.g. This column family will provide very fast data access to the application for all queries at the Device IP Address grain. The difference between Cassandra and an RDBMS is that ALL queries MUST contain the Device IP Address in the predicate. The query will result in a full column family scan or an error in Cassandra if the Device IP Address is not in the predicate. In other words, there is no way to query by Date or any other attribute.

Additional data access paths could be supported by creating secondary indexes, by using 3rd party tools that index the data (e.g. Solr) or by creating additional column families that need to be populated by the application (e.g. duplication of data). For example, a RecordingByDate column family could be created that is keyed by Date.

All of these options require a significant amount of effort to support the unanticipated data access path. These data access paths are supported by an RDBMS without any modifications to the physical storage structures.

It is a common misconception that data modeling is not necessary in the NoSQL world since the NoSQL platforms tend to be schemaless. The example above illustrates the importance of thinking about the application data access paths and any other data access paths that may be required to support downstream consumers. A Data Architect and business stakeholder(s) need to be involved in the implementation of NoSQL platforms to ensure that the data will be accessible to the application and other consumers. A little pre-implementation can prevent a NoSQL implementation from becoming an operational data vault.