Image of Data StreamingIt seems like every week we are in the midst of a paradigm shift in the data space. Enterprise adoption of open-source technologies and cloud-based architectures can make it seem like you are always behind the curve. Most of our top clients have taken a leap into big data, but they are struggling to see how these solutions solve business problems. The fact that some big data technologies are now considered legacy after less than five years of use shows how complex and dynamic this space has become. Tools come and go, but streaming data is fundamentally changing how companies approach architecting solutions and accessing data going forward.

Architecting a traditional relational data solution is very prescriptive. When you try to analyze or create insights in this paradigm, you have a lot of supporting systems and code. These systems typically have to:
  • Connect to a source system to extract data
  • Apply business rules or logic through transformations/ETL
  • Load data into a data warehouse optimized for storage
  • Load/aggregate data within a data mart optimized for reporting
  • Build a report or structure to analyze the data
This creates many challenges. Some common themes are:
  • Batch processing creates stale data - average data age is measured in days (or sometimes months)
  • Batch processing can irreversibly transform source data - data consumers don't know where data originated or what business logic was applied
  • Raw source data is not persisted - limits the ability to reprocess data if business logic changes or additional rules are developed
So, why should you care about streaming data in an ever-changing landscape?

Streaming Data is Faster Data

Time is money. Capturing data and making it available within an organization quickly will be a differentiator for companies in the modern data architecture. A customer can be interacting with a bank's website, and they run into an issue applying for a mortgage. They reach out for help by calling the customer service line. What if the customer service representative could know exactly what page the customer is on, what he or she was trying to do, and the specific error that is being displayed when the customer calls? This would fundamentally change the way that service reps coach customers into becoming more self-sufficient.

One common misconception with streaming is that all data needs to be delivered in near real time. That is possible, but it would vary by use case and comes with additional costs. The main points to consider with streaming data are:
  • How much latency is acceptable on new data?
  • What volume of data are you working with?
  • Can the records be processed individually?
In the banking example above, near real time data is required, but an insurance claims processing engine may be satisfied with a micro-batching data process that runs every five minutes. Either way, the streaming paradigm is significantly faster than many current solutions.

Streaming Data is More Available

Another common practice with a streaming paradigm is to create a streaming hub. This eliminates the point to point connections commonly found within an ETL architecture. "Data Democratization" is a term I hear many clients discussing. In short, it means not storing data in silos. A streaming data hub supports sharing data across departments or lines of business and can significantly increase analytics and insight opportunities. Having a single view of a customer across all product offerings not only creates a streamlined experience for them, but it also allows you to better understand each customer's product utilization, behavior patterns, and willingness to try new products.

Data Democratization also instills the concepts of data producers and data consumers within the organization. Data producers are typically charged with making sure that all data is captured reliably to minimize data quality issues and produce it into the ecosystem quickly. Data consumers are primarily concerned with having that single view of the customer, knowing where data originated, understanding what logic was applied along the way, and accurately reporting results. The streaming data hub only underscores the need for a robust data governance policy to ensure that information is shared effectively, appropriate data security rules are enforced, and data quality checks are implemented.

Streaming Data is More Flexible

In today's agile environment, flexibility is key-iterative development, experimentation, and failing fast are the norm. Cloud-based architectures offer an environment where storage is relatively cheap. Many leading organizations are realizing the benefits of data experimentation. These companies typically choose an architecture where two versions of data are stored - the raw form as it was originally captured and the enriched data with business transformations applied. The streaming paradigm is central to data experimentation methodology by serving data rapidly to support prototyping and delivering insights quickly.
  • You can reprocess data as things change. If the business rules that you've defined within your organization change, you now have historical raw data that you can reprocess to generate new answers or have a clear view into the lineage of events. Organizations can analyze this enriched data to discover new trends or correlations and engineer new data streams for organizational consumption.
  • You can future-proof your data as you define new metrics. If your business model changes and you define a new key performance indicator, streaming architectures allow you to process historical data to "seed" new metrics - keeping you from having to start from scratch. This gives your organization the ability to see what a measurement would look like when applied to existing data and offers the opportunity for more proof of concept exploration. When modeling these new KPIs, you might be able to generate some insight that you may have disregarded before. If the KPI doesn't make sense, there are opportunities to tweak or scrap it with minimal disruption to existing business processes.
Ultimately, the mechanics of streaming data offer benefits that can fundamentally change data processing for organizations. Much like a cloud migration, streaming data is a journey that impacts technical and business users from all departments. If organizations think broadly about the data that applications generate and how to leverage that data in a more predictive analytics environment, it will be a market differentiator. This ultimately feeds more mature machine learning and artificial intelligence algorithms, which is where the market is headed. Tools may come and go, but streaming data is here to stay.