A recent blog post by Raju Bodapati sparked a conversation with coworker, Cameron Snapp, last night. Raju mentions in his blog the large amount of data that can be gathered in simple day to day tasks which Cameron and I took farther, commenting that basically anything you do can have absurd amounts data collected on it. Sports analytics are fairly established and an easy concept - the amount of baseball data from pitching attributes, to hitting quality, to defensive alignments is massive and quickly becoming a target of data architects in Big Data efforts.

This prompted the question "how do you make that first slice?", assuming that once you have the first pass at the data done, you can then figure out where you're going with the mass of data you have. Butchers are used to a standard animal (most cows are… well… cow shaped) - they know where to cut out that ribeye or fillet. Users of Big Data, on the other hand, don't have a standard starting point. We don't know if that first cut will yield an easy starting point to make the data useful or if we've just inadvertently split the choicest cut, making it harder to identify valuable insight.

For years, we IT database professionals have designed transactional and warehouse systems to house the information for analytics and BI. These designs are founded on understanding the business problems and reporting needs. Big Data presents a new problem because the slices of information aren't always known until they're discovered. Verizon has just filed a patent on a DVR which can monitor the physical locations, facial expressions, conversation topics etc of its viewers. Think of the data being harvested here! Obviously there are target marketing opportunities, but Verizon will face an immediate issue of figuring out whether to slice data by demographic, show viewed, time of viewing, amount viewed, how close attention are they paying, are they using mobile devices in tandem. The questions, and slicing possibilities are endless. So how will IT professionals design the system and know how to start cutting?

To continue the analogy, is there some method we can use to separate our mass of data into primal cuts? Take Raju's driving data example: a mechanic knows his primals may include drivetrain, suspension, and environment but take that same mechanic and hand him weather data and he likely will not know how to make his first cut. I suspect a heavy reliance on subject matter experts will always be necessary, rather than a generic approach which applies to all scenarios. There will eventually be people who can do the Big Data analytics and know where to start, but will they always be subject specific?. For the foreseeable future, successful Big Data projects will require a partnership between SME and ‘technician'; it could remain hard to find a butcher who can handle both cows and sharks.