Your web browser is out of date. Update your browser for more security, speed and the best experience on this site.

Update your browser
CapTech Home Page

Blog February 27, 2018

Data Science: The Positives and the Pitfalls

Image of data on screenWhat is it?

Data Science is a nebulous term that's thrown around all the time and nobody seems to know exactly what it is. At CapTech, we have our own definition of data science which doesn't always match with our clients' definition when they come to us with data scientist related projects. It is clear that there is still a lot of confusion around the field.

Why is it so confusing?

One reason is there are a lot of buzz words around data science: neurocomputing, machine learning, data mining and natural language processing are just a few. These are all components of data science and some are the disciplines within it. People in the field may have a specialized degree in analytics or a master's degree in statistics and math - all of these skills and training plays into the industry. Since people with such diverse backgrounds are using Data Science with different backgrounds we end up with many different definitions. Ultimately, though, Data Science boils down to using data to predict and influence the future.
You can do that with math and statistics. You can use a neural network or machine learning to try and figure out what might happen, but all of these different disciplines are working towards the same goal. Data science can be applied to all sorts of things, whether it's predicting consumer behavior or predicting where your autonomous car is going to drive.

You Need Experts

To do this (and do it well), you need the combined knowledge of different technical disciplines. However, unlike data engineering, you also need subject matter expertise. For example, if you're trying to use data to figure out who should be approved for a credit card you need to understand banking and credit cards. You can't just send in a data scientist, apply a linear regression model and expect to achieve useful results.

You need metrics for tracking within the domain to which you're applying data science. For a successful data science solution, you need three types of skills:

  • Advanced math and statistics
  • Data engineering and software development
  • Business oriented subject matter expertise
Now, this doesn't mean that the models you build will be 100% accurate if you possess the above skills. When creating these solutions, it is crucial to have a known dataset that you can test against. This is called a training dataset. For the credit card example above, you would use training data - data with a known result you are trying to predict - and run that data through the predictive model to validate accuracy. Was it 50% accurate or 90% accurate? Knowing that, you can refine the dataset until you reach an acceptable margin that works for your business driven use case.

In theory, you could build a model that is 100% correct in its decision making, but delivering 100% accuracy will result in a model that is overly restrictive and not at all predictive. The resulting output must be evaluated by the business to determine if the result is acceptable.

A good example of this is preventing money laundering. Banks and law enforcement are trying to catch people illegally moving money, and it takes a lot of manual analysis for investigators to sift through data systems and search for suspicious activity. But if we had an algorithm that looked at a large amount of data and could predict there's an 80% chance that someone is laundering money, that would provide a better starting point for an investigation. It's only 80% accurate but that's far better than manually looking at all of the data. Of course, for something like money laundering, a model that's 80% accurate might be acceptable, but it's important to remember that in other disciplines (like self-driving cars) much higher thresholds would need to be put in place; you don't want your car to be only 80% sure that something is a red light.

Risks and Rewards

Data science is a very powerful tool and there are some associated risks. One of the most glaring risks is that you can't always explain the results that are generated in layman's terms. It's a challenge because we can build self-learning neural networks, but because of the self-learning component, you can't point directly to how they reach the conclusions. In some industries, this can be incredibly problematic. For example, some higher education universities are considering using data science to help them sort through applicants in the admissions process.

Think about a scenario where you apply to college and the data says that you're likely going to fail out of school. The data might come to that conclusion based on your GPA or your test scores, and also based on demographic information or that your parents didn't go to college. Using this data, a learning model may decide that there is a 90% chance that you're going to fail, ultimately resulting in an admission rejection letter. Schools have a motivation and incentive to use tools like this to reduce dropout rates, and they want to attract the best and the brightest people coming to get a degree at their institution. But these calculations have the possibility to kill the American dream that anybody can make it. No school wants to be the next headline in the Wall Street Journal.

Learning models make the social and ethical implications difficult to control. Even if you ignore data such as race, gender and income, it still might learn based on a different set of data and arrive at the same conclusions. If you can't explain why your model is profiling one group, that's an issue with ethical, and in some industries, legal ramifications.

Still, our clients are using this technology for countless applications in industries such as healthcare, retail and financial services. Data science can help transform a business and that's why people keep coming to us looking for solutions. So, if you're reading this now you might be wondering "where do I start?"

How to Begin

The first problem I see is people looking to apply data science without the underlying data in place. This is one of the reasons that right now data engineering is more important than data science. You don't need to worry about any of these solutions because you can't apply the algorithms if you don't have the data.

The other barrier to entry in this field is the skills. The data engineers that you have on staff are not trained to build a neural network. If you ask them to try, you will get one of two responses: "I don't know how to do that" or even worse they'll build something without fully understanding how it works. That's very dangerous. They may be able to get a model up and running but knowing if those results are actually accurate isn't easy and requires training to interpret.

This isn't on the same scale as having someone make a website that ends up looking beautiful but it doesn't perform. While challenging, those fixes are fairly straightforward. This is a different problem. These systems can push out an answer that puts your entire business on the wrong track. So, if you're looking to dip your toe into the data science waters, start small and don't try to over engineer or boil the ocean in one go. Try to solve one small thing. Often times you don't need a complex model to get the lift you are looking for, so try to leverage less complex algorithms first.

Expertise is what makes data science projects successful. However, many organizations don't want to use consultants for this type of work because of the proprietary data involved. In some cases the results that data science produces are as secret as the recipe for Coca-Cola; if a company's pricing model leaked to the public it could ruin their ability to stay competitive. This is one reason why the economist calls data the most valuable resource in the world.

So, what do you do with this barrier to entry? You need smart people who know how to do this but there aren't enough people in the world. At the same time, you may not want to bring consultants inside and risk your competitive advantage being exposed.

That being said, you should use your consultants to get started. Have them help you build a data science team, aid in getting all of the production operations stood up and then getting your organization educated on data science. You'll get their expertise and then continue to evolve your models with your newly trained staff. Don't lean on the consultants to come up with the algorithm that's going to make or break your company.

Also, you don't have to be a Coca-Cola to leverage these tools. In fact, if you're a relatively small company it may even be easier. You probably have better organized data and less processes in the way of getting a data science center of excellence started. There is no reason to wait, just make sure you start small - using data science carefully and correctly, you can help keep your organization one step ahead.