Open Source DataThe software world has made tremendous strides through open-source development. Is it time for an open-source movement in data science?

It's a provocative idea, but I would argue that the advantages of open-sourcing data outweigh the disadvantages. I would also argue that open-source data is coming our way regardless of how we might feel about it. If we prepare for the change, we may be able to manage the risks more effectively.

The chief advantage that open-sourcing has brought to software development is sheer speed. Just a few years ago, all of us were using proprietary software that had been built, run, and maintained by Microsoft and other companies. These companies built their products in a vacuum, without obtaining feedback from outside developers or sharing their ideas with them. Certainly, these companies brought us amazing innovations, and it's likely that innovations would have kept on coming even without the contributions of the open-source community.

Today, we can use, run, and modify - to our hearts' content - countless freely available open-source software products. And as more people contribute to building out the software, innovation cycles become faster, far faster than would be the case if open-sourcing hadn't overtaken the software industry.

So why not open-source data? Imagine how it could serve the public good.

Healthcare. We could find effective cures for cancer and other diseases far more quickly if we gave the right people access to all the medical data available in the world.

Consumers. Open-source data could benefit consumers in countless ways. One example: the publication of aggregate billing and power-consumption data from utility companies could be used to help customers identify ways to reduce energy consumption and costs. Lobbying groups could use similar data to press regulators into reducing utility rates, which also would benefit consumers.

Government. I recently met with a Department of Transportation CIO who wants to publish as much data as feasible about roads and traffic in his state. This data could be used by manufacturers of self-driving cars, which need access to massive amounts of data to navigate roads and traffic. Other government agencies also are making data available to the public. Check out the federal government's data.gov website, for example.

The data.gov website makes a compelling point about the value of open-sourcing. "Open government data is important because the more accessible, discoverable, and usable data is, the more impact it can have. These impacts include, but are not limited to: cost savings, efficiency, fuel for business, improved civic services, informed policy, performance planning, research and scientific discoveries, transparency and accountability, and increased public participation in the democratic dialogue."

I agree with that point of view, and it's why I advocate open-source data.

Of course, there is another point of view. Although open-sourcing means making data accessible to people who have good intentions, it also means making data available to hostile parties. And that can put personal privacy - or what's left of it - and public safety at risk.

While massive volumes of healthcare data could be used by researchers in the quest for the cure for cancer, the same data could be used by employers to identify and get rid of employees or prospective employees who have (or are at risk for) cancer. Data related to the power grid and the nation's roads and highways could be used by terrorists to wreak havoc and cause deaths.

There are ways to limit such problems, although I'm not sure how effective they might be. Organizations that elect to release their data will need to strip or mask the individually identifying information without compromising the value of the data. They'll need to ensure compliance with regulatory requirements and privacy laws and consider the implications of letting competitors see their data. And of course, they'll have to decide whether releasing particular data sets would be of benefit to terrorists and other hostile parties.

One way to ease the transition to open-sourcing would be to establish an organization that serves as a data marketplace. In the open-source software world, the all-volunteer Apache Software Foundation plays such a role, and one of its tasks is to try to ensure that open-source software development is done for the public good. The foundation develops, stewards, and incubates more than 350 open-source projects and initiatives that cover a wide range of technologies. The sooner a similar foundation is established within the data world, the better.

Like it or not, open-sourcing is coming. If we prepare for the transition, we can anticipate and try to manage the risks while taking advantage of the benefits, notably high-speed innovation - for the public good.