Managing Hadoop in the cloud


Hadoop in the CloudOriginally, Hadoop's architects did not recommend running Hadoop in a cloud environment, but many businesses and other organizations today are successfully running and managing Hadoop in the cloud. The benefits are numerous and can include substantial cost-savings.

In Part 1 of this article, we discussed the benefits of a transient cloud cluster. Now we expand on examples of platforms that leverage transient architecture and provide management tools for these clusters. Cloudera's Hadoop platform and an open-source project by the Financial Industry Regulatory Authority (FINRA), referred to as Herd, are two prime examples of Hadoop management in the cloud. The advantages of flexibility, durability and cost-savings make Hadoop in the cloud an attractive platform. The growing trend and the emergence of management solutions are opening a path for the cloud to become the future home to big data.

Cloudera's cloud platform

One of the challenges that comes with this architecture is the question of how to manage data in an environment where clusters are transient. This requires a solution not traditionally part of the Hadoop architecture. Cloudera's Enterprise Hadoop platform provides guidance to run the Cloudera stack in the cloud; in particular, it provides a reference architecture for Amazon Web Services (AWS). The goal is to provide recommendations for enterprises looking to reduce cost while gaining greater flexibility and compute power.

The reference architecture is built around instances running in EC2 and recommends storing data in S3. EC2 is Amazon Elastic Compute Cloud; S3 is an online file storage service offered by AWS. Cloudera Manager and Cloudera Director work together to provide an easy to manage deployment tool for Hadoop deployments.

One benefit of the Cloudera package is that, once you get a handle on compute needs, you can use Cloudera Manager to perform software installation and management of your cluster. Manager replaces the need for a deployment tool such as Puppet or Chef to manage software and configuration across the nodes in the cluster. It allows you to deploy software packages, called parcels, via a Web interface, providing a one-stop cluster management dashboard. Manager doesn't get you in the cloud on its own, but it does simplify deployments. That, in turn, can save you time and money.

Similarly, Amazon provides a one-stop deployment called Elastic MapReduce (EMR). EMR is Hadoop cloud provisioning via the AWS management interface. This looks like it could be a great cloud option to get a cluster running quickly. One concern: You're limited to the packages EMR provides. As of this writing, Cloudera seems to have a more complete suite of tools and more customizable installation options than the EMR kit provides.

With respect to storage, the supported option for AWS deployments is ephemeral storage. AWS ephemeral storage is lost if instances are stopped or terminated. Because of this, persistent copies of data and results should be maintained in S3. Ephemeral storage gives you fast access to intermediate data as long as you manage the data you want to persist.

Director, which is similar to Amazon's EMR, was recently added to the Cloudera suite. This provides a cloud-centric deployment tool for your Cloudera clusters. Director was designed to be a provider-neutral dashboard to set up and manage your cloud environments. Currently, the two supported providers are AWS and Google Cloud. Under the covers, Director works with the cloud provider application programming interface (API) to launch and deploy instances in the clusters. In addition to interacting with a Web user interface (UI), clients can interact with the Director API to manage environments and instances. This gives client applications the raw ability to launch on-demand instances and take advantage of transient processing power in the cluster.


This overall architecture builds on the Cloudera components to provide a modular and flexible approach to setting up Hadoop clusters in the cloud.

Director and EMR provide a similar management function. One important consideration is: How tied to your cloud platform would you like to be? EMR is an AWS product; it works well with AWS and will always be an AWS-centric tool. Director, in contrast, isn't tied to any specific cloud platform. It provides connectors to Amazon and the Google Cloud Platform, and offers an open-source service provider interface to build support for additional cloud providers. This allows migration to different providers if needed. If flexibility is important, a platform-agnostic tool such as Director would be a good choice.

EMR, on the other hand, has included support for the transient processing model. While Director clusters are managed using a console, EMR clusters can be launched with a cluster lifecycle of long-running or transient. Cluster lifecycle makes sure nodes automatically clean up after themselves when finished. The built-in transient feature is a benefit of Amazon's EMR.

Using Herd to manage big data in the cloud

One of the issues FINRA struggled with was how to manage big data in the cloud. The financial industry faces regulatory requirements regarding metadata and lineage, and FINRA wanted to find a solution to manage and orchestrate data processing in a heterogeneous, cloud-based, environment. The solution FINRA developed, and recently open-sourced, was a project named Herd. FINRA describes Herd as "big data governance for the cloud." The Herd application is based on a collection of APIs currently grouped into four categories: services that provide a unified data catalog; data lineage; services to manage clusters; and, finally, services to orchestrate jobs across those clusters.


Unified data catalog and lineage

Written in Java and leveraging the Spring framework, Herd metadata is persisted in a PostgreSQL database, and provides a generic data model for storing metadata along with the APIs to manage that metadata. Applications call these APIs to register new data business data object instances. You can then use other services to generate Hive data definition language (DDL) based on the metadata in the catalog. All the objects are versioned and provide users the ability to annotate objects with custom attributes. Lineage can also be captured at the data object level, providing users with the ability to track the parent/child relationship between objects, a critical need for auditing. Forming a relationship between source data, artifacts and the processing job opens up the ability to track each step in the process that was used to arrive at a result. That provides an audit trail necessary for data governance.

Cluster management and orchestration

Herd also provides a set of services for the creation and managing of clusters and running of jobs on those clusters. It leverages Activiti, an open-source business process model (BPM) workflow platform, which allows users to create orchestration workflows using the BPMN 2.0 notation. Herd exposes a list of tasks that can be integrated into Activiti workflows. It also provides a set of services for registering and providing notifications of data object events. This can be used to notify a workflow when a particular data type is registered and ready for processing, enabling on-demand processing of data. When a data type becomes available, spin up a cluster, run your MapReduce, register your data products, and release resources. On-demand processing keeps operating costs down because you don't have to pay for processing nodes that aren't being used.


Many companies are successfully running and managing Hadoop in the cloud. A transient architecture can be used as a way to gain big compute power at a manageable cost. Amazon, Cloudera and newer entries such as Herd provide tools for the enterprise to meet big data needs quickly. Given the advantages of flexibility, durability and cost-savings, running Hadoop in the cloud is an attractive way to meet data management needs. The Herd product owner at FINRA, Nate Weisz, tells us that FINRA sees it as the future. Increasingly, so do many others.

The next article in this series will explore data governance in the cloud with a focus on Herd and Cloudera.