This is the first in a series of articles discussing running Hadoop in the cloud. This article will explore the benefits and challenges of big data in the cloud. The second article will take a closer look at Cloudera and open-source Herd tools for cloud processing. The final article will take a closer look at data governance in the cloud.


Hadoop in the CloudManaging big data is becoming common at many organizations today, and Hadoop, an open-source software framework for big data, is no longer something that only big cloud companies such as Facebook, eBay and Yahoo need to handle. Organizations of all sizes are finding that they need to make big data easier to manage while minimizing maintenance costs. They also need to support massively parallel processing of that data to meet business needs. An effective approach can save millions of dollars.

The recommended configurations for Hadoop have traditionally assumed cheap, bare-metal servers. But as enterprises move to cloud platforms such as Amazon Web Services (AWS), Microsoft Azure and the Google Cloud Platform, Hadoop and the cloud have become a natural fit. The cloud provides virtually unlimited server capabilities with virtually unlimited parallel compute power. Some cost-effective Hadoop solutions have evolved to incorporate transient clusters to process data as needed, but to make this work, Hadoop nodes need to provide permanent storage outside the cluster. This gives the enterprise big data processing power where and when it is needed.

Big data and the cloud: a match made in heaven?

Cloud storage and computing was not something Hadoop was originally designed to accommodate. One of the keys to the traditional Hadoop architecture was the idea of data locality, or bringing the processing to the data. At run-time, Hadoop identifies which server in a cluster hosts the data that needs to be processed. Hadoop then allocates the processing job and execution to that machine, minimizing the amount of data sent over the network. It also means that data can process locally on a machine much faster than is possible when data is shuffled among networked storage devices.

Cloud computing turns this on its head. Now processing can be based on data's being non-local, or stored in the cloud, and accessed when needed. This can be performed using transient clusters of nodes that are allocated and deallocated as needed. Amazon's platform-as-a-service Hadoop offering, Elastic MapReduce (EMR), is based on this approach. EMR spins up Hadoop clusters of Amazon Elastic Compute Cloud (EC2) machines on demand - these are called transient clusters - and then runs MapReduce or other distributed jobs against data in S3 before releasing the machines. S3 is an online file storage service, also offered by AWS.

One example of this approach comes from the Financial Industry Regulatory Authority (FINRA), which moved much of its big data to the Amazon cloud. FINRA uses transient Hadoop clusters running EMR for processing. It uses S3 for storage. In a case study, FINRA has noted that it can use the cloud for processing and storing 30 billion market events each day. This will save the organization $10 million to $20 million annually, according to the case study.

Other Hadoop-as-a-service platforms have sprung up as well. Among them is Qubole, which provides a layer of abstraction on top of the underlying cloud providers and allows even more flexibility by offering a choice of which cloud provider to use, instead of locking in on a single platform. Pinterest used Amazon's cloud processing with Hadoop and more than 10 petabytes of data. Pinterest initially used EMR but, because of stability issues, eventually moved to Qubole. However, the underlying data and processing is still happening on top of AWS.

Advantages of the cloud

There are significant advantages to pushing big data infrastructure into the cloud. One advantage involves decoupling storage from compute. Although this might seem at odds with Hadoop's principle of data locality, which increases network speeds, decoupling allows flexibility not achievable in a traditional Hadoop architecture. For example, it allows multiple Hadoop clusters, each with different types of workloads, to point to the same underlying data. The business implication is that scalability now becomes simple, as storage expansion no longer requires procuring and installing additional hardware, and compute is only a matter of spinning up additional virtual nodes.

The cloud also provides a durability and availability of data that most private data centers would be hard-pressed to match. Amazon, for example, promises 99.999999999% (that's nine nines after the decimal point) of durability in S3 standard storage, with availability of 99.99%. That would mean that in a sample of 10,000 objects, you might lose a single object once in 10 million years. That said, the built-in Hadoop Distributed File System (HDFS) natively replicates data three times, so the probability of losing data is extremely low. But managing your in-house cluster comes with the costs and headaches of managing additional hardware. There is also additional complexity if you want to replicate the data to another location in order to facilitate disaster recovery.

Finally, the low cost of the cloud is attractive when compared with the cost of procuring and standing up hardware, not to mention the cost of installation and maintenance. In the cloud, you pay only for what you use. Lower-cost options may be available if some degree of flexibility is acceptable (for example, Amazon's spot pricing model). Other options are offered at lower cost if your service level agreements are not quite as demanding. Amazon introduced Glacier, and Google introduced Google Cloud Storage Nearline, providing lower-cost storage options for data that is not accessed as frequently. The competition among cloud providers has been good for consumers as it is clearly driving down costs.

New approach with transient clusters

Amazon's EMR embraces temporary clusters for processing data, with the goal of spinning up processing power when needed. This reduces the ongoing costs of compute nodes. Transient nodes can be configured to the scale and size needed for a particular job and are an economical way to gain compute power with minimal maintenance cost.

Cloudera Director allows you to take advantage of Hadoop in the cloud. Director is a management layer for Cloudera Hadoop that leverages the elastic nature of cloud clusters. It provides out-of-the-box integration with AWS and Google Cloud, and the Director Web user interface (UI) and application programming interface (API) provide controls with which to deploy, scale, clone and terminate clusters. Director handles all the heavy lifting required to set up and deploy nodes. Scaling can be performed with a few clicks in the Web UI. Although this doesn't provide the ability to take advantage of transient clusters - i.e., automatically expand and release resources - it does provide a powerful tool for cloud management with an API that can be used in workflows.

For transient processing to work properly, durable storage must live outside the compute cluster. Transient nodes are destroyed after completing processing tasks, so data stored in transient clusters will be lost. You will still need to maintain data, metadata and artifacts. In the AWS ecosystem, S3 provides durable storage. In comparison with HDFS, S3 is a very strong contender, particularly in the areas of durability, scalability and simplicity. With respect to scalability, S3 provides an "infinitely" large bucket in which to put data. Scaling is transparent and immediate, and there is no need to manage cluster resizing due to space.

However, S3 takes a back seat in performance. HDFS stores and process data directly on the compute nodes, making data access fast and efficient. S3 has a higher latency because it doesn't provide data locality. Instead, you are reading and writing to remote storage. This performance issue could be mitigated by tweaking the size of the job and the compute node characteristics. But if your goal is to support transient processing clusters, the performance impact is not severe enough to rule out using S3 for persistent storage.



Hadoop was designed to provide cheap, distributed, processing power for volumes of non-volatile data. Those cheap nodes come at the price of procurement and maintenance of individual machines. With the expanding availability of cloud providers, procurement becomes a matter of making a few clicks in the management console. Hadoop can benefit from that power as more providers and solutions come online.

Our next article in this series will explore how Cloudera and FINRA are addressing cluster management in the cloud.