Many organizations consider using cloud-based services to capture, store, analyze and visualize their data. Confronted with an ever-growing selection, it can be difficult to find the best option. Following a Gartner's survey from June 2017, we give a brief overview of data warehouse cloud services offered by three leading cloud service providers: Amazon Web Services, Google and Microsoft.

Gartner Comparison of Cloud Service Providers

Amazon Web Services

Amazon has been in the "as a Service" market for over 10 years. It offers a mature suite of cloud-based services grouped in four categories: Compute, Storage, Database, and Networking and Content Delivery. AWS also provides a suite of migration tools and services. All these resources are used subject to Amazon's security and identity services.

Of particular interest in the context of Data Warehousing are AWS' offering in the storage and database categories. Here, the offerings include scalable cloud storage (Amazon S3) , manager file storage (Elastic File System), and a variety of SQL (e.g. Amazon RDS, Amazon Redshift) and NoSQL databases (e.g. DynamoDB). Out of the box these databases can handle petabyte-sized data volumes combined with high query speeds. The latter can be accelerated further by using Amazons in-memory caching services.

Amazon Redshift is a database built on Postgres that is targeted specifically at the in-cloud data Warehousing market. It's columnar storage and massive parallel query execution allows it to run complex queries against petabytes of structured data at reasonable speeds. Through its Redshift Spectrum engine, SQL queries can be run against unstructured data in Amazon's S3 storage. Thus, even data stored in a "data lake" (that is, a collection of unstructured information in possible combination with structured data) can be accessed through Redshift's SQL engine.

Redshift's data can be accessed by non-AWS applications through JDBC or ODBC mechanisms. A select of data mining and data visualization products such as Tableau and QLIK can connect directly to the Redshift engine, run queries against the data and visualize the results. Since Redshift is based on (a subset of) Postgres, any desktop client supporting the Postgres data and communication model can connect to the Redshift database - at least in theory! In practice, one might encounter difficulties when trying to use the more "esoteric" functionalities of Postgres.

Amazon offers a choice between pricing options ranging from on-demand hourly rates to reserved instances at predefined size and performance levels. Components contributing to fees charged include the amount of data stored, and computing resources used.

Up- or Down-scaling of resources is typically an active process requiring manual intervention via the AWS Console web portal or an API call.

AWS is an "as-a-Service" or "in the cloud" solution. However, reacting to competitive pressure, Amazon recently started to also offer hybrid, that is "in the cloud: combined with on-premises" solutions by partnering with reputable providers which include VMware, Intel, Microsoft, and SAP. Thus, data residing both on-premises and in the cloud can handled and processed through AWS' services and applications.

Microsoft Azure

Microsoft started offering its Azure service in 2010, that is about 7 years ago. It offers a wide range of services which include Computing, Storage and Data Management. All services are subject to Azure's Active Directory and other security services including Multi-Factor Authentication. One aspect that sets Azure apart is the fact that Microsoft early on offered the possibility to run an application on an Azure stack in the cloud, installed on-premises alone, or as hybrid combination on-premised and cloud.

Azure offers a variety of different storage options for structured data, unstructured data and any combination thereof (data lake). Further, the data and information can be organized in SQL and NoSQL databased. A Redis Cache engine can power applications that require high-throughput and low-latency access to data. The data can be accessed via RESTful HTTP API.

Like AWS, Microsoft offers a dedicated engine for cloud warehousing: The SQL Data Warehouse provides a unified T-SQL interface to access data stored in Azure. It connects to other offerings including Microsoft's cloud-based Hadoop cluster HDInsight, Machine Learning and Power Bi.

Further, many BI tools (Tableau, Qlik and others) can connect directly to Azure's data via ODBC.

All resources can be up- or downscaled using Azure's console.

Pricing is determined by the amount of data stored on Azure and the computing resources used. Both "pay-as-you-go" pricing and pre-defined performance pricing models are available.

Google Cloud Platform

First launched in 2011, Google's Cloud Platform now offers a wide variety of services. Options for data storage include both SQL and NoSQL options. Google uses the OAuth security mechanisms to control access its services.

Google stresses the point that it's services are "fully-managed", that is, resources are resized dynamically and largely automatically, and manual configuration is intentionally kept to a minimum.

BigQuery is Google's dedicated cloud data ware house. Essentially, BigQuery is an ultra-fast SQL interface to data stored internally in BigQuery or external (almost) anywhere in Google's cloud service ether. BigQuery can be accessed by a web UI, a command line tool, or a REST API.

Many BI tools can access BigQuery directly.

Google's sole pricing model is pay-as-you-go: Factors that generate charges include: the amount of data stored, amount of data transferred (from one Google service to the other), data scanned (during a query), computing resources used. Google does not offer a pre-paid pricing model.

Offering Comparison

To allow a better comparison of the various offerings, the table below lists the names of similar services offered by the three cloud providers.

Amazon Web Services

Microsoft Azure

Google Cloud Platform

Data Warehousing

Redshift

MS Azure SQL Data Warehouse* (with Redis Cache)

BigQuery

ETL

Data Pipeline,

Glue

Data Factory

Cloud Dataflow

Storage

Simple Storage System (S3),

Elastic File System (EFS)

Blob Storage*, Disk Storage*, Data Lake Storage*

Google Cloud Storage*

Virtual Servers

EC2

MS Azure Virtual Machines*

Google Cloud Computing Engine*

Serverless Computing

Lambda

MS Azure Functions*

Google Cloud Functions

BI / Visualization

QuickSight

Power BI

DataLab

* MS and Google do not have a brand name for these services, they use the generic names.

Considerations

An organization embarking on employing cloud services for their data warehousing must consider several important aspects when selecting a service provider.

Resource Management

Is the organization prepared and capable to manage all aspects of a data warehouse deployed in the cloud? Managing the cloud service will keep the control and expertise in the owning organization. However, having the required experts on their own payroll is a significant cost factor that should not be underestimated. Or should they better rely on a service that takes over the management to a large degree? At first glance this can be cheaper, but considering how those services limit direct control and potentially impose higher service fees this might not be the best option.

Security Infrastructure

The security infrastructure of the cloud service should work seamlessly with the organization's security systems. Little is more irritating to users - and more expensive in terms of direct and hidden costs - than systems that cannot be accessed in a simple way, and where users gain access through different login pages using different accounts. A seamless security integration between cloud services and on-site services will drastically ease the burden of accessing the data stored in the cloud.

Solution Space

What solutions (beyond the data warehouse) does the organization need: ETL services? Computing power for Data Mining? Reporting and Visualization? It is important to anticipate the requirements early on and take into account what each solution provider offers. Since transferring data in and out of a provider's ecosystem can be expensive both in terms of costs and time, it might be advantageous to select a service provider that offers most if not all the solutions beyond data warehousing which the organizations needs or is planning to use.

Costs

What is the total cost of ownership for a cloud based solution? Some services offer fixed pricing for a pre-defined set of resources which makes it easy to calculate the expected service related costs upfront. Contrasting to this, pay-as-you-go plans allow for up- or downscaling resources and are therefore more flexible. However, it can be difficult to budget the costs when operating in that way.

Further, not only should the costs for data warehousing be considered but also the costs for transferring data in and out of the provider's ecosystem, as well as the costs for all other services (solutions) the organization plans to use from this provider.

Decision-Time

Cloud services are not a one-size-fits-all proposition. Depending on the existing technology stack, budget, accounting rules, available technical expertise organizations will come to different decisions. It is important to realize that the decision on which cloud service provider to use is as important - and as difficult - as decisions make for "on premise" deployment. It is vital to research and compare each option to find the solution that fits the organization best.