Your web browser is out of date. Update your browser for more security, speed and the best experience on this site.

Update your browser
CapTech Home Page

Blog June 4, 2018

Selecting a Data Storage Platform in Azure

Moving Your Data to Azure

You have decided to move your database to the cloud and you've chosen Azure as your provider. What platform or platforms should you use to store your data? What are the costs and where do you get the most bang for your buck?

Azure gives you many options for storage… looking at Platform as a Service (PaaS) you could decide to use Azure SQL Data Warehouse, Azure SQL Database, Cosmos DB, or the up and coming Managed Instance. There are Infrastructure as a Service (IaaS) options on Azure VMs with SQL Server installed. And then there are the raw storage options: Blob, Table, File, and Data Lake. With so many choices, what setup is right for you? Obviously, it depends... Is your data supporting a single online application? Are you simply running an analytical data mart? Or, are you pulling together an enterprise data warehouse? Maybe you are exploring real-time, unstructured data and trying to gain insights from streaming and loosely structured data.

Before making any decisions about where to put your data, you will want to fully understand your goals, your constraints, and your budget. Here are some use cases and considerations…

  1. Application database - If your database is supporting an online application, your primary goals are generally availability and scale. You want to ensure that the database is never "down" and that you are able to easily add processing resources to meet client demand. High volume data storage may or may not be important to your application, but consistent availability almost definitely is.

    To identify constraints, consider any dependencies you have on being able to access data created or maintained by another application. Do you have specific authentication requirements? Are you managing a third-party application or reliant on third-party processes running on the same server as your database? Are you willing to rewrite any data access routines or restructure data as part of the move?

    All of these factors and many more should be considered prior to choosing a platform as they could drastically change your ideal setup. But, typically for this type of use case, you will want to consider running on Azure SQL Database (PaaS) for its guaranteed up-time, recoverability, and ability to scale up processing. As a complimentary service, Blob storage may be a helpful way to inexpensively stow image or other media data. In the event that you rely on other third-party processes running along-side the database, or if you don't have the resources to retest and modify data access routines, you will need to consider SQL Server on a VM (IaaS) or the new Managed Instance (PaaS).

  2. Data Mart - if you are building a data mart for reporting and analysis, your primary goal is to capture the data in a meaningful and understandable way. You will want quick response times and probably limited costs. In this scenario, you should take into consideration the volume of data that you need to store in the mart and the sophistication of your users. Depending on those two factors, you may be able to use an Azure SQL database. This has the potential to be the lowest cost, lowest maintenance option. If your data volume exceeds the limits for Azure SQL Database, you might want to reconsider the scope of your Mart, but there is still the option to use SQL Server on an Azure VM. This option will allow you to host SQL Server and its full suite of tools including SSIS and SSAS and store data well beyond 4TB. A few other bonuses are that you can continue to maintain separate databases with cross-database queries; if you are relying on them currently and if you have an enterprise agreement, you can bring your own license to save on costs. The costs might still outweigh Azure SQL Database, especially when you add in potential VM maintenance, but the licensing offset helps.

  3. Data Warehouse - if you are building an enterprise data warehouse, you will have many goals in mind, including, but not limited to, efficient data access, scalability in both storage and compute, data security, and disaster recovery. There are several factors to consider before jumping into a specific configuration. How much data do you foresee in the warehouse? Do you have an existing warehouse in place that you intend to migrate? If so, what resources are you willing to devote to reworking your design, load routines and reporting solutions? What is your budget?

    For cases where you have vast quantities of data and need the ability to run complex analytical workloads, Azure SQL Data Warehouse (ADW) might be your best bet as the primary data platform. Keep in mind that there are concurrency constraints, though (only a specific number of queries can be run at once). So you will want to use "spokes," such as subject area specific data marts and Analysis Services cubes to satisfy your user base. These can be run together on SQL Server VMs or using PaaS on Azure SQL Database and Azure Analysis Services. As part of the overall design, you will also want to implement Azure Blob Storage or Azure Data Lake Store to inexpensively maintain history of raw data feeds and enable the ELT process through Polybase.

    For other use cases (i.e., smaller budget), you will want to consider your user base, your data volume and the overall development investment you want to make to decide between SQL Server on an Azure VM and a storage service with a complimentary HD Insights setup. SQL Server on Azure acts as your traditional setup, making it easy to port your existing warehouse. Some departments and companies are using Data Lake or even Blob storage as the primary data platform for their warehouse and using HDInsights to process the data via Spark, Hive, and other Hadoop-based technologies. Analysis can then be done on Analysis Services or purely through HDInsights, Data Lake Analytics, or USQL. Having this setup in place to complement the data warehouse will be beneficial regardless of the primary platform that you choose.

  4. Real-time analytics - if you are looking to use Azure for real time data analytics, there are several tools that you'll need to consider for generating, hosting, and consuming events. From a data storage perspective, however, you have several options. Your data format and your analytics use cases will help you decide which is best. Some questions to consider in this scenario: Does your data have a known schema? Is your data generated at a steady rate or in less-predictable ebbs and flows? What volume of data are you ingesting? How much do you need to combine real-time data with existing historical data?

    For several use cases, your best bet might be to implement Cassandra or Mongo on the new Cosmos DB. Cosmos DB hosts these and other technologies which use distributed data via partitioning to efficiently store and query event-based data. Alternatively, you can use Azure Data Warehouse to maintain historical data that will be easy to combine with real time data. Using this option, you will need to be careful with concurrency and watch out for storage costs as you will be using the premium SSD storage that comes with ADW. Many times, you only need to run your real-time event data through an established machine learning model and then store the results along with the raw data. In this case, blob or data lake storage will suffice as an archive for the real time data. Maybe you will want to pass the model output into an existing database to support an application or enhance a data warehouse, but the platform for that system will already have been established. Think through your specific use cases and balance the costs with benefits to make sure you are getting the most impact for your dollar.

The world of data storage in Azure is very exciting but can definitely be overwhelming when deciding where to spend your money. Half the battle is knowing the right questions to ask. The other half is knowing how to apply the answers to those questions to an optimal environment setup. Also, it helps to be able to draw on the collective experience of others who have done this before, either within or outside of your own organization, when making these decisions and implementing the solution.