What I fundamentally need to achieve for HADR is a database system that is, as much as possible, always available ('online and accessible'), complete ('no data loss'), and accurate ('no data corruption'). For the purpose of this discussion, I am ignoring backup strategies on the assumption that they exist primarily to roll back databases to a previous known state. I am also ignoring the issue of data consistency because ACID transactional consistency is largely dependent on the type of database and the design of the application. The alternative concept of consistency (as defined by Eric Brewer's CAP theorem) is really focused on the synchronisation of replicated databases, which is irrelevant to me if I am primarily trying to protect a master database.
Databases can become unavailable for planned reasons (patching and maintenance) or unplanned reasons caused by the failure of some system component (from a hard drive to an entire data centre). A high availability infrastructure must be designed to deal with all possible failures, from fault tolerant components to disaster recovery plans. The more that these capabilities are built into a platform, then the easier my job is and the less likely that I will have to compromise on the other system requirements.
Over the last few years, I have designed and deployed cloud, hybrid, and on-premises database solutions, with cloud deployments to both Azure SQL Databases (PaaS) and SQL Servers running in Azure VMs (IaaS). I have had my fair share of challenges when it comes to architecting disaster recovery solutions. The good news is that, throughout this period, I have seen the high availability features in Azure progressively increase and improve. In particular, the geo-restore and geo-replication features of the new Azure SQL Database service tiers that were introduced last year.
At this point I could give you a long list of the relative capabilities of Azure SQL Databases versus SQL Server in Azure VMs. Fascinating though it may be, this information is readily available online. What I want to do is give my personal observations on whether the differences in high availability features are sufficient to have influenced my choice of platform or architecture.
I know that, in the real world, the choice of platform will probably be determined by factors such as cost, security concerns, or development functionality. But I would make the case that satisfying high availability requirements should be one of the first decisions you make, not the last.
The high availability options available for SQL Server running in Azure VMs are fundamentally the same as on-premises instances. For SQL Server on premises, I would always recommend AlwaysOn Availability Groups, and the same goes for SQL Server in Azure. Now that Azure supports virtual network connections, we can place replicas in different regions for disaster recovery purposes.
So, do SQL Servers in an AlwaysOn Availability Group deployed to Azure satisfy my requirements for high availability? Well, yes and no. Replication to a remote data centre in Azure provides disaster recovery from a data centre outage but it requires the commit mode to be asynchronous to allow for network latency. The implication of this is that failover cannot be automatic and there is the risk of data loss (albeit small) between the primary and secondary replicas. The other issue I have is that failover to another data centre should be the last resort because it impacts the entire application. Because Azure VMs do not provide any built-in fault tolerance, we are dependent on replica databases to recover from a wide range of failures.
In comparison, active geo-replication, which is available for Premium Azure SQL Databases, appears to offer similar functionality to AlwaysOn Availability Groups. They both perform asynchronous replication of transactions to other Azure regions, offer the option of readable secondary replicas, and can be manually failed over. It is worth noting that geo-replication uses a buffer to store transactions queued up to be replicated. My experience is that it copes better with a large number of small transactions rather than fewer large transactions which can swamp the buffer.
Where Azure SQL Databases have an advantage is their fault-tolerant architecture, which is based on local replicas in different fault domains with automatic failure detection and failover. This built-in capability automatically protects the database from most failures apart from the loss of the data centre. For fault tolerance with SQL Server, Microsoft recommends putting all data-tier VMs in a dedicated Availability Set that should have at least 2 VMs to ensure at least one is available during planned or unplanned maintenance events. While each Availability Set has 2 fault domains and up to 5 update domains, none of this guarantees uninterrupted availability of your SQL Server VMs in the event of a local failure.
Another factor in favour of Azure SQL Database is its administrative simplicity. Microsoft have made efforts recently to facilitate the provisioning and administration of SQL Server VMs in Azure with automatic backups, automatic patching, and more virtual machine images (including AlwaysOn). However, none of this comes close to the convenience of enabling geo-replication for Azure SQL Databases and the security offered by built-in fault tolerance.
In conclusion then, if there were no other overriding considerations, I would be more comfortable specifying Azure SQL Databases for cloud deployments to meet my business continuity requirements.
As a postscript, I should mention that, while I have enjoyed the opportunity to design, implement, and test high availability solutions, none of them have failed yet, so I can't report on the real world behaviour in an emergency failover.