Two areas that need closer consideration in enterprise cloud

cloud devops DXC Blogs

There are many postings on the risks and challenges for Cloud migration.

One recent article, posted by Glen Robinson (@GlenPRobinson) had an excellent summary of the key risks and challenges to a broader migration to the cloud. In his paper he cites security, resilience, reputation and regulatory as key considerations, followed by financial, licensing and talent from a commercial perspective.

I’d like to draw out two of these for closer inspection.


Security is of course a huge topic in it’s own right and often will be the first and most significant reason any organisation does not deploy into the public cloud.  To aid this resolution, the concept of a Shared Responsibility Model was created. Whilst AWS use this term directly, this is not in any way exclusive to AWS.   In a bimodal context, (see previous blog for a broader view of this) there are two distinct approaches:

  • Public Cloud Provider (e.g. AWS) are responsible for the security of the cloud
  • The customer is responsible for the security and compliance in the cloud

Or, in other words, there should be no grey areas between the two.

Using the AWS Shared Responsibility Model as an excellent reference point; ‘Of the cloud’ means, the cloud provider secures the following areas which are, of course, linked directly to what services they provide.  So for AWS, this can be broken into two core elements:

  • Compute, storage, network, database
  • Regions, availability zones, edge locations

What this means is that there will be a range of services offered for each of these components, and within that a very full suite of security measures and controls. The other key point is that these are provider services so you do only get what is provided.

The second dimension; ‘In the Cloud’: means, securing the following areas, which by their very nature and context are owned by the customer and therefore it is wholly the responsibility of the customer to secure it:

  • Customer data
  • Platforms, applications, IDAM

Good examples of these would include deploying web application protection, using AWS Identity and Access Management (IAM), AWS Cloud Trail (API tracking) and Cloudwatch (alert triggering)

  • Operating system, network and firewall configuration

Examples here would include configuring AWS Security Groups (SG), Use AMI (Amazon Machine Image) including those that are pre-hardened, deploy VPCs

  • Client, server and network side encryption

The most important take-away here is whichever cloud provider an organisation contracts with, it is critical to understand the Shared Responsibility Model and, thereafter ensure that each element within the system has had the right treatment applied.


Resilience has been highlighted by some recent very high profile outages from both AWS and Azure.  This could be coined as “removing single points of failure”.

Firstly, and logically, when designing an application it is imperative that the requirements for uptime are understood and endorsed from a business perspective.  Defining a requirement for “high availability” means it can withstand failure of individual or multiple components.  The two most common terms and measures used in the industry are recovery time objective (RTO) and recovery point objective (RPO), the first being process restoration time and the second being length of data loss. Getting an agreed business-focused target for this is imperative to allow the solution to be designed correctly.  Furthermore, (logically) it is necessary to ensure the business doesn’t simply say “it needs to be 100% fault tolerant” without the right context: which is business process paired with financial impact.

Using AWS as an example, each of the key building blocks has functionality that provides levels of resiliency and redundancy. This aligns to the Shared Responsibility Model above and therefore, understanding, for example, of the deployment model for an Amazon Virtual Private Cloud (VPC) and Elastic Balancing is key.  Of course, seeing a quoted level of availability and reality can be interesting…(note the quoted S3 uptime is 99.99999999%).

So we see, there are a number of mechanisms on how to remove points of failure. Note these are not unique to a public cloud provider and have been the bread and butter of on-premise architectures for a long time … forever, in fact.

Key Areas of Focus in Removing Points of Failure:

Introduce redundancy: there are two major types, standby and active. Standby is where a process is performed on failover, whereas active is automatically distributing workload.  Typically, standby is significantly easier to design and cheaper to deploy so there is always a cost/benefit trade off to be done.

Failure detection: Automation is very much the hot/key topic here as this allows not only the detection but the reaction activities to take place.  This is recognition that failures will happen. Therefore, the more you know about them, your ability to trend them (even predict) increases and the better prepared you will be.  An interesting, more extreme perspective of this is the Netflix model which is not only detect failure but create failures as well to ensure applications are deployed with the right level of resiliency (Chaos Monkeys).  I love the Netflix model as it introduces a culture that is accepting infrastructure failure rather than one that is surprised by it.

Data storage: at the core of every application will be the data. Therefore, techniques such as data replication that introduce redundant copies of the data automatically create less points of failure. Typically, there are two types of replication: synchronous and asynchronous. The key difference between the two is whether the application has to wait for the data to be written to all places (synchronous), or continues (asynchronous).  Clearly there are significant factors in this choice because of potential latency issues. As always, when there is seemingly only a choice of two, a third option; “quorum based” which is a hybrid of the first two has been developed. This choice is especially useful for large-scale distributed databases. Of course, this is no substitution for actual data backup and this should be part of the overall disaster recovery plan.

Multi-Data centre resilience: traditionally the hardest decision is when to perform a fail over – especially when there is a short disruption and the length of the disruption is not known. In AWS, because there are separate Regions with separate Availability Zones, data can be replicated across data centres synchronously, so failure can be automated and transparent to end users.  There is quite a significant cost to this, so it comes down to the very first point of understanding the business imperative for resilience.

Hybrid: a much discussed subject but a clear option for removing single points of failure.  Hybrid in this context means that applications are not necessarily deployed in both the public and private infrastructures, but there is an option to deploy across them in the event of failure.  Clearly this would then bring in a lot of additional parameters, some as significant as ensuring that the workloads can actually be operated in both the environments being kept in continuous sync.  This would appear on the face of it to be a very costly option as all the economies that are to be gained from moving to a public cloud provider could be irradiated.  But, it is definitely an option to be considered.

Additional References:

AWS Disaster Recovery

Glen Robinson: Bang goes the Cloud..or does it

Cloud Best Practices

Building fault tolerant applications

Stephen Orban on Hybrid


This post first appeared in Neil’s blog.

Neil Fagan

Neil Fagan is CTO of the UK Government Security and Intelligence Account in Global Infrastructure Services. He is an enterprise architecture expert, leading teams of architects who work on solutions from initial concept through delivery and support.


  1. […] Two areas the need closer consideration in enterprise cloud […]

Speak Your Mind


This site uses Akismet to reduce spam. Learn how your comment data is processed.