You are on page 1of 42

Architecting for the Cloud

AWS Best Practices

February 2016

Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016

© 2016, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Notices
This document is provided for informational purposes only. It represents AWS’s
current product offerings and practices as of the date of issue of this document,
which are subject to change without notice. Customers are responsible for
making their own independent assessment of the information in this document
and any use of AWS’s products or services, each of which is provided “as is”
without warranty of any kind, whether express or implied. This document does
not create any warranties, representations, contractual commitments, conditions
or assurances from AWS, its affiliates, suppliers or licensors. The responsibilities
and liabilities of AWS to its customers are controlled by AWS agreements, and
this document is not part of, nor does it modify, any agreement between AWS
and its customers.

Page 2 of 42

Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016

Contents
Abstract 4
Introduction 4
The Cloud Computing Difference 5
IT Assets Become Programmable Resources 5
Global, Available, and Unlimited Capacity 5
Higher Level Managed Services 5
Security Built In 6
Design Principles 6
Scalability 6
Disposable Resources Instead of Fixed Servers 10
Automation 14
Loose Coupling 15
Services, Not Servers 18
Databases 20
Removing Single Points of Failure 25
Optimize for Cost 30
Caching 33
Security 34
Conclusion 37
Contributors 38
Further Reading 38
Notes 39

Page 3 of 42

Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016

Abstract
This whitepaper is intended for solutions architects and developers who are
building solutions that will be deployed on Amazon Web Services (AWS). It
provides architectural patterns and advice on how to design systems that are
secure, reliable, high performing, and cost efficient. It includes a discussion on
how to take advantage of attributes that are specific to the dynamic nature of
cloud computing (elasticity, infrastructure automation, etc.). In addition, this
whitepaper also covers general patterns, explaining how these are evolving and
how they are applied in the context of cloud computing.

Introduction
Migrating applications to AWS, even without significant changes (an approach
known as “lift and shift”), provides organizations the benefits of a secured and
cost-efficient infrastructure. However, to make the most of the elasticity and
agility possible with cloud computing, engineers will have to evolve their
architectures to take advantage of the AWS capabilities.

For new applications, AWS customers have been discovering cloud-specific IT
architecture patterns, driving even more efficiency and scalability. Those new
architectures can support anything from real-time analytics of Internet-scale data
to applications with unpredictable traffic from thousands of connected Internet
of Things (IoT) or mobile devices.

This paper will highlight the principles to consider whether you are migrating
existing applications to AWS or designing new applications for the cloud.

This whitepaper assumes basic understanding of the AWS services and solutions.
If you are new to AWS, please first see the About AWS webpage1.

Page 4 of 42

database.. proximity to your end users. or data residency constraints. Higher Level Managed Services Apart from the compute resources of Amazon Elastic Compute Cloud (Amazon EC2).). analytics. With cloud computing. compliance. It is also much easier to operate production applications and databases across multiple data centers to achieve high availability and fault tolerance. and higher-level application components can be instantiated within seconds.g. On AWS. you can reduce latency to end users around the world by using the Amazon CloudFront content delivery network. and capacity planning. you can access as much or as little as you need. testing. which can lower operational complexity and cost. This resets the way you approach change management. servers. free from the inflexibility and constraints of a fixed and finite IT infrastructure. Because these services are instantly available to developers. they reduce dependency on in-house specialized skills and allow organizations to deliver new solutions faster. You can treat these as temporary and disposable resources. Global. databases. you can deploy your application to the AWS Region2 that best meets your requirements (e. application. reliability.Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 The Cloud Computing Difference This section reviews how cloud computing differs from a traditional environment and why those new best practices have emerged. cost. This can result in periods where expensive resources are idle or occasions of insufficient capacity. while only paying for what you use. storage. and dynamically scale to meet actual demand. etc. and deployment services. Available. For global applications. Together with the virtually unlimited on- demand capacity that is available to AWS customers. you can think differently about how to enable future expansion via your IT architecture. and Unlimited Capacity Using the global infrastructure of AWS. AWS Page 5 of 42 . AWS customers also have access to a broad set of storage. IT Assets Become Programmable Resources In a non-cloud environment you would have to provision capacity based on a guess of a theoretical maximum peak. These services are managed by AWS.

. this can easily be achieved by stopping an instance and resizing it to an instance type that has more RAM. infrastructure security auditing would often be a periodic and manual process. solutions architects can leverage a plethora of native AWS security and encryption features that can help achieve higher levels of data protection and compliance. IO.Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 managed services are designed for scalability and high availability. Such an architecture can support growth in users. Scaling Vertically Scaling vertically takes place through an increase in the specifications of an individual resource (e. Finally. The AWS cloud instead provides governance capabilities that enable continuous monitoring of configuration changes to your IT resources. and cost should follow the same dimension that generates business value out of that system. Security Built In On traditional IT. Scalability Systems that are expected to grow over time need to be built on top of a scalable architecture.g. This way of scaling can eventually hit a limit and it is not always a Page 6 of 42 . There are generally two ways to scale an IT architecture: vertically and horizontally. Since AWS assets are programmable resources. or data size with no drop in performance. your security policy can be formalized and embedded with the design of your infrastructure. On Amazon EC2. security testing can now become part of your continuous delivery pipeline. traffic. With the ability to spin up temporary environments. or networking capabilities. While cloud computing provides virtually unlimited on-demand capacity. we provide design patterns and architectural options that can be applied in a wide variety of use cases. It should provide that scale in a linear manner where adding extra resources results in at least a proportional increase in ability to serve additional load. so they can reduce risk for your implementations. your design needs to be able to take advantage of those resources seamlessly. CPU. upgrading a server with a larger hard drive or a faster CPU). Design Principles In this section. Growth should introduce economies of scale.

While easy to implement. This is a great way to build Internet-scale applications that leverage the elasticity of cloud computing. In a pull model. caching DNS resolvers are outside the control of Amazon Route 53 and might not always respect your settings. tasks that need to be performed or data that need to be processed Page 7 of 42 . AWS Lambda functions).. How to distribute load to multiple nodes Push model: A popular way to distribute a workload is through the use of a load balancing solution like the Elastic Load Balancing (ELB) service. so let’s examine some of the possible scenarios. this approach does not always work well with the elasticity of cloud computing. Scaling Horizontally Scaling horizontally takes place through an increase in the number of resources (e. adding more hard drives to a storage array or adding more servers to support an application). Stateless Applications When users or services interact with an application they will often perform a series of interactions that form a session. A stateless application can scale horizontally since any request can be serviced by any of the available compute resources (e. with Amazon Route 53). any individual resource can be safely terminated (after running tasks have been drained).. Pull model: Asynchronous event-driven workloads do not require a load balancing solution because you can implement a pull model instead. Those resources do not need to be aware of the presence of their peers – all that is required is a way to distribute the workload to them. EC2 instances. With no session data to be shared. In this case. A stateless application is an application that needs no knowledge of previous interactions and stores no session information. Such an example could be an application that. DNS responses return an IP address from a list of valid hosts in a round robin fashion. An alternative approach would be to implement a DNS round robin (e. This is because even if you can set low time to live (TTL) values for your DNS records.g. given the same input. Not all architectures are designed to distribute their workload to multiple resources.. Elastic Load Balancing routes incoming application requests across multiple EC2 instances. When that capacity is no longer required.g. you can simply add more compute resources as needed. provides the same response to any end user.g. it is very easy to implement and can be sufficient for many use cases especially in the short term.Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 cost efficient or highly available approach. However.

Most programming platforms provide a native session management mechanism that works this way. processing them in a distributed fashion.. This would result in a stateful architecture. or else they might present personalized content based on previous actions. web applications can use HTTP cookies to store information about a session at the client’s browser (e. Amazon DynamoDB is a great choice due to its scalability. For example. A common solution to this problem is to store user session information in a database. etc. Other scenarios require storage of larger files (e. high availability. there are two drawbacks with this approach. HTTP cookies are transmitted with every request. By placing those files in a shared storage layer like Amazon S3 or Amazon Elastic File System (Amazon EFS) you can avoid the introduction of stateful components. interim results of batch processes. Second. user uploads. web applications need to track whether a user is signed in. so you should always treat them as untrusted data that needs to be validated. However. and durability characteristics. the content of the HTTP cookies can be tampered with at the client side. Another example is that of a complex multi- step workflow where you need to track the current state of each execution. For many platforms there are open source drop-in replacement libraries that allow you to store native sessions in Amazon DynamoDB3.g. Stateless Components In practice. The browser passes that information back to the server at each subsequent request so that the application does not need to store it. An automated multi-step process will also need to track previous activity to decide what its next action should be. Multiple compute nodes can then pull and consume those messages. Page 8 of 42 ..).g. For example. which means that you should keep their size to a minimum (to avoid unnecessary latency). First. Consider only storing a unique session identifier in a HTTP cookie and storing more detailed user session information server-side. however this is often stored on the local file system by default. most applications need to maintain some kind of state information.Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 could be stored as messages in a queue using Amazon Simple Queue Service (Amazon SQS) or as a streaming data solution like Amazon Kinesis. You can still make a portion of these architectures stateless by not storing anything in the local file system that needs to persist for more than a single request. items in the shopping cart).

In this model. Existing sessions do not directly benefit from the introduction of newly launched compute nodes. users bound to it will be disconnected and experience a loss of session-specific data (e. For example..) In addition. you bind all the transactions of a session to a specific compute resource. there will be layers of your architecture that you won’t turn into stateless components. Another option. is to use client- side load balancing. EFS. How to implement session affinity For HTTP/S traffic. Other use cases might require client devices to maintain a connection to a specific server for prolonged periods of time. databases are stateful. Stateful Components Inevitably. For example you might be using a protocol not supported by ELB or you might need full control on how users are assigned to servers (e.” In this model. For example. Elastic Load Balancing will attempt to use the same server for that user for the duration of the session. a database. when a node is terminated or becomes unavailable. the health checking mechanism will also need to be implemented on the client side. in a gaming scenario you might need to make sure game participants are matched and connect to the same server). (They will be covered separately in the Databases section. session affinity can be achieved through the “sticky sessions” feature of ELB4. many legacy applications were designed to run on a single server by relying on local compute resources.). In the absence of a load balancer. by definition. if you control the code that runs on the client. First. etc.Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 Amazon Simple Workflow Service (Amazon SWF) can be utilized to centrally store execution history and make these workloads stateless. session affinity cannot be guaranteed. This is much simpler to achieve in a non-distributed implementation where participants are connected to the same server. You should be aware of the limitations of this model. More importantly. You can use DNS for that. or you can build a simple discovery API to provide that information to the software running on the client. anything that is not stored in a shared resource like S3.g.. You might still be able to scale those components horizontally by distributing load to multiple nodes with “session affinity. real-time multiplayer gaming must offer multiple players a consistent view of the game world with very low latency. This adds extra complexity but can be useful in scenarios where a load balancer does not meet your requirements. You should design your Page 9 of 42 . the clients need a way of discovering valid server endpoints to directly connect to.g.

long-running servers is that of configuration drift. you have to work with fixed resources due to the upfront cost and lead time of introducing new hardware. Disposable Resources Instead of Fixed Servers In a traditional infrastructure environment. once launched. Changes and software patches applied through time can result in untested and heterogeneous configurations across different environments. You can launch as many as you need. By dividing a task and its data into many small fragments of work. This problem can be solved with the immutable infrastructure pattern. you can execute each of them in any of a larger set of available compute resources. you can use the Amazon Elastic MapReduce (Amazon EMR) service to run Hadoop workloads on top of a fleet of EC2 instances without the operational complexity. For real-time processing of streaming data. How to implement distributed processing Offline batch jobs can be horizontally scaled by using a distributed data processing engine like Apache Hadoop. Another issue with fixed. You can think of servers and other components as temporary resources. is never updated throughout its lifetime. With this approach a server. hardcoding IP addresses. This would drive practices like manually logging in to servers to configure software or fix issues. Instead. running tests or processing jobs sequentially etc. when Page 10 of 42 . and use them only for as long as you need them. you can refer to the “Big Data Analytics Options on AWS" whitepaper5. anything that can’t be handled by a single compute resource in a timely manner) require a distributed processing approach. Amazon Kinesis partitions data in multiple shards that can then be consumed by multiple Amazon EC2 or AWS Lambda resources to achieve scalability.. For more information on these types of workloads. Distributed Processing Use cases that involve processing of very large amounts of data (e. On AWS. devices reconnect to another server with little disruption for the application.g. When designing for AWS you have the opportunity to reset that mindset so that you take advantage of the dynamically provisioned nature of cloud computing.Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 client logic so that when server unavailability is detected.

or increasing capacity of an existing system to cope with extra load. you start with a default configuration. You can use simple scripts. test. can be launched from a golden image: a snapshot of a particular state of that resource.Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 there is a problem or a need for an update the server is replaced with a new one that has the latest configuration. etc. Bootstrapping in practice You can use user data scripts and cloud-init6 directives or AWS OpsWorks lifecycle events7 to automatically set up new EC2 instances. That is. You can then execute automated bootstrapping actions. When compared to the bootstrapping approach. In addition.g. Amazon RDS DB instances.. This is important in auto-scaled environments where you want to be Page 11 of 42 . Instantiating Compute Resources Whether you are deploying a new environment for testing. configuration management tools like Chef or Puppet. Bootstrapping When you launch an AWS resource like an Amazon EC2 instance or Amazon Relational Database (Amazon RDS) DB instance. or through the use of AWS CloudFormation support for AWS Lambda-backed custom resources8. There are a few approaches on how to achieve an automated and repeatable process.) so that the same scripts can be reused without modifications. Amazon Elastic Block Store (Amazon EBS) volumes. scripts that install software or copy data to bring that resource to a particular state. etc. Golden Images Certain AWS resource types like Amazon EC2 instances. It is important that you make this an automated and repeatable process that avoids long lead times and is not prone to human error. it is possible to write provisioning logic that acts on almost any AWS resource. In this way.. through custom scripts and the AWS APIs. You can parameterize configuration details that vary between different environments (e. AWS OpsWorks natively supports Chef recipes or Bash/PowerShell scripts. resources are always in a consistent (and tested) state and rollbacks become easier to perform. a golden image results in faster start times and removes dependencies to configuration services or third-party repositories. you will not want to manually set up new resources with their configuration and code. production.

Containers Another option popular with developers is Docker—an open-source technology that allows you to build and deploy distributed applications inside software containers.Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 able to quickly and reliably launch additional resources as a response to demand changes. if you have an existing on-premises virtualized environment. system libraries. instead of importing the data from a lengthy SQL script. While golden images are most commonly used when launching new EC2 instances. so you will need to have a versioning convention to manage your golden images over time. they can also be applied to resources like Amazon RDS databases or Amazon EBS volumes. system tools. You can also find and use prebaked shared AMIs provided either by AWS or third parties in the AWS Community AMI catalog or the AWS Marketplace. For example. Alternatively. runtime. containing everything the software needs to run: code. We recommend that you use a script to bootstrap the EC2 instances that you use to create your AMIs. You can launch as many instances from the AMI as you need. Docker allows you to package a piece of software in a Docker Image. AWS Elastic Beanstalk and the Amazon EC2 Container Service (Amazon ECS) support Docker and enable you to deploy and manage multiple Docker containers across a cluster of Amazon EC2 instances. Hybrid It is possible to use a combination of the two approaches. etc. where some parts of the configuration are captured in a golden image. Page 12 of 42 . This will give you a flexible way to test and modify those images through time. you can use VM Import/Export from AWS to convert a variety of virtualization formats to an AMI. You can customize an Amazon EC2 instance and then save its configuration by creating an Amazon Machine Image (AMI)9. when launching a new test environment you might want to prepopulate its database by instantiating it from a specific Amazon RDS snapshot. and they will all include those customizations that you’ve made. which is a standardized unit for software development. Each time you want to change your configuration you will need to create a new golden image. while others are configured dynamically through a bootstrapping action.

ebextensions11) and configure environmental variables to parameterize the environment differences. You would also not want to hard code the database hostname configuration to your AMI because that would be different between the test and production environments. For example. For a more detailed discussion of the different ways you can manage deployments of new resources please refer to the Overview of Deployment Options on AWS and Managing Your AWS Infrastructure at Scale whitepapers. AWS Elastic Beanstalk follows the hybrid model. required to run your application. allowing architectures to be reused and production environments to be reliably cloned for testing. and any associated dependencies or run time parameters. For example. Items that change often or differ between your various environments can be set up dynamically through bootstrapping actions. practices. your web server software that would otherwise have to be downloaded by a third-party repository each time you launch an instance is a good candidate. creating a new AMI for each application version might be impractical.Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 The line between bootstrapping and golden image Items that do not change often or that introduce external dependencies will typically be part of your golden image. extensible. they can all use the same AMI and retrieve their content from an Amazon S3 bucket location you specify in the user data at launch. For example. and tools from software development to make your whole infrastructure reusable. Since AWS assets are programmable. if you run web servers for various small businesses. Infrastructure as Code The application of the principles we have discussed does not have to be limited to the individual resource level. if you are deploying new versions of your application frequently. you can apply techniques. maintainable. and testable. It provides preconfigured run time environments (each initiated from its own AMI10) but allows you to run bootstrap actions (through configuration files called . AWS CloudFormation templates give developers and systems administrators an easy way to create and manage a collection of related AWS resources. and provision and update them in an orderly and predictable fashion. User data or tags can be used to allow you to use more generic AMIs that can be modified at launch time. You can describe the AWS resources. Page 13 of 42 . Your CloudFormation templates can live with your application in your version control repository.

so that you improve both your system’s stability and the efficiency of your organization: . including the instance ID. etc. You can use Auto Scaling to help ensure that you are running your desired number of healthy Amazon EC2 instances across multiple Availability Zones. during instance recovery. you would often have to manually react to a variety of events. you can easily route each type of event to one or more targets: AWS Lambda functions. Using simple rules that you can set up in a couple of minutes. When deploying on AWS there is a lot of opportunity for automation. load balancing. you can maintain application availability and scale your Amazon EC2 capacity up or down automatically according to conditions you define. auto scaling. Please refer to the Amazon EC2 documentation for an up-to-date description of those preconditions. AWS Elastic Beanstalk12 is the fastest and simplest way to get an application up and running on AWS. In addition. enqueue a notification message to an Amazon SQS queue. and monitoring. the instance is migrated through an instance reboot. private IP addresses. Elastic IP addresses. . Auto Scaling14: With Auto Scaling. and any data that is in-memory is lost. Page 14 of 42 . .Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 Automation In a traditional IT infrastructure. Amazon CloudWatch Alarms15: You can create a CloudWatch alarm that sends an Amazon Simple Notification Service (Amazon SNS) message when a particular metric goes beyond a specified threshold for a specified number of periods. or perform a POST request to an HTTP/S endpoint. However. Amazon Kinesis streams. A recovered instance is identical to the original instance. Developers can simply upload their application code and the service automatically handles all the details. Amazon CloudWatch Events16: The CloudWatch service delivers a near real-time stream of system events that describe changes in AWS resources. Amazon EC2 Auto recovery13: You can create an Amazon CloudWatch alarm that monitors an Amazon EC2 instance and automatically recovers it if it becomes impaired. . Those Amazon SNS messages can automatically kick off the execution of a subscribed AWS Lambda function. this feature is only available for applicable instance configurations. and all instance metadata. Amazon SNS topics. such as resource provisioning. . Auto Scaling can also automatically increase the number of Amazon EC2 instances during demand spikes to maintain performance and decrease capacity during less busy periods to optimize costs.

Loose Coupling As application complexity increases. maintain. This means that IT systems should be designed in a way that reduces interdependencies—a change or a failure in one component should not cascade to other components. For example. technology- agnostic interfaces (e. deployments of difference components are decoupled. authorization and access control. It handles all the tasks involved in accepting and processing up to hundreds of thousands of concurrent API calls. Amazon API Gateway is a fully managed service that makes it easy for developers to create. you could hardcode the IP address of the compute resource where this service was running. As long as those interfaces maintain backwards compatibility. For example.Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 . AWS Lambda Scheduled events18: These events allow you to create a Lambda function and direct AWS Lambda to execute it on a regular schedule. when a new instance is successfully added to a Database server layer. . monitor. in a traditional infrastructure if your front end web service needed to connect with your back end web service. RESTful APIs). and API version management.g. including traffic management. Well-Defined Interfaces A way to reduce interdependencies in a system is to allow the various components to interact with each other only through specific. monitoring. and secure APIs at any scale. In that way. AWS OpsWorks Lifecycle events17: AWS OpsWorks supports continuous configuration through lifecycle events that automatically update your instances’ configuration to adapt to environment changes. the configure event could trigger a Chef recipe that updates the Application server layer configuration to point to the new database instance. These events can be used to trigger Chef recipes on each instance to perform specific configuration tasks.. if those Page 15 of 42 . technical implementation detail is hidden so that teams can modify the underlying implementation without affecting other components. Because each of those services could be running across multiple compute resources there needs to be a way for each service to be addressed. a desirable attribute of an IT system is that it can be broken into smaller. Service Discovery Applications that are deployed as a set of smaller services will depend on the ability of those services to interact with each other. Although this approach can still work on cloud computing. publish. loosely coupled components.

or HashiCorp Consul. service discovery should also cater for things like health checking. Because each load balancer gets its own hostname you now have the ability to consume a service through a stable endpoint. a highly available database and custom scripts that call the AWS APIs. How to implement service discovery For an Amazon EC2 hosted service a simple way to achieve service discovery is through the Elastic Load Balancing service. Apart from hiding complexity. so that even the particular load balancer’s endpoint can be abstracted and modified at any point in time. If load balancers are not used. this also allows infrastructure details to change at any time. or open source tools like Netflix Eureka. This can be combined with DNS and private Amazon Route53 zones. Page 16 of 42 . they should be able to be consumed without prior knowledge of their network topology details. Because service discovery becomes the glue between the components. Airbnb Synapse. In order to achieve that you will need some way of implementing service discovery. Loose coupling is a crucial element if you want to take advantage of the elasticity of cloud computing where new resources can be launched or terminated at any point in time. it is important that it is highly available and reliable. Another option would be to use a service registration and discovery method to allow retrieval of the endpoint IP addresses and port number of any given service.Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 services are meant to be loosely coupled. Example implementations include custom solutions using a combination of tags.

The two components do not integrate through direct point-to- point interaction but usually through an intermediate durable storage layer (e. an Amazon SQS queue or a streaming data platform like Amazon Kinesis). messages can still be added to the queue to be processed when the system recovers. if a process that is reading messages from the queue fails. for example.Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 Asynchronous Integration Asynchronous integration is another form of loose coupling between services. Page 17 of 42 . It involves one component that generates events and another that consumes them. Finally. Figure 1: Tight and Loose Coupling This approach decouples the two components and introduces additional resiliency. So. It also allows you to protect a less scalable back end service from front end spikes and find the right tradeoff between cost and processing lag.g. For example. This model is suitable for any interaction that does not need an immediate response and where an acknowledgement that a request has been registered will suffice. you can decide that you don’t need to scale your database to accommodate for an occasional peak of write queries as long as you eventually process those queries asynchronously with some delay. by moving slow operations off of interactive request paths you can also improve the end-user experience..

g.). In this case. With traditional IT infrastructure. You can identify ways to reduce the impact to your end users and increase your ability to make progress on your offline procedures.. AWS offers a broad set of compute. even in the event of some component failure. companies would have to build and operate all those components. etc.Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 Examples of asynchronous integration . Amazon DynamoDB update streams. A back-end application processes these events in batches to create aggregated time-series data stored in a database. Page 18 of 42 . analytics. An API generates events and pushes them into Amazon Kinesis streams. For front-end interfaces. database. Services. . . Graceful failure in practice A request that fails can be retried with an exponential backoff and Jitter strategy19 or it could be stored in a queue for later processing. your database server becomes unavailable. You can host your backup site as a static website on Amazon S3 or as a separate dynamic environment. and operating applications—especially at scale—requires a wide variety of underlying technology components. Not Servers Developing. and deployment services that help organizations move faster and lower IT costs. The Amazon Route 53 DNS failover feature also gives you the ability to monitor your website and automatically route your visitors to a backup site if your primary site becomes unavailable. for example. A back-end system retrieves those jobs and processes them at its own pace. managing. Multiple heterogeneous systems use Amazon SWF to communicate the flow of work between them without directly interacting with each other. . A front end application inserts jobs in a queue system like Amazon SQS. Amazon S3 event notifications. application. Graceful Failure Another way to increase loose coupling is to build applications in such a way that they handle allowing component failure in a graceful manner. it might be possible to provide alternative or cached content instead of failing completely when. you don’t even need to worry about implementing a queuing or other asynchronous integration method because the service handles this for you. AWS Lambda functions can consume events from a variety of AWS sources (e. storage.

g. etc. web. With AWS Lambda. The same applies to Amazon S3 where you can store as much data as you want and access it when needed without having to think about capacity. search. Serverless Architectures Another approach that can reduce the operational complexity of running applications is that of the serverless architectures. this pattern can deliver a complete web application. if they use only Amazon EC2) might not be making the most of cloud computing and might be missing an opportunity to increase developer productivity and operational efficiency. When combined with Amazon S3 for serving static content assets. and the Internet of Things (IoT) without managing any server infrastructure. For example. analytics. Amazon S3 can also serve static assets of a web or mobile app. you can develop virtually infinitely scalable synchronous APIs powered by AWS Lambda. Amazon Elastic Transcoder for video encoding. replication. notifications.Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 Architectures that do not leverage that breadth (e. Amazon Simple Email Service (Amazon SES) for sending and receiving emails. In addition. queuing. nor are you provisioning redundant infrastructure to implement high availability. while paying a low price for only what you use.. ELB for load balancing. For more details on this Page 19 of 42 . analytics. Managed Services On AWS. email. It is possible to build both event-driven and synchronous services for mobile. Not only that. There are many other examples such as Amazon CloudFront for content delivery. and more. machine learning. These managed services include databases. Amazon SQS is inherently scalable. and more20. Amazon DynamoDB for NoSQL databases. there is a set of services that provide building blocks that developers can consume to power their applications. hard disk configurations. with the Amazon Simple Queue Service (Amazon SQS) you can offload the administrative burden of operating and scaling a highly available messaging cluster. providing a highly available hosting solution that can scale automatically to meet traffic demands. You can upload your code to the AWS Lambda compute service and the service can run the code on your behalf using AWS infrastructure. you are charged for every 100ms your code executes and the number of times your code is triggered. Amazon CloudSearch for search workloads. By using Amazon API Gateway. These architectures can reduce costs because you are not paying for underutilized servers.

For example. Those can be referenced in your access policies to enable or restrict access to other AWS resources on a per-user basis. There could be constraints based on licensing costs and the ability to support diverse database engines. please refer to the “AWS Serverless Multi-Tier Architectures” whitepaper21. For IoT applications. storage. using IAM you could restrict access to a folder within an Amazon S3 bucket to a particular end user. there is one more way to reduce the surface of a server-based infrastructure. Amazon Cognito generates unique identifiers for your users. Is this a read-heavy. You can utilize Amazon Cognito. When it comes to mobile apps. organizations were often limited to the database and storage technologies they could use. traditionally organizations have had to provision. network state. Determining the right database technology for each workload The following questions can help you take decisions on which solutions to include in your architecture: . Databases With traditional IT infrastructure. As a result. write-heavy. scale. allowing the mobile application running on the device to interact directly with AWS Identity and Access Management (IAM)- protected AWS services. without any operational overhead for you. AWS IoT provides a fully managed device gateway that scales automatically with your usage. operate. so that you don’t have to manage a back end solution to handle user authentication. and maintain their own servers as device gateways to handle the communication between connected devices and their services. and sync. it is not uncommon for applications to run on top of a polyglot data layer choosing the right technology for each workload. On AWS.Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 type of architecture. or balanced workload? How many reads and writes per second are you going to need? How will those values change if the number of users increases? . How much data will you need to store and for how long? How quickly do you foresee this will grow? Is there an upper limit in the foreseeable Page 20 of 42 . these constraints are removed by managed database services that offer enterprise performance at open source cost. Amazon Cognito provides temporary AWS credentials to your users.

g. Page 21 of 42 . schema-less data stores)? Do you require sophisticated reporting or search capabilities? Are your developers more familiar with relational databases than NoSQL? This section discusses the different categories of database technologies for you to consider. Amazon Relational Database Service (Amazon RDS) makes it easy to set up. and the ability to combine data from multiple tables in a fast and efficient manner. operate. What are your latency requirements? How many concurrent users do you need to support? . Scalability Relational databases can scale vertically (e.. max)? How are these objects going to be accessed? . What kind of functionality do you require? Do you need strong integrity controls or are you looking for more flexibility (e. min. In addition. you can also horizontally scale beyond the capacity constraints of a single DB instance by creating one or more read replicas. For read-heavy applications. What are the requirements in terms of durability of data? Is this data store going to be your “source of truth”? . Relational Databases Relational databases (often called RDBS or SQL databases) normalize data into well-defined tabular structures known as tables. flexible indexing capabilities.g. What is your data model and how are you going to query the data? Are your queries relational in nature (e.g.. consider the use of Amazon RDS for Aurora. which consist of rows and columns. strong integrity controls.. JOINs between multiple tables)? Could you denormalize your schema to create flatter data structures that are easier to scale? . and scale a relational database in the cloud. by upgrading to a larger Amazon RDS DB instance or adding more and faster storage). which is a database engine designed to deliver much higher throughput compared to standard MySQL running on the same hardware.Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 future? What is the size of each object (average. They provide a powerful query language.

g. Although Amazon RDS removes the operational overhead of running those instances. Application designers need to consider which queries have tolerance to slightly stale data. while the rest should run on the primary node. which creates a synchronously replicated standby instance in a different Availability Zone (AZ). Page 22 of 42 . Amazon RDS performs an automatic failover to the standby without the need for manual administrative intervention. With this model.. In addition. In case of failure of the primary node. Relational database workloads that need to scale their write capacity beyond the constraints of a single DB instance require a different approach called data partitioning or sharding. Resilient applications can be designed for Graceful Failure by offering reduced functionality (e. it will be more efficient to store the actual files in the Amazon Simple Storage Service (Amazon S3) and only hold the metadata for the files in your database. The application’s data access layer will need to be modified to have awareness of how data is split so that it can direct queries to the right instance. When a failover is performed. As a result. Anti-Patterns If your application primarily indexes and queries data with no need for joins or complex transactions (especially if you expect a write throughput beyond the constraints of a single instance) consider a NoSQL database instead. any schema changes will have to be performed across multiple database schemas so it is worth investing some effort to automate this process. For more detailed relational database best practices refer to the Amazon RDS documentation22. and image). we recommend the use of the Amazon RDS Multi-AZ deployment feature. Those queries can be executed on a read replica. data is split across multiple database schemas each running in its own autonomous primary DB instance. High Availability For any production relational database. there is a short period during which the primary node is not accessible. video. If you have large binary files (audio. they are subject to replication lag and might be missing some of the latest transactions. read-only mode by utilizing read replicas). Read replicas can also not accept any write queries. sharding introduces some complexity to the application.Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 How to take advantage of read replicas Read replicas are separate database instances that are replicated asynchronously.

adding new partitions as your table grows in size or as read. consider storing the files in Amazon S3 and storing the metadata for the files in your database. If you have large binary files (audio. key-value pairs. Amazon DynamoDB in particular manages table partitioning for you automatically.Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 NoSQL Databases NoSQL is a term used to describe databases that trade some of the query and transaction capabilities of relational databases for a more flexible data model that seamlessly scales horizontally. refer to the Amazon DynamoDB best practices24 section of the documentation. In order to make the most of Amazon DynamoDB scalability when designing your application. It is a fully managed cloud database and supports both document and key-value store models. Scalability NoSQL database engines will typically perform data partitioning and replication to scale both the reads and the writes in a horizontal fashion. They do this transparently without the need of having the data partitioning logic implemented in the data access layer of your application. High Availability The Amazon DynamoDB service synchronously replicates data across three facilities in an AWS region to provide fault tolerance in the event of a server failure or Availability Zone disruption. and JSON documents. NoSQL databases utilize a variety of data models.and write-provisioned capacity changes. Anti-Patterns If your schema cannot be denormalized and your application requires joins or complex transactions. scalable performance. consider a relational database instead. NoSQL databases are widely recognized for ease of development. When migrating or evaluating which workloads to migrate from a relational database to DynamoDB you can refer to the “Best Practices for Migrating from RDBMS to DynamoDB”25 whitepaper for more guidance. single-digit millisecond latency at any scale. high availability. Page 23 of 42 . video. and resilience. Amazon DynamoDB is a fast and flexible NoSQL database23 service for applications that need consistent. and image). including graphs.

etc. CRM. a managed data warehouse service that is designed to operate at less than a tenth the cost of traditional solutions. Data is also continuously backed up to Amazon S3. it is compatible with other RDBMS applications and business intelligence tools. including online transaction processing (OLTP) functions. We recommend that you deploy production workloads in multi-node clusters in which data written to a node is automatically replicated to other nodes within the cluster. it is not designed for these workloads. and scaling a data warehouse has been complicated and expensive. columnar data storage. On AWS. optimized for analysis and reporting of large amounts of data.. Page 24 of 42 .Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 Data Warehouse A data warehouse is a specialized type of relational database. user behavior in a web application. running.g. If you expect a high concurrency workload that generally involves reading and writing all of the columns for a small number of records at a time you should instead consider using Amazon RDS or Amazon DynamoDB. and targeted data compression encoding schemes. It can be used to combine transactional data from disparate sources (e. Anti-Patterns Because Amazon Redshift is a SQL-based relational database management system (RDBMS). The Amazon Redshift MPP architecture enables you to increase performance by increasing the number of nodes in your data warehouse cluster. Traditionally.) making them available for analysis and decision-making. High Availability Amazon Redshift has multiple features that enhance the reliability of your data warehouse cluster. data from your finance and billing system. you can leverage Amazon Redshift. Scalability Amazon Redshift achieves efficient storage and optimum query performance through a combination of massively parallel processing (MPP). It is particularly suited to analytic and reporting workloads against very large data sets. Although Amazon Redshift provides the functionality of a typical RDBMS. Refer to the Amazon Redshift FAQ26 for more information. Amazon Redshift continuously monitors the health of the cluster and automatically re-replicates data from failed drives and replaces nodes as necessary. setting up.

you have the choice between Amazon CloudSearch and Amazon Elasticsearch Service (Amazon ES). Amazon ES has also evolved to become a lot more than just a search solution.g.Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 Search Applications that require sophisticated search functionality will typically outgrow the capabilities of relational or NoSQL databases. On AWS. Amazon CloudSearch is a managed service that requires little configuration and will scale automatically. On the other hand. real-time application monitoring. Scalability Both Amazon CloudSearch and Amazon ES use data partitioning and replication to scale horizontally. A system is highly available when it can withstand the failure of an individual or multiple components (e. synonyms. It is often used as an analytics engine for use cases such as log analytics. hard disks. Amazon ES offers an open source API and gives you more control over the configuration details. For more details on the subject. You can think about ways to automate recovery and reduce disruption at every layer of your architecture. High Availability Both services provide features that store data redundantly across Availability Zones. This section discusses high availability design patterns. A search service can be used to index and search both structured and free text format and can support functionality that is not available in other databases. refer to the “Building Fault Tolerant Applications” whitepaper27.. On the one hand. and click stream analytics. such as customizable result ranking. please refer to each service’s documentation. servers. etc.). stemming. Page 25 of 42 . Removing Single Points of Failure Production systems typically come with defined or implicit objectives in terms of uptime. For details. network links etc. The difference is that with Amazon CloudSearch you do not need to worry about the number of partitions and replicas you will need because the service handles all that automatically for you. faceting for filtering.

The secondary resource can either be launched automatically only when needed (to reduce cost). or it can be already running idle (to accelerate failover and minimize disruption). you configure health checks on the Elastic Load Balancing service. Redundancy can be implemented in either standby or active mode. Detect Failure You should aim to build as much automation as possible in both detecting and reacting to failure. Design your health checks with the objective of reliably assessing the health of the back end nodes.Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 Introducing Redundancy Single points of failure can be removed by introducing redundancy. A simple TCP health Page 26 of 42 . In addition. In a typical three-tier application. you will be able to set up alarms for manual intervention or automated response. Standby redundancy is often used for stateful components such as relational databases. Auto Scaling can be configured to automatically replace unhealthy nodes. it can achieve better utilization and affect a smaller population when there is a failure. You can use services like ELB and Amazon Route53 to configure health checks and mask failure by routing traffic to healthy endpoints. functionality is recovered on a secondary resource using a process called failover. In active redundancy. Make sure you collect enough logs and metrics to understand normal system behavior. Specifying the wrong health check can actually reduce your application’s availability. After you understand that. requests are distributed to multiple redundant compute resources. In standby redundancy when a resource fails. which is having multiple resources for the same task. Designing good health checks Configuring the right health checks for your application will determine your ability to respond correctly and promptly to a variety of failure scenarios. and when one of them fails. You can also replace unhealthy nodes using the Amazon EC2 auto- recovery28 feature or services such as AWS OpsWorks and AWS Elastic Beanstalk. and during that period the resource remains unavailable. It won’t be possible to predict every possible failure scenario on day one. The failover will typically require some time before it completes. the rest can simply absorb a larger share of the workload. Compared to standby redundancy.

which is a test that depends on other layers of your application to be successful (this could result in false positives). Durable Data Storage Your application and your users will create and maintain a variety of data. This can compromise performance and availability (especially in topologies that run across unreliable or high-latency network connections). you can recover from both unintended user actions and application failures. particularly for objects stored on Amazon S3. if your health check also assesses whether the instance can connect to a back end database. For the same reason it is not recommended to maintain many synchronous replicas. but the web server process has crashed. but it also increase data durability and availability. For example. retrieve. It can help horizontally scale read capacity. At this layer.Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 check would not detect the scenario where the instance itself is healthy. you risk marking all of your web servers as unhealthy when that database node becomes shortly unavailable. Data replication is the technique that introduces redundant copies of data. It is ideal for protecting the integrity of data from the event of a failure of the primary node. By running a more holistic check that determines if that environment is able to actually provide the required functionality. However. The drawback of synchronous replication is that the primary node is coupled to the replicas. Replication can take place in a few different modes. Durability: No replacement for backups Regardless of the durability of your solution. A transaction can’t be acknowledged before all replicas have performed the write. Synchronous replication only acknowledges a transaction after it has been durably stored in both the primary location and its replicas. and restore any of their versions. Synchronous replication will redundantly store all updates to your data—even those that are results of software bugs or human error. you can use versioning29 to preserve. Page 27 of 42 . Instead. you should assess whether the web server can return a HTTP 200 response for some simple request. this is no replacement for backups. It is crucial that your architecture protects both data availability and integrity. A layered approach is often the best. With versioning. you can configure Amazon Route53 to failover to a static version of your website until your database is up and running again. Synchronous replication can also scale read capacity for queries that require the most up-to-date data (strong consistency). it might not be a good idea to configure what is called a deep health check. A deep health check might be appropriate to implement at the Amazon Route53 level.

As a result. During a failover. server. Data durability in practice It is important to understand where each technology you are using fits in these data storage models. this is a model that provides excellent protection Page 28 of 42 . Amazon RDS. It can also be used to increase data durability when some loss of recent transactions can be tolerated during a failover. or rack. Asynchronous replicas are used to horizontally scale the system’s read capacity for queries that can tolerate that replication lag. you can maintain an asynchronous replica of a database in a separate AWS region as a disaster recovery solution. Quorum-based replication combines synchronous and asynchronous replication to overcome the challenges of large-scale distributed database systems. the Redis engine for Amazon ElastiCache supports replication with automatic failover. A detailed discussion of distributed data stores is beyond the scope of this document. Their behavior during various failover or backup/restore scenarios should align to your recovery point objective (RPO) and your recovery time objective (RTO). For example. You need to ascertain how much data you would expect to lose and how quickly you would be able to resume operations. a failover will most certainly lead to data loss or a very costly data recovery process. You can refer to the Amazon Dynamo whitepaper30 to learn more about a core set of principles that can result in an ultra-scalable and highly reliable database system. It becomes a risky and not always sufficiently tested procedure. should there be a major disruption in the primary one. For example. it is highly likely that some recent transactions would be lost. Automated Multi-Data Center Resilience Business critical applications will also need protection against disruption scenarios affecting a lot more than just a single disk. Replication to multiple nodes can be managed by defining a minimum number of nodes that must participate in a successful write operation. latency makes it impractical to maintain synchronous cross-data center copies of the data. However. with its Multi AZ feature. is designed to provide synchronous replication to keep data on the standby node up-to-date with the primary. Nevertheless. This means that changes performed on the primary node are not immediately reflected on its replicas. In a traditional infrastructure you would typically have a disaster recovery plan to allow a failover to a distant second data center.Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 Asynchronous replication decouples the primary node from its replicas at the expense of introducing replication lag. but the Redis engine’s replication is asynchronous. Because of the long distance between the two data centers.

. An Availability Zone is a data center. Each AWS region contains multiple distinct locations called Availability Zones. it is not sufficient if there is something harmful about the requests themselves. For example. You can refer to the AWS Disaster recovery whitepaper for more guidance on how to implement this approach on AWS31. Availability Zones within a region provide inexpensive. When the Amazon EC2 instances of a particular Availability Zone fail their health checks. low-latency network connectivity to other zones in the same region. This allows you to replicate your data across data centers in a synchronous manner so that failover can be automated and be transparent for your users. It is also possible to implement active redundancy so that you don’t pay for idle resources. Amazon RDS provides high availability and automatic failover support for DB instances using Multi-AZ deployments. In fact. more efficient protection from this type of failure. the choice to perform a failover is a difficult one and generally will be avoided. and in some cases. while with Amazon S3 and Amazon DynamoDB your data is redundantly stored across multiple facilities. On AWS it is possible to adopt a simpler. because the duration of the failure isn’t predicted to be long. If a particular request happens to trigger a bug that causes the system to fail over. then the caller may trigger a cascading failure by repeatedly trying the same request against all instances. Fault Isolation and Traditional Horizontal Scaling Though the active redundancy pattern is great for balancing traffic and handling instance or Availability Zone disruptions. the number of healthy nodes can automatically be rebalanced to the other Availability Zones with no manual intervention. For example. an Availability Zone consists of multiple data centers. Page 29 of 42 . a natural catastrophe that brings down your whole infrastructure for a long time.Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 against a low probability but huge impact risk—e. For example. Each Availability Zone is engineered to be isolated from failures in other Availability Zones. When combined with Auto Scaling. a fleet of application servers can be distributed across multiple Availability Zones and be attached to the Elastic Load Balancing service (ELB). A shorter interruption in a data center is a more likely scenario. many of the higher level services on AWS are inherently designed according to the Multi-AZ principle. ELB will stop sending traffic to those nodes. there could be scenarios where every instance is affected.g. For short disruptions.

By iterating and making use of more AWS capabilities there is further opportunity to create cost-optimized cloud architectures. For example. if you have eight instances for your service. and Amazon Elasticsearch Service (Amazon ES) give you a lot of choice of instance types. so the key is to make the client fault tolerant. until one succeeds. Amazon S3 offers a variety of storage classes. In other cases. In this way.Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 Shuffle Sharding One fault-isolating improvement you can make to traditional horizontal scaling is called sharding. organizations can reduce capital expenses and drive savings as a result of the AWS economies of scale. Reduced Redundancy. You should benchmark and select the right instance type depending on how your workload utilizes CPU. using fewer instances of a larger instance type might result in lower total cost or better performance. and Amazon ES support different Amazon Elastic Block Store (Amazon EBS) volume types (magnetic. Amazon Redshift. Other services. Amazon RDS. RAM. including Standard. Page 30 of 42 . This section discusses the main principles of optimizing for cost with AWS cloud computing. there will still be affected customers. and Standard-Infrequent Access. general purpose SSD. Amazon RDS. and I/O. If the client can try every endpoint in a set of sharded resources. instead of spreading traffic from all customers across every node. services like Amazon EC2. you should select the cheapest type that suits your workload’s requirements. you are able to reduce the impact on customers in direct proportion to the number of shards you have. you might create four shards of two instances each (two instances for some redundancy within each shard) and distribute each customer to a specific shard. Similarly. you can reduce cost by selecting the right storage solution for your needs. Similar to the technique traditionally used with data storage systems. Right Sizing AWS offers a broad range of resource types and configurations to suit a plethora of use cases. This technique is called shuffle sharding and is described in more detail in the relevant blog post32. storage size. such as Amazon EC2. However. For example. Optimize for Cost Just by moving existing architectures into the cloud. For example. provisioned IOPS SSD) that you should evaluate. network. In some cases. you can group the instances into shards. you get a dramatic improvement.

Your application and its usage will evolve through time. In addition. AWS provides tools33 to help you identify those cost saving opportunities and keep your resources right-sized. replace Amazon EC2 workloads with AWS managed services that either don’t require you to take any capacity decisions (e. AWS Lambda.. Where possible. You can also use the managed rules provided by AWS Config to assess whether specific tags are applied to your resources or not. Amazon Elasticsearch Service). Amazon CloudFront. Amazon CloudSearch) or enable you to easily modify capacity as and when need (e.g.. Amazon RDS. Elasticity Another way you can save money with AWS is by taking advantage of the platform’s elasticity. Plan to implement Auto Scaling for as many Amazon EC2 workloads as possible. This is ideal for applications with predictable minimum capacity requirements. Reserved Capacity Amazon EC2 Reserved Instances allow you to reserve Amazon EC2 computing capacity in exchange for a significantly discounted hourly rate compared to On- Demand instance pricing. To make those tools’ outcomes easy to interpret you should define and implement a tagging policy for your AWS resources. Amazon Kinesis Firehose.g. consider which compute workloads you could implement on AWS Lambda so that you never pay for idle or redundant resources. Ultimately. In addition. Amazon SQS. Amazon SES. You can take advantage of tools like the AWS Page 31 of 42 . Amazon DynamoDB. so that you horizontally scale up when needed and scale down and automatically reduce your spend when you don’t need all that capacity anymore.Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 Continuous monitoring and tagging Cost optimization is an iterative process. You can make tagging a part of your build process and automate it with AWS management tools like AWS Elastic Beanstalk and AWS OpsWorks. you can automate turning off non-production workloads when not in use34. Take Advantage of the Variety of Purchasing Options Amazon EC2 On-Demand instance pricing gives you maximum flexibility with no long term commitments. AWS iterates frequently and regularly releases new options. ELB. There are two more ways to pay for Amazon EC2 instances that can help you reduce spend: Reserved Instances and Spot Instances.

and Amazon CloudFront). Amazon EC2 Spot Instances allow you to bid on spare Amazon EC2 computing capacity. and Spot Instances to combine a predictable minimum capacity with “opportunistic” Page 32 of 42 . you can also use Spot Instances when you require more predictable availability: Bidding strategy: You are charged the Spot market price (not your bid price) for as long as the Spot Instance runs. Amazon RDS. Since Spot Instances are often available at a discount compared to On-Demand pricing. If the Spot market price increases above your bid price. On-Demand. Spot Instances For less steady workloads. The difference lies in the way you pay for instances that you reserve. or until the Spot market price exceeds your bid. Mix with On-Demand: Consider mixing Reserved. your instance will be terminated automatically and you will not be charged for the partial hour that your instance has run. Note that there is technically no difference between an On-Demand EC2 instance and a Reserved Instance. Your Spot Instance is launched when your bid exceeds the current Spot market price. However. Spot Instances are great for workloads that have tolerance to interruption. As a result. Amazon Redshift. you can use the Reserved Instance utilization reports to ensure you are still making the most of your reserved capacity.Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 Trusted Advisor or Amazon EC2 usage reports to identify the compute resources that you use most of the time that you should consider reserving. the discounts will be reflected in the monthly bill. and will continue run until you choose to terminate it. Reserved capacity options exist for other services as well (e.. you can significantly reduce the cost of running your applications. you can consider the use of Spot Instances. Your bidding strategy could be to bid much higher than that with the expectation that even if the market price occasionally spikes you would still be saving a lot of cost in the long term. Spot Instances are ideal for workloads that have flexible start and end times.g. Amazon DynamoDB. Depending on your Reserved Instance purchases. After you have purchased reserved capacity. Tip: You should not commit to Reserved Instance purchases before sufficiently benchmarking your application in production.

you can often get more compute capacity for the same price if your app is designed to be flexible about instance types. in-memory caches. In some cases. If your bid is accepted your instance will continue to run until you choose to terminate it. however. Because prices fluctuate independently for each instance type in an Availability Zone. which improves latency for end users and reduces load on back end systems. When the result set is not found in the cache. your instance will not be terminated due to changes in the Spot price (but of course. Your application can control for how long each cached item will remain valid. It can be applied at multiple layers of an IT architecture. These have different hourly pricing but allow you to specify a duration requirement. a result set is found in the cache the application can use that directly. It supports two open-source in-memory Page 33 of 42 . Amazon ElastiCache is a web service that makes it easy to deploy.Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 access to additional compute resources depending on the spot market price. you should still design for fault tolerance because a Spot Instance can still fail like any other EC2 instance). This is a great way to improve throughput or application performance. When. This technique is used to improve application performance and increase the cost efficiency of an implementation. the application can calculate it or retrieve it from a database and store it in the cache for subsequent requests. and scale an in-memory cache in the cloud. Caching Caching is a technique that stores previously calculated data for future use. even a few seconds of caching for very popular objects can result in a dramatic decrease on the load for your database. Test your application on different instance types when possible. Spot pricing best practices Spot Instances allow you to bid on multiple instance types simultaneously. Application Data Caching Applications can be designed so that they store and retrieve information from fast. Bid on all instance types that meet your requirements to further reduce costs. Cached information may include the results of I/O-intensive database queries or the outcome of computationally intensive processing. operate. or until the specified duration has ended. managed. Spot Blocks for Defined-Duration Workloads: You can also bid for fixed duration Spot Instances.

At the same time. as well as a description of common ElastiCache design patterns please refer to the “Performance at Scale with Amazon ElastiCache”35 whitepaper. Amazon CloudFront can be used to deliver your entire website. lowering latency and giving you the high.g.Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 caching engines: Memcached and Redis. For more details on how to select the right engine for your workload.g. Requests for your content are carried back to Amazon S3 or your origin servers. AWS allows you to improve your security in a variety of ways. images. and makes your environment much easier to audit in a continuous manner. Amazon CloudFront also applies the same performance benefits to upload requests as those applied to the requests for downloading dynamic content. css files. Edge caching allows content to be served by infrastructure that is closer to viewers. streaming of pre-recorded video) and dynamic content (e. For a detailed view on how you can achieve a high level of security governance please refer to the “Security at Page 34 of 42 . html response. AWS is a platform that allows you to formalize the design of security controls in the platform itself. sustained data transfer rates needed to deliver large popular objects to end users at scale. which is a content delivery network (CDN) consisting of multiple edge locations around the world. It simplifies system use for administrators and those running IT. including non-cachable content. This means that Amazon CloudFront can speed-up the delivery of your dynamic content and provide your viewers with a consistent and reliable. Other connection optimizations are also applied to avoid Internet bottlenecks and fully utilize available bandwidth between the edge location and the viewer. live video) can be cached at Amazon CloudFront. The benefit in that case is that Amazon CloudFront reuses existing connections between the Amazon CloudFront edge and the origin server reducing connection setup latency for each origin request. This section gives you a high-level overview of AWS security best practices. Security Most of the security tools and techniques that you might already be familiar with in a traditional IT infrastructure can be used in the cloud. yet personalized experience when navigating your web application... Edge Caching Copies of static content (e. If the origin is running on AWS then requests will be transferred over optimized network paths for a more reliable and consistent experience.

and AWS resources. before you replace instances you should collect and centrally store logs on your instances that can help you recreate issues in your development environment and deploy them as fixes through your continuous deployment process. you can capitalize on that for benefits in the security space as well. but you can learn more by visiting the AWS Security page39. However. groups. security patches become the responsibility of AWS.. whether it is in transit or at rest with encryption38. Utilize AWS Features for Defense in Depth AWS provides a wealth of features that can help architects build defense in depth. For example. the AWS platform offers a breadth of options for protecting data. you can implement just-in-time access by using an API Page 35 of 42 . etc. security groups. can help protect your web applications from SQL injection and other vulnerabilities in your application code. You can use Amazon CloudWatch Logs to collect this information. Reduce Privileged Access When you treat servers as programmable resources. Offload Security Responsibility to AWS AWS operates under a shared security responsibility model. Amazon ElastiCache. This not only reduces operational overhead for your team. Where you don’t have access and you need it. Services like AWS WAF.Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 Scale: Governance in AWS”36 and the “AWS Security Best Practices”37 whitepapers. and routing controls. you can reduce the scope of your responsibility and focus on your core competencies through the use of AWS managed services. This way. when you use services such as Amazon RDS. Finally. a web application firewall. where AWS is responsible for the security of the underlying cloud infrastructure and you are responsible for securing the workloads you deploy in AWS. For access control. you can use IAM to define a granular set of policies and assign them to users. When you can change your servers whenever you need to you can eliminate the need for guest operating system access to production environments. but it could also reduce your exposure to vulnerabilities. If an instance experiences an issue you can automatically or manually terminate and replace it. An exhaustive list of all security features is beyond the scope of this document. This is particularly important in an elastic compute environment where servers are temporary. Starting at the network level you can build a VPC topology that isolates parts of the infrastructure through the use of subnets. Amazon CloudSearch.

service accounts would often be assigned long-term credentials stored in a configuration file. security. Page 36 of 42 . You can implement these in an AWS environment as well. and automatically discover application gaps and drift from your security policy. network access controls. In a traditional environment. the use of Amazon Cognito allows client devices to get controlled access to AWS resources via temporary tokens. regulations. and compliance requirements. while enabling users to quickly deploy only the approved IT services they need.” This means you can create an AWS CloudFormation script that captures your security policy and reliably deploys it. and you define constraints to restrict the ways that specific AWS resources can be deployed for a product. an employee who leaves your organization and is removed from your organization’s identity directory will also lose access to your AWS account. You can integrate these requests for access with your ticketing system. This enables centralized management of resources to support consistent governance. In that way.Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 action to open up the network for management only when necessary. On AWS. Another common source of security risk is the use of service accounts. you can instead use IAM roles to grant permissions to applications running on Amazon EC2 instances through the use of short-term credentials. You can perform security testing as part of your release cycle. Additionally. internal/external subnets. AWS CloudFormation templates can be imported as “products" into AWS Service Catalog40. You apply IAM permissions to control who can view and modify your products. for greater control and security. Security as Code Traditional security frameworks. For AWS Management Console users you can similarly provide federated access through temporary tokens instead of creating IAM users in your AWS account. Those credentials are automatically distributed and rotated. Security best practices can now be reused among multiple projects and become part of your continuous integration pipeline. and organizational policies define security requirements related to things such as firewall rules. For mobile applications. but you now have the opportunity to capture them all in a script that defines a “Golden Environment. so that access requests are tracked and dynamically handled only after approval. and operating system hardening.

The topic of cloud computing architectures is broad and continuously evolving. Page 37 of 42 . Traditional approaches that involve periodic (and often manual or sample-based) checks are not sufficient. making both point-in-time and period-in-time audits very effective. or third- party tools from the AWS Marketplace to scan logs to detect things like unused permissions. overuse of privileged accounts. On AWS. With AWS Config rules you will also know if some component was out of compliance even for a brief period of time. the Amazon Elasticsearch Service. Services like AWS Config. and system abuse. Logs can then be stored in an immutable manner and automatically processed to either notify or even take action on your behalf. Amazon EMR.Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 Real-Time Auditing Testing and auditing your environment is key to moving fast while staying safe. protecting your organization from non-compliance. As each use case is unique. anomalous logins. to architecting applications that can scale horizontally and with high availability. it is possible to implement continuous monitoring and automation of controls to minimize exposure to security risks. You can implement extensive logging for your applications (using Amazon CloudWatch Logs) and for the actual AWS API calls by enabling AWS CloudTrail41. policy violations. Conclusion This whitepaper provides guidance for designing architectures that make the most of the AWS platform by covering important principles and design patterns: from how to select the right database for your application. you will have to evaluate how those can be applied to your implementation. Going forward you can stay updated through the wealth of material available on the AWS website and the training and certification offerings of AWS. and which are not. and AWS Trusted Advisor continually monitor for compliance or vulnerabilities giving you a clear overview of which IT resources are in compliance. especially in agile environments where change is constant. usage of keys. You can use AWS Lambda. Amazon Inspector. AWS CloudTrail is a web service that records API calls to supported AWS services in your AWS account and delivers a log file to your Amazon S3 bucket.

For applications already running on AWS we recommend you also go through the “AWS Well Architected Framework” whitepaper43 that complements this document by providing a structured evaluation model. to validate your operational readiness you can also refer to the comprehensive AWS Operational Checklist44. you can refer to the AWS Architecture Center42. Manager. AWS Solutions Architecture Further Reading For more architecture examples. Page 38 of 42 . Finally.Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 Contributors The following individual contributed to this document:  Andreas Chatzakis.

aws.amazon.html 9 Amazon Machine Images http://docs.amazon.aws.amazon.aws.html 10 AMIs for the AWS Elastic Beanstalk run times: http://docs.com/AWSEC2/latest/UserGuide/AMIs.html 7 AWS Opsworks Lifecycle events http://docs.com/elasticbeanstalk/ Page 39 of 42 .amazon.amazon.com/AWSSdkDocsJava/latest//DeveloperGuide/java -dg-tomcat-session-manager.amazon.amazon.com/AWSEC2/latest/UserGuide/ec2-instance- metadata.html 12 AWS Elastic Beanstalk: https://aws.ht ml 11 AWS Elastic Beanstalk customization with configuration files: http://docs.amazon.aws.com/aws-sdk-php/v3/guide/service/dynamodb- session-handler.com/AWSCloudFormation/latest/UserGuide/templat e-custom-resources-lambda.com/about-aws/ 2 The AWS global infrastructure: https://aws.amazon.html 5 “Big Data Analytics Options on AWS” whitepaper https://d0.Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 Notes 1 About AWS: https://aws.amazon.amazon.platforms.aws.com/ElasticLoadBalancing/latest/DeveloperGuide/elb -sticky-sessions.aws.amazon.com/about-aws/global- infrastructure/ 3 For example there is the PHP Amazon DynamoDB session handler (http://docs.com/elasticbeanstalk/latest/dg/ebextensions.aws.com/opsworks/latest/userguide/workingcookbook- events.html) and the Tomcat Amazon DynamoDB session handler (http://docs.com/whitepapers/Big_Data_Analytics_Options_on_AWS .pdf 6 Bootstrapping with user data scripts and cloud-init: http://docs.html 8 AWS Lambda-backed custom CloudFormation resources: http://docs.awsstatic.html) 4 ELB sticky sessions http://docs.aws.com/elasticbeanstalk/latest/dg/concepts.aws.

com/whitepapers/AWS_Serverless_Multi- Tier_Archiectures.com/AmazonCloudWatch/latest/DeveloperGuide/Wh atIsCloudWatchEvents.amazon.aws.pdf 22Best Practices for Amazon RDS: http://docs.com/amazondynamodb/latest/developerguide/BestPr actices.com/autoscaling/ 15 Amazon CloudWatch alarms: http://docs.com/AmazonCloudWatch/latest/DeveloperGuide/Ala rmThatSendsEmail.aws.awsstatic.aws.awsstatic.html 23 NoSQL databases on AWS https://aws.aws.html 17 AWS OpsWorks lifecycle http://docs.html 16 Amazon CloudWatch events: http://docs.Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 13 Amazon EC2 auto recovery: http://docs.amazon.aws.amazon.com/lambda/latest/dg/with-scheduled-events.com/products/ 21 “AWS Serverless Multi-Tier Architectures” whitepaper https://d0.html 14 Auto Scaling: https://aws.pdf 26 Amazon Redshift FAQ: https://aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_BestPractic es.html 20 You can see the full list of AWS products here: http://aws.html 25 Best practices for Migrating from RDBMS to Amazon DynamoDB: https://d0.amazon.amazon.com/redshift/faqs/ Page 40 of 42 .aws.amazon.awsarchitectureblog.com/nosql/ 24 Best practices for Amazon DynamoDB: http://docs.com/opsworks/latest/userguide/workingcookbook- events.amazon.amazon.com/AWSEC2/latest/UserGuide/ec2-instance- recover.amazon.com/whitepapers/migration-best-practices-rdbms-to- dynamodb.amazon.html 19 Exponential Backoff and Jitter http://www.com/2015/03/backoff.html 18 AWS Lambda scheduled events: http://docs.aws.

com/awsaccountbilling/latest/aboutv2/monitoring- costs.com/whitepapers/aws-building-fault-tolerant- applications.amazon.pdf 28 Recover your instance: http://docs.com/AWS_Disaster_Recovery.amazon.awsstatic.allthingsdistributed.awsarchitectureblog.awsstatic.Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 27 “Building Fault Tolerant Applications” whitepaper: https://d0.aws.awsstatic.amazon.com/AmazonS3/latest/dev/Versioning.com/whitepapers/performance-at-scale-with-amazon- elasticache.amazon.amazonwebservices.awsstatic.com/security 40 AWS Service Catalog: https://aws.html 34 Create Alarms that stop or terminate an instance http://docs.aws.pdf 37 “AWS Security Best Practices”: https://d0.amazon.pdf 38 Securing data at rest with encryption: https://d0.html 35 “Performance at Scale with Amazon ElastiCache:” https://d0.com/2007/10/amazons_dynamo.pdf 39 AWS Security: http://aws.pdf 32 Shuffle sharding http://www.html 31 “Using Amazon Web Services for Disaster Recovery” https://media.com/whitepapers/compliance/AWS_Security_at_Scale_ Governance_in_AWS_Whitepaper.com/whitepapers/aws-securing-data-at-rest-with- encryption.html 33 Monitoring Your Usage and Costs http://docs.com/whitepapers/aws- security-best-practices.amazon.com/2014/04/shuffle- sharding.aws.aws.html 30 “Dynamo: Amazon’s Highly Available Key-value Store”http://www.awsstatic.com/servicecatalog/ Page 41 of 42 .html 29 Amazon S3 versioning: http://docs.pdf 36 “Security at Scale: Governance in AWS” https://d0.com/AmazonCloudWatch/latest/DeveloperGuide/Usi ngAlarmActions.com/AWSEC2/latest/UserGuide/ec2-instance- recover.

pdf 44 AWS Operational Checklist http://media.com/architecture 43 “AWS Well Architected Framework”: http://d0.awsstatic.amazon.awsstatic.Amazon Web Services – Architecting for the Cloud: AWS Best Practices February 2016 41“Security at Scale: Logging in AWS” https://d0.com/AWS_Operational_Checklists.pdf 42 AWS Architecture Center https://aws.amazonwebservices.com/whitepapers/architecture/AWS_Well- Architected_Framework.com/whitepapers/compliance/AWS_Security_at_Scale_Lo gging_in_AWS_Whitepaper.pdf Page 42 of 42 .