Skip to Content

Scaling Amazon ElastiCache for Redis with Online Cluster Resizing

Posted on October 27, 2022 by

Categories: AWS


For clients wishing to process enormous volumes of data at great rates, quicker than conventional databases can handle, Amazon ElastiCache epitomizes much of what makes fast data a reality. Redis is one of the most well-liked NoSQL key-value stores because of its speed, simplicity, and in-memory features. Due to its microsecond latency, Redis has been the de facto option for caching. Leaderboards, in-memory analytics, messaging, and other in-memory use cases are all made possible by their support for complex data structures (such as lists, sets, and sorted sets).

We introduced Amazon ElastiCache for Redis, a fully managed in-memory data storage with microsecond latency, as part of our AWS fast data journey four years ago. Redis cluster functionality has since been enabled, allowing clients to run quicker and more scalable workloads. ElastiCache for Redis cluster setup allows users to operate Redis workloads with up to 6.1 TB of in-memory capacity in a single cluster and supports up to 15 shards. Resizing the cluster needed backup and restore, necessitating bringing the cluster offline, even though the Redis cluster setup allowed for more extensive deployments with excellent performance.

We introduced online cluster scaling in ElastiCache earlier last month. It is now possible to add and delete shards from an active cluster using ElastiCache for Redis. Your Redis cluster workloads may now be dynamically scaled out or even scaled in to accommodate variations in demand. While the cluster is still active and handling requests, ElastiCache resizes it by adding or deleting shards and dispersing keys equally across the new shard configuration. No application modifications are required.

elastic scalability

After closely following ElastiCache’s development over the years, I’m thrilled to learn that it is now being utilized by thousands of businesses, including Airbnb, Hulu, McDonald’s, Adobe, Expedia, Hudl, Grab, Duolingo, PBS, HERE, and Ubisoft. ElastiCache for Redis is incredibly simple to use and offers reliable microsecond latencies. ElastiCache for Redis is used by our clients in their most demanding applications, which serve millions of users. Speed always wins, whether the industry is gaming, ad tech, travel, or retail.

Customers have requested greater flexibility in growing their workloads dynamically while maintaining high availability and handling incoming traffic as Redis’ use cases expand. I’ve recently spoken with a few gaming firms; for example, we’ve been discussing the necessity of speed and flexibility in growing, both in and out.

They deal with workloads that vary greatly depending on the popularity of the game or seasonal factors like approaching holidays. Gaming platforms desire to enlarge the online cluster whenever a leaderboard spikes due to a new game title, and many players swarm to play the game. However, they have to be able to scale in the environment to minimize expenses when demand declines, all the while remaining online and responding to incoming requests.

Similar difficulties controlling workload increases and decreases brought on by significant sales occasions have been reported by our retail customers. For workloads where offline cluster resizing was not possible, several customers have also discussed their experiences trying to implement online cluster resizing and self-manage Redis workloads. Although open-source Redis has primitives to aid in resharding a cluster, they are insufficient. Customers must cope with errors during cluster scaling and the cost of self-management.

Failures may make the cluster inoperable, resulting in the possibility of data loss and extended downtime until the cluster can be manually repaired.

At Amazon, our main priority has always been to innovate for the consumer’s benefit. Our intention with online cluster resizing was to create a fully managed cluster resharding experience that would allow both scale-out and scale-in while maintaining open-source compatibility.

Our ability to provide the promise of more elasticity and the flexibility to adjust workloads while maintaining availability, consistency, and performance has been made possible by an exciting path of thought leadership and innovation.

Inside the engine

The critical space in a Redis cluster is divided into slots (16,384 slots), and slots are dispersed among shards. These slots must be distributed differently when a cluster is restarted. Redis clients can automatically detect and keep up with changes in slot assignments, so apps employing Redis can pick up on this.

On the server side, the slots must be manually moved. As read and write requests must be fulfilled on the same dataset while altering the number of shards and moving data, cluster scaling is challenging. The steps in a resharding operation to scale out include adding shards, making a strategy for redistributing slots, migrating the slots, and then transferring slot ownership across shards when the slots are moved.

Atomic slot movement

ElastiCache’s online cluster scaling procedure substitutes atomic slot migration for open-source Redis’ atomic key migration. ElastiCache keeps a duplicate of each key once it is migrated to the target shard so that the source shard can keep ownership of the key until the entire slot and all of its keys have been migrated. This offers several advantages:

The dataset never experiences a slot split since every key in the slot remains held by the source shard. This makes it simple to handle actions like transactions, LUA scripts, and multi-key commands, ensuring complete API coverage while cluster resharding occurs.

The source shard continues to allow queries for keys whose slots have already been moved while slot migration is in progress. As a result, the client redirection window is reduced, resulting in reduced latency during migration operations.

Critical ownership remains with the source shard, ensuring that replicas within the source shard have access to the most recent critical information. In the event of a failover, the replicas can go on executing instructions using the most current critical status without losing any data.
It is a more reliable system. The source shard has complete ownership of the key, making it simple to recover from failures like targeting out of memory, which might prevent migration.

Along the process, we have also made various improvements. The utilization of multi-threaded operations at the source shard is a significant addition. At the source shard, slot migration is carried out concurrently as a different thread from the primary I/O thread. Critical migration no longer obstructs I/O on the source. As a result, guaranteeing no availability effect. All data alterations during the migration procedure are asynchronously replicated to the destination shard to ensure data consistency.

Our ElastiCache for Redis users will love the feature of online cluster scaling. With no modifications to the application side, you may scale up or down your ElastiCache for Redis 3.2.10 cluster. See Online Cluster Resizing for further details on setting up clustered Redis and attempting to reshard a cluster.