Keyfront design documentation

This document outlines the design principles and architecture of Keyfront, a distributed proxy that allows better clustering solution for Redis/Valkey.

Motivation

Target use cases

Keyfront focuses on the use of Redis/Valkey as a large distributed cache, rather than as a primary database. In this use case, the following properties are desirable:

Scalability: The system should be able to scale out to handle large amounts of data and traffic.
High availability: The system should be able to continue operating even if some nodes fail.
Strong consistency: The system should not serve stale or incorrect data.
Low latency: The system should respond to requests quickly.
Manageability: The system should be easy to operate and maintain.

But the following properties and features are not required:

Durability: Losing cached data is acceptable.
Replication: Replication is not required for high availability as long as the system can quickly fail over to other nodes.
Backup and restore: Losing cached data is acceptable, so backup and restore is not required.

Issues with Redis/Valkey Cluster

Redis/Valkey has a built-in clustering solution called Redis/Valkey Cluster. However, it comes with some limitations, such as:

Scalability issues with large clusters
- This comes mainly from the n^2 communication between nodes due to the gossip protocol.
Lack of strongly-consistent topology
- Cluster topology (which nodes are in the cluster, and which slots they own) is spread through the gossip protocol, and not strongly consistent.
Manageability issues
- Redis/Valkey Cluster lacks central metadata store. This means configurations, ACLs, functions, etc. are not shared across nodes, and operators have to manually configure all nodes.

Those issues are described in more detail in valkey-io/valkey#384.

Additionally, Redis/Valkey Cluster has fundamental limitations on the consistency guarantees it can provide:

For shards to be highly available, they must have replicas. When a shard has only a single primary node and it fails, the shard becomes unavailable. When a shard has replicas, if the primary node fails, one of the replicas is promoted to primary.
The replication from primary to replicas is asynchronous, so if the primary fails, some writes can be lost.

Thus, achieving both high availability and strong consistency at the same time is not possible with Redis/Valkey Cluster.

Issues with existing clustering solutions

There are existing proxy solutions for Redis/Valkey, such as twemproxy. They are deployed in front of a fleet of Redis/Valkey nodes, and they route requests to the correct node based on the key, as shown in the diagram below:

┌────────┐ ┌────────┐ ┌────────┐
│ Client │ │ Client │ │ Client │
└────┬───┘ └────┬───┘ └────┬───┘
     └──────────┼──────────┘
            ┌───┴───┐
            │ Proxy │
            └───┬───┘
     ┌──────────┼──────────┐
 ┌───┴───┐  ┌───┴───┐  ┌───┴───┐
 │ Redis │  │ Redis │  │ Redis │
 └───────┘  └───────┘  └───────┘

This design has some limitations:

Increased latency
- The proxy adds an extra hop for each request, which can increase latency.
Single point of failure
- If the proxy goes down, the entire cluster becomes unavailable.
- Some proxy solutions try to mitigate this by failing over to another proxy, however, when the proxy goes down, the entire cluster becomes temporarily unavailable until the failover is complete.

Another approach for the clustering is make clients obtain slot assignments from some metadata store, and then communicate with the Redis/Valkey nodes directly.

          ┌──────────┐
          │ Metadata ├─────────────────┐
          │  store   │                 │
          └─────┬────┘                 │
     ┌──────────┼──────────┐      ┌────┴────┐
┌────┴───┐ ┌────┴───┐ ┌────┴───┐  │ Control │
│ Client │ │ Client │ │ Client │  │  plane  │
└────┬───┘ └────┬───┘ └────┬───┘  └────┬────┘
     ├──────────┼──────────┤           │
 ┌───┴───┐  ┌───┴───┐  ┌───┴───┐       │
 │ Redis ├──┤ Redis ├──┤ Redis ├───────┘
 └───────┘  └───────┘  └───────┘

In this design, the slot assignments in the metadata store are often updated by some control plane, which monitors the Redis/Valkey nodes and reassigns slots when nodes fail. This design has some limitations as well:

Need of custom client libraries
- Clients need to be aware of the metadata store, so they need to use custom client libraries that read the slot assignments from the metadata store and communicate with appropriate Redis/Valkey nodes.
Difficulty of failure detection
- The control plane needs to externally monitor the Redis/Valkey nodes to detect failures, which is tricky especially when there are temporary network issues.
No protection against misbehaving clients
- If we use unmodified Redis/Valkey nodes, they are not aware of the clustering, so they just respond to any commands they receive. This means that if a client sends a command to a node that does not own the slot, the node will happily respond with incorrect data.
- Even if the client implementation is correct, its perception of the cluster topology can be simply out-of-date. In this case, the client will send commands to the wrong node.
Increased deployment complexity
- The control plane needs to be deployed and maintained, which adds complexity to the deployment.

Design overview

Keyfront takes a "distributed proxy" approach, where multiple proxies are deployed across the cluster, and each proxy instance is deployed in the same host as a Redis/Valkey node in the similar way as vttablet in Vitess.

   ┌────────┐        ┌────────┐        ┌────────┐
   │ Client │        │ Client │        │ Client │
   └────┬───┘        └────┬───┘        └────┬───┘
        ├─────────────────┼─────────────────┤
┌───────┼───────┐ ┌───────┼───────┐ ┌───────┼───────┐
│ Host  │       │ │ Host  │       │ │ Host  │       │
│   ┌───┴───┐   │ │   ┌───┴───┐   │ │   ┌───┴───┐   │  ┌──────────┐
│   │ Proxy ├───┼─┼───┤ Proxy ├───┼─┼───┤ Proxy ├───┼──┤ Metadata │
│   └───┬───┘   │ │   └───┬───┘   │ │   └───┬───┘   │  │  store   │
│   ┌───┴───┐   │ │   ┌───┴───┐   │ │   ┌───┴───┐   │  └──────────┘
│   │ Redis │   │ │   │ Redis │   │ │   │ Redis │   │
│   └───────┘   │ │   └───────┘   │ │   └───────┘   │
└───────────────┘ └───────────────┘ └───────────────┘

The metadata store is used for the following purposes:

Storing cluster topology: the cluster topology is maintained in a strongly-consistent manner instead of spreading it through the gossip protocol.
For leader election: the proxies use the metadata store to elect a single leader out of the proxy instances. The leader is responsible for managing the cluster topology and coordinating the proxies.

Unlike Redis/Valkey Cluster that requires replica nodes in each shard for high availability, Keyfront achieves high availability by failing over the slots owned by a failed host to other hosts. As replicas are not required, Keyfront does not suffer from the issues that arise from asynchronous replication.

This design addresses the issues of Redis/Valkey Cluster:

Scalability
- As the proxy instances need to communicate only with the metadata store and the leader, the communication overhead is reduced to O(n) instead of O(n^2) in the gossip protocol.
Strong consistent topology and manageability
- The proxy instances store the cluster topology and other metadata in a central metadata store (etcd) to ensure strong consistency and manageability.

This design has the following advantages compared to the existing solutions:

Lower latency
- As the proxy instances are deployed in the same host as the Redis/Valkey nodes, they can communicate via Unix domain sockets, which reduces latency compared to TCP communication.
High availability
- If one host goes down, the other hosts can still serve requests, and other proxy instances can take over the slots owned by the failed host.
No need for custom client libraries
- Keyfront proxy implements the Redis/Valkey Cluster protocol, allowing clients to communicate with proxies using the unmodified Redis/Valkey Cluster client libraries.

Proxying

Intercepting commands

The Keyfront proxies intercept commands from clients and relay them to the backend Redis/Valkey nodes the proxy instance is responsible for. Before relaying the command, the proxy checks:

If the command takes keys as arguments (e.g. GET, SET, etc.)
- If so, the proxy determines whether the proxy instance owns the slot, and responds with -MOVED if it does not.
If the command is a cluster-related command (e.g. CLUSTER SLOTS)
- If so, the proxy responds with appropriate response that emulates the Redis/Valkey Cluster protocol.

Otherwise, the proxy relays the command to the backend Redis/Valkey node.

The extra layer of the proxy introduces some latency, however, it is mitigated by multiplexing requests from multiple clients into a single shared connection to the backend Redis/Valkey node. This achieves the similar effect as pipelining.

Emulating cluster mode on Redis/Valkey with standalone mode

The Redis/Valkey instances behind the proxies are run in standalone mode, and the proxies handle the clustering logic. This is achieved by mapping slots to databases.

Each Redis/Valkey instance is configured with databases 16384, and each slot is mapped to a database.
The proxy automatically performs SELECT <slot> before executing commands on the backend Redis/Valkey node.
Commands that work on a slot (e.g. CLUSTER GETKEYSINSLOT) are translated to work on the corresponding database.
Commands that work on a database (e.g. FLUSHDB, DBSIZE, etc.) are translated to work on the all databases.

Clustering and high availability

Design principles

Keyfront is designed with the following principles:

Focus on use as a cache: Losing cached data is acceptable.
- However, rolling back to older data or serving incorrect data is not acceptable.
- Instead of migrating existing keys in a slot to a new node when reassigning slots, slots are reassigned instantly without moving the keys. The idea of cache-only mode is also described in redis/redis#4160.
No replicas: High availability is achieved by having multiple shards, instead of replicas.
- If we have 2n nodes, let's have 2n shards instead of n shards with 2 nodes each.
- Replication is not supported.
Prioritize consistency over availability: In case of network partitions, it is better to make some slots unavailable than serving incorrect data.

Node discovery

Unlike Redis/Valkey Cluster that spreads the cluster membership information through the gossip protocol, Keyfront uses the central metadata store to discover nodes in the cluster. Instead of calling CLUSTER MEET <existing node>, the nodes are configured with the address of the metadata store, and they register themselves to the metadata store when they start up and discover other nodes by reading the metadata store.

Leader election

Keyfront uses the metadata store to elect a single leader node out of the proxy instances. The leader node is responsible for managing the cluster topology and coordinating the proxies.

Auto-rebalancing of slots

The leader node is responsible for distributing slots evenly across the nodes in the cluster.

When a new node joins the cluster, the leader node moves some slots from existing nodes to the new node.
When a node leaves or fails, the leader node moves the slots from the failed node to other nodes.

The similar idea is described in redis/redis#3009.

Users can also manually trigger rebalancing of slots by calling management commands.

mosmeh/keyfront.md

Select an option

No results found