Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions build-and-config/configuring-for-high-availability.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,23 @@ In order to achieve high availability for your application, you need multiple re
Applications running Kubernetes v1.12 or lower do not support the multi-master feature on Cloud 66. If you have deployed an application via Cloud 66 before March 2019, you will need to **redeploy your application "with upgrades" and choose to perform a Kubernetes upgrade** (note that this will incur significant downtime as your cluster will be recreated). All applications deployed after **March 2019** on version v1.13 and above automatically support multi-master clusters.
</Callout>

## Choosing between a shared master and dedicated workers

A new Maestro Kubernetes cluster starts with a single master node that also runs your application workloads — a "shared master" topology. This is fine for development and small production loads, but it has a hard ceiling: the same node is responsible for both Kubernetes control-plane traffic (the API server, scheduler, controller manager, etcd) and your app's containers, and those two responsibilities compete for the same CPU, memory, and disk I/O.

The right time to add a **dedicated worker** — a node that runs *only* application pods, with the master returned to control-plane duties — is when you start seeing any of:

- **Slow `kubectl` responses or Dashboard timeline lag** when nothing else is changing. The API server is being starved by application containers.
- **Scheduling decisions that take noticeably longer than they used to** (new pods sitting in `Pending` for tens of seconds before being placed).
- **Pod evictions on the master node** — the kubelet evicting application pods because the node itself is under memory or disk pressure. These show up on the cluster's events feed and in the timeline.
- **etcd warnings in the master's logs** about slow writes (`took too long`, `apply request took too long`). etcd is the most latency-sensitive part of the control plane.

If you're hitting any of those on a single-master cluster, add one dedicated worker before chasing other tuning options — it's usually the fastest path back to a healthy cluster. The procedure is the same as [adding any node](#adding-nodes-to-an-application); pick *Worker* in step 5.

<Callout type="info" title="One worker first, then think about HA">
Adding the first dedicated worker is a different decision from going to a full HA topology (three masters + workers). A single-master + single-worker cluster is not HA — losing the master still takes the cluster down — but it does take the workload pressure off the master and is often enough for small production apps. Move to three masters when you also need the cluster to survive a master node failure.
</Callout>

## Adding nodes to an application

To add nodes to an existing application:
Expand Down