Clustering

From Halon, SMTP software for hosting providers
Jump to: navigation, search
Video guide on YouTube

The Halon SMTP software is built for clustering, and we encourage all customers to use at least two nodes, for performance and fault tolerance. Given the nature of email, clustering operates in an active-active, master-master, share-nothing fashion. Traffic is normally directed to all nodes, for example by adding multiple DNS records. Because the system can operate without queues, it's possible to design a setup where any node can fail or be destroyed, without losing any data. All aspects of the system, such as logging, rate limiting and browsing the mail tracking, are clustered and multiplexed (the time required to perform such tasks doesn't increase with the number of cluster nodes).

The concept

Clustering can be done for many reasons, including

  • Risk mitigation
  • Performance
  • Improved manageability with multiple systems

The clustering is based on a multi-master topology; all nodes in the cluster are connected to each other and configuration can be preformed on any appliances in your cluster. This reduces the risk of a single point of failure, and allowing appliances to be added and taken out of the cluster without breaking the overall topology.

Together, all nodes share a "cluster" configuration, which includes everything except your local network settings (and a few other "unclusterable" values, like storage disk path). We also support private values on selected parts of the configuration, for example host names (advanced users may set even more private keys).

Cluster demystified

This chapter will turn out clustering magic into logic. It's highly recommended to read this chapter now, since it can be somewhat stressful once the cluster is broken and this knowledge is really needed.

Our cluster consists of a shared configuration, which is "Pushed" to the cluster each time you modify the configuration on one of the appliances. In order to keep the clustering logic sane, this can only be done when all appliances are synchronized.

How do we know when to synchronize the configuration?

Successful Synchronization
Appliances ready to Join

Well, each configuration revision is assigned a UUID (Universally Unique Identifier or a unique randomised id if you like). This UUID can be seen in the configuration when exported (eg. uuid="bb30efa7-3c0f-4cb1-b0dd-fb601286d303"), but it cannot be changed manually, and any attempts to import a configuration manually with a custom UUID value will make us generate a new one internally, this is simple because one should not try to out smart our clustering mechanism.

The clusters common goal is the have the latest UUID generated (configuration revision) shared on all appliances in the cluster, this is achieved by Pushing the configuration to all other appliances in the cluster which are synchronized.

All appliances keep track of a backlog (history file) of UUIDs which have either been Pushed or imported from the cluster (Accepted).

We consider a unit to be "in-sync":

  • If the latest UUID is somewhere in our backlog (this is a very common scenario when preforming any kind of reconfiguration).
    • This will also allow a unit to be shut down while all the other appliances are reconfigured; and when the unit comes back online again, it's latest configuration will still be in our backlog; so the cluster will Push the new configuration onto the unit and the cluster will be in-sync once again.
  • If the latest UUID is empty (the unit is in "overwrite me" mode and is about to be overwritten).

We will consider a unit "out of sync" if:

  • The latest UUID is not in our backlog, this can happen if:
    • You preform a configuration on two appliances too fast, so the first unit haven't had time to synchronize the cluster in between.
    • The communication between two or more appliances are down, while the configuration is changed on these appliances.

We will consider all unit to be synchronized when:

  • All appliances have the same UUID (Clustering -> Overview).

How do I know when the cluster is synchronized?

The cluster overview show each appliances latest UUID. when all are the same, the cluster is synchronized.

How do I resolve a conflict?

Failed Synchronization

The easiest way is to put the broken unit in "Overwrite me"-mode as if it were to enter the cluster for the first time and accept the shared configuration.

How do I test a configuration on just one of the appliances?

You must disable automatic pushing of configurations, this will prevent this unit from synchronize it's changes to the cluster..

How do I keep the changes I made?

Re-enable automatic pushing and the configuration should be synchronized from this unit.

How do I discard the changes I made?

1. Push "overview me" and force a sychronization from one of the other clustered appliances or wait 5 minutes and watch the UUID be synchronized in the cluster overview.

2. Re-enable automatic pushing (it's important that you don't activate it until the cluster is repaired).

Can I cluster different software versions with each other?

Yes, as long as there is only one or two releases in between them as new and removed features might otherwise cause incompatibility.

Are licenses synchronized in the cluster?

Licenses are not synchronized or shared in the cluster, just in the configuration.