Insights, comments, tips and random ramblings.
High availability with ZFS has been around for some time. In this article we will look at the traditional way of doing high availability ZFS as a well as a completely different way of doing it.
The most common way of doing HA ZFS is to have 2 "head" nodes and a central JBOD system which houses the disks.
This uses external SAS (Serial Attached SCSI) cables to connect everything together.
Dual port SAS drives are required. SATA or NVMe drives cannot be used.
This is how Nexenta, Open-E, QuantaStor, Syneto and other ZFS based systems do high availability.
If one of the head nodes fails, the other head node can failover and handle the ZFS pool and associated storage services.
How is this scaled to increase storage capacity?
It is a complex design that has many limitations.
If we want to:
then we cannot do it with the existing JBOD based architecture. We need a completely different cluster design to meet those requirements.
Back in 2013, we began planning for a next generation of cluster that supported ZFS.
At that time, we had a dual-node solution that used DRBD  to handle the replication.
This only supported 2 nodes, where the controller and storage was combined in the same system. It could not scale beyond that.
What we really wanted was a highly scalable architecture. We wanted to remove the 2 controller limit and be able to connect as many storage nodes as we wanted.
And we wanted to support any disk type. HDD, SSD, SATA, SAS, NVMe, 2.5", 3.5" and PCIe. Anything from cheap consumer grade stuff all the way up to the expensive high end enterprise hardware.
It was pretty clear from the start there was only one way to connect the nodes. It had to be a switched based architecture.
We wanted high throughput and very low latency. We have customers who run hosting companies and the latency aspect is very important when these people are running thousands of virtual machines.
InfiniBand was the clear leader . At the time we began the design, InfiniBand offered up to 56 Gbit/s of bandwidth. Ethernet was only 10 Gbit/s.
When it comes to latency, InfiniBand is around 15% the latency of Ethernet.
Then there is RDMA . It has been part of InfiniBand since it was released 15 years ago. It is a standard feature of InfiniBand, unlike Ethernet where the requirements to get RDMA working are more complex. RDMA is the key to really low latency.
The icing on the cake was the pricing. InfiniBand really found it's niche in supercomputing, where the profit margins are slim. To compare with Fibre Channel, a 56 Gbit InfiniBand adapter is around half the price of a 16 Gbit Fibre Channel adapter.
We had experience with InfiniBand going back to 2007. So it was a technology we were already familiar with and we had the test equipment to start building our proposed design.
The network interconnect is really simple. The controller nodes and storage nodes are connected by an InfiniBand or Ethernet switch.
The head nodes are basically the same specification as the head nodes used in the traditional design.
The storage nodes are not JBOD enclosures. Rather they are systems with an Intel processor and a hardware RAID controller. They have their own OS and run their own storage target.
To add new controllers, just configure them and plug into the switch. To add new storage nodes, again just configure them and plug into the switch. This does not affect the running cluster. ZFS pools can be expanded with new storage without downtime or new ZFS pools can be created. Cluster expansion is completely non-disruptive.
In the above diagram, the switch is a single point of failure. So we support multiple switches and multipath all IO from the storage nodes to the controllers.
This design can also be supported in a 2 node configuration.
We support combining the controller and storage node in the same system. In that configuration, a switch is not required and the nodes can be connected point-to-point. This can be expanded in future with the addition of a switch.
It was also important that we support Ethernet based networks. If you have an Ethernet network running at 10 Gbit or better, then that is also suitable to connect the controller and storage nodes.
RDMA is supported if your hardware is capable. The iSER protocol is used which offers a significant performance increase over TCP/IP based iSCSI.
The current architecture is not just designed for InfiniBand and Ethernet networks. It is also designed to support future networks. One example is the Intel Omni-Path  network, which has 100 Gbit/s of bandwidth. We plan to support that in early 2017.
Another example is InfiniBand HDR  running at 200 Gbit/s and RDMA based Ethernet running at 100 Gbit/s. In short, the networking aspect of the architecture is very future proof.
Our cluster architecture was designed to overcome most of the limitations with the existing JBOD based designs. It is designed to scale to hundreds of nodes per cluster and be much easier to manage. It is also designed with an eye to the future. Our customers install these clusters and want them to run for many years. What is state of the art today will soon become yesterdays technology.
Storage technology which is not even available to buy yet can be supported when released. An example would be Intel's Xpoint . We can support any block device that can be installed into a standard server.
Networks which only exist on roadmaps can also be supported when released.
The future is definitely bright for high availability ZFS.