Hardware

Our “shared nothing” architecture replicates storage between the nodes in the cluster.

We use dedicated storage nodes and no JBODs. This eliminates single points of failure in the storage cluster.

As a switch based architecture, scalability is vastly superior to daisy-chained JBOD based systems.

Each storage node is an Intel X86 based server that exports it's storage to the controller nodes by InfiniBand or Ethernet.

The storage is exported as a single block device using a high performance SAN service running on each storage node.

Architecture

In the most basic configuration of a two node cluster, a ZFS storage pool is created from each node.

ZFS Storage Pool
Mirrored VDEV
Scale Scale

We use the entire storage nodes to create mirrored VDEVs, and not the individual disks.

IO is written to each storage node at the same time. Replication is synchronous.

The advantages over traditional shared storage JBOD setups are:

No single point of failure
Servers can act as both controller and storage node
The hardware is cheaper to purchase and SAS drives are not required
High performance NVMe drives can be used
Takes up less space in the rack
Power consumption is lower
InfiniBand bandwidth of up to 56 Gigabit compared to 12 Gigabit SAS
Easily add new nodes to scale the storage capacity
Live expansion of the storage pool without downtime
Mix HDD and SSD disks
One heartbeat packet is written to the entire node, not each disk

By accessing the entire storage node as a single block device, the IO latency is dramatically lower than accessing each disk individually.

This improves performance and scalability.

HDD or SSD

Mechanical and solid state disks are supported.

Multiple ZFS pools are supported on the same controller allowing you to create pools of storage that are either very cheap or very fast.

Example Setups

Setups can be anything from basic two node configurations to clusters with hundreds of nodes.

Example 1: Dual Storage Nodes

Two storage nodes, connected point-to-point. A switch is not required. InfiniBand or Ethernet for the replication link.

The nodes act simultaneously as both controller and storage nodes.

Scale

Scale

Example 2: Four Storage Nodes

Four storage nodes and an InfiniBand or Ethernet switch.

Nodes 1 and 2 act simultaneously as both controller and storage nodes.

Nodes 3 and 4 act as storage nodes.

This is an expansion to Example 1 above with the addition of a switch and two storage nodes.

Scale

Scale

Scale

Scale

Scale

Example 3: Storage Controllers, Switch and Storage nodes

Two controller nodes without internal storage, an InfiniBand or Ethernet switch and two storage nodes.

Scale

Scale

Scale

Scale

Scale

Example 4: Storage Controllers, Switches and Storage nodes

Two controller nodes without internal storage, two InfiniBand or Ethernet switches and four storage nodes.

This is an expansion to Example 3 above with the addition of two storage nodes and an additional InfiniBand or Ethernet switch.

This setup can tolerate switch failure.

Scale

Scale

Scale

Scale

Scale

Scale

Scale

Scale

Example 5: Mix Of HDD And SSD Based Storage

This setup uses a mix of HDD based and SSD based storage nodes.

Separate ZFS pools are created from each node type.

Scale

Scale

Scale

Scale

Scale

Scale

Scale

Scale