Zetavault Blog

Insights, comments, tips and random ramblings.

High Availability Storage On Dell PowerEdge & HP ProLiant

11th October 2016

This article explains how to create a high availability storage cluster using Dell and HP servers.

Our software exports the cluster storage to the clients by iSCSI, NFS or SMB.

In this article we will export the cluster storage by iSCSI to a VMware host.

With the exception of a couple of post installation commands, all configuration is done in the web GUI.

No previous experience with Linux or ZFS is required.

Aims For This Example Cluster

We have a serious allergy to JBODs. They make us sneeze and our eyes itch. There is no remedy for this. Only complete avoidance.

Hardware

We are creating a high availability storage cluster from 2 nodes.

For illustrative purposes, we have used servers from 2 different vendors: Dell and HP.

Most users will use the same hardware vendor when creating a cluster, but you are not restricted to doing so.

We are using cheap consumer level Crucial MX300 SSDs in this article. You are free to use alternative disks including HDD or enterprise SSD. You can use SATA, SAS or NVMe disks.

We'll use a 10 Gigabit Ethernet interface to handle storage traffic to the clients.

We'll use a 40 Gigabit Ethernet interface to handle storage replication between the servers.

dell hp 1

Dell PowerEdge R730

HP ProLiant DL380 Gen9

Common To Both Systems

Software

The Cluster edition of Zetavault is installed on each system.

This is available to download as an ISO image with a 60 day trial period. So you can test it before deciding to purchase.

Download here:
http://www.zeta.systems/zetavault/download/

The software is installed to the 80GB Intel SSD.

A license is required for each system. A separate capacity based license is also required. This covers the entire storage in the cluster.

The software license is perpetual and includes all support and maintenance. Nothing more to pay unless you add new physical storage.

Total Pricing

Since the HP system is available for lower cost — and it can be purchased without storage — we'll base the pricing on HP hardware.

The HP part numbers listed below have a slightly higher spec CPU than the system we used for this article — an Intel E5-2620v3.

The HP system has 16GB of memory. We'll upgrade it to 32GB.

We only need a single CPU installed in each server. Dual CPU is overkill.

In this article we will demonstrate how replication over RDMA based Ethernet is supported. It is not a requirement, and you could use the second port on the Intel X710 interface to do TCP/IP based replication instead.

We have used standard Internet retail pricing for all parts. If you have trade accounts, you should be able to do it for even less.

All pricing was done in October 2016. UK pricing excludes VAT.

Part Part No Price $ Price £ Qty Total $ Total £
HP ProLiant DL380 Gen9 server 800073-S01 (US), 843556-425 (Europe) $1860 £1250 2 $3720 £2500
Crucial 16GB memory module CT16G4RFD4213 $95 £85 2 $190 £170
Intel DC S3510 80GB SSD SSDSC2BB080G601 $84 £72 2 $168 £144
Crucial MX300 525GB SSD CT525MX300SSD1 $120 £92 12 $1440 £1104
Intel 10 Gigabit Ethernet interface X710-DA2 $342 £290 2 $684 £580
Zetavault Cluster Edition license N/A $1400 £1000 2 $2800 £2000
Zetavault 6TB capacity license N/A $378 £270 1 $378 £270
Total pricing $9380 £6768
Optional Mellanox interfaces for RDMA based replication
Mellanox 40 Gigabit Ethernet interface MCX314A-BCCT $535 £450 2 $1070 £900
Mellanox 40 Gigabit Ethernet 1M cable MC2207130-001 $73 £70 1 $73 £70

Initial Setup

We need to perform the initial setup before configuring the cluster.

Install OS

Zetavault is installed on each server to the Intel SSD drive.

We also need to install some additional system packages for the cluster.

Login to the system by SSH. Become the root user.

Update the system updates cache as follows:

apt-get update

Install the packages:

install-cluster

That will install the required packages.

Configure RAID

We will create a single RAID5 array from the 6x SSD disks.

For those of you with experience of running ZFS, don't worry. We will still be doing ZFS based RAID as well.

We can create the array in the Zetavault GUI.

On the Dell system:

raid dell 1

On the HP system:

raid hp 1

The RAID arrays will now initialize in the background. We can continue configuring the cluster while the arrays initialize.

Configure Networking

We need to spend a moment on deciding which network interfaces will be assigned to which roles.

On the Dell system:

network dell

On the HP system:

network hp

The interfaces with a green background are configured and running. The interfaces with a white background are not in use.

The interfaces will be used as follows:

eth0 Gigabit Management (by HTTP and SSH)
eth1 Gigabit Cluster communication (cluster messaging between the nodes)
eth4 10 Gigabit Storage client traffic (iSCSI in our example)
eth6 40 Gigabit Replication traffic (to read/write to the peer node storage)

Make sure each IP address is on a different subnet.

You need a minimum of 4 interfaces, but is is possible to do it with 3 interfaces by using the management interface for cluster communication.

The storage client traffic can be handled by multiple interfaces. In this article we will keep it simple and use a single 10 Gigabit interface.

It is possible to use the Intel X710 interface to handle both storage client traffic and replication traffic. In that case, interface eth5 could be used. This is the second port on the Intel X710 interface. In this article we want to demonstrate RDMA based replication so we have added the Mellanox adapter for that.

To support hardware fencing, the Dell DRAC and HP iLO ports will also be used. They are not exposed to the OS as standard interfaces so they do not appear above.

Here is an image of the rear cabling (Dell top, HP bottom):

dell hp 2

1 Dell DRAC / HP iLO (Gigabit, to switch)
2 Management (Gigabit, to switch)
3 Cluster communication (Gigabit, connected point-to-point)
4 Storage client traffic (10 Gigabit SFP, to switch)
5 Replication traffic (40 Gigabit QSFP, connected point-to-point)

Cluster Configuration: Part 1

From here on, we'll refer to the Dell system as controller1 and the HP system as controller2.

Before we start, we need the iSCSI IQN initiator names for each system. These can be found in the GUI in Settings Cluster

controller1 iqn.1993-08.org.debian:01:8934d55876aa
controller2 iqn.1993-08.org.debian:01:8685bde9d083

The initial cluster configuration is wizard based.

Let's run the wizard on controller1.


wizard step 1

The kernel version is OK and the cluster packages have been installed. Click Next.


wizard step 2

Select the interface for the node communication. This is used to pass cluster messages between the nodes. Click Next.


wizard step 3

Since this is the first node we are configuring, no other nodes have been detected. Enter the cluster name. Click Next.


wizard step 4

Select the interface for the storage replication. Click Next.


wizard step 5

If your storage replication interface supports RDMA based Ethernet, you can enable it here.
Since our Mellanox interface supports it, we'll enable it. Click Next.


wizard step 6

Select the RAID array we created in the initial setup. Click Next.


wizard step 7

Enter the iSCSI initiator name of controller2. Click Next.


wizard step 8

Check the configuration. Click Next.


wizard step 9

The wizard will now perform the configuration.


In the GUI, go to Cluster Nodes

This system is the only controller node in the cluster.

cluster nodes 1


Now run the configuration wizard on controller2.

Perform the same steps as above.


wizard step 3 controller2

In step 3, controller1 will be detected. Make sure you select the same cluster as controller1. Click Next.


wizard step 7 controller2

In step 7, enter the iSCSI initiator name of controller1. Click Next.

Complete the wizard to perform the configuration.


On controller1, go to Cluster Nodes

controller2 has now joined the cluster.

cluster nodes 2

Confirm that the clocks are in sync. In the example above, controller2 clock is ahead of controller1 clock by 0.02 seconds.

Do not continue if the clocks are out of sync by more than 1 second. Wait for the NTP service to correct the clocks.

Cluster Configuration: Part 2

The cluster is now created. We need to do some further configuration before we create the storage pool.

Configure IPMI

We need to configure IPMI to support hardware based fencing. This is for automatic failover to work in the event of controller failure.

While still on the nodes page, click the "Info" button in the IPMI column for controller1.

ipmi 1

This is the output from the HP DL380 system.

Specifically we want to see the "Chassis Device" in the "Additional Device Support" section. That is used to do power control.

You will now need the IPMI configuration for the system. How this is configured is dependent on your vendor. For Dell, consult the DRAC documentation. For HP, consult the iLO documentation.

Specifically we need:

If you don't have these — or you have not configured IPMI yet — you can come back to this later.

Click the edit button in the IPMI column for controller1.

ipmi 2

Enter the IP address, username and password.

Click Update.

Click the "Test" button in the IPMI column for controller1.

ipmi 3

Now perform the IPMI configuration and test for controller2.

This configuration is automatically copied between the cluster nodes. You only need to do this on one controller.

A fully working IPMI configuration is required for automatic failover to work in the event of controller failure.

Configure Disks

In the GUI, go to Cluster Disks

This is the local RAID array (type local) and the array from the peer controller (type iSCSI).

disks 1

On the local disk, click the edit button.

disks 2

We'll use the hostname for the disk alias ("controller1" for the local disk and "controller2" for the iSCSI disk).

Set the media as well.

Click Update.

Do the same for the iSCSI disk.

disks 3

All disks are multipathed. This is to support multiple switches for cluster setups that are switch based for the storage replication.

This configuration is automatically copied between the cluster nodes. You only need to do this on one controller.

Configure Virtual Networking

We need to create a virtual network interface that will move between controllers with the storage pool.

In the GUI, go to Cluster Interfaces

We have already configured the interface for storage traffic on each server. These are the IP addresses we will use:

controller1 192.168.20.37 (already configured)
controller2 192.168.20.38 (already configured)
virtual 192.168.20.39

Click the "Add Virtual Ethernet" button.

network 1

Select the parent.

Enter the IP address.

Click Add.

This configuration is not automatically copied between the cluster nodes. That is because the parent interface might be a different ethX device.

Add the virtual interface on controller2 as well.

Create Storage Pool

We will now create the ZFS storage pool.

Back to controller1.

Run the "Create ZFS Pool" wizard.


zfs create pool 1

These are the disks detected in the system. Click Next.


zfs create pool 2

Select "RAID 1" for the RAID level. Select each disk. Click Next.


zfs create pool 3

The disks are scanned to see if previous storage pools exist. Click Next.


zfs create pool 4

Initial pool options. Enter the pool name. Leave the rest as defaults. Click Next.


zfs create pool 5

Check the configuration. Click Next.


zfs create pool 6

The wizard will now create the pool.

Configure Pool

In the GUI, go to Cluster Pools

The pool is shown in the table. It is currently running, but it does not have a configuration.

pools 1

In the table row, click the add button.

pools 2

Select the virtual interface.

Select the "SAN" service.

Click Add.

The configuration is now shown in the "Config" part of the table.

pools 3

To add new interfaces and services this pool will handle, just edit the configuration here.

This configuration is automatically copied between the cluster nodes. You only need to do this on one controller.

SAN Configuration

That's the cluster configured. The systems are now replicating.

You can export the cluster storage by iSCSI, NFS or SMB. In this article we'll export it by iSCSI.

Run the "SAN Target" wizard.


san 01

The pool has been detected. Click Next.


san 02

Select "iSCSI". Click Next.


san 03

Select the virtual interface that was created earlier. Click Next.


san 04

Enter a target name and alias. Click Next.


san 05

Enter the group name. Click Next.


san 06

We need to add at least one initiator. Click the "Add new user" button.


san 07

Enter the iSCSI IQN initiator name for the VMware host.

Enter the alias as well. Click Add.


san 08

Select the initiator we just created. Click Next.


san 09

We need a volume to hold the data. Click the "Create new volume" button.


san 10

Enter the name of the volume. We have used "vol1" for the name.

Enter the size of the volume. We have set it to 1024 GB (1 TB).

Leave the rest of the options as the defaults. Click Create.


san 11

Select the volume we just created. Click Next.


san 12

Check the configuration. Click Next.


san 13

The wizard will now configure the SAN.


The SAN configuration is automatically copied between the cluster nodes. You only need to do this on one controller.

Any changes to the SAN configuration will be automatically copied between the cluster nodes.

VMware Integration

We will now connect our VMware host.

Networking

If you already have a VMware host with a configured iSCSI software adapter, you can skip this part.

We need to create a vSwitch for the iSCSI traffic.

In the vSphere Client, go to Configuration Networking

vmware 01

Click "Add Networking..." in the top right.

vmware 02

Select "VMkernel". Click Next.

vmware 03

Select the 10 Gigabit adapter. Click Next.

vmware 04

Give the switch a name in the "Network Label" box. Leave the other options as the defaults. Click Next.

vmware 05

Enter the IP address. Make sure this is on the same subnet as the virtual interface we created above. Click Next.

vmware 06

Confirm the settings. Click Finish.

The switch is now created.

vmware 07

Next to the switch, click "Properties..."

vmware 08

Leaving the vSwitch selected, click "Edit..."

vmware 09

Change the MTU to 9000. Click OK.

On the left, select the switch name.

vmware 10

Click "Edit..."

vmware 11

Change the MTU to 9000. Click OK.

That is the networking configured.

Storage Adapter

If you already have a VMware host with a configured iSCSI software adapter, you can skip this part.

Go to Configuration Storage Adapters

Click "Add..." in the top right.

vmware 13

Select "Add Software iSCSI Adapter". Click OK.

The "iSCSI Software Adapter" now appears in the storage adapters list.

vmware 14

Select the iSCSI adapter. Click "Properties..." in the bottom right.

Click on the "Network Configuration" tab.

vmware 15

Click the "Add..." button.

vmware 16

Select the storage vSwitch we created. Click OK.

Click Close.

iSCSI Discovery

Go to Configuration Storage Adapters

Select the iSCSI adapter. Click "Properties..." in the bottom right.

Click on the "Dynamic Discovery" tab.

vmware 17

Click "Add..."

vmware 18

Enter the address of the virtual interface. Click OK.

iSCSI discovery will now take place.

Click on the "Static Discovery" tab.

vmware 19

Confirm the discovery has been successful. The correct target name should be shown.

Click Close.

You will be prompted to rescan the host bus adapter. Click Yes.

vmware 20

The iSCSI disk is now present. The storage is now connected to the VMware host.

From there you can chose to create a datastore on the disk or connect the disk directly to a VM using pass-through.

VMware Performance

We ran the benchmark in a Windows Server 2012 R2 virtual machine running on an ESX 6.0.0 host.

The ESX host is connected to the cluster storage by iSCSI over a single 10 Gigabit Ethernet connection exactly as per the instructions above.

A datastore was created from the iSCSI LUN and formatted as VMFS5.

vmware 21

The virtual machine OS disk and test disk exist on this datastore.

We used Iometer 1.1.0 for the test.

We created a 30 GB virtual disk to run the test on.

iometer

The test is the "unofficial storage performance" VMware benchmark.

Details here:
http://vmktree.org/iometer/

Here is the current discussion thread on the VMware Technology Network:
https://communities.vmware.com/thread/197844

Here are the results:

Test Name Latency Avg IOPS Avg MB/s
Max Throughput 100% Read 1.82 32,825 1,025
Real Life 60% Random 65% Read 1.49 35,983 281
Max Throughput 50% Read 1.50 39,796 1,243
Random 8K 70% Read 1.43 41,026 320

The benchmark was done with the storage systems doing replication over a 10 Gigabit Ethernet TCP/IP link. RDMA was not used for storage replication.

Failover

We can migrate the storage pool between controllers without storage client downtime.

We can do this manually — for example we want to reboot a controller due to firmware updates.

In the event of controller failure, this will happen automatically if IPMI is configured.

In the GUI, go to Cluster Overview

failover 1

The pool is running on controller1.

Click on controller2.

failover 2

Select Fetch on "pool1". Click Process.

failover 3

The pool will migrate to controller2. The virtual interface and SAN service will follow.

failover 4

The pool, virtual interface(s) and storage service(s) are now running on controller2.

Let's do this while a virtual machine is running Iometer and playing a video.

This video shows a VMware guest running Windows Server 2012 R2.

The Iometer benchmark tool is started, and the storage pool is migrated from controller1 to controller2.

A video is also played while migration takes place.

Automatic Rebuild

Storage rebuild is completely automatic.

In the GUI, go to Cluster Overview

rebuild 1

The pool is in the normal state.

Click on the pool name.

rebuild 2

This is the full ZFS pool status.

Let's reboot controller2.

rebuild 3

The pool is now degraded. This is akin to a disk failing in a RAID 1 mirror.

The storage services will continue as normal. The storage clients are unaffected.

Click on the pool name.

rebuild 4

It shows that the "controller2" disk is offline and the pool is degraded.

Wait for controller2 to boot.

rebuild 5

The controller has now booted and automatically rejoined the cluster.

Click on the pool name.

rebuild 6

The storage on controller2 is now synchronizing with the storage on controller1.

In ZFS terminology, the pool is "resilvering".

Wait for the pool to finish resilvering.

rebuild 1

Click on the pool name.

rebuild 7

The pool has now finished resilvering.

The storage on each node is now fully synchronized.

The pool is in the normal state.

What About Something Larger?

Our cluster architecture is designed to scale both up and out. You are not restricted to 2 nodes. You can scale to 100s of nodes in a single cluster.

Here is an example of a cluster with 4x head nodes (no internal storage), 2x SSD based storage nodes and 2x HDD based storage nodes.

cluster

The head nodes are HP ProLiant DL320e Gen8.

The storage nodes are Supermicro, with SSD drives in the 2.5" systems and HDD drives in the 3.5" systems.

An Ethernet or InfiniBand switch handles the connection between the head nodes and the storage nodes.

Conclusion

It's now easier than ever to build high availability storage systems.

Gone are the days of "working" with a vendor for a couple of weeks before they give you a price.

Gone are the days of being restricted to complex and expensive JBOD based systems that restrict you to using dual port SAS drives.

You can now build high availability storage clusters using SATA, SAS, SSD and NVMe disks.

A entry level cluster can be purchased for what some vendors would charge for in annual support alone.

Software-defined storage has most definitely come of age.