Recently, our laboratory acquired eight NVMe SSDs, which we connected to a server via PCIe adapter cards and configured into a RAID using ZFS. To enable rapid access to the storage pool from other servers in the lab, I picked up two ConnectX-4 CX4121A 10GbE network cards from a second-hand platform to link two servers, and set up NFS over RDMA.

The seller on the second-hand platform did not include optical modules, so I randomly purchased two 10G Huawei modules, costing about 15-20 RMB each.

Driver Installation

I visited NVIDIA’s official website to download the NVIDIA Firmware Tools (MFT). Given that our laboratory uses Ubuntu, I downloaded the package mft-4.25.0-62-x86_64-deb.tgz.

After downloading, I ran the install.sh from the installation package.

Once installed, I started the MST by running mst start, and checked the card status with mst status, which also showed the NIC corresponding to the optical card on the computer.

# mst status
MST modules:
------------
    MST PCI module is not loaded
    MST PCI configuration module loaded

MST devices:
------------
/dev/mst/mt4117_pciconf0         - PCI configuration cycles access.
                                   domain:bus:dev.fn=0000:04:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
                                   Chip revision is: 00

For network cards solely used for server interconnection, static IPs and routes can be configured for both servers.

Firmware Update

I went to NVIDIA’s firmware download page to download the appropriate zip file (note the distinction between Ethernet and IB cards).

After finding the device (/dev/mst prefix), I used flint -d <device_name> -i <binary image> burn to flash the firmware.

Installing MLNX_OFED

Although I purchased Ethernet cards, the RDMA kernel modules from Mellanox required installing the MLNX_OFED package provided for IB cards. I directly downloaded the latest MLNX_OFED_LINUX-23.07-0.5.0.0-ubuntu22.04-x86_64.tgz from Nvidia’s website, without opting for the LTS version.

After extracting, I ran:

./mlnxofedinstall

Then, I manually installed the NFS-RDMA kernel module, typically found at ./DEBS/mlnx-**nfs**rdma-dkms_23.04-OFED.23.04.0.5.3.1_all.deb. Alternatively, find . | grep nfs | grep .deb can be used to locate it. Installation is done via dpkg -i ./DEBS/mlnx-**nfs**rdma-dkms_23.04-OFED.23.04.0.5.3.1_all.deb. Patience is required as DKMS needs to build the kernel module.

NFS Server Installation

The NFS Server was installed on one server end:

apt install nfs-kernel-server
systemctl start nfs-kernel-server.service

Before starting NFS, it’s necessary to mount the corresponding kernel module and add the RDMA port to NFS’s portlist:

/sbin/modprobe rpcrdma
echo 'rdma 20049' | tee /proc/fs/nfsd/portlist

For automatic execution later, the /lib/systemd/system/nfs-kernel-server.service file can be edited to include ExecStartPre and ExecStartPost.

 [Unit]
Description=NFS server and services
DefaultDependencies=no
Requires=network.target proc-fs-nfsd.mount
Requires=nfs-mountd.service
Wants=rpcbind.socket network-online.target
Wants=rpc-statd.service nfs-idmapd.service
Wants=rpc-statd-notify.service
Wants=nfsdcld.service

After=network-online.target local-fs.target
After=proc-fs-nfsd.mount rpcbind.socket nfs-mountd.service
After=nfs-idmapd.service rpc-statd.service
After=nfsdcld.service
Before=rpc-statd-notify.service

# GSS services dependencies and ordering
Wants=auth-rpcgss-module.service
After=rpc-gssd.service gssproxy.service rpc-svcgssd.service

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStartPre=-/usr/sbin/exportfs -r
ExecStartPre=/sbin/modprobe rpcrdma
ExecStart=/usr/sbin/rpc.nfsd
ExecStartPost=/bin/bash -c "sleep 3 && echo 'rdma 20049' | tee /proc/fs/nfsd/portlist"
ExecStop=/usr/sbin/rpc.nfsd 0
ExecStopPost=/usr/sbin/exportfs -au
ExecStopPost=/usr/sbin/exportfs -f

ExecReload=-/usr/sbin/exportfs -r

[Install]
WantedBy=multi-user.target

Subsequently, systemctl daemon-reload and restart the NFS service. At this point, the NFS listening port should display both the standard 2049 and RDMA’s 20049.

# cat /proc/fs/nfsd/portlist

rdma 20049
rdma 20049
tcp 2049
tcp 2049

Exposing Mount Directories

Edit /etc/exports,

or if using a ZFS filesystem, the following command can allow access to the pool-name pool from the 7.0.115.0/24 network:

zfs set sharenfs="[email protected]/24,no_root_squash,async" pool-name

After executing exportfs -v, the exposed directories should be visible:

# exportfs -v
/data/pool-name 7.0.115.0/24(async,wdelay,hide,no_subtree_check,mountpoint,sec=sys,rw,secure,no_root_squash,no_all_squash)

Client Operations

The client also needs to load rpcrdma before mounting, which can be included in the ExecStartPre= of the systemctl configuration:

modprobe rpcrdma

Then proceed with mounting:

mount 7.0.115.1:/data/pool-name /data/pool-name -o rdma,port=20049,async,noatime,nodiratime -vvvv

If all goes well, the following can be added to /etc/fstab:

7.0.115.1:/data/pool-name /data/pool-name nfs rdma,port=20049,async,noatime,nodiratime 0 0

Speed Test

I used fio to test the sequential write speed:

fio --name=testfile --directory=/data/pool-name/speedtest --size=2G --numjobs=10 --rw=write --bs=1000M --ioengine=libaio --fdatasync=1 --runtime=60 --time_based --group_reporting --eta-newline=1s

The write speed was able to max out the 10G network card (1078MiB/s), and no traffic was visible on the network card during the test via iftop, indicating that NFS traffic was being directly transmitted over RDMA.

testfile: (groupid=0, jobs=10): err= 0: pid=3968057: Thu Aug 24 08:00:00 2023
  write: IOPS=1, BW=1078MiB/s (1130MB/s)(67.4GiB/64006msec); 0 zone resets
    slat (msec): min=328, max=6801, avg=3973.96, stdev=1633.37
    clat (nsec): min=1780, max=12590, avg=3690.00, stdev=1745.87
     lat (msec): min=328, max=6801, avg=3973.96, stdev=1633.37
    clat percentiles (nsec):
     |  1.00th=[ 1784],  5.00th=[ 2064], 10.00th=[ 2192], 20.00th=[ 2800],
     | 30.00th=[ 2928], 40.00th=[ 3088], 50.00th=[ 3280], 60.00th=[ 3408],
     | 70.00th=[ 3824], 80.00th=[ 4576], 90.00th=[ 4896], 95.00th=[ 6688],
     | 99.00th=[12608], 99.50th=[12608], 99.90th=[12608], 99.95th=[12608],
     | 99.99th=[12608]
   bw (  MiB/s): min=19984, max=20000, per=100.00%, avg=19997.62, stdev= 0.93, samples=64
   iops        : min=   16, max=   20, avg=19.40, stdev= 0.23, samples=64
  lat (usec)   : 2=2.90%, 4=71.01%, 10=24.64%, 20=1.45%
  fsync/fdatasync/sync_file_range:
    sync (nsec): min=20, max=9970, avg=597.50, stdev=1058.05
    sync percentiles (nsec):
     |  1.00th=[   20],  5.00th=[   50], 10.00th=[  110], 20.00th=[  161],
     | 30.00th=[  231], 40.00th=[  382], 50.00th=[  470], 60.

00th=[  532],
     | 70.00th=[  612], 80.00th=[  708], 90.00th=[  948], 95.00th=[ 1464],
     | 99.00th=[ 9920], 99.50th=[ 9920], 99.90th=[ 9920], 99.95th=[ 9920],
     | 99.99th=[ 9920]
  cpu          : usr=1.00%, sys=7.62%, ctx=1122377, majf=0, minf=141
  IO depths    : 1=233.3%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,69,0,0 short=92,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1078MiB/s (1130MB/s), 1078MiB/s-1078MiB/s (1130MB/s-1130MB/s), io=67.4GiB (72.4GB), run=64006-64006msec

References