Recently, our laboratory acquired eight NVMe SSDs, which we connected to a server via PCIe adapter cards and configured into a RAID using ZFS. To enable rapid access to the storage pool from other servers in the lab, I picked up two ConnectX-4 CX4121A 10GbE network cards from a second-hand platform to link two servers, and set up NFS over RDMA.
The seller on the second-hand platform did not include optical modules, so I randomly purchased two 10G Huawei modules, costing about 15-20 RMB each.
Driver Installation
I visited NVIDIA’s official website to download the NVIDIA Firmware Tools (MFT). Given that our laboratory uses Ubuntu, I downloaded the package mft-4.25.0-62-x86_64-deb.tgz.
After downloading, I ran the install.sh
from the installation package.
Once installed, I started the MST by running mst start
, and checked the card status with mst status
, which also showed the NIC corresponding to the optical card on the computer.
# mst status
MST modules:
------------
MST PCI module is not loaded
MST PCI configuration module loaded
MST devices:
------------
/dev/mst/mt4117_pciconf0 - PCI configuration cycles access.
domain:bus:dev.fn=0000:04:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
Chip revision is: 00
For network cards solely used for server interconnection, static IPs and routes can be configured for both servers.
Firmware Update
I went to NVIDIA’s firmware download page to download the appropriate zip file (note the distinction between Ethernet and IB cards).
After finding the device (/dev/mst
prefix), I used flint -d <device_name> -i <binary image> burn
to flash the firmware.
Installing MLNX_OFED
Although I purchased Ethernet cards, the RDMA kernel modules from Mellanox required installing the MLNX_OFED
package provided for IB cards. I directly downloaded the latest MLNX_OFED_LINUX-23.07-0.5.0.0-ubuntu22.04-x86_64.tgz from Nvidia’s website, without opting for the LTS version.
After extracting, I ran:
./mlnxofedinstall
Then, I manually installed the NFS-RDMA kernel module, typically found at ./DEBS/mlnx-**nfs**rdma-dkms_23.04-OFED.23.04.0.5.3.1_all.deb
. Alternatively, find . | grep nfs | grep .deb
can be used to locate it. Installation is done via dpkg -i ./DEBS/mlnx-**nfs**rdma-dkms_23.04-OFED.23.04.0.5.3.1_all.deb
. Patience is required as DKMS needs to build the kernel module.
NFS Server Installation
The NFS Server was installed on one server end:
apt install nfs-kernel-server
systemctl start nfs-kernel-server.service
Before starting NFS, it’s necessary to mount the corresponding kernel module and add the RDMA port to NFS’s portlist:
/sbin/modprobe rpcrdma
echo 'rdma 20049' | tee /proc/fs/nfsd/portlist
For automatic execution later, the /lib/systemd/system/nfs-kernel-server.service
file can be edited to include ExecStartPre
and ExecStartPost
.
[Unit]
Description=NFS server and services
DefaultDependencies=no
Requires=network.target proc-fs-nfsd.mount
Requires=nfs-mountd.service
Wants=rpcbind.socket network-online.target
Wants=rpc-statd.service nfs-idmapd.service
Wants=rpc-statd-notify.service
Wants=nfsdcld.service
After=network-online.target local-fs.target
After=proc-fs-nfsd.mount rpcbind.socket nfs-mountd.service
After=nfs-idmapd.service rpc-statd.service
After=nfsdcld.service
Before=rpc-statd-notify.service
# GSS services dependencies and ordering
Wants=auth-rpcgss-module.service
After=rpc-gssd.service gssproxy.service rpc-svcgssd.service
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStartPre=-/usr/sbin/exportfs -r
ExecStartPre=/sbin/modprobe rpcrdma
ExecStart=/usr/sbin/rpc.nfsd
ExecStartPost=/bin/bash -c "sleep 3 && echo 'rdma 20049' | tee /proc/fs/nfsd/portlist"
ExecStop=/usr/sbin/rpc.nfsd 0
ExecStopPost=/usr/sbin/exportfs -au
ExecStopPost=/usr/sbin/exportfs -f
ExecReload=-/usr/sbin/exportfs -r
[Install]
WantedBy=multi-user.target
Subsequently, systemctl daemon-reload
and restart the NFS service. At this point, the NFS listening port should display both the standard 2049 and RDMA’s 20049.
# cat /proc/fs/nfsd/portlist
rdma 20049
rdma 20049
tcp 2049
tcp 2049
Exposing Mount Directories
Edit /etc/exports
,
or if using a ZFS filesystem, the following command can allow access to the pool-name
pool from the 7.0.115.0/24
network:
zfs set sharenfs="[email protected]/24,no_root_squash,async" pool-name
After executing exportfs -v
, the exposed directories should be visible:
# exportfs -v
/data/pool-name 7.0.115.0/24(async,wdelay,hide,no_subtree_check,mountpoint,sec=sys,rw,secure,no_root_squash,no_all_squash)
Client Operations
The client also needs to load rpcrdma before mounting, which can be included in the ExecStartPre=
of the systemctl configuration:
modprobe rpcrdma
Then proceed with mounting:
mount 7.0.115.1:/data/pool-name /data/pool-name -o rdma,port=20049,async,noatime,nodiratime -vvvv
If all goes well, the following can be added to /etc/fstab
:
7.0.115.1:/data/pool-name /data/pool-name nfs rdma,port=20049,async,noatime,nodiratime 0 0
Speed Test
I used fio to test the sequential write speed:
fio --name=testfile --directory=/data/pool-name/speedtest --size=2G --numjobs=10 --rw=write --bs=1000M --ioengine=libaio --fdatasync=1 --runtime=60 --time_based --group_reporting --eta-newline=1s
The write speed was able to max out the 10G network card (1078MiB/s
), and no traffic was visible on the network card during the test via iftop, indicating that NFS traffic was being directly transmitted over RDMA.
testfile: (groupid=0, jobs=10): err= 0: pid=3968057: Thu Aug 24 08:00:00 2023
write: IOPS=1, BW=1078MiB/s (1130MB/s)(67.4GiB/64006msec); 0 zone resets
slat (msec): min=328, max=6801, avg=3973.96, stdev=1633.37
clat (nsec): min=1780, max=12590, avg=3690.00, stdev=1745.87
lat (msec): min=328, max=6801, avg=3973.96, stdev=1633.37
clat percentiles (nsec):
| 1.00th=[ 1784], 5.00th=[ 2064], 10.00th=[ 2192], 20.00th=[ 2800],
| 30.00th=[ 2928], 40.00th=[ 3088], 50.00th=[ 3280], 60.00th=[ 3408],
| 70.00th=[ 3824], 80.00th=[ 4576], 90.00th=[ 4896], 95.00th=[ 6688],
| 99.00th=[12608], 99.50th=[12608], 99.90th=[12608], 99.95th=[12608],
| 99.99th=[12608]
bw ( MiB/s): min=19984, max=20000, per=100.00%, avg=19997.62, stdev= 0.93, samples=64
iops : min= 16, max= 20, avg=19.40, stdev= 0.23, samples=64
lat (usec) : 2=2.90%, 4=71.01%, 10=24.64%, 20=1.45%
fsync/fdatasync/sync_file_range:
sync (nsec): min=20, max=9970, avg=597.50, stdev=1058.05
sync percentiles (nsec):
| 1.00th=[ 20], 5.00th=[ 50], 10.00th=[ 110], 20.00th=[ 161],
| 30.00th=[ 231], 40.00th=[ 382], 50.00th=[ 470], 60.
00th=[ 532],
| 70.00th=[ 612], 80.00th=[ 708], 90.00th=[ 948], 95.00th=[ 1464],
| 99.00th=[ 9920], 99.50th=[ 9920], 99.90th=[ 9920], 99.95th=[ 9920],
| 99.99th=[ 9920]
cpu : usr=1.00%, sys=7.62%, ctx=1122377, majf=0, minf=141
IO depths : 1=233.3%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,69,0,0 short=92,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=1078MiB/s (1130MB/s), 1078MiB/s-1078MiB/s (1130MB/s-1130MB/s), io=67.4GiB (72.4GB), run=64006-64006msec