Vhost vs local NVMe-over-fabrics targets

Sungjoon Ahn
Apr 23, 2019
4 min read

Overview

We show how to set up vhost targets as a local SPDK storage service and measure the basic set of performance numbers in comparison to local NVMe-over-fabrics connections. Local NVMe-over-fabrics connections provide an alternative way to provide local storage service based on SPDK.

SPDK provides an accelerated vhost target by applying the user space and polling techniques. Since SPDK is polling for vhost submissions, it can signal the VM to control I/O submissions. As a result, CPU usage can be significantly reduced on heavy I/O workloads. Currently, SPDK provides Vhost-SCSI, Vhost-BLK, and Vhost-NVMe as vhost targets. In this note, we use Vhost-BLK for the target storage.

Qemu VM connecting to vhost targets

We set up a Qemu VM that utilizes Vhost-BLK targets as shown in the diagram.

ree — Fig 1. QEMU/SPDK vhost data flow (from spdk.io web site)

First, create a Qemu VM and install Ubuntu 18.10 desktop on it. We continue to use OCSSD qemu-nvme for our Qemu VM which is based on Qemu emulator version 3.1.50. SPDK vhost target requires qemu version 2.12 or later for Vhost-BLK targets.

cbuser@pm111:~/github/qemu-nvme/bin$ ./qemu-img create -f qcow2 \  
~/work/qemu/u1.qcow2 20G
cbuser@pm111:~/github/qemu-nvme/bin$ ./qemu-system-x86_64 -m 4G \         
-enable-kvm -drive if=virtio,file=~/work/qemu/u1.qcow2 \
-cdrom /tmp/ubuntu-18.10-desktop-amd64.iso -vnc :2

Second, we initiate the SPDK vhost target service on the physical machine that hosts qemu VM. For the target block devices, we use an SPDK malloc ram drive and an NVMe SSD drive.

cbuser@pm101:~$ sudo NRHUGE=20 ~/github/spdk/scripts/setup.sh                                           
cbuser@pm101:~$ sudo ~/github/spdk/app/vhost/vhost --wait-for-rpc \                      -r localhost:7778 -S /var/tmp &
cbuser@pm101:~$ sudo ~/github/spdk/scripts/rpc.py -s localhost -p 7778 \ start_subsystem_init
cbuser@pm101:~$ sudo ~/github/spdk/scripts/rpc.py -s localhost -p 7778 \ construct_malloc_bdev -b Malloc1 512 512
cbuser@pm101:~$ sudo ~/github/spdk/scripts/rpc.py -s localhost -p 7778 \ construct_vhost_blk_controller vhost.0 Malloc1                cbuser@pm101:~$ sudo ~/github/spdk/scripts/rpc.py -s localhost -p 7778 \ construct_nvme_bdev -b NVMe0 -t PCIe -a 0000:d8:00.0             cbuser@pm101:~$ sudo ~/github/spdk/scripts/rpc.py -s localhost -p 7778 \ construct_vhost_blk_controller vhost.1 NVMe0n1

Third, we launch the qemu with the following parameters. Note that we pass /var/tmp/vhost.[0,1] to identify the vhost targets we created in the second step.

cbuser@pm111:~$ sudo ~/github/qemu-nvme/bin/qemu-system-x86_64 \
--enable-kvm -cpu host -smp 1 -m 4G \
-drive file=/home/cbuser/work/qemu/u1.qcow2,if=none,id=disk \             -device ide-hd,drive=disk,bootindex=0 \
-object memory-backend-file,id=mem0,size=4G,mem-path=/dev/hugepages,share=on \                                            
-numa node,memdev=mem0 \                                                  -chardev socket,id=spdk_vhost_blk0,path=/var/tmp/vhost.0 \                -device vhost-user-blk-pci,chardev=spdk_vhost_blk0,num-queues=4 \         -chardev socket,id=spdk_vhost_blk1,path=/var/tmp/vhost.1 \                -device vhost-user-blk-pci,chardev=spdk_vhost_blk1,num-queues=4 \         -net nic,macaddr=DE:AD:BE:EF:01:41 \                                      -net tap,ifname=tap0,script=q_br_up.sh,downscript=q_br_down.sh -vnc :2

Inside the VM, you can verify that the vhost block devices have been created as expected. In the example below, you can see the information in the lines of vda and vdb.

cbuser@qemu-nvme141:~$ lsblk --output \ "NAME,KNAME,MODEL,HCTL,SIZE,VENDOR,SUBSYSTEMS"
NAME   KNAME MODEL         HCTL     SIZE VENDOR   SUBSYSTEMS
fd0    fd0                            4K          block:platform
loop0  loop0                        3.7M          block                loop1  loop1                      140.9M          block     
...(truncated)...                                           
sda    sda   QEMU HARDDISK 1:0:0:0   20G ATA      block:scsi:pci     
└─sda1 sda1                          20G          block:scsi:pci         vda                                 512M 0x1af4   block:virtio:pci      vdb                                 477G 0x1af4   block:virtio:pci

Comparison with local NVMe-over-fabrics

One can set up NVMe-over-fabrics SPDK malloc local targets as described in our previous tech notes. We also show additional SPDK setup commands needed for the NVMe-oF RDMA target based on the NVMe drive.

cbuser@pm111:~$ sudo ~/github/spdk/app/nvmf_tgt/nvmf_tgt --wait-for-rpc \ -r localhost:7779
cbuser@pm111:~$ sudo ~/github/spdk/scripts/rpc.py -s localhost -p 7779 \ construct_nvme_bdev -b NVMe0 -t PCIe -a 0000:d8:00.0
cbuser@pm111:~$ sudo ~/github/spdk/scripts/rpc.py -s localhost -p 7779 \ nvmf_subsystem_add_ns nqn.2016-06.io.spdk:cnode02 NVMe0n1
cbuser@pm111:~$ sudo ~/github/spdk/scripts/rpc.py -s localhost -p 7779 \ nvmf_subsystem_add_listener nqn.2016-06.io.spdk:cnode02 -t RDMA \         -a 10.0.0.11 -s 4421

We took latency and throughput fio tests. For simplicity, we used command line parameters to fio without job files. A throughput and latency test parameters are shown below.

cbuser@qemu-nvme141:~$ sudo fio --name=global --filename=/dev/vda \       --direct=1 --rw=read --norandommap --ioengine=libaio --bs=256k  \         --iodepth=64 --numjobs=1 --time_based --runtime=60 --group_reporting \    --name=job1
cbuser@pm111:~$ sudo fio --name=global --filename=/dev/nvme1n1 \          --direct=1 --rw=randwrite --norandommap --ioengine=libaio --bs=4k  \      
--iodepth=1 --numjobs=1 --time_based --runtime=60 --group_reporting \     --name=job1

The test results are presented in the graphs below.

ree — Fig 2. Throughput of malloc/nvme over vhost/nvmf

For throughput, SPDK malloc drive boasts the highest numbers with vhost. NVMe-oF throughput numbers are much lower. The NVMe drive's write throughput number seems to indicate that about 2.1GB/s is bounded by the SSD's performance. We verified this with additional device tests. The local NVMe-oF throughput maxes out around 2.4GB/s.

ree — Fig 3. Latency of malloc/nvme over vhost/nvmf

For latency, vhost performs about 3.2usec to 7.9usec better than NVMe-oF RDMA connections. For all tests, NVMe SSD's are preconditioned with x2 disk capacity writes which are indicated by much higher read latency numbers for both vhost and NVMe-oF read latency.

ree — Fig 4. vhost against reference performance numbers

Lastly, we measured reference numbers, i.e. the best read throughput and latency, of SPDK malloc and the NVMe SSD. Read throughput was virtually a tie while vhost protocol seems to add about 5.2usec to the read latency.

Summary

Vhost and NVMe-oF RDMA protocols are useful tools to expose local SPDK devices as kernel block devices. This is one of the powerful features of SPDK since one can expose their SPDK devices to multitudes of existing applications. Vhost protocol showed superior performance while local NVMe-oF RDMA connections still showed sufficient performance for scenarios where applications run on the same physical machine. Our measurements here are not meant to be comprehensive. Interested readers should run their own tests before determining whether this configuration suits their requirements.

Software versions

Sticking to following software versions will ensure all the commands in this article work in your environment:

Linux OS: Ubuntu 18.10 Desktop for both physical machine and qemu virtual machine
Linux kernel: 4.18.0 version
SPDK: v19.01-642-g130a5f772

Questions?

Contact info@circuitblvd.com for additional information.