RocksDB in local vs NVMe-oF environment
We would like to know how much impact NVMe-oF (NVMe-over-fabrics) have on real-world database applications. Let's use the RocksDB performance benchmark for both local NVMe SSD and remote NVMe-oF setups to answer this question.
For NVMe-oF, we use Mellanox NIC cards for 56Gbps Ethernet connecting two servers. Clone and install RocksDB from the github page. Follow the installation instructions here and make sure to use version 6.0.1 and do the release build via 'git checkout v6.0.1; make release'. For each of the local and remote NVMe-oF setups, we use a single NVMe SSD. Two types of SSDs, TLC NAND SSD and Optane SSD, are used to see if faster storage hardware produces different test results. The specification of the server that runs RocksDB benchmark over local NVMe SSD and NVMe-oF target are:
CPU: 48 logical core Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz
DRAM: 64 GB DDR4
1.8 TB TLC flash NVMe SSD / 480GB Optane SSD
We use the benchmark test for flash storage described in the RocksDB github wiki page. The summary of five benchmark tests are:
T1. Bulk Load of keys in Random Order - The keys are inserted in random order. The database is empty at the beginning of this benchmark run and gradually fills up. All files are loaded to L0 and then get merge-sorted into L1.
T2. Bulk Load of keys in Sequential Order - The keys are inserted in sequential order. The database is empty at the beginning of this benchmark run and gradually fills up. Multi-threaded compactions occur at multiple levels simultaneously.
T3. Random Write - Two billion keys are randomly overwritten into the database. DB created from T2 is used in the beginning. The test is single threaded while compaction is done in parallel with 20 threads.
T4. Random Read - Measure the random read performance of a database with one billion keys. Random read is carried out by a single thread.
T5. Multi-threaded Read and Single-threaded Write - Randomly read 100 million keys out of database with 8B keys and ongoing updates to existing keys. There are 32 dedicated random read threads. A separate thread issues writes to the database at its best effort.
For all test cases from T1 to T5, we reduced the data size to 1/10 because the original 8 billion key values worth 3.2TB data size is too big for our single NVMe SSD setup. Table 2 presents performance measurements from the local TLC NAND SSD setup. Note that each test is best characterized by different performance metrics as indicated by the populated table entries. Considering the reduced data size and different system parameters such as the number of SSDs, we can see that our numbers are reasonable compared to the numbers published at the github wiki page (Table 1). We observe at most a single order difference over multiple metrics. Therefore these results can be regarded as a sanity check pass.
Now let's compare the results from local NVMe and remote NVMe-oF setup for TLC NAND SSDs. NVMe-oF numbers in Table 3 are consistently worse but with fairly small differences, approximately from 0.3% to 20% (Table 4). This meets our expectation because NVMe-oF over RDMA is supposed to add very little latency compared to local PCIe NVMe interfaces.
We wondered how RocksDB benchmark would perform on faster SSD devices. As you can see in Table 5 and 6, Optane SSD excels in T4 - Random Read. This aligns with Optane SSD's well known, much lower random read latency than TLC NAND flash devices. Throughput and write numbers are expected to be similar between the TLC NAND and Optane SSDs.
Our final table is nvmf/local ratios for Optane SSD. This setup also shows consistently better numbers for the local setup but again small differences in the range of under 0.1% to about 11%.
We showed that NVMe-oF over RDMA adds very little latency to local PCIe NVMe interfaces over two different SSD types. Pooling multitudes of NVMe SSDs and exposing them over NVMe-oF shall benefit from disaggregations while providing a similar level of application performance. The RocksDB benchmark in our comparison tests certainly proves this point.
Our comparisons are, however, rather simple and should be taken with a grain of salt. For example, data size was reduced from the original test set and this resulted in shallower/ smaller LSM trees. Such factors may have impacts on performance. In addition, the RockDB benchmark does not have test cases where point query performance affects the overall performance. Circuit Blvd has expertise in combining database software and low latency SSDs to accelerate point query dominated workloads. We will post some of our data in the upcoming tech notes.
Sticking to following software versions will ensure all the commands in this article work in your environment:
Linux OS: Ubuntu 18.10 Desktop for both local and remote machine
Linux kernel: 5.0.9 version
Contact email@example.com for additional information.