Overview
We continue from the last technote about the collaboration between Coinplug and Circuit Blvd. We collect transient Transactions Per Second (TPS) numbers while data is being written from an empty key-value store up to N key-value pairs. The workload mixes random small writes and random small reads. We measure combined read-write TPS to discover what parameters are major contributors to the overall performance.
System parameters for the benchmarks
Server - local x86 server and AWS t3.xlarge EC2 instance (/w similar CPU and DRAM spec)
Key-value store database - RocksDB and GriffinDB
Local server SSD - Intel Optane and Samsung TLC
AWS SSD - gp2(general tier) and io1 (provisioned tier)
Benchmark workload
The workload keeps writing user key-value pair data while randomly selecting and reading key-value pairs that have been written previously. Following is a pseudo code of the workload. On each loop, bulk-kv-put() generates kv_per_loop key-value pairs by batch_size which are pushed down to storage devices as small random writes. The function kv-get() executes small random reads among the data that have been already written.
loop_count=50
kv_per_loop=1000000
batch_size=100
start_seq=1
for i in [1..loop_count]
end_seq = start_seq + kv_per_loop - 1
bulk-kv-put (start_seq, end_seq, batch_size)
kv-get(1, end_seq, kv_per_loop)
start_seq += kv_per_loop
TPS graphs
(1) GriffinDB vs. RocksDB on the local server
First, we compare GriffinDB vs. RocksDB on the local test server. The x-axis represents key-value sequence numbers which were explained in the previous section. GriffinDB outperforms RocksDB by roughly 2x. The secondary y-axis displays the system buffer cache consumption. RocksDB's TPS drops down when the system buffer cache is used up. GriffinDB's buffer cache usage decreases because it stores its index into main memory which is shared with buffer cache.
(2) GriffinDB vs. RocksDB on the AWS server
On the AWS server, we test two SSD types. AWS gp2 is general purpose SSD that gets charged on capacity only. AWS io1 is performance tier SSD whose price includes IOPS target. We use 200GB capacity for both. According to AWS policy, IOPS:GB is 3:1 for gp2 and 50:1 for io1. This translates to 600 IOPS for gp2 and 10,000 IOPS for io1. The following graph shows a big difference between gp2 and io1 after buffer cache usage is saturated. This is expected since gp2 SSD's IOPS is very low. Because of gp2's slow performance, we decided to cut the test short due to lengthened test time. GriffinDB didn't perform better than RocksDB over gp2 SSD. For io1, GriffinDB shows better performance after buffer cache usage is saturated. However, compared to local server TPS numbers were more than two times lower even though we matched EC2's CPU and DRAM configuration to that of the local server. We suspect that this is due to the difference in SSD performance (io1 vs. local SSD) and AWS EC2's VM and other overheads.
(3) Different SSD types on the local server
After having observed SSD performance impacts from the AWS benchmarks, we ran additional tests over different SSD types on the local server environment. As you can see in the graph below, RocksDB does not show meaningful TPS difference between Intel Optane and Samsung NVMe TLC NAND SSD. The drive's raw random read IOPS numbers are quite different though. Intel Optane's number is 550K while Samsung 960 Pro's number is 14K. On the other hand, GriffinDB results show the meaningful difference between Optane and TLC NAND SSD. Noticing this particular result, we ran GriffinDB on main memory using tmpfs file system. The result shows distinctively and better TPS than what was obtained over Intel Optane SSD. But we were also surprised by the fact that the difference was relatively small and smaller than the difference between Intel Optane and Samsung TLC SSD curves. We concluded that GriffinDB's higher performance exposes raw SSD's latency differences while RocksDB does not. Note that this test required almost double the number of key-value pairs to confirm the TPS difference trends.
(4) Amplification factors
What are the major factors that contribute to TPS differences between GriffinDB and RocksDB? We believe the number one factor is read and write amplification factor (RAF and WAF in short). As we explained in the previous technote, both Metadium and Ethereum's internal storage architecture suffers a lot from Merkle Patricia Trie (MPT) and RocksDB/LevelDB's read/write amplification. GriffinDB greatly reduces both RAF and WAF by utilizing in-memory key index tables and append-only value logs. In the following graphs, you see GriffinDB's RAF and WAF are more than 3x and 20x better than RocksDB's in our benchmark tests. Note that changes to MPT based key-value store would reduce RAF and WAF additionally but it was not attempted in this benchmark work.
Discussions
(1) Metadium workload characteristics
Let's see more details about Metadium/Ethereum storage workload characteristics. Its workload is random write and random read. But what are the actual data sizes? We know that most of the key size is 32 bytes (256 bit crypto hash key). The following table shows instrumented data for rlp encoded MPT node sizes. One can see that 67 byte takes up more than half and 36/67/83 bytes contribute to more than 75%. Also, the biggest data is only 532 bytes, far smaller than the usual 4K I/O request sizes to storage devices.
We initially thought that one of the GriffinDB's higher TPS should come from the temporal locality of how MPT nodes are created. As GriffinDB stores values in append-only mode, adjacent values on the storage device has strong temporal locality. We have collected statistics on GriffinDB writes but did not find strong temporal locality. The following table shows statistics about this temporal locality. We measured how temporally related consecutive reads to GriffinDB are. If consecutive reads are less than DISTANCE, we regarded the instance as an adjacent instance. DISTANCE=0 on the value log means that consecutive reads are adjacent only when they are written next to each other. DISTANCE=4KB means that consecutive reads are regarded adjacent when their byte distance on the value log is less than 4KB. We see that only under 6% of GiriffinDB reads are adjacent. It shows that our expected temporal locality had a marginal impact on the overall TPS. Temporal locality could become strong if MPTs get more populated. But 256 cryptographic hash key spaces are very sparse so that it would take a huge number of data entries for this to show up.
(2) GriffinDB features for additional optimizations
We implemented additional GriffinDB features to optimize for Metadium/Ethereum workload characteristics. One effort is to leverage the sparseness and the uniformity of 256 cryptographic hash key distribution. As the distribution is uniform, in-memory hash buckets can maintain almost the same fill rate when each bucket is large enough. This enables a simple but efficient hash bucket implementations. As the distribution is sparse, we do not need to store full 256 bits of keys into in-memory hash table indices. Storing partial keys save a lot of main memory space while key collisions only happen with very low probabilities for most of the practical situations.
Summary
Metadium shows a huge improvement over Ethereum in terms of TPS. One of the big impacts was due to replacing LevelDB with RocksDB. Stepping on the shoulder, we worked to obtain even higher TPS by leveraging GriffinDB. Our prototype implementation of GriffinDB performed well both in on-prem and cloud environments over different types of storage devices. It revealed raw device latency performance which RocksDB did not. Circuit Blvd is developing OCSSD based storage systems that would enhance the performance even further. For instance, TPS presented in the technotes are average values. QoS latencies would be greatly improved on OCSSD based systems.
Questions?
Contact info@circuitblvd.com for additional information.
Σχόλια