[Part-1] Achieving 70% lossless compression for Redis Writes
Using a combination of Kryo Serialization, Lettuce DEFLATE and modifying database can help developers gain amazing memory gains!
This was a blog in my draft list for last 2 years. However, one of my recent use cases compelled me to get back to thinking the most optimal way of compressing data before storing into redis. Before writing further, I need to give due credits to a blog post from DoorDash that triggered the idea of improving compression: https://doordash.engineering/2019/01/02/speeding-up-redis-with-compression/
Dataset of Embeddings
Before getting into actual exercise, let us define our single row that we will work on. For more clarity, this one row is basically a row from a mxn
metrics of embeddings with rows of type A whereas columns are of type B. In classic examples like metrics factorization used for recommendations, you can consider A
being users whereas B
being items, with values at each cells being a double
. Our goal is to store this mxn
dataset into redis.
For sake of brevity, however, we will consider just 1 row and work on it.
Following gist contains a sample row that we will use for this exercise:
Here’s a list of tasks we will perform on this json object before storing into redis:
- Try changing the data structure of
data
field. - Try serializing
data
field in a couple of different ways. - Use a custom codec in case #2 is possible.
We will use Java/Scala for this exercise. This exercise is not possible in python at the moment because stress tests were conducted using JVM with memory profiling done with JVM compatible serializer. In case you are using python, you may want to refer to this excellent article: https://itecnote.com/tecnote/python-which-is-the-best-way-to-compress-json-to-store-in-a-memory-based-store-like-redis-or-memcache/
Step 1: Changing the data structure of data field
Storing my metrics in In this case, instead of storing entire dictionary as is, I decided to use an auxiliary
data structure called embeddingMap
where i extracted the _1
field into an index map.
This step helped splitting the big json blob into two auxiliary
data structures:
- An
embeddingMap
to store the_1
field with itsindex
in aMap[String, Integer]
- A list of
Double
with maintained order for indices stored in theembeddingMap
Creating these two datasets itself reduced memory footprints by more than 50%. embeddingMap
can be stored in DynamoDB with DAX enabled to be fetched in runtime to map the embeddings again as required(we won’t get into this part for now). We will focus on how efficiently we can store List[Double]
in redis in the subsequent sections.
Step 2: Try to find the best serializer
Redis stores everything as byte[]
. Therefore, we need to find the most optimal serializer that can help us to store data to redis as bytes. We will focus on writing this List[Float]
in redis(in reality, the length of this embedding list can be in scale of 1000's):
Without compression:
// Using Kryo
127.0.0.1:30001> memory usage a-test-key-a
(integer) 16952// Using Jackson
127.0.0.1:30001> memory usage a-test-key-b
-> Redirected to slot [8404] located at 127.0.0.1:30002
(integer) 65592// Using plain GZIP
127.0.0.1:30002> memory usage a-test-key-c
-> Redirected to slot [12533] located at 127.0.0.1:30003
(integer) 15928
With compression enabled(using CompressionCodec.valueCompressor(codec, CompressionCodec.CompressionType.DEFLATE)
:
// Using Kryo
127.0.0.1:30003> memory usage a-test-key-a
-> Redirected to slot [4279] located at 127.0.0.1:30001
(integer) 11320// Using Jackson
127.0.0.1:30001> memory usage a-test-key-b
-> Redirected to slot [8404] located at 127.0.0.1:30002
(integer) 15928// Using plain GZIP
127.0.0.1:30002> memory usage a-test-key-c
-> Redirected to slot [12533] located at 127.0.0.1:30003
(integer) 15928
As seen from above snippets, Kryo
stood out to be a winner with 30% lesser compressed memory. I was quite fascinated by this result and decided to write my custom implementation for a RedisCodec<T, U>
Step 3: Building a custom codec
for creating Redis Connection to AWS Elasticache
After investigating what is kryo in a deeper sense, I noticed that developers din’t make it inherently thread-safe. What that means is, if a kryo instance is being used for serialization, it may be used by other thread with byte overflow/leaks causing garbage bytes collected for the results. Deeper discussion on kryo’s thread-safety can be found here: https://github.com/EsotericSoftware/kryo/issues/188
In my case, I then decided to instead use an ObjectPool
. Rest of this article will focus on building a custom RedisCodec<T, U>
that uses kryo serializer for ingesting bytes into redis.
Before starting to emulate the code, add this dependency in the build.gradle
OR pom.xml
:
implementation("com.esotericsoftware.kryo:kryo5:5.0.0-RC6")
Step 3.1: A utility for creating a Kryo Pool
:
Step 3.2: A KryoProvider
Step 3.3: Create an implementation of KryoProvider
:
Step 3.4: Creating a custom Serde<T, U>
:
Now using the kryo provider created in step 3.2 and 3.3, we have an AbstractSerde<T,U>
that can be used directly for implementing concrete Serde's
:
Now using this AbstractSerde<T, U>
, we can create a concrete implementation for our List<Float>
:
Step 3.5: Use your FloatSerde
in lettuce’s RedisCodec<String, List<Float>>
:
Husshhh!!!! Well that was lot of code. Now you can use the FloatWithKryoRedisCodec
to initialize connection using lettuce:6.0*
library.
How to use this codec is clearly documented here on lettuce’s official github handle: https://github.com/lettuce-io/lettuce-core/wiki/Codecs
That’s all for today. In Part 2 of this blog post, I will focus on reviewing and adding results of stress and load test to confirm the extra serialization efforts by Kryo have no bearing on the read latency(I have tested this, but haven’t got time to really write it down at the moment. Stay tuned for just a week!)
What do you think? Can you try implementing this? Would love to hear from you about the usage!