Great questions! > what is the volume your system is operating at? This varies, ...

scaleout1 · on Oct 29, 2018

Thanks for the reply, got a few more additional questions for you :-)

Lets say you are counting distinct ips used by `users` using HLL. Lets say you start getting DDOSed by certain users since I am assuming you are not doing s shuffle before writing to FDB, you will be locking the user, reading HLL, deserializing, merging and writing back to FDB from multiple machines which will results in a lot of rejected transaction and retries. My question is whether retries unwind fast enough or you will end up dropping data on the floor as you will exhaust the retry count

monstrado · on Oct 30, 2018

Turns out we are doing a shuffle :) - We're using Apache Flink for the aggregation step (5 second window) which performs a merge on key before writing the value out. So at the end of the day, we would only read/deserialize/merge/write once every 5 seconds, that is of course assuming we received data for the HLL aggregation.

However, due to the need for HA, we might run two or three clusters in different AZs which means we might have a few servers writing a partial aggregation to the same row, thus, the awesomeness of FDB plays a role.

That being said, our P99 latency writing to FDB is typically very low (few ms). We're doing usually 4,000 - 5,000 transactions a second at any given time.