The HyperLogLog Algorithm: How it works and why you will love it Thursday 12:50 Casablanca
Twitter: @byucesoy Blog: www.citusdata.com/blog/ LinkedIn: byucesoy Company website: citusdata.com
I'm Burak Yucesoy. I work at Citus Data as a software engineer. In the past, I spend time on a wide range of topics from flying robots to machine learning. Now I'm working on in distributed databases and PostgreSQL.
Mostly by developing extensions on top of PostgreSQL and attending conferences. I was at PostgresOpen SV, pgconf.eu, PGDay Istanbul either as speaker or attendee. Apart from that, I'm part of the team which develops our open source PostgreSQL extension, Citus, which makes PostgreSQL a distributed database. I'm also the maintainer of postgresql-hll extension.
I attended the previous pgconf.eu conference at Warsaw as a speaker. I enjoyed a lot thanks to good talk selections and opportunity of meeting new people from PostgreSQL community.
I'll talk about HyperLogLog algorithm for estimating COUNT(DISTINCT) queries. Estimating COUNT(DISTINCT) sounds like a very specific thing with small application area but when you think about it you'd realize almost any application would need to run some sort of COUNT(DISTINCT) query.
HyperLogLog can estimate cardinality very quickly and with very little memory footprint, but more importantly, it does this so elegantly, maybe as elegant as binary search. The idea behind HyperLogLog so simple, I was amazed by the accuracy of it when I first saw the algorithm.
I'll talk about HyperLogLog algorithm, its internals and real-world applications of HyperLogLog in production. Since the talk will be self-contained and a complete overview of HyperLogLog algorithm; from internals to real world applications, I believe both application developers and algorithm enthusiast can enjoy the talk.
Not much, but some knowledge about aggregations in PostgreSQL would be nice.
Improvements on partitioning and JIT.