I have a database with 1 billion
val pairs with 20 million unique
keys. On average, each
key is associated with 50
key val key1 val1 key1 val2 key1 val3 key2 val2 key2 val7 . . .
I ran the following and got the standard deviation of the number of
vals per each unique
select avg(cnt), stddev(cnt) from (select count(key) as cnt, key from original_db)
This gives avg(cnt) = 50 and stddev(cnt)=137
I would like to create a subset of
keys from this table such that the avg(cnt) of the subset is 100. This means that on average, each unique key in the subset table is associated with an average of ~ 100 values.
You can aggregate and use a cumulative average to calculate a running average:
select key from (select key, count(*) as cnt, avg(count(*)) over (order by cnt desc, key) as running_avg from t ) t where running_avg >= 100;
In other words, this takes all the keys have have 100+ values and then keeps taking a smaller number while the cumulative average is 100 or over.
Do note that this could return no keys, if no keys have 100 values.