creating a random subset of a table with an average number of counts per keys

I have a database with 1 billion key val pairs with 20 million unique keys. On average, each key is associated with 50 vals.

key  val
key1 val1
key1 val2
key1 val3
key2 val2
key2 val7

I ran the following and got the standard deviation of the number of vals per each unique key.

select avg(cnt), stddev(cnt)
  from (select count(key) as cnt, key
        from original_db)

This gives avg(cnt) = 50 and stddev(cnt)=137

I would like to create a subset of keys from this table such that the avg(cnt) of the subset is 100. This means that on average, each unique key in the subset table is associated with an average of ~ 100 values.


You can aggregate and use a cumulative average to calculate a running average:

select key
from (select key, count(*) as cnt,
             avg(count(*)) over (order by cnt desc, key) as running_avg
      from t
     ) t
where running_avg >= 100;

In other words, this takes all the keys have have 100+ values and then keeps taking a smaller number while the cumulative average is 100 or over.

Do note that this could return no keys, if no keys have 100 values.