I have a database with 1 billion `key`

`val`

pairs with 20 million unique `key`

s. On average, each `key`

is associated with 50 `val`

s.

key val key1 val1 key1 val2 key1 val3 key2 val2 key2 val7 . . .

I ran the following and got the standard deviation of the number of `val`

s per each unique `key`

.

select avg(cnt), stddev(cnt) from (select count(key) as cnt, key from original_db)

This gives avg(cnt) = 50 and stddev(cnt)=137

I would like to create a subset of `key`

s from this table such that the avg(cnt) of the subset is 100. This means that on average, each unique key in the subset table is associated with an average of ~ 100 values.

## Answer

You can aggregate and use a cumulative average to calculate a running average:

select key from (select key, count(*) as cnt, avg(count(*)) over (order by cnt desc, key) as running_avg from t ) t where running_avg >= 100;

In other words, this takes all the keys have have 100+ values and then keeps taking a smaller number while the cumulative average is 100 or over.

Do note that this could return no keys, if no keys have 100 values.