# creating a random subset of a table with an average number of counts per keys

I have a database with 1 billion `key` `val` pairs with 20 million unique `key`s. On average, each `key` is associated with 50 `val`s.

```key  val
key1 val1
key1 val2
key1 val3
key2 val2
key2 val7
.
.
.
```

I ran the following and got the standard deviation of the number of `val`s per each unique `key`.

```select avg(cnt), stddev(cnt)
from (select count(key) as cnt, key
from original_db)
```

This gives avg(cnt) = 50 and stddev(cnt)=137

I would like to create a subset of `key`s from this table such that the avg(cnt) of the subset is 100. This means that on average, each unique key in the subset table is associated with an average of ~ 100 values.

You can aggregate and use a cumulative average to calculate a running average:

```select key
from (select key, count(*) as cnt,
avg(count(*)) over (order by cnt desc, key) as running_avg
from t
) t
where running_avg >= 100;
```

In other words, this takes all the keys have have 100+ values and then keeps taking a smaller number while the cumulative average is 100 or over.

Do note that this could return no keys, if no keys have 100 values.