I’m working with genetic data in which alleles were observed n times in t number of chromosomes sequenced. In other words, n successes in t trials.
I want to include an estimate of each allele’s frequency as a feature in a machine learning algorithm. I can of course get a point estimate with n/t, but I want to represent the confidence of that point estimate — i.e. something about the likelihood of that estimate.
Now, I believe the negative binomial (or just binomial) distribution would be the right one to use, but
- How can I estimate the parameters of the distribution in Python?
- What representation of the distribution would be ideal as a feature for classical (non-NN) machine learning? A conservative estimate might be the 95% CI upper bound, but how would I calculate that, and is there a better way to featurize the distribution than just taking that one value?
I suppose that all of the required information that you need can be calculated by mean of the standard statistical methods without applying machine learning.
MLE estimate of the parameter p of your Binomial distribution
Bin(t,p) is just n/t as you properly suggested. If you want to get a confidence interval instead of a point estimate, there is one way to do it by means of the
where z is 1 – 0.5α quantile of a standard normal distribution. You can find more possibilities via the following link depending on your modelling assumptions: Binomial confidence intervals.
95% CI for p̂ can be calculated as indicated above with z = 1.96.
As for the feature engineering for the machine learning algorithm: since your parametric distribution basically depends only on one estimated parameter p (except for t which is given), you can use it directly as a feature for the unique distribution representation. It is also possible to add CI or variance as additional features of course. Everything depends on what exactly you are going to learn and what is your final objective/criterion is.