I am interested in fitting a 2-component Gaussian Mixture Model to the data shown below. However, since what I am plotting here are log-transformed counts normalized to be between 0-1, the maximum value my data will ever take is 0. When I try a naive fit using sklearn.mixture.GaussianMixture (code below), I get the resulting fit, which is obviously not what I want.
from sklearn.mixture import GaussianMixture import numpy as np # start with some count data in (0,1] logged_counts = np.log(counts) model = GaussianMixture(2).fit(logged_counts.reshape(-1,1)) # plot resulting fit x_range = np.linspace(np.min(logged_counts), 0, 1000) pdf = np.exp(model.score_samples(x_range.reshape(-1, 1))) responsibilities = model.predict_proba(x_range.reshape(-1, 1)) pdf_individual = responsibilities * pdf[:, np.newaxis] plt.hist(logged_counts, bins='auto', density=True, histtype='stepfilled', alpha=0.5) plt.plot(x_range, pdf, '-k', label='Mixture') plt.plot(x_range, pdf_individual, '--k', label='Components') plt.legend() plt.show()
I would love it if I could fix the mean of the top component at 0 and only optimize the other mean, the two variances, and the mixing fractions. (Additionally I would love to be able to use a half-normal for the component on the right.) Is there a simple way to do this with built-in functions in python/sklearn, or will I have to build that model myself using some probabilistic programming language?
Afaik, you cannot do exactly what you want in sklearn.
Imho, basically there are multiple strategies: (i) implement GMM yourself, (ii) switch to another language/framework, (iii) adapt GMM code, or (iv) adapt.
(i) You probably do not want to do this unless you want to learn for yourself.
(ii) You could use stan and adapt the code in the last paragraph to have a fixed component of your choice (distribution type and parameters)
(iii) You could do (i) but slightly adapt the sklearn code or simply use the methods for estimation but with you own slight modifications.
- Gaussian Mixture model will not work here (as you mentioned) because you require a truncated Normal distribution for the “first” (fixed) component.
- If you would not require to fit for the variance of the fixed component then you can always just substract your fixed component from the data. (i.e. for each point subtract the point’s quantile-value from the point value)
- If you don’t mind the precision in the estimate, you can make two passed: First use GMM to identify both components. Then look only at the data from the component you want to be fixed. Fit a truncated gaussian model (use
.fit(data)). Then subtract the resulting parameters from your original data (like in option 2). And then fit a GMM. to find out the next component.
Hope this helps 🙂