Keywords: supervised learning |  semi-supervised learning |  unsupervised learning |  mixture model |  gaussian mixture model |  pymc3 |  label-switching |  identifiability |  normal distribution |  pymc3 potentials |  Download Notebook

Contents

We now do a study of learning mixture models with MCMC. We have already done this in the case of the Zero-Inflated Poisson Model, and will stick to Gaussian Mixture models for now.

%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")
import pymc3 as pm
import theano.tensor as tt

Supervised and semi-supervised Learning

In other notes, we used mixture models to categorize the various types of learning. The key difference was how we wrote the likelihoods. Lets use those likelihoods to untertake sampling in supervised and semi-supervised ways. We have already seen unsupervised sampling

We’ll use the heights-weights data set as we do have a story for why these outght to be separately gaussianly distributed for men and women.

df=pd.read_csv("data/01_heights_weights_genders.csv")
df.head()
Gender Height Weight
0 Male 73.847017 241.893563
1 Male 68.781904 162.310473
2 Male 74.110105 212.740856
3 Male 71.730978 220.042470
4 Male 69.881796 206.349801
df.describe()
Height Weight
count 10000.000000 10000.000000
mean 66.367560 161.440357
std 3.847528 32.108439
min 54.263133 64.700127
25% 63.505620 135.818051
50% 66.318070 161.212928
75% 69.174262 187.169525
max 78.998742 269.989699

A KDEplot confirms our intuition

sns.kdeplot(df.Weight, df.Height)
<matplotlib.axes._subplots.AxesSubplot at 0x113eaa860>

png

To keep things simple, we’ll keep to one dimension by sticking with the weight marginal. Plotting it, there definitely seems to be a double humped distribution:

sns.kdeplot(df.Weight)
plt.hist(df.Weight, bins=100, alpha=0.2, normed=True);

png

df.shape
(10000, 3)

We do the train-test split. Notice we will keep a very small number of points in the training set.

makesmall=np.random.choice(range(df.shape[0]), size=500, replace=False)
datax=df.Weight.values
dataz=1*(df.Gender=='Male')
datax=datax[makesmall]
dataz=dataz.values[makesmall]
from sklearn.model_selection import train_test_split
itr, ite = train_test_split(range(dataz.shape[0]), train_size=0.04, stratify=dataz)
xte=datax[ite]
xtr=datax[itr]
zte=dataz[ite]
ztr=dataz[itr]

zte.shape, zte.mean(), ztr.mean(), ztr.shape
((480,), 0.48749999999999999, 0.5, (20,))

We use 2-D categoricals to make cluster assignments. Notice that these training assignments are also observed stochastics now, in addition to the $x$.

with pm.Model() as classmodel1:
    p1 = pm.Uniform('p', 0, 1)
    p2 = 1 - p1
    p = tt.stack([p1, p2])
    #Notice the "observed" below
    assignment_tr = pm.Categorical("assignment_tr", p, 
                                observed=ztr)
    sds = pm.Uniform("sds", 0, 100, shape=2)
    centers = pm.Normal("centers", 
                        mu=np.array([130, 170]), 
                        sd=np.array([20, 20]), 
                        shape=2)
    p_min_potential = pm.Potential('lam_min_potential', tt.switch(tt.min(p) < .1, -np.inf, 0))
    order_centers_potential = pm.Potential('order_centers_potential',
                                         tt.switch(centers[1]-centers[0] < 0, -np.inf, 0))
    
    # and to combine it with the observations:
    observations = pm.Normal("obs", mu=centers[assignment_tr], sd=sds[assignment_tr], observed=xtr)
    
with classmodel1:
    step1 = pm.Metropolis(vars=[p, sds, centers])
    trace_cm1_full = pm.sample(10000, step=[step1], njobs=2)
Multiprocess sampling (2 chains in 2 jobs)
CompoundStep
>Metropolis: [centers]
>Metropolis: [sds_interval__]
>Metropolis: [p_interval__]
100%|██████████| 10500/10500 [00:05<00:00, 1864.73it/s]
The number of effective samples is smaller than 10% for some parameters.
trace_cm1 = trace_cm1_full[2000::5]
pm.traceplot(trace_cm1);

png

pm.autocorrplot(trace_cm1);

png

center_trace = trace_cm1["centers"]
std_trace = trace_cm1["sds"]
#identify the clusters
colors = ["#348ABD", "#A60628"] \
if center_trace[-1, 0] > center_trace[-1, 1] \
    else ["#A60628", "#348ABD"]
norm = sp.stats.norm
x = np.arange(0, 300, 1)
posterior_center_means = center_trace.mean(axis=0)
posterior_std_means = std_trace.mean(axis=0)
posterior_p_mean = 1 - ztr.mean()


y = posterior_p_mean * norm.pdf(x, loc=posterior_center_means[0],
                                scale=posterior_std_means[0])
plt.plot(x, y, label="Cluster 0 (using posterior-mean parameters)", color=colors[1], lw=3)
plt.fill_between(x, y, color=colors[1], alpha=0.3)

y = (1 - posterior_p_mean) * norm.pdf(x, loc=posterior_center_means[1],
                                      scale=posterior_std_means[1])
plt.plot(x, y, label="Cluster 1 (using posterior-mean parameters)", color=colors[0], lw=3)
plt.fill_between(x, y, color=colors[0], alpha=0.3)
plt.hist(xtr, bins=10, histtype="step", normed=True, color="k",
     lw=2, label="histogram of training data")
# plt.hist(xte, bins=20, histtype="step", normed=True, color="k",
#      lw=2, label="histogram of  test data")
plt.legend(loc="upper right")
plt.title("Visualizing Clusters using posterior-mean parameters")
<matplotlib.text.Text at 0x1208c2ac8>

png

We use the $p(x \vert z)$ to do our classification. What we really want is $p(z \vert x)$, but we can use bayes theorem to inver this, as shown in the diagram below.

gen and dis

xgivenz0 = lambda x: norm.pdf(x, loc=posterior_center_means[0],
                                scale=posterior_std_means[0])
xgivenz1 = lambda x: norm.pdf(x, loc=posterior_center_means[1],
                                      scale=posterior_std_means[1])
zpred1 = 1*(xgivenz1(xte) > xgivenz0(xte))
from sklearn.metrics import confusion_matrix, accuracy_score
confusion_matrix(zte, zpred1)
array([[223,  23],
       [ 13, 221]])
accuracy_score(zte, zpred1)
0.92500000000000004

Semi-Supervised Learning

In the previous section we did the work on the testing set separately. Here we’d like to model the fact that we have a partial likelihood available for the testing set: we have observed the $x$s, but not the $z$s.

Thus in our model we will include both training and testing $x$s, but only training $z$s as observations, fitting for the testing $z$s. We now want the trace of the testing assignments. That is we want $p(z \vert x)$ and we’ll use the MCMC model to do the whole bayes theorem inversion for us!

with pm.Model() as classmodel2:
    p1 = pm.Uniform('p', 0, 1)
    p2 = 1 - p1
    p = tt.stack([p1, p2])
    assignment_tr = pm.Categorical("assignment_tr", p, 
                                observed=ztr)
    # we do not know the test assignments
    assignment_te = pm.Categorical("assignment_te", p, 
                                shape=xte.shape[0])
    sds = pm.Uniform("sds", 0, 100, shape=2)
    centers = pm.Normal("centers", 
                        mu=np.array([130, 170]), 
                        sd=np.array([20, 20]), 
                        shape=2)
    p_min_potential = pm.Potential('lam_min_potential', tt.switch(tt.min(p) < .1, -np.inf, 0))
    order_centers_potential = pm.Potential('order_centers_potential',
                                         tt.switch(centers[1]-centers[0] < 0, -np.inf, 0))
    
    # and to combine it with the observations:
    observations_tr = pm.Normal("obs_tr", mu=centers[assignment_tr], sd=sds[assignment_tr], observed=xtr)
    observations_te = pm.Normal("obs_te", mu=centers[assignment_te], sd=sds[assignment_te], observed=xte)
with classmodel2:
    step1 = pm.Metropolis(vars=[p, sds, centers])
    step2 = pm.ElemwiseCategorical(vars=[assignment_te])
    trace_cm2_full = pm.sample(15000, step=[step1, step2])
//anaconda/envs/py3l/lib/python3.6/site-packages/ipykernel_launcher.py:3: DeprecationWarning: ElemwiseCategorical is deprecated, switch to CategoricalGibbsMetropolis.
  This is separate from the ipykernel package so we can avoid doing imports until
Multiprocess sampling (2 chains in 2 jobs)
CompoundStep
>CompoundStep
>>Metropolis: [centers]
>>Metropolis: [sds_interval__]
>>Metropolis: [p_interval__]
>ElemwiseCategorical: [assignment_te]
100%|██████████| 15500/15500 [03:07<00:00, 82.60it/s]
The estimated number of effective samples is smaller than 200 for some parameters.
pm.summary(trace_cm2_full)
mean sd mc_error hpd_2.5 hpd_97.5 n_eff Rhat
assignment_te__0 0.002533 0.050268 0.000324 0.000000 0.000000 28072.0 0.999982
assignment_te__1 0.002700 0.051891 0.000370 0.000000 0.000000 30000.0 1.000017
assignment_te__2 0.348267 0.476421 0.005942 0.000000 1.000000 7151.0 1.000092
assignment_te__3 0.993433 0.080768 0.000493 1.000000 1.000000 29885.0 0.999967
assignment_te__4 0.356933 0.479095 0.006023 0.000000 1.000000 7371.0 0.999968
assignment_te__5 0.000900 0.029986 0.000169 0.000000 0.000000 30000.0 0.999968
assignment_te__6 0.002700 0.051891 0.000293 0.000000 0.000000 27560.0 0.999967
assignment_te__7 0.081333 0.273346 0.002749 0.000000 1.000000 12044.0 0.999967
assignment_te__8 0.000400 0.019996 0.000118 0.000000 0.000000 25748.0 1.000067
assignment_te__9 0.000167 0.012909 0.000073 0.000000 0.000000 30000.0 0.999973
assignment_te__10 0.998100 0.043548 0.000268 1.000000 1.000000 29204.0 0.999995
assignment_te__11 0.006100 0.077864 0.000525 0.000000 0.000000 28606.0 0.999989
assignment_te__12 0.997700 0.047903 0.000257 1.000000 1.000000 28382.0 0.999990
assignment_te__13 0.045233 0.207815 0.001690 0.000000 0.000000 19230.0 0.999976
assignment_te__14 0.002367 0.048591 0.000328 0.000000 0.000000 24049.0 1.000046
assignment_te__15 0.995133 0.069592 0.000396 1.000000 1.000000 28033.0 1.000025
assignment_te__16 0.001700 0.041196 0.000256 0.000000 0.000000 30000.0 0.999973
assignment_te__17 0.977300 0.148945 0.001122 1.000000 1.000000 19770.0 0.999969
assignment_te__18 0.999467 0.023088 0.000131 1.000000 1.000000 30000.0 0.999975
assignment_te__19 0.012967 0.113131 0.000892 0.000000 0.000000 21124.0 0.999974
assignment_te__20 0.031900 0.175734 0.001611 0.000000 0.000000 13911.0 0.999970
assignment_te__21 0.996733 0.057061 0.000340 1.000000 1.000000 29198.0 0.999972
assignment_te__22 0.004233 0.064926 0.000359 0.000000 0.000000 30000.0 0.999980
assignment_te__23 0.882233 0.322332 0.002970 0.000000 1.000000 12202.0 0.999969
assignment_te__24 0.000467 0.021597 0.000125 0.000000 0.000000 30000.0 0.999967
assignment_te__25 0.997800 0.046853 0.000255 1.000000 1.000000 30000.0 0.999975
assignment_te__26 0.013033 0.113417 0.000792 0.000000 0.000000 27063.0 0.999986
assignment_te__27 0.018133 0.133434 0.001063 0.000000 0.000000 22463.0 0.999997
assignment_te__28 0.996700 0.057351 0.000342 1.000000 1.000000 23843.0 0.999983
assignment_te__29 0.653367 0.475898 0.005825 0.000000 1.000000 6356.0 1.000146
... ... ... ... ... ... ... ...
assignment_te__455 0.999467 0.023088 0.000147 1.000000 1.000000 30000.0 1.000000
assignment_te__456 0.999467 0.023088 0.000122 1.000000 1.000000 30000.0 0.999975
assignment_te__457 0.000000 0.000000 0.000000 0.000000 0.000000 30000.0 NaN
assignment_te__458 0.974033 0.159036 0.001079 1.000000 1.000000 24994.0 0.999968
assignment_te__459 0.001867 0.043165 0.000280 0.000000 0.000000 29178.0 1.000026
assignment_te__460 0.891267 0.311304 0.002964 0.000000 1.000000 15793.0 1.000008
assignment_te__461 0.974933 0.156328 0.001106 1.000000 1.000000 22775.0 1.000055
assignment_te__462 0.004767 0.068876 0.000523 0.000000 0.000000 29477.0 1.000113
assignment_te__463 0.544267 0.498037 0.006115 0.000000 1.000000 6374.0 1.000008
assignment_te__464 0.826533 0.378650 0.003447 0.000000 1.000000 11093.0 1.000141
assignment_te__465 0.976833 0.150433 0.001156 1.000000 1.000000 22704.0 0.999969
assignment_te__466 0.001533 0.039128 0.000238 0.000000 0.000000 30000.0 0.999970
assignment_te__467 0.003500 0.059057 0.000430 0.000000 0.000000 28636.0 1.000038
assignment_te__468 0.000267 0.016328 0.000090 0.000000 0.000000 30000.0 1.000033
assignment_te__469 0.254200 0.435411 0.005559 0.000000 1.000000 6911.0 1.000004
assignment_te__470 0.978400 0.145373 0.000986 1.000000 1.000000 24908.0 1.000021
assignment_te__471 0.632167 0.482216 0.006195 0.000000 1.000000 6355.0 0.999973
assignment_te__472 0.000200 0.014141 0.000092 0.000000 0.000000 30000.0 0.999989
assignment_te__473 1.000000 0.000000 0.000000 1.000000 1.000000 30000.0 NaN
assignment_te__474 0.000133 0.011546 0.000065 0.000000 0.000000 30000.0 1.000000
assignment_te__475 0.998300 0.041196 0.000281 1.000000 1.000000 29067.0 0.999967
assignment_te__476 0.016967 0.129146 0.001001 0.000000 0.000000 22061.0 0.999986
assignment_te__477 0.999933 0.008165 0.000047 1.000000 1.000000 30000.0 1.000033
assignment_te__478 0.205700 0.404212 0.004851 0.000000 1.000000 6375.0 0.999970
assignment_te__479 0.962967 0.188843 0.001394 1.000000 1.000000 15035.0 0.999968
centers__0 133.514264 1.777552 0.070120 130.176436 137.083078 463.0 1.000104
centers__1 189.559856 1.942276 0.081149 185.929568 193.411621 428.0 0.999968
p 0.517260 0.033518 0.001251 0.452766 0.582704 580.0 0.999972
sds__0 18.286038 1.217578 0.042751 15.778710 20.579906 783.0 0.999990
sds__1 17.934786 1.345961 0.053764 15.344399 20.467248 497.0 1.000131

485 rows × 7 columns

Since this is partially an unsupervized scenario, it takes much longer to sample, and we stop sampling even though there is some correlation left over

trace_cm2 = trace_cm2_full[5000::5]
pm.traceplot(trace_cm2, varnames=['centers', 'p', 'sds']);

png

pm.autocorrplot(trace_cm2, varnames=['centers', 'p', 'sds']);

png

There is still quite a bit of autocorrelation and we might want to sample longer. But let us see how we did.

assign_trace = trace_cm2['assignment_te']
zpred2=1*(assign_trace.mean(axis=0) > 0.5)
confusion_matrix(zte, zpred2)
array([[233,  13],
       [ 17, 217]])
accuracy_score(zte, zpred2)
0.9375

The accuracy has improved just by having some additional terms in the likelihood!


cmap = mpl.colors.LinearSegmentedColormap.from_list("BMH", colors)
plt.scatter(xte, 1 - assign_trace.mean(axis=0), cmap=cmap,
        c=assign_trace.mean(axis=0), s=10)
plt.ylim(-0.05, 1.05)
plt.xlim(35, 300)
plt.title("Probability of data point belonging to cluster 0")
plt.ylabel("probability")
plt.xlabel("value of data point")
<matplotlib.text.Text at 0x124ba7b38>

png