Keywords: normal distribution |  multivariate normal |  oceanic tools |  correlation |  covariance |  posterior |  posterior predictive |  gaussian process |  poisson regression |  Data: Kline2.csv |  distmatrix.csv |  Download Notebook

Contents

%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")
import pymc3 as pm

Reading in our data

We read back the Oceanic tools data

df = pd.read_csv("data/Kline2.csv", sep=';')
df.head()
culture population contact total_tools mean_TU lat lon lon2 logpop
0 Malekula 1100 low 13 3.2 -16.3 167.5 -12.5 7.003065
1 Tikopia 1500 low 22 4.7 -12.3 168.8 -11.2 7.313220
2 Santa Cruz 3600 low 24 4.0 -10.7 166.0 -14.0 8.188689
3 Yap 4791 high 43 5.0 9.5 138.1 -41.9 8.474494
4 Lau Fiji 7400 high 33 5.0 -17.7 178.1 -1.9 8.909235

And center it

df['logpop_c'] = df.logpop - df.logpop.mean()
df.head()
culture population contact total_tools mean_TU lat lon lon2 logpop logpop_c
0 Malekula 1100 low 13 3.2 -16.3 167.5 -12.5 7.003065 -1.973939
1 Tikopia 1500 low 22 4.7 -12.3 168.8 -11.2 7.313220 -1.663784
2 Santa Cruz 3600 low 24 4.0 -10.7 166.0 -14.0 8.188689 -0.788316
3 Yap 4791 high 43 5.0 9.5 138.1 -41.9 8.474494 -0.502510
4 Lau Fiji 7400 high 33 5.0 -17.7 178.1 -1.9 8.909235 -0.067769

And read in the distance matrix

dfd = pd.read_csv("data/distmatrix.csv", header=None)
dij=dfd.values
dij
array([[ 0.   ,  0.475,  0.631,  4.363,  1.234,  2.036,  3.178,  2.794,
         1.86 ,  5.678],
       [ 0.475,  0.   ,  0.315,  4.173,  1.236,  2.007,  2.877,  2.67 ,
         1.965,  5.283],
       [ 0.631,  0.315,  0.   ,  3.859,  1.55 ,  1.708,  2.588,  2.356,
         2.279,  5.401],
       [ 4.363,  4.173,  3.859,  0.   ,  5.391,  2.462,  1.555,  1.616,
         6.136,  7.178],
       [ 1.234,  1.236,  1.55 ,  5.391,  0.   ,  3.219,  4.027,  3.906,
         0.763,  4.884],
       [ 2.036,  2.007,  1.708,  2.462,  3.219,  0.   ,  1.801,  0.85 ,
         3.893,  6.653],
       [ 3.178,  2.877,  2.588,  1.555,  4.027,  1.801,  0.   ,  1.213,
         4.789,  5.787],
       [ 2.794,  2.67 ,  2.356,  1.616,  3.906,  0.85 ,  1.213,  0.   ,
         4.622,  6.722],
       [ 1.86 ,  1.965,  2.279,  6.136,  0.763,  3.893,  4.789,  4.622,
         0.   ,  5.037],
       [ 5.678,  5.283,  5.401,  7.178,  4.884,  6.653,  5.787,  6.722,
         5.037,  0.   ]])

Implementing the simple tools:logpop model and varying intercepts models

import theano.tensor as tt
import theano.tensor as t
with pm.Model() as m2c_onlyp:
    betap = pm.Normal("betap", 0, 1)
    alpha = pm.Normal("alpha", 0, 10)
    loglam = alpha + betap*df.logpop_c
    y = pm.Poisson("ntools", mu=t.exp(loglam), observed=df.total_tools)
    trace2c_onlyp = pm.sample(6000, tune=1000)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [alpha, betap]
100%|██████████| 7000/7000 [00:05<00:00, 1374.78it/s]
pm.summary(trace2c_onlyp)
mean sd mc_error hpd_2.5 hpd_97.5 n_eff Rhat
betap 0.239422 0.031505 0.000286 0.178178 0.301947 10383.0 0.999918
alpha 3.478431 0.057299 0.000577 3.366003 3.587042 10055.0 0.999939

Notice that $\beta_P$ has a value around 0.24

We also implement the varying intercepts per society model from before

with pm.Model() as m3c:
    betap = pm.Normal("betap", 0, 1)
    alpha = pm.Normal("alpha", 0, 10)
    sigmasoc = pm.HalfCauchy("sigmasoc", 1)
    alphasoc = pm.Normal("alphasoc", 0, sigmasoc, shape=df.shape[0])
    loglam = alpha + alphasoc + betap*df.logpop_c 
    y = pm.Poisson("ntools", mu=t.exp(loglam), observed=df.total_tools)
with m3c:
    trace3 = pm.sample(6000, tune=1000, nuts_kwargs=dict(target_accept=.95))
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [alphasoc, sigmasoc_log__, alpha, betap]
100%|██████████| 7000/7000 [00:28<00:00, 247.87it/s]
The number of effective samples is smaller than 25% for some parameters.
pm.summary(trace3)
mean sd mc_error hpd_2.5 hpd_97.5 n_eff Rhat
betap 0.258089 0.082296 0.001338 0.096656 0.424239 4166.0 0.999973
alpha 3.446465 0.120746 0.002027 3.199103 3.687472 3653.0 0.999921
alphasoc__0 -0.209619 0.247940 0.003688 -0.718741 0.259506 4968.0 1.000043
alphasoc__1 0.038430 0.219664 0.002941 -0.404914 0.487200 5961.0 0.999917
alphasoc__2 -0.050901 0.195434 0.002468 -0.447657 0.339753 5818.0 0.999921
alphasoc__3 0.324157 0.189557 0.002763 -0.031798 0.699002 4321.0 0.999929
alphasoc__4 0.039406 0.175986 0.002227 -0.301135 0.401451 6167.0 1.000062
alphasoc__5 -0.320429 0.208348 0.003087 -0.733230 0.055638 4967.0 0.999927
alphasoc__6 0.144230 0.172236 0.002496 -0.168542 0.513625 5458.0 0.999972
alphasoc__7 -0.174227 0.184070 0.002252 -0.568739 0.162993 6696.0 0.999919
alphasoc__8 0.273610 0.174185 0.002854 -0.050347 0.627762 4248.0 1.000032
alphasoc__9 -0.088533 0.291865 0.004870 -0.679972 0.487844 4385.0 0.999929
sigmasoc 0.312019 0.129527 0.002224 0.098244 0.588338 2907.0 0.999981

A model with a custom covariance matrix

The assumption here now is that the intercepts for these various societies are correlated…

We use a custom covariance matrix which inverse-square weights distance

You have seen this before! This is an example of a Gaussian Process Covariance Matrix.

Here is the complete model:

with pm.Model() as mgc:
    betap = pm.Normal("betap", 0, 1)
    alpha = pm.Normal("alpha", 0, 10)
    etasq = pm.HalfCauchy("etasq", 1)
    rhosq = pm.HalfCauchy("rhosq", 1)
    means=tt.stack([0.0]*10)
    sigma_matrix = tt.nlinalg.diag([0.01]*10)
    cov=tt.exp(-rhosq*dij*dij)*etasq + sigma_matrix
    gammasoc = pm.MvNormal("gammasoc", means, cov=cov, shape=df.shape[0])
    loglam = alpha + gammasoc + betap*df.logpop_c 
    y = pm.Poisson("ntools", mu=t.exp(loglam), observed=df.total_tools)
with mgc:
    mgctrace = pm.sample(10000, tune=2000, nuts_kwargs=dict(target_accept=.95))
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [gammasoc, rhosq_log__, etasq_log__, alpha, betap]
100%|██████████| 12000/12000 [05:11<00:00, 38.58it/s]
The acceptance probability does not match the target. It is 0.895944134094, but should be close to 0.8. Try to increase the number of tuning steps.
The number of effective samples is smaller than 25% for some parameters.
pm.summary(mgctrace)
mean sd mc_error hpd_2.5 hpd_97.5 n_eff Rhat
betap 0.247414 0.116188 0.001554 0.005191 0.474834 5471.0 1.000157
alpha 3.512649 0.357406 0.007140 2.790415 4.239959 2111.0 1.001964
gammasoc__0 -0.269862 0.455632 0.008620 -1.239333 0.607492 2569.0 1.001484
gammasoc__1 -0.117755 0.445192 0.008520 -1.041095 0.757448 2439.0 1.001758
gammasoc__2 -0.165474 0.430544 0.008116 -1.042846 0.686170 2406.0 1.001881
gammasoc__3 0.299581 0.387140 0.007481 -0.481745 1.079855 2365.0 1.001936
gammasoc__4 0.026350 0.382587 0.007425 -0.763338 0.770842 2292.0 1.001728
gammasoc__5 -0.458827 0.389807 0.006976 -1.286453 0.231992 2481.0 1.001517
gammasoc__6 0.097538 0.377499 0.007064 -0.653992 0.840048 2382.0 1.001464
gammasoc__7 -0.263660 0.378417 0.006917 -1.077743 0.404521 2407.0 1.001890
gammasoc__8 0.233544 0.361616 0.006715 -0.510818 0.909164 2407.0 1.001721
gammasoc__9 -0.123068 0.473731 0.006671 -1.034810 0.850175 3985.0 1.000439
etasq 0.354953 0.660893 0.009717 0.001437 1.114851 4904.0 1.000206
rhosq 2.306880 30.113269 0.343453 0.000437 4.550517 8112.0 0.999955
pm.traceplot(mgctrace);

png

pm.autocorrplot(mgctrace);

png

d={}
for i, v in enumerate(df.culture.values):
    d[v] = mgctrace['gammasoc'][:,i]
dfsamps=pd.DataFrame.from_dict(d)
dfsamps.head()
Chuuk Hawaii Lau Fiji Malekula Manus Santa Cruz Tikopia Tonga Trobriand Yap
0 0.282612 0.059503 0.145726 -0.024485 -0.283392 -0.064665 0.211578 0.421679 -0.337896 0.552060
1 0.138628 -0.023895 0.202573 0.083340 -0.147399 0.037469 0.259075 0.432713 -0.513998 0.579821
2 0.257026 0.065751 0.263479 -0.175381 -0.196812 -0.139182 -0.060890 0.358894 -0.478396 0.349795
3 0.138580 -0.494443 0.124155 0.065190 -0.183672 0.130431 0.207621 0.186890 -0.280590 0.541165
4 0.034011 0.036418 0.224733 -0.600040 -0.405896 -0.531054 -0.348027 0.026920 -0.589172 0.220345
dfsamps.describe()
Chuuk Hawaii Lau Fiji Malekula Manus Santa Cruz Tikopia Tonga Trobriand Yap
count 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000
mean 0.097538 -0.123068 0.026350 -0.269862 -0.263660 -0.165474 -0.117755 0.233544 -0.458827 0.299581
std 0.377509 0.473743 0.382597 0.455644 0.378426 0.430555 0.445203 0.361625 0.389816 0.387149
min -3.359014 -3.773198 -3.389680 -3.856332 -3.592680 -3.704032 -3.634145 -3.229867 -3.808333 -3.046678
25% -0.080201 -0.369415 -0.156227 -0.507103 -0.444126 -0.377180 -0.344367 0.064836 -0.650613 0.107982
50% 0.108659 -0.112151 0.039389 -0.242684 -0.244814 -0.140033 -0.098324 0.241970 -0.429267 0.304480
75% 0.287604 0.132445 0.220311 -0.015149 -0.057941 0.070862 0.129670 0.415369 -0.233960 0.500363
max 2.908532 3.523802 2.684128 2.373157 2.501715 2.561251 2.619592 3.046274 2.093728 3.017744

Plotting posteriors and predictives

Lets plot the covariance posteriors for the 100 random samples in the trace.

smalleta=np.random.choice(mgctrace['etasq'], replace=False, size=100)
smallrho=np.random.choice(mgctrace['rhosq'], replace=False, size=100)
with sns.plotting_context('poster'):
    d=np.linspace(0,10,100)
    for i in range(100):
        covarod = lambda d: smalleta[i]*np.exp(-smallrho[i]*d*d)
        plt.plot(d, covarod(d),alpha=0.1, color='k')
    medetasq=np.median(mgctrace['etasq'])
    medrhosq=np.median(mgctrace['rhosq'])
    covarodmed = lambda d: medetasq*np.exp(-medrhosq*d*d)
    plt.plot(d, covarodmed(d),alpha=1.0, color='k', lw=3)
    plt.ylim([0,1]); 

png

The x-axis is thousands of kilometers. Notice how almost everything damps out by 4000 kms. Lets calculate the median correlation matrix:

medkij = np.diag([0.01]*10)+medetasq*(np.exp(-medrhosq*dij*dij))
#from statsmodels
def cov2corr(cov, return_std=False):
    '''convert covariance matrix to correlation matrix

    Parameters
    ----------
    cov : array_like, 2d
        covariance matrix, see Notes

    Returns
    -------
    corr : ndarray (subclass)
        correlation matrix
    return_std : bool
        If this is true then the standard deviation is also returned.
        By default only the correlation matrix is returned.

    Notes
    -----
    This function does not convert subclasses of ndarrays. This requires
    that division is defined elementwise. np.ma.array and np.matrix are allowed.

    '''
    cov = np.asanyarray(cov)
    std_ = np.sqrt(np.diag(cov))
    corr = cov / np.outer(std_, std_)
    if return_std:
        return corr, std_
    else:
        return corr
medcorrij=cov2corr(medkij)
medcorrij
array([[  1.00000000e+00,   8.71785980e-01,   8.14096714e-01,
          4.99776683e-04,   5.21038310e-01,   1.84041573e-01,
          1.73288749e-02,   4.30502586e-02,   2.41593313e-01,
          2.65025654e-06],
       [  8.71785980e-01,   1.00000000e+00,   9.16627626e-01,
          9.51206322e-04,   5.20017938e-01,   1.92806478e-01,
          3.57157022e-02,   5.63298362e-02,   2.06001853e-01,
          1.47715424e-05],
       [  8.14096714e-01,   9.16627626e-01,   1.00000000e+00,
          2.58767794e-03,   3.67501866e-01,   2.99604292e-01,
          6.68407750e-02,   1.05368392e-01,   1.21399142e-01,
          8.95698536e-06],
       [  4.99776683e-04,   9.51206322e-04,   2.58767794e-03,
          1.00000000e+00,   9.34889296e-06,   8.60394889e-02,
          3.65244813e-01,   3.38258916e-01,   3.09612981e-07,
          1.25890327e-09],
       [  5.21038310e-01,   5.20017938e-01,   3.67501866e-01,
          9.34889296e-06,   1.00000000e+00,   1.56160443e-02,
          1.52965471e-03,   2.23880332e-03,   7.56772084e-01,
          7.38793465e-05],
       [  1.84041573e-01,   1.92806478e-01,   2.99604292e-01,
          8.60394889e-02,   1.56160443e-02,   1.00000000e+00,
          2.63213982e-01,   7.15782789e-01,   2.33071089e-03,
          2.24573987e-08],
       [  1.73288749e-02,   3.57157022e-02,   6.68407750e-02,
          3.65244813e-01,   1.52965471e-03,   2.63213982e-01,
          1.00000000e+00,   5.31771921e-01,   1.06386644e-04,
          1.61407707e-06],
       [  4.30502586e-02,   5.63298362e-02,   1.05368392e-01,
          3.38258916e-01,   2.23880332e-03,   7.15782789e-01,
          5.31771921e-01,   1.00000000e+00,   1.98487950e-04,
          1.55709958e-08],
       [  2.41593313e-01,   2.06001853e-01,   1.21399142e-01,
          3.09612981e-07,   7.56772084e-01,   2.33071089e-03,
          1.06386644e-04,   1.98487950e-04,   1.00000000e+00,
          4.04514877e-05],
       [  2.65025654e-06,   1.47715424e-05,   8.95698536e-06,
          1.25890327e-09,   7.38793465e-05,   2.24573987e-08,
          1.61407707e-06,   1.55709958e-08,   4.04514877e-05,
          1.00000000e+00]])

We’ll data frame it to see clearly

dfcorr = pd.DataFrame(medcorrij*100).set_index(df.culture.values)
dfcorr.columns = df.culture.values
dfcorr
Malekula Tikopia Santa Cruz Yap Lau Fiji Trobriand Chuuk Manus Tonga Hawaii
Malekula 100.000000 87.178598 81.409671 4.997767e-02 52.103831 18.404157 1.732887 4.305026 24.159331 2.650257e-04
Tikopia 87.178598 100.000000 91.662763 9.512063e-02 52.001794 19.280648 3.571570 5.632984 20.600185 1.477154e-03
Santa Cruz 81.409671 91.662763 100.000000 2.587678e-01 36.750187 29.960429 6.684077 10.536839 12.139914 8.956985e-04
Yap 0.049978 0.095121 0.258768 1.000000e+02 0.000935 8.603949 36.524481 33.825892 0.000031 1.258903e-07
Lau Fiji 52.103831 52.001794 36.750187 9.348893e-04 100.000000 1.561604 0.152965 0.223880 75.677208 7.387935e-03
Trobriand 18.404157 19.280648 29.960429 8.603949e+00 1.561604 100.000000 26.321398 71.578279 0.233071 2.245740e-06
Chuuk 1.732887 3.571570 6.684077 3.652448e+01 0.152965 26.321398 100.000000 53.177192 0.010639 1.614077e-04
Manus 4.305026 5.632984 10.536839 3.382589e+01 0.223880 71.578279 53.177192 100.000000 0.019849 1.557100e-06
Tonga 24.159331 20.600185 12.139914 3.096130e-05 75.677208 0.233071 0.010639 0.019849 100.000000 4.045149e-03
Hawaii 0.000265 0.001477 0.000896 1.258903e-07 0.007388 0.000002 0.000161 0.000002 0.004045 1.000000e+02

Notice how there is correlation in the upper left and with Manus and Trobriand. Mcelreath has a distance plot i reproduce below:

To produce a plot like the one on the right, we calculate the posterior predictives with the correlation free part of the model and then overlay the correlations

from scipy.stats import poisson
def compute_pp_no_corr(lpgrid, trace, contact=0):
    alphatrace = trace['alpha']
    betaptrace = trace['betap']
    tl=len(trace)
    gl=lpgrid.shape[0]
    lam = np.empty((gl, 2*tl))
    lpgrid = lpgrid - lpgrid.mean()
    for i, v in enumerate(lpgrid):
        temp = alphatrace + betaptrace*lpgrid[i]
        lam[i,:] = poisson.rvs(np.exp(temp))
    return lam
lpgrid = np.linspace(6,13,30)
pp = compute_pp_no_corr(lpgrid, mgctrace)
ppmed = np.median(pp, axis=1)
pphpd = pm.stats.hpd(pp.T)
import itertools
corrs={}
for i, j in itertools.product(range(10), range(10)):
    if i <j:
        corrs[(i,j)]=medcorrij[i,j]
corrs
{(0, 1): 0.87178598035086485,
 (0, 2): 0.81409671378956938,
 (0, 3): 0.00049977668270061984,
 (0, 4): 0.52103830967748688,
 (0, 5): 0.18404157314186861,
 (0, 6): 0.017328874910928813,
 (0, 7): 0.04305025856741021,
 (0, 8): 0.241593313072233,
 (0, 9): 2.6502565440185815e-06,
 (1, 2): 0.9166276258652174,
 (1, 3): 0.00095120632192379471,
 (1, 4): 0.5200179376567835,
 (1, 5): 0.19280647828541175,
 (1, 6): 0.035715702211517139,
 (1, 7): 0.05632983623575849,
 (1, 8): 0.20600185316804365,
 (1, 9): 1.4771542363619412e-05,
 (2, 3): 0.0025876779443579313,
 (2, 4): 0.36750186564264986,
 (2, 5): 0.29960429169647806,
 (2, 6): 0.066840774961541019,
 (2, 7): 0.10536839180821188,
 (2, 8): 0.12139914234488278,
 (2, 9): 8.9569853640368644e-06,
 (3, 4): 9.348892959453282e-06,
 (3, 5): 0.086039488876625755,
 (3, 6): 0.36524481329909764,
 (3, 7): 0.33825891559247928,
 (3, 8): 3.0961298086947269e-07,
 (3, 9): 1.2589032666550904e-09,
 (4, 5): 0.01561604425648749,
 (4, 6): 0.0015296547115497437,
 (4, 7): 0.0022388033209531861,
 (4, 8): 0.75677208354568837,
 (4, 9): 7.3879346458710708e-05,
 (5, 6): 0.26321398249820332,
 (5, 7): 0.71578278931845729,
 (5, 8): 0.0023307108876489215,
 (5, 9): 2.2457398672039314e-08,
 (6, 7): 0.53177192062078238,
 (6, 8): 0.00010638664384256532,
 (6, 9): 1.6140770654271327e-06,
 (7, 8): 0.00019848794950225336,
 (7, 9): 1.557099581470421e-08,
 (8, 9): 4.0451487726182869e-05}
with sns.plotting_context('poster'):
    plt.plot(df.logpop, df.total_tools,'o', color="g")
    lpv = df.logpop.values
    ttv = df.total_tools.values
    for a,x,y in zip(df.culture.values, lpv, ttv):
        plt.annotate(a, xy=(x,y))
    for i in range(10):
        for j in range(10):
            if i < j:
                plt.plot([lpv[i],lpv[j]],[ttv[i], ttv[j]],'k', alpha=corrs[(i,j)]/1.)
    plt.plot(lpgrid, ppmed, color="g")
    plt.fill_between(lpgrid, pphpd[:,0], pphpd[:,1], color="g", alpha=0.2, lw=1)
    plt.ylim([10, 75])
    plt.xlim([6.5, 13])

png

Notice how distance probably pulls Fiji up from the median, and how Manus and Trobriand are below the median but highly correlated. A smaller effect can be seen with the triangle on the left. Of-course, causality is uncertain