Homework 2

Summary

**Due Date: ** Friday, Febrary 9th, 2017 at 10am

Instructions:

Upload your final answers as well as your iPython notebook containing all work to Canvas.
Structure your notebook and your work to maximize readability.

Problem 1: Monte Carlo Integration

Let $X$ be a random variable with distribution described by the following pdf:

$f_X(x) = \begin{cases} \frac{1}{12}(x-1), &1\leq x\leq 3\\ -\frac{1}{12}(x-5), &3< x\leq 5\\ \frac{1}{6}(x-5), &5< x\leq 7\\ -\frac{1}{6}(x-9), &7< x\leq 9\\ 0, &otherwise \end{cases}$

Let $h$ be the following function of $X$:

$h(X) = \frac{1}{3\sqrt{2}\pi}\mathrm{exp}\left\{ -\frac{1}{18}\left( X - 5\right)^2\right\}$

Compute $\mathbb{E}[h(X)]$ via Monte Carlo simulation using the following sampling methods:

inverse transform sampling
rejection sampling with both uniform proposal distribution and normal proposal distribution (steroids) (with appropriately chosen parameters)

Problem 2: Variance Reduction

Part A

Compute the variance of each estimate of $\mathbb{E}[h(X)]$ obtained in Problem 1. What do you see?

Part B (Stratified Sampling)

Often, a complex integral can be computed with more ease if one can break up the domain of the integral into pieces and if on each piece of the domain the integral is simplified.

Find a natural way to divide the domain of $X$ and express $\mathbb{E}[h(X)]$ as an correctly weighted sum of integrals over the pieces of the domain of $X$. (This constitutes the essentials of Stratified Sampling)
Estimate each integral in the summand using rejection sampling using a normal proposal distribution (with sensibly chosen parameters). From these, estimate $\mathbb{E}[h(X)]$.
Compute the variance of your estimate of $\mathbb{E}[h(X)]$. Compare with the variance of your previous estimate of $\mathbb{E}[h(X)]$ (in Part A, using rejection sampling, a normal proposal distribution over the entire domain of $X$).

Problem 3: Linear Regression

Consider the following base Regression class, which roughly follows the API in the python package scikit-learn.

Our model is the the multivariate linear model whose MLE solution or equivalent cost minimization was talked about in lecture:

$y = X\beta + \epsilon$ where $y$ is a length $n$ vector, $X$ is an $m \times p$ matrix created by stacking the features for each data point, and $\beta$ is a $p$ length vector of coefficients.

The class showcases the API:

$fit(X, y)$: Fits linear model to $X$ and $y$.

$get_params()$: Returns $\hat{\beta}$ for the fitted model. The parameters should be stored in a dictionary with keys “intercept” and “coef” that give us $\hat{\beta_0}$ and $\hat{\beta_{1:}}$. (The second value here is thus a numpy array of coefficient values)

$predict(X)$: Predict new values with the fitted model given $X$.

$score(X, y)$: Returns $R^2$ value of the fitted model.

$set_params()$: Manually set the parameters of the linear model.

class Regression(object):
    
    def __init__(self):
        self.params = dict()
    
    def get_params(self, k):
        return self.params[k]
    
    def set_params(self, **kwargs):
        for k,v in kwargs.iteritems():
            self.params[k] = v
        
    def fit(self, X, y):
        raise NotImplementedError()
        
    def predict(self, X):
        raise NotImplementedError()
        
    def score(self, X, y):
        raise NotImplementedError()

Part A: a class for Ordinary Least Squares

Inherit from this class to create an ordinary Least Squares Linear Regression class.

It’s signature will look like this:

class OLS(Regression):

Implement fit, predict and score. This will involve some linear algebra. (You might want to read up on pseudo-inverses before you directly implement the linear algebra on the lecure slides).

$R^2$ score

To implement score, look below:

The $R^2$ score is defined as: ${R^{2} = {1-{SS_E \over SS_T}}}$

Where:

$SS_T=\sum_i (y_i-\bar{y})^2, SS_R=\sum_i (\hat{y_i}-\bar{y})^2, SS_E=\sum_i (y_i - \hat{y_i})^2$ where ${y_i}$ are the original data values, $\hat{y_i}$ are the predicted values, and $\bar{y_i}$ is the mean of the original data values.

Part B: test your code

We’ll create a synthetic data set using the code below. (Read the documentation for make_regression to see what is going on).

from sklearn.datasets import make_regression
import numpy as np
np.random.seed(99)
X, y, coef = make_regression(30,10, 10, bias=1, noise=2, coef=True)
coef

array([ 76.6568183 ,  77.67682678,  63.78807738,  19.3299907 ,
        59.01638708,  53.13633737,  28.77629958,  10.01888939,
         9.25346811,  59.55220395])

Verify that your code recovers these coefficients approximately on doing the fit. Plot the predicted y against the actual y. Also calculate the score using the same sets X and y. The usage will look something like:

lr = OLS()
lr.fit(X,y)
lr.get_params['coef']
lr.redict(X,y)
lr.score(X,y)