Download Notebook
Summary
Harvard University
Spring 2018
Instructors: Rahul Dave
**Due Date: ** Thursday, Febrary 1st, 2018 at 11:59pm
Instructions:
-
Upload your final answers as well as your iPython notebook containing all work to Canvas.
-
Structure your notebook and your work to maximize readability.
Problem 1: Bayes Theorem
Coding not required
Part A:
Your child has been randomly selected for Type I diabetes screening, using a highly accurate new test that boasts of a false positive rate of 1% and a false negative rate of 0%. The prevalence of of Type I diabetes in children is approximately 0.228%.
1.Should your child test positive, what is the probability that he/she has Type I diabetes?
2.Should you be concerned enough to ask for further testing or treatment for your child? Suppose an independent test with the same false positive/false negative rate is available.
Later, you read online that Type I diabetes is 6 times more prevalent in prematurely born children.
3.If this statistic is true, what is the probability that your child, who is prematurely born, has Type I diabetes?
4.Subjectively, given the new information, should you be concerned enough to ask for treatment for your child?
Justify your decisions using your calculations.
Part B:
During shopping week, you’re trying to decide between two classes based on the criteria that the class must have a lenient grading system.
Both classes have hard TFs who give assignments lower grades than the work merits. You know from rumors that in one class has 35% of the TF staff are harsh graders and in the other class 15% are harsh graders, but you don’t know which class is which.
So, you decide to conduct an experiment: submit an assignment to be graded. Fortunately, both classes offer an optional Homework 0 that is graded as extra credit. Unfortunately, you only have time to complete the problem set for just one of these classes.
Suppose you randomly pick the Homework 0 from Class A to complete and suppose that you received a grade that you believe is lower than the quality of your work warrents.
- Based on this evidence, what is the probability that Class A has the harsher grading system? Which class should you drop based on the results of your experiment (or do you not have sufficient evidence to decide)?
Justify your decisions using your calculations.
Problem 2: Maximum Likelihood Estimates
Coding required for Part B only
Suppose you observe the following data set $\mathbf{x}^{(0)} = (0.5, 2.5), \mathbf{x}^{(1)} = (3.2, 1.3), \mathbf{x}^{(2)} = (2.72, 5.84), \mathbf{x}^{(3)}= (10.047, 0.354)$. By convention, for any vector $\mathbf{x}$, we will denote the first component of $\mathbf{x}$ by $x_{1}$ and the second component by $x_{2}$. Suppose that the data is drawn from the same two-dimensional probability distribution with pdf $f_X$, that is, $\mathbf{x}^{(i)} \overset{iid}{\sim} f_X$, where You should assume that $\lambda_1, \lambda_0 > 0$ and that $f_X$ is supported on the nonnegative quandrant of $\mathbb{R}^2$ (i.e. $f_X$ is zero when either component is negative).
Part A:
What are the values for $\lambda_0$ and $\lambda_1$ that maximizes the likelihood of the observed data?
Support your answer with full and rigorous analytic derivations.
Part B:
Visualize the data along with the distribution you determined in part A (in two dimensions or three).
Problem 3: Frequentist Stats
Coding required
Read the data set contained in Homework_1_Data.txt. Each data point is a two-dimensional vector, $\mathbf{x} = (x_1, x_2)$.
Part (A): Visualization and Interpretation
-
Make a 2-D visualization the distribution of the data.
-
Visualize the the pdf, $f_X$, of the underlying distribution of the data.
-
Visualize the distribution defined by $f_{x_2 \mid x_1}$ for $x_1 \in [3.99, 4.01]$.
-
Visualize the distribution defined by $f_{x_1}$.
Part (B): Estimation using bootstrap
-
Empirically estimate the mean of the distribution $f_{x_1}$. Estimate, also the SE (standard error) of the estimate.
-
Empirically estimate the standard deviation of the distribution $f_{x_2 \mid x_1}$, for $x_1 \in [3.99, 4.01]$. Estimate, also the SE (standard error) of the estimate.
-
Given the SE, How many digits in your standard deviation estimate are significant? Explain why.
In obtaining estimates for this problem we want you to
- define a function called get_bootstrap_sample(dataset) to generate each bootstrap sample
- and then another function perform_bootstrap(dataset) to generate all the samples.
They should both take as parameters the dataset from which you’ll be drawing samples. perform_bootstrap should call get_bootstrap_sample and return a sequence of bootstrap samples. get_bootstrap_sample should return an individual bootstrap sample.
Problem 4: Missing Data
Coding required
Missing data is a very important topic in statistics and machine learning. Let us begin to explore how to handle missing data in our analysis. We’ll be working with a dataset from the UCI Machine Learning Repository that uses a variety of wine chemical predictors to classify wines grown in the same region in Italy. Each line represents 13 (mostly chemical) predictors of the response variable wine class, including things like alcohol content, hue, and phenols. Unfortunately some of the predictor values were lost in measurement. Please load wine_quality_missing.csv
. (If pandas makes your life easier, feel free to use it).
-
One way to handle missing data is to just totally ignore it by removing any rows that have any missing values. This is called drop imputation. Use drop imputation on our wine quality dataset. How many rows does our imputed dataset have?
-
Another way to handle missing data is to replace any missing value with the mean of the non-missing values in that column. This is called mean imputation. How many rows does our mean imputed dataset have?
-
Empirically estimate the pdf of the distribution of the sample mean of the ash predictor under drop imputation and mean imputation. Use the perform_bootstrap function from problem 3.
-
Compare the standard errors (SE) of the distributions under drop imputation and mean imputation in the previous question. Can you suggest one or two reasons why they may be different?