Cross Validation
To pick the optimal model, we need to use a validation set. Cross validation is even better than a single validation set because it uses numerous validation sets created from the training data. In this case, we are using 5 different validation sets. The model that performs best on the cross validation is usually the optimal model because it has shown that it can learn the relationships while not overfitting.
The model with the lowest cross validation error had four degrees. Therefore, we will use a 4th degree polynomial for the final model.
Notes
https://towardsdatascience.com/interpreting-the-coefficients-of-linear-regression-cc31d4c6f235
https://towardsdatascience.com/simple-and-multiple-linear-regression-in-python-c928425168f9
https://towardsdatascience.com/experimentation-in-data-science-90521e74ee4c
https://towardsdatascience.com/search?q=linear%20regression
https://towardsdatascience.com/detecting-stationarity-in-time-series-data-d29e0a21e638
https://towardsdatascience.com/how-to-predict-a-time-series-part-1-6d7eb182b540
classical forecasting methodology (arima, exponential smoothing state space models , moving average etc…)
https://www.analyticsvidhya.com/blog/2018/08/auto-arima-time-series-modeling-python-r/
Time Series: https://towardsdatascience.com/time-series-forecasting-arima-models-7f221e9eee06
https://towardsdatascience.com/time-series-forecasting-with-prophet-54f2ac5e722e
https://towardsdatascience.com/forecasting-exchange-rates-using-arima-in-python-f032f313fc56
https://towardsdatascience.com/forecasting-with-prophet-d50bbfe95f91
https://towardsdatascience.com/exploring-the-sp500-with-r-part-2-asset-analysis-657d3c1caf60
https://towardsdatascience.com/time-series-machine-learning-regression-framework-9ea33929009a
https://colab.research.google.com/notebooks/widgets.ipynb#scrollTo=QKk_E6-QRVPW
Multicollinearity occurs when independent variables in a regression model are correlated.
Recently at a meetup regarding AI, the topic of statistics came up during discussion. The statistical method is a great tool to quantify your test and check for significant impact between your independent variables (variables that you control and can change- think of the X-axis terms in a graph) and how it affects the dependent variable (the variable that changes due to the change of your independent variable- the y-axis terms in a graph).
For example, y=2x would mean that for every change in X (our independent variable), y would change by 2 (change of the dependent variable). This is quite simple when we have one independent variable for one dependent variable. However, the issue becomes a bit more complex when we have multiple variables in our dataset- referred to as features that can affect our target variable (the variable we are trying to predict using linear regression).
It becomes even more complicated when each of the features become “dependent” on each other. Another way of saying this is that when we change an independent variable and expect a change in a dependent variable, we see that another independent variable have also changed. These two independent variables are now codependent, or collinear of each other. Add in more features that are collinear of each others and we get multicollinearity.
Multicollinearity occurs when independent variables in a regression model are correlated. There are two main types of multicollinearity. The first one is structural (independent variable x is squared), which is simply a byproduct and since more often than not that you will create it using an existing independent variable, you will be able to track it. Think of when we have a data set and we decide to use log to either scale all features or normalize them. That would be an example of structural multicollinearity. The second, and more dangerous in my opinion, is the data multicollinearity. This is already embedded within the feature set of your dataframe (pandas dataframe) and are much harder to observe.
What problems can this cause? It can reduce our overall coefficient as well as our p-value (known as the significance value) and cause unpredictable variance. This will lead to overfitting where the model may do great on known training set but will fail at unknown testing set. As this leads to higher standard error with lower statistical significance value, multicollinearity makes it difficult to ascertain how important a feature is to the target variable. And with a lower significance value, we will fail to reject the null, which leads to type II error for our hypothesis testing.
The next question I was asked during discussion was the same question you have. How do we notice them? Well, the best way I found was to test each independent feature against each other. This is great and not much work if you have limited features to work with, which is what I did for one of my projects that involved linear base model. But when you have features in the numbers of hundreds or more, it is much more difficult. At that point, it also becomes a dimensionality issue (known as the curse of dimensionality). The best at that point is to use PCA (Principal Component Analysis) to reduce your features. Another method of detection is to look at the coefficient score.
If you can identify which variables are affected by multicollinearity and the strength of the correlation, you’re well on your way to determining whether you need to fix it. Fortunately, there is a very simple test to assess multicollinearity in your regression model. The variance inflation factor (VIF) identifies correlation between independent variables and the strength of that correlation. Using Variance Inflation Factor- VIF- we can determine if two independent variables are collinear with each other. When measuring, if the two features have a VIF of 1, then they are not collinear of each other (ie there are no correlation between these two features). However, as the numbers increases, the higher they are correlated with each other. If VIF returns a number greater than 5, then those two features should be reduced to one using PCA.
There is also another final statistical tool that is great for analyzing the variance between features. It is ANOVA which stands for Analysis of Variances. Generally, the higher the variance between the variables, the less likely that they are related (or correlated). But perhaps this will be a topic for a future discussion. I will link my source for this discussion below, which also contains a great example you can try out.
multiple vs multivariate linear regression A regression analysis with one dependent variable and 8 independent variables is NOT a multivariate regression. It's a multiple regression. Multivariate analysis ALWAYS refers to the dependent variable. So when you're in SPSS, choose univariate GLM for this model, not multivariate.
It is tempting to ignore or remove outliers. Don’t do it without a very good reason. Always do external research in order to find hypothesized true population statistics. This is probably easier said than done.
In econometrics, it is said that a linear regression model presents heteroscedasticity when the variance of the perturbations is not constant throughout the observations. This implies the breach of one of the basic hypothesis on which the linear regression model is based.
Recall that one of the basic assumptions of linear regression is “That errors have constant variance.” From it is derived that the data with which one works are heterogeneous since they come from probability distributions with a different variance.
Obviously, observations where u has low variance, like that for X1, will tend to be better guided to the underlying relationship than those like that for X5, where it has a relatively high variance.
When the distribution is not the same for each observation, the disturbance term is said to be subject to heteroscedasticity. There are two major consequences of heteroscedasticity. One is that the standard errors of the regression coefficients are estimated wrongly and the t-tests (and F test) are invalid.
The other is that OLS is an inefficient estimation technique. An alternative technique which gives relatively high weight to the relatively low-variance observations should tend to yield more accurate estimates.
https://towardsdatascience.com/pyspark-in-google-colab-6821c2faf41c
Cook’s distance is a measurement of the effect on the regression of deleting a point and so given this information, it would be good to investigate those points with extreme/higher Cook’s distances.
https://towardsdatascience.com/model-assumptions-for-regression-problems-e4591af44901
When looking for relationships in your data, it’s no good just fitting a regression and hoping for the best. Linear regression has many assumptions that need to be met in order for the model to be accurate. Some of the common assumptions are:
- Linearity: A linear relationship exists between the dependent and predictor variables. If no linear relationship exists, then linear regression is an inaccurate representation of the data
- No multicollinearity: Predictor variables are not collinear, meaning they aren’t highly correlated
- No autocorrelation (serial correlation in time): Autocorrelation is when a variable is correlated with itself across observations
- Homoscedasticity: There is no pattern in the residuals, meaning that the variance is constant
- Normally distributed: Residuals, independent, and dependent variables must be normally distributed Residual average is zero, indicating that data is evenly spread across the regression line
https://towardsdatascience.com/regression-analysis-linear-regression-239df26a94ac
Ridge Regression A standard linear or polynomial regression will fail in the case where there is high collinearity among the feature variables. Collinearity is the existence of near-linear relationships among the independent variables. The presence of hight collinearity can be determined in a few different ways:
- A regression coefficient is not significant even though, theoretically, that variable should be highly correlated with Y.
- When you add or delete an X feature variable, the regression coefficients change dramatically.
- Your X feature variables have high pairwise correlations (check the correlation matrix).
We can first look at the optimization function of a standard linear regression to gain some insight as to how ridge regression can help:
min || Xw - y ||²
Where X represents the feature variables, w represents the weights, and y represents the ground truth. Ridge Regression is a remedial measure taken to alleviate collinearity amongst regression predictor variables in a model. Collinearity is a phenomenon in which one feature variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy. Since the feature variables are so correlated in this way, the final regression model is quite restricted and rigid in its approximation i.e it has high variance. To alleviate this issue, Ridge Regression adds a small squared bias factor to the variables:
min || Xw — y ||² + z|| w ||²
Such a squared bias factor pulls the feature variable coefficients away from this rigidness, introducing a small amount of bias into the model but greatly reducing the variance.
A few key points about Ridge Regression:
- The assumptions of this regression is same as least squared regression except normality is not to be assumed.
- It shrinks the value of coefficients but doesn’t reaches zero, which suggests no feature selection feature
Lasso Regression
Lasso Regression is quite similar to Ridge Regression in that both techniques have the same premise. We are again adding a biasing term to the regression optimization function in order to reduce the effect of collinearity and thus the model variance. However, instead of using a squared bias like ridge regression, lasso instead using an absolute value bias:
min || Xw — y ||² + z|| w ||
Regression coefficients are estimates of the unknown population parameters and describe the relationship between a predictor variable and the response. In linear regression, coefficients are the values that multiply the predictor values. Suppose you have the following regression equation: y = 3X + 5
In regression with multiple independent variables, the coefficient tells you how much the dependent variable is expected to increase when that independent variable increases by one, holding all the other independent variables constant. Remember to keep in mind the units which your variables are measured in.
The sign of a regression coefficient tells you whether there is a positive or negative correlation between each independent variable the dependent variable. A positive coefficient indicates that as the value of the independent variable increases, the mean of the dependent variable also tends to increase.
R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. ... 100% indicates that the model explains all the variability of the response data around its mean.
R2 is defined as the ratio of the sum of squares of the model and the total sum of squares, times 100, in order to express it in percentage. It is often called the coefficient of determination.
Never do a regression analysis unless you have already found at least a moderately strong correlation between the two variables. (A good rule of thumb is it should be at or beyond either positive or negative 0.50.) If the data don’t resemble a line to begin with, you shouldn’t try to use a line to fit the data and make predictions (but people still try).
Confidence in our Model Question: Is linear regression a high bias/low variance model, or a low bias/high variance model?
Answer: High bias/low variance. Under repeated sampling, the line will stay roughly in the same place (low variance), but the average of those models won't do a great job capturing the true relationship (high bias). Note that low variance is a useful characteristic when you don't have a lot of training data!
A closely related concept is confidence intervals. Statsmodels calculates 95% confidence intervals for our model coefficients, which are interpreted as follows: If the population from which this sample was drawn was sampled 100 times, approximately 95 of those confidence intervals would contain the "true" coefficient.
Keep in mind that we only have a single sample of data, and not the entire population of data. The "true" coefficient is either within this interval or it isn't, but there's no way to actually know. We estimate the coefficient with the data we do have, and we show uncertainty about that estimate by giving a range that the coefficient is probably within.
Note that using 95% confidence intervals is just a convention. You can create 90% confidence intervals (which will be more narrow), 99% confidence intervals (which will be wider), or whatever intervals you like.
Hypothesis Testing and p-values Closely related to confidence intervals is hypothesis testing. Generally speaking, you start with a null hypothesis and an alternative hypothesis (that is opposite the null). Then, you check whether the data supports rejecting the null hypothesis or failing to reject the null hypothesis.
(Note that "failing to reject" the null is not the same as "accepting" the null hypothesis. The alternative hypothesis may indeed be true, except that you just don't have enough data to show that.)
As it relates to model coefficients, here is the conventional hypothesis test:
- null hypothesis: There is no relationship between TV ads and Sales (and thus β1 equals zero)
- alternative hypothesis: There is a relationship between TV ads and Sales (and thus β1 is not equal to zero)
How do we test this hypothesis? Intuitively, we reject the null (and thus believe the alternative) if the 95% confidence interval does not include zero. Conversely, the p-value represents the probability that the coefficient is actually zero:
f the 95% confidence interval includes zero, the p-value for that coefficient will be greater than 0.05. If the 95% confidence interval does not include zero, the p-value will be less than 0.05. Thus, a p-value less than 0.05 is one way to decide whether there is likely a relationship between the feature and the response. (Again, using 0.05 as the cutoff is just a convention.)
In this case, the p-value for TV is far less than 0.05, and so we believe that there is a relationship between TV ads and Sales.
Note that we generally ignore the p-value for the intercept.
How Well Does the Model Fit the data?
The most common way to evaluate the overall fit of a linear model is by the R-squared value. R-squared is the proportion of variance explained, meaning the proportion of variance in the observed data that is explained by the model, or the reduction in error over the null model. (The null model just predicts the mean of the observed response, and thus it has an intercept and no slope.)
R-squared is between 0 and 1, and higher is better because it means that more variance is explained by the model.
Is that a "good" R-squared value? It's hard to say. The threshold for a good R-squared value depends widely on the domain. Therefore, it's most useful as a tool for comparing different models.
How do we interpret these coefficients? For a given amount of Radio and Newspaper ad spending, an increase of 1000 in _2012_Births_by_Count ad spending is associated with an increase in _2017_Births_by_Count of 54.8.
What are a few key things we learn from this output?
- _2012_Births_by_Count, _2013_Births_by_Count and _2014_Births_by_Count and _2016_Births_by_Count have significant p-values, whereas _2015_Births_by_Count does not. Thus we reject the null hypothesis for all but 15 (that there is no association between those 15 and 17), and fail to reject the null hypothesis for 15.
- everything is positively associated with 17 except the intercept. the intercept slightly negatively associated with 17. (However, this is irrelevant since don't think about the interval. As well, it wouldnt matter if it failed to reject the null hypothesis like 15.)
- This model has a high R-squared (0.861)
Feature Selection
How do I decide which features to include in a linear model? Here's one idea:
Try different models, and only keep predictors in the model if they have small p-values. Check whether the R-squared value goes up when you add new predictors. What are the drawbacks to this approach?
Linear models rely upon a lot of assumptions (such as the features being independent), and if those assumptions are violated (which they usually are), R-squared and p-values are less reliable. Using a p-value cutoff of 0.05 means that if you add 100 predictors to a model that are pure noise, 5 of them (on average) will still be counted as significant. R-squared is susceptible to overfitting, and thus there is no guarantee that a model with a high R-squared value will generalize. Below is an example:
R-squared will always increase as you add more features to the model, even if they are unrelated to the response. Thus, selecting the model with the highest R-squared is not a reliable approach for choosing the best linear model.
There is alternative to R-squared called adjusted R-squared that penalizes model complexity (to control for overfitting), but it generally under-penalizes complexity.
So is there a better approach to feature selection? Cross-validation. It provides a more reliable estimate of out-of-sample error, and thus is a better way to choose which of your models will best generalize to out-of-sample data. There is extensive functionality for cross-validation in scikit-learn, including automated methods for searching different sets of parameters and different models. Importantly, cross-validation can be applied to any model, whereas the methods described above only apply to linear models.
[('TV', 0.04574401036331379), ('Radio', 0.18786669552525814), ('Newspaper', -0.0010876977267108706), ('IsLarge', 0.077396607497479411), ('Area_suburban', -0.10656299015958708), ('Area_urban', 0.26813802165220019)]
How do we interpret the coefficients?
Holding all other variables fixed, being a suburban area is associated with an average decrease in Sales of 106.56 widgets (as compared to the baseline level, which is rural). Being an urban area is associated with an average increase in Sales of 268.13 widgets (as compared to rural). A final note about dummy encoding: If you have categories that can be ranked (i.e., strongly disagree, disagree, neutral, agree, strongly agree), you can potentially use a single dummy variable and represent the categories numerically (such as 1, 2, 3, 4, 5).
What Didn't We Cover? Detecting collinearity Diagnosing model fit Transforming predictors to fit non-linear relationships Interaction terms Assumptions of linear regression And so much more! You could certainly go very deep into linear regression, and learn how to apply it really, really well. It's an excellent way to start your modeling process when working a regression problem. However, it is limited by the fact that it can only make good predictions if there is a linear relationship between the features and the response, which is why more complex methods (with higher variance and lower bias) will often outperform linear regression.
Therefore, we want you to understand linear regression conceptually, understand its strengths and weaknesses, be familiar with the terminology, and know how to apply it. However, we also want to spend time on many other machine learning models, which is why we aren't going deeper here.
Resources To go much more in-depth on linear regression, read Chapter 3 of An Introduction to Statistical Learning, from which this lesson was adapted. Alternatively, watch the related videos or read my quick reference guide to the key points in that chapter. To learn more about Statsmodels and how to interpret the output, DataRobot has some decent posts on simple linear regression and multiple linear regression. This introduction to linear regression is much more detailed and mathematically thorough, and includes lots of good advice. This is a relatively quick post on the assumptions of linear regression.
https://github.com/justmarkham
Linear Regressions Basic Example
2012 Births by Count fl... ,2013 Births by Count int64 ,2014 Births by Count int64 ,2015 Births by Count int64 ,2016 Births by Count fl... ,2017 Births by Count int64 ,dtype: object Missing Values? True How many from what columns? 2012 Births by Count 7 ,2013 Births by Count 0 ,2014 Births by Count 0 ,2015 Births by Count 0 ,2016 Births by Count 1 ,2017 Births by Count 0 ,dtype: int64 total missing 8 I'm gonna bite the bullet and say that NAN == 0 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)Approach 1 statsmodels OLS
https://stackoverflow.com/questions/19991445/run-an-ols-regression-with-pandas-data-frame
I have a pandas data frame and I would like to able to predict the values of column A from the values in columns B and C. Here is a toy example:
Ideally, I would have something like ols(A ~ B + C, data = df) but when I look at the examples from algorithm libraries like scikit-learn it appears to feed the data to the model with a list of rows instead of columns. This would require me to reformat the data into lists inside lists, which seems to defeat the purpose of using pandas in the first place. What is the most pythonic way to run an OLS regression (or any machine learning algorithm more generally) on data in a pandas data frame?
Intercept 14... B 0.... C 0.... dtype: float64 OLS Regression Results ============================================================================== Dep. Variable: A R-squared: 0.579 Model: OLS Adj. R-squared: 0.158 Method: Least Squares F-statistic: 1.375 Date: Fri, 09 Aug 2019 Prob (F-statistic): 0.421 Time: 05:21:02 Log-Likelihood: -18.178 No. Observations: 5 AIC: 42.36 Df Residuals: 2 BIC: 41.19 Df Model: 2 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 14.9525 17.764 0.842 0.489 -61.481 91.386 B 0.4012 0.650 0.617 0.600 -2.394 3.197 C 0.0004 0.001 0.650 0.583 -0.002 0.003 ============================================================================== Omnibus: nan Durbin-Watson: 1.061 Prob(Omnibus): nan Jarque-Bera (JB): 0.498 Skew: -0.123 Prob(JB): 0.780 Kurtosis: 1.474 Cond. No. 5.21e+04 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 5.21e+04. This might indicate that there are strong multicollinearity or other numerical problems.How to directly get R-squared, Coefficients and p-value:
Intercept 14... ,B 0.... ,C 0.... ,dtype: float64 Intercept 0.... ,B 0.... ,C 0.... ,dtype: float64 0.578871792091864Approach 2 sklearn linearRegression()
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False) array([4.01182386e-01, 3.51587361e-04]) 14.95247950395367 array([22.98737801, 27.07022252, 18.97238987, 31.00786144, 49.96214816])A | B | C | A_pred | |
---|---|---|---|---|
0 | 10 | 20 | 32 | 22... |
1 | 20 | 30 | 234 | 27... |
2 | 30 | 10 | 23 | 18... |
3 | 40 | 40 | 23 | 31... |
4 | 50 | 50 | 42523 | 49... |
Walkthrough on all Communities and Years
fill missing values + assessment
predicting future indicator values
hypothesis testing p-value
model fit - r2
intro to cross validation
2012 Births by Count fl... ,2013 Births by Count int64 ,2014 Births by Count int64 ,2015 Births by Count int64 ,2016 Births by Count fl... ,2017 Births by Count int64 ,dtype: object Missing Values? True How many from what columns? 2012 Births by Count 7 ,2013 Births by Count 0 ,2014 Births by Count 0 ,2015 Births by Count 0 ,2016 Births by Count 1 ,2017 Births by Count 0 ,dtype: int64 total missing 8 I'm gonna bite the bullet and say that NAN == 02012 Births by Count | 2013 Births by Count | 2014 Births by Count | 2015 Births by Count | 2016 Births by Count | 2017 Births by Count | |
---|---|---|---|---|---|---|
0 | 49.0 | 39 | 60 | 38 | 59.0 | 41 |
1 | 64.0 | 42 | 49 | 59 | 62.0 | 44 |
2 | 41.0 | 40 | 37 | 43 | 32.0 | 37 |
3 | 37.0 | 31 | 53 | 33 | 34.0 | 45 |
4 | 22.0 | 23 | 28 | 22 | 28.0 | 28 |
Predicting the 2017 values that we already know
array([48.0174213 , 45.08597785, 32.88262277, 37.37845509, 20.4051096 ])Predicting 2017 values using new (fake) data
_2012_Births_by_Count | _2013_Births_by_Count | _2014_Births_by_Count | _2015_Births_by_Count | _2016_Births_by_Count | |
---|---|---|---|---|---|
0 | 18 | 30 | 33 | 10 | 27 |
Plotting the Least Squares Line
Let's make predictions for the smallest and largest observed values of x, and then use the predicted values to plot the least squares line:
_2012_Births_by_Count | _2013_Births_by_Count | _2014_Births_by_Count | _2015_Births_by_Count | _2016_Births_by_Count | |
---|---|---|---|---|---|
0 | 0.0 | 1 | 2 | 1 | 0.0 |
1 | 106.0 | 139 | 154 | 149 | 148.0 |
_2012_Births_by_Count | _2013_Births_by_Count | _2014_Births_by_Count | _2015_Births_by_Count | _2016_Births_by_Count | _2017_Births_by_Count | A_pred | |
---|---|---|---|---|---|---|---|
0 | 49.0 | 39 | 60 | 38 | 59.0 | 41 | 48... |
1 | 64.0 | 42 | 49 | 59 | 62.0 | 44 | 45... |
2 | 41.0 | 40 | 37 | 43 | 32.0 | 37 | 32... |
3 | 37.0 | 31 | 53 | 33 | 34.0 | 45 | 37... |
4 | 22.0 | 23 | 28 | 22 | 28.0 | 28 | 20... |
Map our predicted Min and Max value
X_new has all the X columns as our original dataset but only two rows.
The first row is the max value for the column. The second is the min value for that column.
preds has the predicted Y columns values from the X_new min and max rows
... Perhaps I should also draw a line of best fit for each column too
_2012_Births_by_Count | _2013_Births_by_Count | _2014_Births_by_Count | _2015_Births_by_Count | _2016_Births_by_Count | _2017_Births_by_Count | A_pred | |
---|---|---|---|---|---|---|---|
0 | 49.0 | 39 | 60 | 38 | 59.0 | 41 | 48... |
1 | 64.0 | 42 | 49 | 59 | 62.0 | 44 | 45... |
2 | 41.0 | 40 | 37 | 43 | 32.0 | 37 | 32... |
3 | 37.0 | 31 | 53 | 33 | 34.0 | 45 | 37... |
4 | 22.0 | 23 | 28 | 22 | 28.0 | 28 | 20... |
Confidence in our Model Question: Is linear regression a high bias/low variance model, or a low bias/high variance model?
Answer: High bias/low variance. Under repeated sampling, the line will stay roughly in the same place (low variance), but the average of those models won't do a great job capturing the true relationship (high bias). Note that low variance is a useful characteristic when you don't have a lot of training data!
A closely related concept is confidence intervals. Statsmodels calculates 95% confidence intervals for our model coefficients, which are interpreted as follows: If the population from which this sample was drawn was sampled 100 times, approximately 95 of those confidence intervals would contain the "true" coefficient.
0 | 1 | |
---|---|---|
Intercept | -10... | -3.... |
_2012_Births_by_Count | 0.... | 0.... |
_2013_Births_by_Count | 0.... | 0.... |
_2014_Births_by_Count | 0.... | 0.... |
_2015_Births_by_Count | -0.... | 0.... |
_2016_Births_by_Count | 0.... | 0.... |
Keep in mind that we only have a single sample of data, and not the entire population of data. The "true" coefficient is either within this interval or it isn't, but there's no way to actually know. We estimate the coefficient with the data we do have, and we show uncertainty about that estimate by giving a range that the coefficient is probably within.
Note that using 95% confidence intervals is just a convention. You can create 90% confidence intervals (which will be more narrow), 99% confidence intervals (which will be wider), or whatever intervals you like.
Hypothesis Testing and p-values Closely related to confidence intervals is hypothesis testing. Generally speaking, you start with a null hypothesis and an alternative hypothesis (that is opposite the null). Then, you check whether the data supports rejecting the null hypothesis or failing to reject the null hypothesis.
(Note that "failing to reject" the null is not the same as "accepting" the null hypothesis. The alternative hypothesis may indeed be true, except that you just don't have enough data to show that.)
As it relates to model coefficients, here is the conventional hypothesis test:
- null hypothesis: There is no relationship between TV ads and Sales (and thus β1 equals zero)
- alternative hypothesis: There is a relationship between TV ads and Sales (and thus β1 is not equal to zero)
How do we test this hypothesis? Intuitively, we reject the null (and thus believe the alternative) if the 95% confidence interval does not include zero. Conversely, the p-value represents the probability that the coefficient is actually zero:
Intercept 1.... ,_2012_Births_by_Count 4.... ,_2013_Births_by_Count 3.... ,_2014_Births_by_Count 9.... ,_2015_Births_by_Count 8.... ,_2016_Births_by_Count 7.... ,dtype: float64f the 95% confidence interval includes zero, the p-value for that coefficient will be greater than 0.05. If the 95% confidence interval does not include zero, the p-value will be less than 0.05. Thus, a p-value less than 0.05 is one way to decide whether there is likely a relationship between the feature and the response. (Again, using 0.05 as the cutoff is just a convention.)
In this case, the p-value for TV is far less than 0.05, and so we believe that there is a relationship between TV ads and Sales.
Note that we generally ignore the p-value for the intercept.
How Well Does the Model Fit the data?
The most common way to evaluate the overall fit of a linear model is by the R-squared value. R-squared is the proportion of variance explained, meaning the proportion of variance in the observed data that is explained by the model, or the reduction in error over the null model. (The null model just predicts the mean of the observed response, and thus it has an intercept and no slope.)
R-squared is between 0 and 1, and higher is better because it means that more variance is explained by the model.
0.8608769112394646Is that a "good" R-squared value? It's hard to say. The threshold for a good R-squared value depends widely on the domain. Therefore, it's most useful as a tool for comparing different models.
Intercept -7.... ,_2012_Births_by_Count 0.... ,_2013_Births_by_Count 0.... ,_2014_Births_by_Count 0.... ,_2015_Births_by_Count 0.... ,_2016_Births_by_Count 0.... ,dtype: float64How do we interpret these coefficients? For a given amount of Radio and Newspaper ad spending, an increase of 1000 in _2012_Births_by_Count ad spending is associated with an increase in _2017_Births_by_Count of 54.8.
Dep. Variable: | _2017_Births_by_Count | R-squared: | 0.861 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.857 |
Method: | Least Squares | F-statistic: | 238.9 |
Date: | Fri, 09 Aug 2019 | Prob (F-statistic): | 1.27e-80 |
Time: | 05:33:10 | Log-Likelihood: | -694.36 |
No. Observations: | 199 | AIC: | 1401. |
Df Residuals: | 193 | BIC: | 1420. |
Df Model: | 5 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -7.1948 | 1.865 | -3.857 | 0.000 | -10.874 | -3.516 |
_2012_Births_by_Count | 0.0548 | 0.027 | 2.065 | 0.040 | 0.002 | 0.107 |
_2013_Births_by_Count | 0.3743 | 0.060 | 6.202 | 0.000 | 0.255 | 0.493 |
_2014_Births_by_Count | 0.4904 | 0.056 | 8.771 | 0.000 | 0.380 | 0.601 |
_2015_Births_by_Count | 0.0045 | 0.027 | 0.162 | 0.871 | -0.050 | 0.059 |
_2016_Births_by_Count | 0.1413 | 0.031 | 4.613 | 0.000 | 0.081 | 0.202 |
Omnibus: | 6.567 | Durbin-Watson: | 1.916 |
---|---|---|---|
Prob(Omnibus): | 0.037 | Jarque-Bera (JB): | 10.684 |
Skew: | -0.045 | Prob(JB): | 0.00479 |
Kurtosis: | 4.132 | Cond. No. | 332. |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
What are a few key things we learn from this output?
- _2012_Births_by_Count, _2013_Births_by_Count and _2014_Births_by_Count and _2016_Births_by_Count have significant p-values, whereas _2015_Births_by_Count does not. Thus we reject the null hypothesis for all but 15 (that there is no association between those 15 and 17), and fail to reject the null hypothesis for 15.
- everything is positively associated with 17 except the intercept. the intercept slightly negatively associated with 17. (However, this is irrelevant since don't think about the interval. As well, it wouldnt matter if it failed to reject the null hypothesis like 15.)
- This model has a high R-squared (0.861)
Feature Selection
How do I decide which features to include in a linear model? Here's one idea:
Try different models, and only keep predictors in the model if they have small p-values. Check whether the R-squared value goes up when you add new predictors. What are the drawbacks to this approach?
Linear models rely upon a lot of assumptions (such as the features being independent), and if those assumptions are violated (which they usually are), R-squared and p-values are less reliable. Using a p-value cutoff of 0.05 means that if you add 100 predictors to a model that are pure noise, 5 of them (on average) will still be counted as significant. R-squared is susceptible to overfitting, and thus there is no guarantee that a model with a high R-squared value will generalize. Below is an example:
0.8608578796125359 0.8608769112394646R-squared will always increase as you add more features to the model, even if they are unrelated to the response. Thus, selecting the model with the highest R-squared is not a reliable approach for choosing the best linear model.
There is alternative to R-squared called adjusted R-squared that penalizes model complexity (to control for overfitting), but it generally under-penalizes complexity.
So is there a better approach to feature selection? Cross-validation. It provides a more reliable estimate of out-of-sample error, and thus is a better way to choose which of your models will best generalize to out-of-sample data. There is extensive functionality for cross-validation in scikit-learn, including automated methods for searching different sets of parameters and different models. Importantly, cross-validation can be applied to any model, whereas the methods described above only apply to linear models.
ONEHOTENCODING
2012 Births by Count | 2013 Births by Count | 2014 Births by Count | 2015 Births by Count | 2016 Births by Count | 2017 Births by Count | centroid | ctract | geometry | |
---|---|---|---|---|---|---|---|---|---|
0 | 49.0 | 39 | 60 | 38 | 59.0 | 41 | PO... | 10100 | PO... |
1 | 64.0 | 42 | 49 | 59 | 62.0 | 44 | PO... | 10200 | PO... |
2 | 41.0 | 40 | 37 | 43 | 32.0 | 37 | PO... | 10300 | PO... |
3 | 37.0 | 31 | 53 | 33 | 34.0 | 45 | PO... | 10400 | PO... |
4 | 22.0 | 23 | 28 | 22 | 28.0 | 28 | PO... | 10500 | PO... |
2012 Births by Count | 2013 Births by Count | 2014 Births by Count | 2015 Births by Count | 2016 Births by Count | 2017 Births by Count | ctract | |
---|---|---|---|---|---|---|---|
0 | 49.0 | 39 | 60 | 38 | 59.0 | 41 | 10100 |
1 | 64.0 | 42 | 49 | 59 | 62.0 | 44 | 10200 |
2 | 41.0 | 40 | 37 | 43 | 32.0 | 37 | 10300 |
3 | 37.0 | 31 | 53 | 33 | 34.0 | 45 | 10400 |
4 | 22.0 | 23 | 28 | 22 | 28.0 | 28 | 10500 |
We have to represent Area numerically, but we can't simply code it as 0=rural, 1=suburban, 2=urban because that would imply an ordered relationship between suburban and urban (and thus urban is somehow "twice" the suburban category).
Instead, we create another dummy variable:
2012 Births by Count | 2013 Births by Count | 2014 Births by Count | 2015 Births by Count | 2016 Births by Count | 2017 Births by Count | ctract | ctract_10200 | ctract_10300 | ctract_10400 | ctract_10500 | ctract_20100 | ctract_20200 | ctract_20300 | ctract_30100 | ctract_30200 | ctract_40100 | ctract_40200 | ctract_60100 | ctract_60200 | ctract_60300 | ctract_60400 | ctract_70100 | ctract_70200 | ctract_70300 | ctract_70400 | ctract_80101 | ctract_80102 | ctract_80200 | ctract_80301 | ctract_80302 | ctract_80400 | ctract_80500 | ctract_80600 | ctract_80700 | ctract_80800 | ctract_90100 | ctract_90200 | ctract_90300 | ctract_90400 | ... | ctract_270701 | ctract_270702 | ctract_270703 | ctract_270801 | ctract_270802 | ctract_270803 | ctract_270804 | ctract_270805 | ctract_270901 | ctract_270902 | ctract_270903 | ctract_271001 | ctract_271002 | ctract_271101 | ctract_271102 | ctract_271200 | ctract_271300 | ctract_271400 | ctract_271501 | ctract_271503 | ctract_271600 | ctract_271700 | ctract_271801 | ctract_271802 | ctract_271900 | ctract_272003 | ctract_272004 | ctract_272005 | ctract_272006 | ctract_272007 | ctract_280101 | ctract_280102 | ctract_280200 | ctract_280301 | ctract_280302 | ctract_280401 | ctract_280402 | ctract_280403 | ctract_280404 | ctract_280500 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 49.0 | 39 | 60 | 38 | 59.0 | 41 | 10100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 64.0 | 42 | 49 | 59 | 62.0 | 44 | 10200 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 41.0 | 40 | 37 | 43 | 32.0 | 37 | 10300 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 37.0 | 31 | 53 | 33 | 34.0 | 45 | 10400 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 22.0 | 23 | 28 | 22 | 28.0 | 28 | 10500 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows Ă— 205 columns
Why do we only need k-1 dummy variables, not k? Because k-1 dummies captures all of the information about the feature, and implicitly defines the last as the baseline level. (In general, if you have a categorical feature with k levels, you create k-1 dummy variables.)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False) array([ 5.01789652e-03, 2.82093942e-02, -5.11060660e-03, -3.89506920e-03, , 6.30364071e-03, -2.75592185e-02, 4.03538526e-07, -7.05904702e-02, , -2.62541104e-02, 4.98347149e-01, 1.97954600e-01, -2.74147758e-01, , -3.43483104e-01, -1.26575571e-01, -2.11275765e-01, -4.13686086e-01, , 2.03280825e-01, 1.74182528e-01, -5.65360615e-01, -2.05796262e-01, , -4.17322963e-01, 7.57817793e-02, 4.42163091e-01, -2.37286514e-01, , -3.26927778e-01, -5.61378089e-02, 1.02119043e-01, -3.85539566e-01, , -4.18949762e-02, -4.92567060e-02, 4.03449925e-01, -4.72343178e-01, , -3.47964220e-01, -2.99397195e-01, 1.18206788e-02, -2.88980937e-01, , 2.32276252e-01, 1.46307207e-01, -7.02015446e-03, 3.47518427e-02, , -9.96328892e-02, -4.20163408e-01, -2.42348673e-01, -3.36781034e-01, , -2.47585854e-01, -4.85139378e-01, -2.85306808e-01, -8.33809898e-02, , -2.39559310e-01, -3.24077023e-01, -1.69704119e-01, -3.53696504e-02, , -3.59200172e-02, -8.34328447e-02, 1.49083606e-01, 8.44962855e-03, , 4.08503078e-02, -1.56888292e-01, -1.10467348e+00, -1.20324076e-01, , -4.89942089e-02, 2.06411272e-01, 2.66198664e-01, -1.35015942e-01, , -4.11475363e-01, -1.47565919e-01, 1.07268585e-01, 6.04407903e-02, , -3.80713811e-01, -2.74666077e-01, -3.68234971e-01, -5.24365162e-02, , -1.62987650e-01, -2.34502560e-01, -1.68560109e-01, -7.46277908e-01, , 2.24848371e-01, 4.84102064e-02, -5.74381603e-01, 2.27208547e-01, , 3.89080886e-02, 2.58464268e-01, 2.37267429e-01, 8.34705289e-02, , -3.35492814e-01, -1.21339815e-01, -9.56313604e-02, -5.73365959e-01, , -9.33791821e-02, -4.88610532e-01, -1.01115012e-01, 3.07066435e-01, , -5.19014403e-01, -2.35269301e-02, -1.13894052e-01, -5.62728959e-01, , -7.13761441e-01, -1.49381236e-01, -4.66695354e-01, -4.57121936e-01, , -9.00914506e-01, -5.13884450e-01, 2.49570865e-01, 2.69352383e-01, , -5.70537579e-02, -4.15040000e-01, -3.85756258e-01, -4.21966264e-01, , -2.07624047e-01, -2.54500368e-01, -1.74247495e-01, -6.13382450e-01, , -3.20014626e-01, -4.38006680e-01, 5.65624660e-02, 2.91031659e-01, , 2.96937988e-02, -1.88173203e-01, -3.31275162e-01, -2.04396061e-01, , -3.78796198e-01, 2.72117792e-01, -3.08606692e-01, -3.93049820e-01, , 1.50286689e-01, -1.33900580e-01, 3.39595137e-01, 9.40968089e-02, , -2.58122291e-03, -4.51778861e-01, -4.08765763e-01, -7.40820701e-01, , 4.89815426e-01, -6.27111011e-01, -2.87048409e-02, -8.41748929e-01, , -4.40951878e-01, -4.19425835e-01, -3.86796717e-01, -2.32334523e-01, , 2.62540971e-01, 9.88602548e-02, -1.23736058e-01, 3.02126388e-02, , 5.27128208e-02, -2.17247106e-01, -5.51295332e-01, 5.05078751e-01, , -2.45401947e-01, 1.21494040e-01, 1.15289096e-02, -7.09341148e-01, , 1.15802429e-01, 5.76976831e-02, -5.33097298e-02, -6.05233658e-01, , 1.69056572e-01, -1.02774876e-01, -5.59055791e-01, -7.83825564e-01, , -7.69899156e-01, -5.84046935e-01, -3.20980227e-01, -1.82631770e-01, , 5.48169687e-02, 6.30844712e-01, -3.31358487e-01, -3.66148660e-01, , -1.46899614e-01, -2.95586259e-01, -9.53017786e-02, -7.01265629e-01, , 2.02367260e-02, -3.43963700e-02, -8.20683499e-02, -3.76452185e-01, , -6.40674692e-01, -2.27789251e-01, -1.33203919e-01, -6.89731425e-01, , 2.14841966e-01, -1.84254648e-01, -3.06131648e-01, -5.81607635e-01, , -2.68500634e-01, -5.98094317e-01, -8.59661335e-02, -1.24054718e-01, , 1.00258810e-02, -5.69871190e-01, -1.23297208e-01, 2.04223093e-01, , 9.24465310e-02, -1.07292144e+00, 3.14535139e-01, -3.50649350e-02, , 2.29626265e-01, -2.82603589e-01, -5.84840233e-01, 5.45894247e-02, , -2.74493696e-01, -3.24250886e-01, -2.13903790e-01, 1.24591649e-01])[('TV', 0.04574401036331379), ('Radio', 0.18786669552525814), ('Newspaper', -0.0010876977267108706), ('IsLarge', 0.077396607497479411), ('Area_suburban', -0.10656299015958708), ('Area_urban', 0.26813802165220019)]
How do we interpret the coefficients?
Holding all other variables fixed, being a suburban area is associated with an average decrease in Sales of 106.56 widgets (as compared to the baseline level, which is rural). Being an urban area is associated with an average increase in Sales of 268.13 widgets (as compared to rural). A final note about dummy encoding: If you have categories that can be ranked (i.e., strongly disagree, disagree, neutral, agree, strongly agree), you can potentially use a single dummy variable and represent the categories numerically (such as 1, 2, 3, 4, 5).
What Didn't We Cover? Detecting collinearity Diagnosing model fit Transforming predictors to fit non-linear relationships Interaction terms Assumptions of linear regression And so much more! You could certainly go very deep into linear regression, and learn how to apply it really, really well. It's an excellent way to start your modeling process when working a regression problem. However, it is limited by the fact that it can only make good predictions if there is a linear relationship between the features and the response, which is why more complex methods (with higher variance and lower bias) will often outperform linear regression.
Therefore, we want you to understand linear regression conceptually, understand its strengths and weaknesses, be familiar with the terminology, and know how to apply it. However, we also want to spend time on many other machine learning models, which is why we aren't going deeper here.
Resources To go much more in-depth on linear regression, read Chapter 3 of An Introduction to Statistical Learning, from which this lesson was adapted. Alternatively, watch the related videos or read my quick reference guide to the key points in that chapter. To learn more about Statsmodels and how to interpret the output, DataRobot has some decent posts on simple linear regression and multiple linear regression. This introduction to linear regression is much more detailed and mathematically thorough, and includes lots of good advice. This is a relatively quick post on the assumptions of linear regression.
More things with the data SPLOM 3d Scatterchart and CorrMatrix
Making your own prophecies
https://towardsdatascience.com/forecasting-with-prophet-d50bbfe95f91
https://www.kaggle.com/neuromusic/avocado-prices
- Prophet only takes data as a dataframe with a ds (datestamp) and y (value we want to forecast) column. So first, let’s convert the dataframe to the appropriate format.
- Create an instance of the Prophet class and then fit our dataframe to it.
- Create a dataframe with the dates for which we want a prediction to be made with make_future_dataframe(). Then specify the number of days to forecast using the periods parameter.
- Call predict to make a prediction and store it in the forecast dataframe. What’s neat here is that you can inspect the dataframe and see the predictions as well as the lower and upper boundaries of the uncertainty interval.
ds | yhat | yhat_lower | yhat_upper | |
---|---|---|---|---|
254 | 201... | 1.... | 1.... | 1.... |
255 | 201... | 1.... | 1.... | 1.... |
256 | 201... | 1.... | 1.... | 1.... |
257 | 201... | 1.... | 1.... | 1.... |
258 | 201... | 1.... | 1.... | 1.... |
Last but not least, we can evaluate the forecast using Prophet’s cross validation procedure.
- Use the cross_validation() function on the model and specify the forecast horizon with the horizon parameter.
- Next, call performance_metrics() to get a table with various prediction performance metrics.
horizon | mse | rmse | mae | mape | coverage | |
---|---|---|---|---|---|---|
0 | 9 days | 0.... | 0.... | 0.... | 0.... | 0.... |
1 | 10 ... | 0.... | 0.... | 0.... | 0.... | 0.... |
2 | 11 ... | 0.... | 0.... | 0.... | 0.... | 0.... |
3 | 12 ... | 0.... | 0.... | 0.... | 0.... | 0.... |
4 | 13 ... | 0.... | 0.... | 0.... | 0.... | 0.... |
We can plot the mean absolute percent error (MAPE) over the forecast horizon to determine how trustworthy our forecast is. Here, we use a percentage error instead of the mean squared error (MSE) simply because it is easier to interpret.
We see for this forecast that accuracy decreases as the forecast horizon expands. Overall, the error rises from 5% during the first month to around 10% over the course of the next two months.
Machine Learning: Polynomial Regression with Python SKlearn
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, , normalize=False)Recall and Precision Diagnostic (ROC Curve) Example
Create Confusion Matrix: Threshold, TP, FP, TN, FN
Calculate: Precision, Recall, F1, TPR, FPR
Create Roc Curve
Required Imports
Create Confusion Matrix Numbers
Threshold, TP, FP, TN, FN
Calculate Precision, Recall, F1, TPR, FPR
tp 50.0 fp 50.0 tn 0.0 fn 0.0 tp 48.0 fp 47.0 tn 3.0 fn 2.0 tp 47.0 fp 40.0 tn 9.0 fn 4.0 tp 45.0 fp 31.0 tn 16.0 fn 8.0 tp 44.0 fp 23.0 tn 22.0 fn 11.0 tp 42.0 fp 16.0 tn 29.0 fn 13.0 tp 36.0 fp 12.0 tn 34.0 fn 18.0 tp 30.0 fp 11.0 tn 38.0 fn 21.0 tp 20.0 fp 4.0 tn 43.0 fn 33.0 tp 12.0 fp 3.0 tn 45.0 fn 40.0 tp 0.0 fp 0.0 tn 50.0 fn 50.0threshold | recall | precision | f1 | tpr | fpr | |
---|---|---|---|---|---|---|
0 | 0.0 | 1 | 0.5 | 0.666667 | 1 | 1 |
1 | 0.1 | 0.96 | 0.505263 | 0.662069 | 0.96 | 0.94 |
2 | 0.2 | 0.921569 | 0.54023 | 0.681159 | 0.921569 | 0.816327 |
3 | 0.3 | 0.849057 | 0.592105 | 0.697674 | 0.849057 | 0.659574 |
4 | 0.4 | 0.8 | 0.656716 | 0.721311 | 0.8 | 0.511111 |
5 | 0.5 | 0.763636 | 0.724138 | 0.743363 | 0.763636 | 0.355556 |
6 | 0.6 | 0.666667 | 0.75 | 0.705882 | 0.666667 | 0.26087 |
7 | 0.7 | 0.588235 | 0.731707 | 0.652174 | 0.588235 | 0.22449 |
8 | 0.8 | 0.377358 | 0.833333 | 0.519481 | 0.377358 | 0.0851064 |
9 | 0.9 | 0.230769 | 0.8 | 0.358209 | 0.230769 | 0.0625 |
10 | 1.0 | 0 | 0 | 0 | 0 | 0 |
Receiver Operating Characteristic (ROC) Curve
Cross Validation
To pick the optimal model, we need to use a validation set. Cross validation is even better than a single validation set because it uses numerous validation sets created from the training data. In this case, we are using 5 different validation sets. The model that performs best on the cross validation is usually the optimal model because it has shown that it can learn the relationships while not overfitting.
We want to try and capture the data using a polynomial function. A polynomial is defined by the degree, or the highest power to for the x-values. A line has a degree of 1 because it is of the form $y = b_1*x + b_0$ where $b_1$ is the slope and $b_0$ is the intercept. A third degree polynomial would have the form $y = b_3 * x^3 + b_2 * x^2 + b_1 * x + b_0$ and so on. The higher the degree of the polynomial, the more flexible the model. A more flexible model is prone to overfitting because it can can "bend" to follow the training data.
The following function creates a polynomial with the specified number of degrees and plots the results. We can use these results to determine the optimal degrees to achieve the right balance between over and underfitting.
degrees | cross_valid | |
---|---|---|
0 | 4 | 0.010549 |
1 | 5 | 0.010637 |
2 | 7 | 0.010665 |
3 | 6 | 0.010887 |
4 | 8 | 0.011182 |
5 | 3 | 0.011695 |
6 | 9 | 0.011757 |
7 | 11 | 0.011769 |
8 | 10 | 0.011902 |
9 | 12 | 0.012642 |
Final Model with Comparison
The model with the lowest cross validation error had four degrees. Therefore, we will use a 4th degree polynomial for the final model.
Bayesian Linear Regression (INCOMPLETE)
Position | Level | Salary | |
---|---|---|---|
0 | Bu... | 1 | 45000 |
1 | Ju... | 2 | 50000 |
2 | Se... | 3 | 60000 |
3 | Ma... | 4 | 80000 |
4 | Co... | 5 | 11... |
The primary variable of interest is the grade, so let’s take a look at the distribution to check for skew:
(array([5., 1., 1., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1.]), , array([ 45000. , 113214.28571429, 181428.57142857, , 249642.85714286, 317857.14285714, 386071.42857143, , 454285.71428571, 522500. , 590714.28571429, , 658928.57142857, 727142.85714286, 795357.14285714, , 863571.42857143, 931785.71428571, 1000000. ]), , ) Text(0.5, 0, 'Salary') Text(0, 0.5, 'Level') Text(0.5, 1.0, 'Distribution of Final Grades')