--- title: "Exercise 5" output: word_document: default --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## Initial settings Set the working directory: ```{r message=FALSE, warning=FALSE} setwd("C:/Users/Favero/Dropbox/exam/SPORTMAN/R/4_WoW") rm(list=ls()) ``` Install and load the relevant packages: ```{r NBAdata, message=FALSE, warning=FALSE, paged.print=FALSE} listofpackages <- c("lrmest", "dplyr","assertthat","bindrcpp","glue","pkgconfig","utf8","cli","ellipse","reshape2","ggplot2","dygraphs","aod") for (j in listofpackages){ if(sum(installed.packages()[, 1] == j) == 0) { install.packages(j) } library(j, character.only = T) } ``` Read the csv file to extract the data for the analysis and check the type of data: ```{r} NBAdata=read.csv("C:/Users/favero/Dropbox/exam/SPORTMAN/R/data_L_2020/Teams_overall2020.csv", header = T, stringsAsFactors = F, sep = ";") typeof(NBAdata)#to check the type of data ``` ## Exercise 5.1 Using regression analysis assess the relative performance of the NBA Efficiency measure and of the model proposed in the class to explain WINS over the sample of seasons 1994-2005 (omitting season 1999). # Data transformation Construct the new variables: ```{r} NBAdata$FGMISS=NBAdata$FGA-NBAdata$FG NBAdata$FTMISS=NBAdata$FTA-NBAdata$FT NBAdata$OFGMISS=NBAdata$OFGA-NBAdata$OFG NBAdata$OFTMISS=NBAdata$OFTA-NBAdata$OFT NBAdata$MISS=NBAdata$FGMISS+NBAdata$FTMISS NBAdata$OMISS=NBAdata$OFGMISS+NBAdata$OFTMISS NBAdata$W.=NBAdata$W/(NBAdata$W+NBAdata$L) ``` # NBA Efficency Measures Compute the NBA Efficency Measures: ```{r} NBAdata$eff_NBA=NBAdata$PTS+NBAdata$TRB+NBAdata$STL+NBAdata$BLK-NBAdata$MISS-NBAdata$TOV NBAdata$eff_O_NBA=NBAdata$OPTS+NBAdata$OTRB+NBAdata$OSTL+NBAdata$OBLK-NBAdata$OMISS-NBAdata$OTOV NBAdata$rel_effNBA=NBAdata$eff_NBA-NBAdata$eff_O_NBA ``` #Modelling process Employed possessions: ```{r} NBAdata$empl_poss=NBAdata$FGA + 0.45*NBAdata$FTA + NBAdata$TOV - NBAdata$ORB NBAdata$ptsxgame=NBAdata$PTS/NBAdata$G NBAdata$ptsxposs=NBAdata$PTS/NBAdata$empl_poss ``` To calculate the acquired possessions, we have to do some additional steps before. Field goal attempt differenced: ```{r} NBAdata$FGAD=NBAdata$FGA-NBAdata$OFG-NBAdata$OTOV-NBAdata$TRB+NBAdata$TOV ``` Regression Analysis to check variable construction: ```{r} reg9405 = subset(NBAdata,(Season<2006 & Season>1993 & Season!=1999),select=c(FGAD,OFT,FTA)) reg_1 = lm(reg9405$FGAD ~ reg9405$OFT + reg9405$FTA ) summary(reg_1) ``` Hypothesis Testing: ```{r} wald.test(vcov(reg_1), coef(reg_1), H0 = c(0.45, -0.45), df = 313, L = matrix(c(0, 0, 1, 0, 0, 1), ncol = 3), verbose = T) ``` Create a proxy using the regressions results: NBAdata$TEAM_R=coef(reg_1)[1]+reg_1$residuals ```{r} NBAdata$TEAM_R=NBAdata$FGAD-0.45*NBAdata$OFT+0.45*NBAdata$FTA NBAdata$TEAM_R2=NBAdata$FGA-NBAdata$OFG-NBAdata$OTOV-NBAdata$TRB+NBAdata$TOV-0.45*NBAdata$OFT+0.45*NBAdata$FTA TeamReb_test=NBAdata$TEAM_R2-NBAdata$TEAM_R ``` The TeamReb_test returns a serie of 0, so the result of the test is positive and the two equations give the same result. ```{r, echo = FALSE} plot(x = NBAdata$Season, y = NBAdata$TEAM_R, main = "Estimated Team Rebounds", ylab = "TR",ylim = c(200,4000), xlab = "Seasons",col = "red") points(x = NBAdata$Season, y = NBAdata$DRB,col = "blue") ``` Acquired possessions: ```{r} NBAdata$acq_poss=NBAdata$OTOV + NBAdata$DRB+NBAdata$TEAM_R+ NBAdata$OFG + 0.45*NBAdata$OFT NBAdata$ptsaxgame=NBAdata$OPTS/NBAdata$G NBAdata$ptsall_poss=NBAdata$OPTS/NBAdata$acq_poss ``` Now counstruct the efficency as the difference between the points per possession and the opponent points per acquired possession: ```{r} NBAdata$eff=(NBAdata$ptsxposs-NBAdata$ptsall_poss) ``` Model estimation to measure the impact of defensive and offensive efficiency: ```{r} s9405 = subset(NBAdata,(Season>1993 & Season<2006 & Season!=1999)) ``` ```{r, echo = FALSE} plot(x = s9405$Season, y = s9405$W, main = "WINS", ylab = "TR",ylim = c(0,85), xlab = "Seasons",col = "red") ``` Regression using NBA efficency measures: ```{r} reg_NBA_eff = lm(s9405$W ~ s9405$eff_NBA + s9405$eff_O_NBA) summary(reg_NBA_eff) ``` The R squared of the regression is: ```{r} summary(reg_NBA_eff)$r.squared ``` The adjusted R squared of the regression is: ```{r} adjRsq_NBA_eff=0.9118 ``` Regression using NBA relative efficiency measure: ```{r} reg_NBA_releff = lm(s9405$W ~ s9405$rel_effNBA ) summary(reg_NBA_releff) ``` The R squared of the regression is: ```{r} summary(reg_NBA_releff)$r.squared ``` The adjusted R squared of the regression is: ```{r} adjRsq_NBA_releff=0.9108 ``` Regression using model efficiency measure: ```{r} reg_MODEL=lm(s9405$W ~ s9405$eff) summary(reg_MODEL) ``` The R squared of the regression is: ```{r} summary(reg_MODEL)$r.squared ``` The adjusted R squared of the regression is: ```{r} adjRsq_MODEL= 0.9479 ``` Regression using both the efficiency measures: ```{r} reg_both=lm(s9405$W ~ s9405$eff + s9405$eff_NBA + s9405$eff_O_NBA ) summary(reg_both) ``` The R squared of the regression is: ```{r} summary(reg_both)$r.squared ``` The adjusted R squared of the regression is: ```{r} adjRsq_both= 0.9479 ``` # Answer The performance of the model-constructed efficiency seems to be higher, considering that the parameter of the first 3 regressions are all significant at 0.001 level and the R square is higher when the regressor is the modelled efficiency. Hence, it is able to better explain the variability of the model. Moreover, adding in the third regression the NBA efficiency measures, the adjR^2 (purified by degrees of freedom) remains the same (0.9479). This means that, adding them, there are not additional informations. In fact, in the fourth (reg_both) regression, the coefficients of NBA efficiency measures are not significant at any level. This is the model that allows us to assess the validity of the model against the NBA efficiency. In fact, there is evidence that the new efficency model dominates the NBA efficiency model in term of explaining WINS. ## Exercise 5.2 Consider the baseline MODEL for WINS estimated in the lecture over the sample 1994-2005. # Exercise 5.2.1 The model estimated in the lecture drops Season 1998-99. Why? # Answer Season 1999 is excluded because of strikes that occured during that year. Therefore, number of games is smaller than a regular season. All variables are affected by this situation. What happens to the results obtained in class if you extend the sample by including Season 1998-1999? # Impact calculation Include the Season 1998-1999: ```{r} regeff_1999 = subset(NBAdata,(Season>1993 & Season<2006)) ``` Repeat the two regressions. The one on NBA efficency measure: ```{r} reg_NBA_1999 = lm(regeff_1999$W ~ regeff_1999$eff_NBA + regeff_1999$eff_O_NBA ) summary(reg_NBA_1999) ``` And the one on the model efficency measure: ```{r} reg_MODEL_1999=lm(regeff_1999$W ~ regeff_1999$eff) summary(reg_MODEL_1999) ``` Calculate and compare the R squared ratios: ```{r} R2_ratio_NBA = summary(reg_NBA_1999)$r.squared/summary(reg_NBA_eff)$r.squared R2_ratio_NBA ``` The R2_ratio_NBA is greater than 1, so linear regression is not sensitive to the presence of outliers. ```{r} R2_ratio_model = summary(reg_MODEL_1999)$r.squared/summary(reg_MODEL)$r.squared R2_ratio_model ``` The R2_ratio_model is lower than 1, so linear regression is sensitive to the presence of outliers. # Answer R2 decrease for all regressions due to Season 1999's strikes (outlier). The drop in R2 is bigger for the regression where the regressor is the modelled efficiency. The coefficents of both regressions remain significant at 0.001 level. # Exercise 5.2.2 Add to the baseline model a Season effect and a Team effect and test their significance. Using the Dataset s9405, the two dummies are constructed as follows: ```{r} s9405$SeasonEffect<- ifelse(s9405$Season=="2002",1,0) s9405$TeamEffect<- ifelse(s9405$Team=="Chicago Bulls",1,0) ``` Estimation of the model including the two dummies: ```{r} reg_M_w_dummies=lm(s9405$W ~ s9405$eff + s9405$SeasonEffect + s9405$TeamEffect) summary(reg_M_w_dummies) ``` The R squared of the model with the two dummies is: ```{r} summary(reg_M_w_dummies)$r.squared ``` # Answer The coefficients of the two dummies are not significant at any level and the Rsqr does not increase. Hence, Season Effect and Team Effect dummies are not able to add relevant informations to the model and to explain additional variance. ## Exercise 5.3 Compare the Team Rebounds constructed by the formula used in the baseline programme with the proxy obtained by considering the sum of residuals and estimated constant in the regression of Total Possession Differenced on Free Throws and Opponent Free Throws. Construction of Team Rebounds with the sum of residuals and constant in the regression of Total Possession Differenced on Free Throws and Opponent Free Throws: ```{r} NBAdata$empl_poss_diff=NBAdata$FGA+NBAdata$TOV-NBAdata$ORB NBAdata$acq_poss_diff=NBAdata$OTOV+NBAdata$DRB+NBAdata$OFG NBAdata$tot_poss_diff=NBAdata$empl_poss_diff-NBAdata$acq_poss_diff NBAdata$TEAM_R=NBAdata$FGAD-0.45*NBAdata$OFT+0.45*NBAdata$FTA ``` Extract the data of interest: ```{r} regTPD = subset(NBAdata,(Season<2006 & Season>1993 & Season!=1999),select=c(tot_poss_diff,OFT,FTA,TEAM_R)) ``` Regression of Total Possession Differenced on Free Throws and Opponent Free Throws: ```{r} reg_2 = lm(regTPD$tot_poss_diff ~ regTPD$OFT + regTPD$FTA ) summary(reg_2) ``` ```{r} TEAM_R_TPD=coef(reg_2)[1]+reg_2$residuals ``` ```{r} compare=abs(TEAM_R_TPD-regTPD$TEAM_R) print(sum(compare)/length(compare)) ``` # Answer The difference between TEAM_R_TPD and regTPD$TEAM_R is on average 51 in absolute value. ## Exercise 5.4 Using the simulation model, solve the following points. # Exercise 5.4.1 Derive the effect of a made free throws on WINS. # Deterministic Simulation Baseline no shock, specify values for each stats: ```{r} FTA=mean(s9405$FTA) TOV=mean(s9405$TOV) ORB=mean(s9405$ORB) X3P=mean(s9405$X3P) X2P=mean(s9405$X2P) FGMISS=mean(s9405$FGMISS) FGA=X3P+X2P+FGMISS FT=mean(s9405$FT) OTOV=mean(s9405$OTOV) DRB=mean(s9405$DRB) TEAM_R=mean(s9405$TEAM_R) OFG=mean(s9405$OFG) OFT=mean(s9405$OFT) OX3P=mean(s9405$O3P) OX2P=mean(s9405$O2P) OFT=mean(s9405$OFT) ``` Identities: ```{r} PTS=3*X3P+2*X2P+1*FT empl_poss=FGA + 0.44*FTA + TOV - ORB pts_poss=PTS/empl_poss OPTS=3*OX3P+2*OX2P+1*OFT acq_poss=OTOV + DRB+TEAM_R+ OFG + 0.45*OFT ptsall_poss=OPTS/acq_poss ``` Alternative scenario, schock on FT_AL: ```{r} FTA_AL=mean(s9405$FTA) TOV_AL=mean(s9405$TOV) ORB_AL=mean(s9405$ORB) X3P_AL=mean(s9405$X3P) X2P_AL=mean(s9405$X2P) FGMISS_AL=mean(s9405$FGMISS) FGA_AL=X3P_AL+X2P_AL+FGMISS_AL FT_AL=mean(s9405$FT)+1 OTOV_AL=mean(s9405$OTOV) DRB_AL=mean(s9405$DRB) TEAM_R_AL=mean(s9405$TEAM_R) OFG_AL=mean(s9405$OFG) OFT_AL=mean(s9405$OFT) OX3P_AL=mean(s9405$O3P) OX2P_AL=mean(s9405$O2P) OFT_AL=mean(s9405$OFT) ``` Identities: ```{r} PTS_AL=3*X3P_AL+2*X2P_AL+1*FT_AL empl_poss_AL=FGA_AL + 0.44*FTA_AL + TOV_AL - ORB_AL pts_poss_AL=PTS_AL/empl_poss_AL OPTS_AL=3*OX3P_AL+2*OX2P_AL+1*OFT_AL acq_poss_AL=OTOV_AL + DRB_AL+TEAM_R_AL+ OFG_AL + 0.45*OFT_AL ptsall_poss_AL=OPTS_AL/acq_poss_AL ``` Now add the last equation and simulate WINS under the two scenarios: ```{r} W_BL <- reg_MODEL$coefficients[1] +reg_MODEL$coefficients[2]*(pts_poss-ptsall_poss) W_AL <- reg_MODEL$coefficients[1] +reg_MODEL$coefficients[2]*(pts_poss_AL-ptsall_poss_AL) VAL=W_AL-W_BL VAL ``` # Answer Estimation of WINS change of +3% by increasing Free Throw by 1. # Exercise 5.4.2 Compare the distribution of wins form the simulated baseline model with that observed in the data and comment on the difference between them. # Stochastic Simulation Baseline no shock, specify values for each stats: ```{r} FTA=mean(s9405$FTA) TOV=mean(s9405$TOV) ORB=mean(s9405$ORB) X3P=mean(s9405$X3P) X2P=mean(s9405$X2P) FGMISS=mean(s9405$FGMISS) FGA=X3P+X2P+FGMISS FT=mean(s9405$FT) OTOV=mean(s9405$OTOV) DRB=mean(s9405$DRB) TEAM_R=mean(s9405$TEAM_R) OFG=mean(s9405$OFG) OFT=mean(s9405$OFT) OX3P=mean(s9405$O3P) OX2P=mean(s9405$O2P) OFT=mean(s9405$OFT) ``` Identities: ```{r} PTS=3*X3P+2*X2P+1*FT empl_poss=FGA + 0.44*FTA + TOV - ORB pts_poss=PTS/empl_poss OPTS=3*OX3P+2*OX2P+1*OFT acq_poss=OTOV + DRB+TEAM_R+ OFG + 0.45*OFT ptsall_poss=OPTS/acq_poss ``` WINS from data: ```{r} hist(NBAdata$W, breaks = seq(min(NBAdata$W), max(NBAdata$W), l = 20+1),prob=TRUE, main = "WINS from data") curve(dnorm(x,mean=mean(NBAdata$W),sd=sd(NBAdata$W)),col='darkblue',lwd=2,add=TRUE) up=mean(NBAdata$W)+2*sd(NBAdata$W) down=mean(NBAdata$W)-2*sd(NBAdata$W) ``` On average, data tell us that number of wins in a Season is 40. The 95% of observations are in the interval (14.7,66). Preparing to simulate two scenarios simulation: ```{r} nrep <- 10^3 tT <- 1 # 1 obs W_BL <- VAL_STAT <- array(0, c(tT, nrep)) # the containers ``` ```{r} for (i in 1:nrep){ x <- rt( n=tT, df=314 )*summary(reg_MODEL)$coefficients[2,2] W_BL[, i] <- reg_MODEL$coefficients[1] +(reg_MODEL$coefficients[2]+x)*(pts_poss-ptsall_poss) } VAL=W_BL ``` ```{r} hist(VAL, breaks = seq(min(VAL), max(VAL), l = 20+1),prob=TRUE, main = "histogram of effects") curve(dnorm(x,mean=mean(VAL),sd=sd(VAL)),col='darkblue',lwd=2,add=TRUE) up_BL=mean(VAL)+2*sd(VAL) down_BL=mean(VAL)-2*sd(VAL) ``` # Answer Stochastic smiulations with baseline model predicts 41 wins on average and very low variance. It is almost deterministic because range of predicted values is very small. # Thank you! Gabriele Ammoni 1828664 Mario Signoretti 3005442 Giuditta ViganĂ² 3010382