---
title: "Exercise 5"
output:
  word_document: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Initial settings

Set the working directory:

```{r message=FALSE, warning=FALSE}
setwd("C:/Users/Favero/Dropbox/exam/SPORTMAN/R/4_WoW")
rm(list=ls())
```

Install and load the relevant packages:

```{r NBAdata, message=FALSE, warning=FALSE, paged.print=FALSE}
listofpackages <- c("lrmest", "dplyr","assertthat","bindrcpp","glue","pkgconfig","utf8","cli","ellipse","reshape2","ggplot2","dygraphs","aod")

for (j in listofpackages){
  if(sum(installed.packages()[, 1] == j) == 0) {
    install.packages(j)
  }
  library(j, character.only = T)
}
```

Read the csv file to extract the data for the analysis and check the type of data:

```{r}
NBAdata=read.csv("C:/Users/favero/Dropbox/exam/SPORTMAN/R/data_L_2020/Teams_overall2020.csv", header = T, stringsAsFactors = F, sep = ";")
typeof(NBAdata)#to check the type of data
```

## Exercise 5.1

Using regression analysis assess the relative performance of the NBA Efficiency measure and of the model proposed in the class to explain WINS over the sample of seasons 1994-2005 (omitting season 1999).

# Data transformation

Construct the new variables:

```{r}
NBAdata$FGMISS=NBAdata$FGA-NBAdata$FG
NBAdata$FTMISS=NBAdata$FTA-NBAdata$FT
NBAdata$OFGMISS=NBAdata$OFGA-NBAdata$OFG
NBAdata$OFTMISS=NBAdata$OFTA-NBAdata$OFT
NBAdata$MISS=NBAdata$FGMISS+NBAdata$FTMISS
NBAdata$OMISS=NBAdata$OFGMISS+NBAdata$OFTMISS
NBAdata$W.=NBAdata$W/(NBAdata$W+NBAdata$L)
```

# NBA Efficency Measures

Compute the NBA Efficency Measures:

```{r}
NBAdata$eff_NBA=NBAdata$PTS+NBAdata$TRB+NBAdata$STL+NBAdata$BLK-NBAdata$MISS-NBAdata$TOV
NBAdata$eff_O_NBA=NBAdata$OPTS+NBAdata$OTRB+NBAdata$OSTL+NBAdata$OBLK-NBAdata$OMISS-NBAdata$OTOV
NBAdata$rel_effNBA=NBAdata$eff_NBA-NBAdata$eff_O_NBA
```

#Modelling process

Employed possessions:

```{r}
NBAdata$empl_poss=NBAdata$FGA + 0.45*NBAdata$FTA + NBAdata$TOV - NBAdata$ORB
NBAdata$ptsxgame=NBAdata$PTS/NBAdata$G
NBAdata$ptsxposs=NBAdata$PTS/NBAdata$empl_poss
```

To calculate the acquired possessions, we have to do some additional steps before.

Field goal attempt differenced:

```{r}
NBAdata$FGAD=NBAdata$FGA-NBAdata$OFG-NBAdata$OTOV-NBAdata$TRB+NBAdata$TOV
```

Regression Analysis to check variable construction:

```{r}
reg9405 = subset(NBAdata,(Season<2006 & Season>1993 & Season!=1999),select=c(FGAD,OFT,FTA))
reg_1 = lm(reg9405$FGAD ~ reg9405$OFT + reg9405$FTA )
summary(reg_1)
```

Hypothesis Testing:

```{r}
wald.test(vcov(reg_1), coef(reg_1), H0 = c(0.45, -0.45), df = 313,
          L = matrix(c(0, 0, 1, 0,
                       0, 1), ncol = 3), verbose = T)
```

Create a proxy using the regressions results:

NBAdata$TEAM_R=coef(reg_1)[1]+reg_1$residuals

```{r}
NBAdata$TEAM_R=NBAdata$FGAD-0.45*NBAdata$OFT+0.45*NBAdata$FTA
NBAdata$TEAM_R2=NBAdata$FGA-NBAdata$OFG-NBAdata$OTOV-NBAdata$TRB+NBAdata$TOV-0.45*NBAdata$OFT+0.45*NBAdata$FTA
TeamReb_test=NBAdata$TEAM_R2-NBAdata$TEAM_R
```

The TeamReb_test returns a serie of 0, so the result of the test is positive and the two equations give the same result.

```{r, echo = FALSE}
plot(x = NBAdata$Season, y = NBAdata$TEAM_R, main = "Estimated Team Rebounds", ylab = "TR",ylim = c(200,4000), xlab = "Seasons",col = "red")
points(x = NBAdata$Season, y = NBAdata$DRB,col = "blue")
```

Acquired possessions:

```{r}
NBAdata$acq_poss=NBAdata$OTOV + NBAdata$DRB+NBAdata$TEAM_R+ NBAdata$OFG + 0.45*NBAdata$OFT
NBAdata$ptsaxgame=NBAdata$OPTS/NBAdata$G
NBAdata$ptsall_poss=NBAdata$OPTS/NBAdata$acq_poss
```

Now counstruct the efficency as the difference between the points per possession and the opponent points per acquired possession:

```{r}
NBAdata$eff=(NBAdata$ptsxposs-NBAdata$ptsall_poss)
```

Model estimation to measure the impact of defensive and offensive efficiency:

```{r}
s9405 = subset(NBAdata,(Season>1993 & Season<2006 & Season!=1999))
```

```{r, echo = FALSE}
plot(x = s9405$Season, y = s9405$W, main = "WINS", ylab = "TR",ylim = c(0,85), xlab = "Seasons",col = "red")
```

Regression using NBA efficency measures:

```{r}
reg_NBA_eff = lm(s9405$W ~  s9405$eff_NBA + s9405$eff_O_NBA)
summary(reg_NBA_eff)
```

The R squared of the regression is:

```{r}
summary(reg_NBA_eff)$r.squared
```

The adjusted R squared of the regression is:

```{r}
adjRsq_NBA_eff=0.9118
```

Regression using NBA relative efficiency measure:

```{r}
reg_NBA_releff = lm(s9405$W ~  s9405$rel_effNBA )
summary(reg_NBA_releff)
```

The R squared of the regression is:

```{r}
summary(reg_NBA_releff)$r.squared
```

The adjusted R squared of the regression is:

```{r}
adjRsq_NBA_releff=0.9108
```

Regression using model efficiency measure:

```{r}
reg_MODEL=lm(s9405$W ~  s9405$eff)
summary(reg_MODEL)
```

The R squared of the regression is:

```{r}
summary(reg_MODEL)$r.squared
```

The adjusted R squared of the regression is:

```{r}
adjRsq_MODEL= 0.9479
```

Regression using both the efficiency measures:

```{r}
reg_both=lm(s9405$W ~  s9405$eff + s9405$eff_NBA + s9405$eff_O_NBA )
summary(reg_both)
```

The R squared of the regression is:

```{r}
summary(reg_both)$r.squared
```

The adjusted R squared of the regression is:

```{r}
adjRsq_both= 0.9479
```

# Answer

The performance of the model-constructed efficiency seems to be higher, considering that the parameter of the first 3 regressions are all significant at 0.001 level and the R square is higher when the regressor is the modelled efficiency.
Hence, it is able to better explain the variability of the model.
Moreover, adding in the third regression the NBA efficiency measures, the adjR^2 (purified by degrees of freedom) remains the same (0.9479). 
This means that, adding them, there are not additional informations.
In fact, in the fourth (reg_both) regression, the coefficients of NBA efficiency measures are not significant at any level.
This is the model that allows us to assess the validity of the model against the NBA efficiency.
In fact, there is evidence that the new efficency model dominates the NBA efficiency model in term of explaining WINS.

## Exercise 5.2

Consider the baseline MODEL for WINS estimated in the lecture over the sample 1994-2005.

# Exercise 5.2.1

The model estimated in the lecture drops Season 1998-99. Why?

# Answer

Season 1999 is excluded because of strikes that occured during that year.
Therefore, number of games is smaller than a regular season. All variables are affected by this situation.


What happens to the results obtained in class if you extend the sample by including Season 1998-1999?

# Impact calculation

Include the Season 1998-1999:

```{r}
regeff_1999 = subset(NBAdata,(Season>1993 & Season<2006))
```

Repeat the two regressions. The one on NBA efficency measure:

```{r}
reg_NBA_1999 = lm(regeff_1999$W ~  regeff_1999$eff_NBA + regeff_1999$eff_O_NBA )
summary(reg_NBA_1999)
```

And the one on the model efficency measure:

```{r}
reg_MODEL_1999=lm(regeff_1999$W ~  regeff_1999$eff)
summary(reg_MODEL_1999)
```

Calculate and compare the R squared ratios:

```{r}
R2_ratio_NBA = summary(reg_NBA_1999)$r.squared/summary(reg_NBA_eff)$r.squared
R2_ratio_NBA
```

The R2_ratio_NBA is greater than 1, so linear regression is not sensitive to the presence of outliers.

```{r}
R2_ratio_model = summary(reg_MODEL_1999)$r.squared/summary(reg_MODEL)$r.squared
R2_ratio_model
```

The R2_ratio_model is lower than 1, so linear regression is sensitive to the presence of outliers.

# Answer

R2 decrease for all regressions due to Season 1999's strikes (outlier).
The drop in R2 is bigger for the regression where the regressor is the modelled efficiency.
The coefficents of both regressions remain significant at 0.001 level.

# Exercise 5.2.2

Add to the baseline model a Season effect and a Team effect and test their significance.

Using the Dataset s9405, the two dummies are constructed as follows:

```{r}
s9405$SeasonEffect<- ifelse(s9405$Season=="2002",1,0)
s9405$TeamEffect<- ifelse(s9405$Team=="Chicago Bulls",1,0)
```

Estimation of the model including the two dummies:

```{r}
reg_M_w_dummies=lm(s9405$W ~  s9405$eff + s9405$SeasonEffect + s9405$TeamEffect)
summary(reg_M_w_dummies)
```

The R squared of the model with the two dummies is:

```{r}
summary(reg_M_w_dummies)$r.squared
```

# Answer

The coefficients of the two dummies are not significant at any level and the Rsqr does not increase.
Hence, Season Effect and Team Effect dummies are not able to add relevant informations to the model and to explain additional variance.

## Exercise 5.3

Compare the Team Rebounds constructed by the formula used in the baseline programme with the proxy obtained by considering the sum of residuals and estimated constant in the regression of Total Possession Differenced on Free Throws and Opponent Free Throws.

Construction of Team Rebounds with the sum of residuals and constant in the regression of Total Possession Differenced on Free Throws and Opponent Free Throws:

```{r}
NBAdata$empl_poss_diff=NBAdata$FGA+NBAdata$TOV-NBAdata$ORB
NBAdata$acq_poss_diff=NBAdata$OTOV+NBAdata$DRB+NBAdata$OFG
NBAdata$tot_poss_diff=NBAdata$empl_poss_diff-NBAdata$acq_poss_diff
NBAdata$TEAM_R=NBAdata$FGAD-0.45*NBAdata$OFT+0.45*NBAdata$FTA
```

Extract the data of interest:
```{r}
regTPD = subset(NBAdata,(Season<2006 & Season>1993 & Season!=1999),select=c(tot_poss_diff,OFT,FTA,TEAM_R))
```

Regression of Total Possession Differenced on Free Throws and Opponent Free Throws:

```{r}
reg_2 = lm(regTPD$tot_poss_diff ~ regTPD$OFT + regTPD$FTA )
summary(reg_2)
```

```{r}
TEAM_R_TPD=coef(reg_2)[1]+reg_2$residuals
```

```{r}
compare=abs(TEAM_R_TPD-regTPD$TEAM_R)
print(sum(compare)/length(compare))
```

# Answer

The difference between TEAM_R_TPD and regTPD$TEAM_R is on average 51 in absolute value.

## Exercise 5.4

Using the simulation model, solve the following points.

# Exercise 5.4.1

Derive the effect of a made free throws on WINS.

# Deterministic Simulation

Baseline no shock, specify values for each stats:

```{r}
FTA=mean(s9405$FTA)
TOV=mean(s9405$TOV)
ORB=mean(s9405$ORB)
X3P=mean(s9405$X3P)
X2P=mean(s9405$X2P)
FGMISS=mean(s9405$FGMISS)
FGA=X3P+X2P+FGMISS
FT=mean(s9405$FT)
OTOV=mean(s9405$OTOV)
DRB=mean(s9405$DRB)
TEAM_R=mean(s9405$TEAM_R)
OFG=mean(s9405$OFG)
OFT=mean(s9405$OFT)
OX3P=mean(s9405$O3P)
OX2P=mean(s9405$O2P)
OFT=mean(s9405$OFT)
```

Identities:

```{r}
PTS=3*X3P+2*X2P+1*FT
empl_poss=FGA + 0.44*FTA + TOV - ORB
pts_poss=PTS/empl_poss
OPTS=3*OX3P+2*OX2P+1*OFT
acq_poss=OTOV + DRB+TEAM_R+ OFG + 0.45*OFT
ptsall_poss=OPTS/acq_poss
```

Alternative scenario, schock on FT_AL:

```{r}
FTA_AL=mean(s9405$FTA)
TOV_AL=mean(s9405$TOV)
ORB_AL=mean(s9405$ORB)
X3P_AL=mean(s9405$X3P)
X2P_AL=mean(s9405$X2P)
FGMISS_AL=mean(s9405$FGMISS)
FGA_AL=X3P_AL+X2P_AL+FGMISS_AL
FT_AL=mean(s9405$FT)+1
OTOV_AL=mean(s9405$OTOV)
DRB_AL=mean(s9405$DRB)
TEAM_R_AL=mean(s9405$TEAM_R)
OFG_AL=mean(s9405$OFG)
OFT_AL=mean(s9405$OFT)
OX3P_AL=mean(s9405$O3P)
OX2P_AL=mean(s9405$O2P)
OFT_AL=mean(s9405$OFT)
```

Identities:

```{r}
PTS_AL=3*X3P_AL+2*X2P_AL+1*FT_AL
empl_poss_AL=FGA_AL + 0.44*FTA_AL + TOV_AL - ORB_AL
pts_poss_AL=PTS_AL/empl_poss_AL
OPTS_AL=3*OX3P_AL+2*OX2P_AL+1*OFT_AL
acq_poss_AL=OTOV_AL + DRB_AL+TEAM_R_AL+ OFG_AL + 0.45*OFT_AL
ptsall_poss_AL=OPTS_AL/acq_poss_AL
```

Now add the last equation and simulate WINS under the two scenarios:

```{r}
W_BL <- reg_MODEL$coefficients[1] +reg_MODEL$coefficients[2]*(pts_poss-ptsall_poss)
W_AL <- reg_MODEL$coefficients[1] +reg_MODEL$coefficients[2]*(pts_poss_AL-ptsall_poss_AL)
VAL=W_AL-W_BL
VAL
```

# Answer

Estimation of WINS change of +3% by increasing Free Throw by 1.

# Exercise 5.4.2

Compare the distribution of wins form the simulated baseline model with that observed in the data and comment on the difference between them.

# Stochastic Simulation

Baseline no shock, specify values for each stats:

```{r}
FTA=mean(s9405$FTA)
TOV=mean(s9405$TOV)
ORB=mean(s9405$ORB)
X3P=mean(s9405$X3P)
X2P=mean(s9405$X2P)
FGMISS=mean(s9405$FGMISS)
FGA=X3P+X2P+FGMISS
FT=mean(s9405$FT)
OTOV=mean(s9405$OTOV)
DRB=mean(s9405$DRB)
TEAM_R=mean(s9405$TEAM_R)
OFG=mean(s9405$OFG)
OFT=mean(s9405$OFT)
OX3P=mean(s9405$O3P)
OX2P=mean(s9405$O2P)
OFT=mean(s9405$OFT)
```

Identities:

```{r}
PTS=3*X3P+2*X2P+1*FT
empl_poss=FGA + 0.44*FTA + TOV - ORB
pts_poss=PTS/empl_poss
OPTS=3*OX3P+2*OX2P+1*OFT
acq_poss=OTOV + DRB+TEAM_R+ OFG + 0.45*OFT
ptsall_poss=OPTS/acq_poss
```

WINS from data:

```{r}
hist(NBAdata$W, breaks = seq(min(NBAdata$W), max(NBAdata$W), l = 20+1),prob=TRUE, main = "WINS from data")
curve(dnorm(x,mean=mean(NBAdata$W),sd=sd(NBAdata$W)),col='darkblue',lwd=2,add=TRUE)
up=mean(NBAdata$W)+2*sd(NBAdata$W)
down=mean(NBAdata$W)-2*sd(NBAdata$W)
```

On average, data tell us that number of wins in a Season is 40.
The 95% of observations are in the interval (14.7,66).

Preparing to simulate two scenarios simulation:

```{r}
nrep <- 10^3
tT <- 1 # 1 obs
W_BL <- VAL_STAT <- array(0, c(tT, nrep)) # the containers
```

```{r}
for (i in 1:nrep){
  x <- rt( n=tT, df=314 )*summary(reg_MODEL)$coefficients[2,2]
  W_BL[, i] <- reg_MODEL$coefficients[1] +(reg_MODEL$coefficients[2]+x)*(pts_poss-ptsall_poss)
}
VAL=W_BL
```

```{r}
hist(VAL, breaks = seq(min(VAL), max(VAL), l = 20+1),prob=TRUE, main = "histogram of effects")
curve(dnorm(x,mean=mean(VAL),sd=sd(VAL)),col='darkblue',lwd=2,add=TRUE)
up_BL=mean(VAL)+2*sd(VAL)
down_BL=mean(VAL)-2*sd(VAL)
```

# Answer

Stochastic smiulations with baseline model predicts 41 wins on average and very low variance. 
It is almost deterministic because range of predicted values is very small.


# Thank you!

Gabriele Ammoni   1828664

Mario Signoretti  3005442

Giuditta Viganò   3010382