---
title: "Data Handling with R"
output:
  pdf_document:
    toc: yes
  html_document:
    toc: yes
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Preliminary Steps  
The preliminary steps in any R program are cleaning the memory, setting up the working directory and loading and installing the relevant packages 
```{r}
#clear the environment 
rm(list=ls()) 
## ------------------------------------------------------------------------
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))

# packages used
listofpackages <- c("dygraphs", "dplyr","ellipse","reshape2","ggplot2","highcharter","xts","xlsx","readxl",
"quantmod","foreign")

for (j in listofpackages){
  if(sum(installed.packages()[, 1] == j) == 0) {
    install.packages(j)
  }
  library(j, character.only = T)
}

```

## Loading data from different formats 
First we need to read the data and recognize their format. In the following we read the data from  .txt, .csv and build in databases.We also check the structure of the data imported from different files, which is that of a "dataframe" or of a "tibble". 

### Loading data from .txt format
```{r}

args(read.table)
data_readtable1=read.table("demo_data.txt",sep="\t",header=TRUE)
head(data_readtable1) 
str(data_readtable1)
```
### Loading data from .csv format
```{r}
args(read.csv)
data_readcsv=read.csv("demo_data.csv")
head(data_readcsv)
str(data_readcsv)
```
### Loading data from R inbuilt dataset
```{r}
data(iris)
head(iris)
str(iris)
```
### Loading data from foreign software (SPSS, STATA)
```{r}
#SPSS
data_SPSS=read.spss("personnel.sav")
head(data_SPSS)
#data are read as a list then convert this list into a dataframe
data_SPSS=as.data.frame(data_SPSS)
head(data_SPSS)
#STATA
data_Stata=read.dta("Income.dta")
head(data_Stata)
```
##From a Tibble to a .xts object  
As many of the applications in asset pricing deal with time series there is a specific format to deal with this data a .xts object
```{r}
raw_data           = read_xlsx("2023_monthly_stocks.xlsx") 
names(raw_data)[1] = 'Date'
typeof(raw_data)
typeof(raw_data$Date)
typeof(raw_data$AXP)
typeof(raw_data$CSCO)
str(raw_data)

dates <-seq(as.Date("1985-02-01"),length=462, by="months")
params <- c("Date","AXP","AMGN","AAPL","BA","CAT","CSCO", "DJI")
data <- raw_data[, c(params)]
data<- na.omit(data)
data <- data %>% 
  mutate(Date = as.Date(Date, format = "%Y-%m-%d"))

params1 <- c("AXP","AMGN","AAPL","BA","CAT","CSCO", "DJI")
tsdata <- xts(raw_data[, c(params1)], order.by=dates) # creates a time series object
tsdata <- na.omit(tsdata) # omitting the rows with NA presence
data<- na.omit(data)
## having created the database with all observation we generate a subset 
#tsdata1 <- tsdata["1992-02-01/1993-02-01"]
#data=subset(data,select=c(1:12))
```
##Data Handling 
We Illustrate data handling with the us_stocks.csv file 
```{r}
data_stocks=read.csv(file="us_stocks.csv",header=TRUE)
head(data_stocks)
names(data_stocks)
colnames(data_stocks)

msft_prices1=data_stocks$MSFT #the data is returned as a vector
head(msft_prices1)
msft_prices2=data_stocks[["MSFT"]]
#the data is returned as a vector
head(msft_prices2)
#the following returns data as a data frame
msft_prices3=data_stocks["MSFT"]
#can also be used to access multiple columns
head(msft_prices3)


## ------------------------------------------------------------------------
#MSFT is in the second column and leaving the row index blank returns all the rows for the particular column

msft_prices4=data_stocks[,2]

head(msft_prices4)
#all the elements in row 4
data_stocks[4,]

## ------------------------------------------------------------------------
#First create a vector having the returns for msft
msft_ret=100*diff(log(data_stocks$MSFT)) 
#combine the vector with the data 
#data_stocks_r=cbind(data_stocks,MSFT_RET=msft_ret)
#this will generate an error message 
#different length 
length(msft_ret) 
length(data_stocks$MSFT)
#add one more value to vector msft_ret 
msft_ret=c(0,msft_ret)
#check the length 
length(msft_ret)
#lets combine now (it should work)
data_stocks_r=cbind(data_stocks,MSFT_RET=msft_ret)
head(data_stocks_r)#shows one more column added to the data

## ------------------------------------------------------------------------
#create two dataframes from data_stocks
data_r1=data_stocks[1:10,]#first 10 rows
data_r2=data_stocks[2775:2784,]#last 10 rows
data_stocks_rbind=rbind(data_r1,data_r2)
print(data_stocks_rbind)

## ------------------------------------------------------------------------
#select all rows with Apple prices above 100
data_aaplgr100=data_stocks[data_stocks$AAPL>100,] 
head(data_aaplgr100)
min(data_aaplgr100$AAPL) #check if the prices are above 100
#this give NA as the minimum which indicates that data frame has NA 
#lets remove NAs from data_aaplgr100 using na.omit function
data_aaplgr100=na.omit(data_aaplgr100)
#now check the minimum again
min(data_aaplgr100$AAPL)

## ------------------------------------------------------------------------
head(data_stocks)#notice NAs in GOOG
data_stocks_googlena=data_stocks[!is.na(data_stocks$GOOG),]
head(data_stocks_googlena)#after removing NAs
#the above can still leave NAs in other columns
#use na.omit to remove all the blank data
data_stocks_naomit=na.omit(data_stocks)

## ------------------------------------------------------------------------
data_msft=data_stocks_naomit[data_stocks_naomit$MSFT<=30&data_stocks_naomit$MSFT>20,]
min(data_msft$MSFT)#check 

## ------------------------------------------------------------------------
args(subset.data.frame)
aaplgr100=subset(data_stocks_naomit,AAPL>100)
head(aaplgr100)
min(aaplgr100$AAPL)

```

##Saving data as an object in .Rdata format
```{r}
save(data_readtable1,file="data1.Rdata")
 #saving data into another text file
write.table(data_readtable1,file="data1.txt")
load("data1.Rdata")#using load to load R data 
head(data_readtable1)

```

## Extract an R file from R markdown 
For a literate programming document, we can either compile it to a report
(run the code), or extract the program code in it. They are called
“weaving” and “tangling,” respectively. Apparently the function knit()
is for weaving, and the corresponding tangling function is purl() in
knitr. For example,

install.packages("knitr", dependencies = TRUE)
library(knitr)
purl("your-file.Rnw")
purl("your-file.Rmd")

The result of tangling is an R script.

## Asking Chat GPT for Advice 
### What is the difference between a tibble and dataframe ? 
n R, both tibbles and data frames are data structures used for working with tabular data. However, tibbles are a modern and enhanced version of data frames introduced by the tidyverse ecosystem, which includes packages like dplyr and ggplot2. Here are the key differences between tibbles and data frames:

Printing:

Tibbles provide a more concise and informative printout compared to data frames. When you print a tibble, it shows only the first few rows and all columns that fit on the screen, making it easier to inspect the data.
Data frames, on the other hand, may print a large number of rows and columns by default, making it harder to quickly grasp the structure of the data.
Column Types:

Tibbles are stricter about column types. They don't automatically convert character vectors to factors, which is a common behavior in data frames. This can help prevent unexpected type conversions.
Data frames may convert character columns to factors by default, which can lead to unintended consequences.
Column Names:

Tibbles allow non-standard column names, such as names with spaces, dots, or special characters. These column names need to be enclosed in backticks (``) to be referenced.
Data frames typically enforce stricter naming conventions, and non-standard names may require extra effort to work with.
Subsetting:

Tibbles use non-standard evaluation (NSE), which means you can refer to columns directly by their names without the need for quotation marks.
Data frames use standard evaluation (SE), so you often need to use quotation marks when referring to columns.
Row Names:

Tibbles do not have row names, whereas data frames can have row names by default.
Data frames can store row names, but they are not typically used in modern data analysis workflows.
Data Frame Size:

Tibbles are not limited by the default maximum number of rows or columns displayed in the console. Data frames may have display limits set by default.
Package Dependency:

Tibbles are associated with the tibble package, which is part of the tidyverse ecosystem. You need to load this package to work with tibbles.
Data frames are a built-in part of base R and do not require additional packages.
In practice, tibbles are often preferred in data analysis workflows within the tidyverse ecosystem due to their improved usability, stricter column typing, and more informative printing. However, data frames are still widely used and are compatible with a wide range of R packages and functions. You can easily convert between tibbles and data frames using functions like as_tibble() and as.data.frame().
### How do I extract a .R file from a .Rmd file in R ?