Insegnamento a.a. 2022-2023

30607 - FOUNDATIONS OF DATA SCIENCE

Department of Decision Sciences

Course taught in English
Go to class group/s: 31
CLEAM (6 credits - II sem. - OP  |  SECS-S/01) - CLEF (6 credits - II sem. - OP  |  SECS-S/01) - CLEACC (6 credits - II sem. - OP  |  SECS-S/01) - BESS-CLES (6 credits - II sem. - OP  |  SECS-S/01) - WBB (6 credits - II sem. - OP  |  SECS-S/01) - BIEF (6 credits - II sem. - OP  |  SECS-S/01) - BIEM (6 credits - II sem. - OP  |  SECS-S/01) - BIG (6 credits - II sem. - OP  |  SECS-S/01) - BEMACS (6 credits - II sem. - OP  |  SECS-S/01) - BAI (6 credits - II sem. - OP  |  SECS-S/01)
Course Director:
OMIROS PAPASPILIOPOULOS

Classes: 31 (II sem.)
Instructors:
Class 31: OMIROS PAPASPILIOPOULOS


Suggested background knowledge

Basic Mathematics, Probability and Statistics. Recommended some programming experience and some experience with basic data analysis. Overall, the expected student profile is one very keen on quantitative analysis, eager to understand how machine learning algorithms work and to get first hand experience with the different components of data science from basic data warehousing to elements of causal inference. A couple of months before the course starts there will be available some videos and some online exercises. These are meant to help students to familiarize themselves with the fundamental background in order to make the most of the course. The first TA session will take place before the course starts and will address questions that stem from this introductory material.

Mission & Content Summary

MISSION

The purpose of the course is to bring Bocconi at the international forefront of undergraduate education in Social Sciences by providing a hands-on training on the foundations of Data Science especially targeted to people that are less exposed or even unaware of this material. The course is based on Jupyter notebooks using Python implementations on a Google-colab environment that requires no installations from the student. The code needed to carry out all analyses is provided in the notebooks. Case studies also involve text and image data. The course has three main components, as outlined below. At the end of each component there is a data project that students do jointly with the instructor. There is also a more substantive project based on a real application from the industry. There will also be an associated competition via the Bocconi Data Science Challenges Platform.

CONTENT SUMMARY

Part A: The basics
+ Intro to course and case studies
+ (Less) Basic Python programming
+ Basic data management and visualization with Python
+ Messy data and feature engineering 


Part B: Predictive modelling
+ Fundamentals: supervised learning and optimization
+ Lasso regression
+ Classification
+ Representational learning pt1: trees, bagging and boosting
+ Representational learning pt2: neural networks

 

Part C: Uncertainty quantification and causal inference
+ Stability 
+ Split sample methods, bootstrap, conformal inference
+ Elements of causal inference
+ Treatment effect estimation and double machine learning
+ Causal forests

 

Part D: Wrap up
+ Student project presentations


Intended Learning Outcomes (ILO)

KNOWLEDGE AND UNDERSTANDING

At the end of the course student will be able to...
  •  define data analysis methodology
  •  carry out basic data warehousing to represent, visualize and transform data
  • build, train and evaluate machine learning models and algorithms
  • Integrate machine learning with uncertainty quantification and basic causal inference
  • develop models, algorithms and code 
  • understand the fundamental machine learning methodologies 

APPLYING KNOWLEDGE AND UNDERSTANDING

At the end of the course student will be able to...
  • apply appropriate data analysis methodologies
  • choose appropriate machine learning algorithms and evaluate their performance
  • produce measures of uncertainty associated with the statistical learning
  • carry out causal inference using appropriate assumptions and algorithms
  • develop and adapt Python code for all the above tasks

Teaching methods

  • Face-to-face lectures
  • Exercises (exercises, database, software etc.)
  • Case studies /Incidents (traditional, online)
  • Individual assignments
  • Group assignments
  • Interactive class activities (role playing, business game, simulation, online forum, instant polls)

DETAILS

Combination of 5 basic approaches:
0. Videos distributed before course that review background knowledge in Statistics, computing and Python
1. few lectures on the foundations of the methodology
2. most of the lectures are based on jupyter notebooks where models and algorithms are illustrated directly on data and the students can interact with the code 
3. guided project sessions
4. TA sessions on more practical coding aspects


Assessment methods

  Continuous assessment Partial exams General exam
  • Written individual exam (traditional/online)
    x
  • Individual assignment (report, exercise, presentation, project work etc.)
x    
  • Group assignment (report, exercise, presentation, project work etc.)
x    

ATTENDING AND NOT ATTENDING STUDENTS

  1.  9/31 of the mark is on the basis of exercises given at the end of each theme and  correspond to the guided project sessions. 
  2. 13/31 of the mark is for a group project for Part B, done in groups of 4. This will take the form of a hackathlon managed through the Bocconi Data Science Challenges Platform. 
  3. 9/31 of the mark is based on an individual final exam

Teaching materials


ATTENDING AND NOT ATTENDING STUDENTS

0. Videos distributed before the course
1. Jupyter notebooks
2. Lecture notes

 

Suggested references:
0. Art of Statistics
https://www.amazon.it/Art-Statistics-Learning-Data/dp/0241398630
This is an excellent book for understanding modern Statistics and it can serve as a preparation before starting the course

 

The following three books can be used to understand deeper the machine learning methods we will cover

 

1. Elements of Statistical Learning
https://www.amazon.it/Elements-Statistical-Learning-Inference-Prediction/dp/0387848576/ref=sr_1_1?adgrpid=54230735724&gclid=Cj0KCQjw-daUBhCIARIsALbkjSZOMjFXZB-g4Nbo7ccbC7-1-2vbv4NqoVYrCnkuIDKD94LaTcmy-OsaAk3sEALw_wcB&hvadid=255139979982&hvdev=c&hvlocphy=1008463&hvnetw=g&hvqmt=e&hvrand=3531467951480362546&hvtargid=kwd-299792246878&hydadcr=18578_1822585&keywords=elements+of+statistical+learning&qid=1654013448&sr=8-1

 

2. Pattern recognition and machine learning
https://www.amazon.it/Pattern-Recognition-Machine-Learning-Christopher/dp/0387310738

 

3. Deep Learning
https://www.deeplearningbook.org/

 

Parts of the course will also be based on the forthcoming book:

 

5. Veridical Data Science
The Practice of Responsible Data Analysis and Decision Making

 

6. There will be references to certain articles. The following four are particularly relevant for the aims of this course: 
   + Statistical Modeling: The Two Cultures (2001) by Leo Breiman
   + Prediction, Estimation and Attribution (2020) by Brad Efron
   + Statistics in the big data era: Failures of the machine (2018) by
David Dunson
  + 50 years of Data Science (2017) by David Donoho

Last change 06/06/2022 19:23