30607 - FOUNDATIONS OF DATA SCIENCE
Course taught in English
Go to class group/s: 31
Basic Mathematics, Probability and Statistics. Recommended some programming experience and some experience with basic data analysis. Overall, the expected student profile is one very keen on quantitative analysis, eager to understand how machine learning algorithms work and to get first hand experience with the different components of data science from basic data warehousing to elements of causal inference. A couple of months before the course starts there will be available some videos and some online exercises. These are meant to help students to familiarize themselves with the fundamental background in order to make the most of the course. The first TA session will take place before the course starts and will address questions that stem from this introductory material.
The purpose of the course is to bring Bocconi at the international forefront of undergraduate education in Social Sciences by providing a hands-on training on the foundations of Data Science especially targeted to people that are less exposed or even unaware of this material. The course is based on Jupyter notebooks using Python implementations on a Google-colab environment that requires no installations from the student. The code needed to carry out all analyses is provided in the notebooks. Case studies also involve text and image data. The course has three main components, as outlined below. At the end of each component there is a data project that students do jointly with the instructor. There is also a more substantive project based on a real application from the industry. There will also be an associated competition via the Bocconi Data Science Challenges Platform.
Part A: The basics
+ Intro to course and case studies
+ (Less) Basic Python programming
+ Basic data management and visualization with Python
+ Messy data and feature engineering
Part B: Predictive modelling
+ Fundamentals: supervised learning and optimization
+ Lasso regression
+ Classification
+ Representational learning pt1: trees, bagging and boosting
+ Representational learning pt2: neural networks
Part C: Uncertainty quantification and causal inference
+ Stability
+ Split sample methods, bootstrap, conformal inference
+ Elements of causal inference
+ Treatment effect estimation and double machine learning
+ Causal forests
Part D: Wrap up
+ Student project presentations
- define data analysis methodology
- carry out basic data warehousing to represent, visualize and transform data
- build, train and evaluate machine learning models and algorithms
- Integrate machine learning with uncertainty quantification and basic causal inference
- develop models, algorithms and code
- understand the fundamental machine learning methodologies
- apply appropriate data analysis methodologies
- choose appropriate machine learning algorithms and evaluate their performance
- produce measures of uncertainty associated with the statistical learning
- carry out causal inference using appropriate assumptions and algorithms
- develop and adapt Python code for all the above tasks
- Face-to-face lectures
- Exercises (exercises, database, software etc.)
- Case studies /Incidents (traditional, online)
- Individual assignments
- Group assignments
- Interactive class activities (role playing, business game, simulation, online forum, instant polls)
Combination of 5 basic approaches:
0. Videos distributed before course that review background knowledge in Statistics, computing and Python
1. few lectures on the foundations of the methodology
2. most of the lectures are based on jupyter notebooks where models and algorithms are illustrated directly on data and the students can interact with the code
3. guided project sessions
4. TA sessions on more practical coding aspects
Continuous assessment | Partial exams | General exam | |
---|---|---|---|
x | |||
x | |||
x |
- 9/31 of the mark is on the basis of exercises given at the end of each theme and correspond to the guided project sessions.
- 13/31 of the mark is for a group project for Part B, done in groups of 4. This will take the form of a hackathlon managed through the Bocconi Data Science Challenges Platform.
- 9/31 of the mark is based on an individual final exam
0. Videos distributed before the course
1. Jupyter notebooks
2. Lecture notes
Suggested references:
0. Art of Statistics
https://www.amazon.it/Art-Statistics-Learning-Data/dp/0241398630
This is an excellent book for understanding modern Statistics and it can serve as a preparation before starting the course
The following three books can be used to understand deeper the machine learning methods we will cover
1. Elements of Statistical Learning
https://www.amazon.it/Elements-Statistical-Learning-Inference-Prediction/dp/0387848576/ref=sr_1_1?adgrpid=54230735724&gclid=Cj0KCQjw-daUBhCIARIsALbkjSZOMjFXZB-g4Nbo7ccbC7-1-2vbv4NqoVYrCnkuIDKD94LaTcmy-OsaAk3sEALw_wcB&hvadid=255139979982&hvdev=c&hvlocphy=1008463&hvnetw=g&hvqmt=e&hvrand=3531467951480362546&hvtargid=kwd-299792246878&hydadcr=18578_1822585&keywords=elements+of+statistical+learning&qid=1654013448&sr=8-1
2. Pattern recognition and machine learning
https://www.amazon.it/Pattern-Recognition-Machine-Learning-Christopher/dp/0387310738
3. Deep Learning
https://www.deeplearningbook.org/
Parts of the course will also be based on the forthcoming book:
5. Veridical Data Science
The Practice of Responsible Data Analysis and Decision Making
6. There will be references to certain articles. The following four are particularly relevant for the aims of this course:
+ Statistical Modeling: The Two Cultures (2001) by Leo Breiman
+ Prediction, Estimation and Attribution (2020) by Brad Efron
+ Statistics in the big data era: Failures of the machine (2018) by
David Dunson
+ 50 years of Data Science (2017) by David Donoho