30401 - MATHEMATICS AND STATISTICS - MODULE 2 (STATISTICS)
Department of Decision Sciences
OMIROS PAPASPILIOPOULOS
Suggested background knowledge
Mission & Content Summary
MISSION
CONTENT SUMMARY
The course is organized in themes. Each theme starts with a theme overview, it introduces some motivating data and associated scientific questions and then develops the statistical tools (models, algorithms, mathematical concepts) needed to gather knowledge from the data to address the motivating questions. The theme finishes with a summary and exercises.
The themes are:
1. Data visualization and summarization
Data: heart attack study, Shipman's dead patients, daily homicides, test results, jelly beans competition
Concepts: barplots, box plots, means and medians and variational formulation, logarithmic scale, correlation and distance correlation
2. From randomization to randomness
Data: chocolates and nobel prizes, university admission data, death penalty data
Concepts: spurious correlations, experimental vs observational data, random numbers, randomized control trials, confounders, simpson's paradox
3. What is probability and what is it useful for
Concepts: Bernoulli distr., probability densities, Poisson distribution, series and limits, learning a model from the data
4. The calculus of probability
Concepts: events, basic set theory, axioms of probability
5. More models for more data
Data: birth weights, human heights, heart transplant survival data
Concepts: density functions, Gaussian distribution, survival analysis, exponential distribution, censoring, gamma distribution and special functions, uniform distribution, transformation of variables, simulation of random variables
6. Joint distributions, independence and combinatorics
Data: 10 year maturity bonds, heights of fathers and sons, the Sally Clark story
Concepts: joint and marginal distributions, independence, statistical arguments in Law, the binomial distribution
7. Expectation
Concepts: expected value and interpretation, properties of expectation, moments, variance, standard deviation and interpretation, the uncertainty rule of thumb, skewness and interpretation, sample and population moments
8. Elements of Network Science
Data: the Internet, employees communication network, the actor network
Concepts: Erdos-Renyi network model, degree distributions, six degrees of separation, heavy tails, scale-free property, power laws, the Student-t distribution
9. Concentration, inequalities and limit theorems
Concepts: Markov inequality, Chebyshev inequality, uncertainty quantification, weak law of large numbers, a basic understanding of the central limit theorem
10. Statistical learning
Data: cholestor and heart disease, arm-folding and sex, bowel cancer rates in the UK
Concepts: quantifying evidence in data about a hypothesis, p-value, Fisher exact test, multiple testing, confidence intervals from concentration inequalities, bootstrap and confidence intervals, funnel plots
Intended Learning Outcomes (ILO)
KNOWLEDGE AND UNDERSTANDING
+ fomulate statistical learning questions
+ identify appropriate data analysis methodologies
+ carry out uncertainty quantification
+ learn basic models from data
APPLYING KNOWLEDGE AND UNDERSTANDING
+ choose appropriate data summaries and visualization
+ carry out basic network analysis
+ derive basic probability calculations
+ use statistical learning tools
Teaching methods
- Practical Exercises
- Collaborative Works / Assignments
- Interaction/Gamification
DETAILS
Exercises (Exercises, database, software etc.):
Special sessions with exercises, examples and illustrations of concepts and methods, also with the help of statistical software R, will be provided.
Group assignments:
A project will be given for students to work in groups that will involve both methodology and data analysis
Assessment methods
Continuous assessment | Partial exams | General exam | |
---|---|---|---|
|
x | x | |
|
x |
ATTENDING AND NOT ATTENDING STUDENTS
Students may choose between the following two options:
- Two partial written exams (a mid-term and a final) that contribute to the final grade with a 50% weight each.
- A single general written exam (after the end of the course) that counts for 100% of the final mark.
The tests consist of exercises. They aim at ascertaining students' mastery of concepts and results discussed during lectures as well as an adequate knowledge of R.
In each test the maximum grade is 31.
The assessment method is the same for both attending and non-attending students.
Students who take the mid-term exam may still take the general exam instead of taking the final exam.
Importantly, access to the final (or second partial) exam follows the rules indicated in Section 7.6 of the Guide to the University.
There will be an optional group project that will receive a maximum of 1.5/31 points. These will be added to the total mark achieved by the previous options but it will be applicable to exams taken before the end of June, that is the project mark cannot be carried over to exams taken after June.
Teaching materials
ATTENDING AND NOT ATTENDING STUDENTS
The teaching material will be primarily that developed during the classes and distributed to the students in a PDF format after each class.
The course will use examples and extracts primarily from the first book listed below. It is advisable to acquire this book either in its original publication or its Italian translation (it is also available as an e-book), since it is an excellent modern resource to learn Probability and Statistics and why these are fundamental in anything that has to do with learning from data.
Early chapters from the second book provide an excellent more technical introduction to Probability. The introduction and some Appendices of the third book provide an excellent and accessible introduction to statistical machine learning and the use of Probability and Statistics for designing and analyzing algorithms. The fourth is a textbook whose syllabus correlates highly with the contents of this course. For a number of basic concepts the corresponding Wikipedia pages are a great resource. Please use that instead of random blogs, webpages or videos posted on youtube.
-
Spiegelhalter, The Art of Statistics: How to Learn from Data, Penguin, 2019, ISBN 978-1541618510 (available also in Italian translation)
-
Barabasi, Network Science
- Grimmett and Stirzaker, Probability and Random Processes, Oxford, Fourth Edition, 2020, ISBN 978-0198847595
- Bishop, Pattern Recognition and Machine Learning, Springer, 2006, ISBN 978-0387310732
- S. ROSS, Introduction to Probability and Statistics for Engineers and Scientists, Fourth Edition, Academic Press, 2014