ÐÓ°ÉÂÛ̳

 

ST443      Half Unit
Machine Learning and Data Mining

This information is for the 2024/25 session.

Teacher responsible

Prof Milan Vojnovic

Availability

This course is compulsory on the MSc in Data Science. This course is available on the MPA in Data Science for Public Policy, MSc in Applied Social Data Science, MSc in Econometrics and Mathematical Economics, MSc in Geographic Data Science, MSc in Health Data Science, MSc in Quantitative Methods for Risk Management, MSc in Statistics, MSc in Statistics (Financial Statistics), MSc in Statistics (Financial Statistics) (Research), MSc in Statistics (Research), MSc in Statistics (Social Statistics) and MSc in Statistics (Social Statistics) (Research). This course is available with permission as an outside option to students on other programmes where regulations permit.

This course has a limited number of places (it is controlled access) and demand is typically high. This may mean that you’re not able to get a place on this course.

Pre-requisites

The course will be taught from a statistical perspective and students must have a very solid understanding of linear regression models.

Students are not permitted to take this course alongside Algorithmic Techniques for Data Mining (MA429).

Course content

Machine learning and data mining are emerging fields situated between statistics and computer science. They focus on the objectives such as prediction, classification and clustering, particularly in contexts where datasets are large, commonly referred to as the world of 'big data'.

This course will commence with the classical statistical methodology of linear regression as a foundation. From there, it will progress to provide an introduction to machine learning and data mining methods from a statistical perspective. In this framework, machine learning will be conceptualised as 'statistical learning', aligning with the titles of the books in the essential reading list.

The course aims to cover modern non-linear methods such as spline methods, generalised additive models, decision trees, random forests, bagging, boosting and support vector machines. Additionally, it will delve into advanced approaches, such as ridge regression, the lasso, linear discriminant analysis, k-means clustering, and nearest neighbours. 

Teaching

The first part of the course reviews regression methods and covers logistic regression, linear and quadratic discriminant analysis, cross-validation, variable selection, nearest neighbours and shrinkage methods.

The second part of the course introduces non-linear models and covers splines, generalized additive models, tree methods, bagging, random forest, boosting, support vector machines, principal components analysis, k-means, and hierarchical clustering.

This course will be delivered through a combination of classes, lectures and Q&A sessions totalling a minimum of 35 hours across Autumn Term. This course includes a reading week in Week 6 of Autumn Term.

Formative coursework

Students will be expected to produce 5 problem sets in the AT.

The problem sets will consist of both theoretical questions and data problems that require the implementation of various methods in class using a computer.

Indicative reading

James, G., Witten, D., Hastie, T. and Tibshirani, R. An Introduction to Statistical Learning. 2nd Edition, Springer, 2021. Available online at https://www.statlearning.com/

Hastie, T., Tibshirani, R. and Friedman, J. The Elements of Statistical Learning: Data Mining, Inference and Prediction. 2nd Edition, Springer,  2009. Available online at http://statweb.stanford.edu/~tibs/ElemStatLearn/index.html 

Assessment

Exam (70%, duration: 2 hours) in the spring exam period.
Project (30%) in the AT Week 11.

Student performance results

(2020/21 - 2022/23 combined)

Classification % of students
Distinction 33.5
Merit 43.8
Pass 15.2
Fail 7.6

Key facts

Department: Statistics

Total students 2023/24: 68

Average class size 2023/24: 18

Controlled access 2023/24: Yes

Value: Half Unit

Course selection videos

Some departments have produced short videos to introduce their courses. Please refer to the course selection videos index page for further information.

Personal development skills

  • Self-management
  • Team working
  • Problem solving
  • Application of information skills
  • Communication
  • Application of numeracy skills
  • Specialist skills