ÐÓ°ÉÂÛ̳

 

ST445      Half Unit
Managing and Visualising Data

This information is for the 2023/24 session.

Teacher responsible

Dr Chengchun Shi

Availability

This course is compulsory on the MSc in Data Science and MSc in Health Data Science. This course is available on the MPA in Data Science for Public Policy, MSc in Applied Social Data Science, MSc in Geographic Data Science, MSc in Statistics, MSc in Statistics (Financial Statistics), MSc in Statistics (Financial Statistics) (Research), MSc in Statistics (Research), MSc in Statistics (Social Statistics) and MSc in Statistics (Social Statistics) (Research). This course is available with permission as an outside option to students on other programmes where regulations permit.

This course has a limited number of places (it is controlled access) and demand is typically very high. Priority is given to students for whom the course is compulsory; as well as students in the Department of Statistics where the course is listed as an option in their programme regulations, students on the MSc in Applied Social Data Science, and students on the MSc in Geographic Data Science. Students from outside these programmes may not get a place.

Pre-requisites

Students who have no previous experience in Python are required to take an online pre-sessional Python course from the Digital Skills Lab (https://moodle.lse.ac.uk/course/view.php?id=7696).

Course content

The focus of the course is on the fundamental principles and best practices for data manipulation and visualisation. The course is based on using Python as the primary programming language and various software packages. 

The first five weeks will focus on data manipulation which covers the basic concepts such as data types and data models. Students learn how to create data model instances, load data into them, and manipulate and query data. The course will cover data structures for scientific computing and their manipulation through the Python package NumPy, and high-level data structures and functions for working with structured or tabular data through the Python package Pandas. We will cover the basic concepts of relational data models and SQL query language for creating and querying database tables.

The last five weeks focus on data visualisation starting with the exploratory data analysis using various statistical plots. We will explain visualisations used for evaluation of binary classifiers such as receiver operating curve plots and precision recall plots. We will explain the principles of some dimensionality reduction methods used for visualisation of high-dimensional data points, starting with classical methods such as multidimensional scaling to more recent methods such as stochastic neighbour embedding. We will discuss the basic principles of graph data visualisation methods and different graph data layouts. The data visualisations will be materialised in code using Python packages such as Matplotlib, Seaborn, and various scikit-learn modules.

Teaching

20 hours of lectures and 15 hours of seminars in the AT.

This course will be delivered through a combination of classes, lectures and Q/A sessions totalling a minimum of 35 hours in Michaelmas Term. This course includes a reading week in Week 6 of Michaelmas Term.

Students are required to install Python on their own laptops and use their own laptops in the seminar sessions.

Formative coursework

Students will be expected to produce 8 problem sets in the AT.

Indicative reading

  • Mckinney, W., Python for Data Analysis, 2nd Edition, O’Reilly 2017
  • Muller, A. C. and Guido, S., Introduction to Machine Learning with Python, O’Reilly, 2016
  • Geron, A., Hands-on Machine Learning with Scikit-Learn & TensorFlow, O’Reilly, 2017
  • Ramakrishnan, R. and Gehrke, J., Database Management Systems, 3rd Edition, McGraw Hill, 2002
  • Obe, R. and Hsu, L., PostgreSQL Up & Running, 3rd Edition, O’Reilly 2017
  • Robinson, I., Webber, J. and Eifrem, E., Graph Databases, 2nd Edition, O’Reilly 2015
  • Murray, S., Interactive Data Visualisation for the Web, O'Reilly, 2013
  • Matplotlib, https://matplotlib.org
  • Seaborn: statistical data visualization https://seaborn.pydata.org
  • Sci-kit learn, Machine learning in Python, http://scikit-learn.org

Assessment

Project (80%) in the WT.
Continuous assessment (20%) in the AT.

Students are required to hand in solutions to 2 problem sets, each accounting for 10% of the final assessment (20% in total). In addition, there will be a take-home exam (80%) in the form of a final project in which they will demonstrate the ability to manage data and visualise it through effective statistical graphics using principles they have learnt on the course. 

Student performance results

(2019/20 - 2021/22 combined)

Classification % of students
Distinction 89.1
Merit 9.8
Pass 0.5
Fail 0.5

Key facts

Department: Statistics

Total students 2022/23: 93

Average class size 2022/23: 32

Controlled access 2022/23: Yes

Lecture capture used 2022/23: Yes (MT)

Value: Half Unit

Course selection videos

Some departments have produced short videos to introduce their courses. Please refer to the course selection videos index page for further information.

Personal development skills

  • Self-management
  • Problem solving
  • Application of information skills
  • Communication
  • Application of numeracy skills
  • Specialist skills