Module Learning Outcomes Assessed:
B1: COMPUTATION THINKING: develop and understand algorithms to solve problems; measure and optimise algorithm complexity; appreciate the limits of what may be done algorithmically in reasonable time or at all.
B2: PROGRAMMING: create working solutions to a variety of computational and real world
problems using multiple programming languages chosen as appropriate for the task.
B4: DATA SCIENCE: work with (potentially large) datasets; using appropriate storage technology; applying statistical analysis to draw meaningful conclusions; and using modern machine learning tools to discover hidden patterns.
B5. SOFTWARE DEVELOPMENT: develop a product from the initial stage of requirement / analysis all the way through development to its final stages of testing / evaluation.
B6: PROFESSIONAL PRACTICE: understand professional practices of the modern IT industry which include those technical (e.g. version control / automated testing) but also social, ethical & legal responsibilities.
B7: TRANSFERABLE SKILLS: apply a wide variety of degree level transferable skills including time
management, team working, written and verbal presentation to both experts and non-experts, and critical reflection on own and others work.
B8: ADVANCED WORK: apply the above to advanced topics selected according to the interests of individual students.
Task and Mark distribution:
The report is grade out of 150 and contributes 10 credits towards the module. Resit marks are capped at 40%.
For detailed guidance on mark allocation, see the grading scheme below.
This is also available as a separate Excel document on Aula.
Your original submission has been graded and feedback provided. By considering the written feedback, along with the marks for each part you are required to improve your work before re-submitting for the re-sit assessment. For convenience, the details are repeated below.
Please note that work which has not been improved may attract lower marks at the second submission.
Over the course of this module you have been introduced to a range of techniques that may be used for programming a big data project. This assessment allows you to pull together these techniques in a realistic scenario to complete a big data analysis project. Below is a realistic project scenario. By using the techniques presented during class you are to carry out the project and write a final project report for your client.
In line with real world projects, where the client has rejected your work and requested improvements, work which is not improved in line with the feedback may be marked lower.
You have been approached by a client who analyses atmospheric science and climate model data. They have developed a new analysis technique, but it takes too long to run for them to use it. They have asked you to investigate the use of big data techniques to reduce the processing time.
They have a large volume of data to process, and the analysis needs to be repeated frequently. They have the following basic requirements:
- Current analysis time is approximately 2.5 hours to analyse the climate model output data for a 1-hour time period.
- The data for a single day of model output is approximately 250MB. However, they have over 100 years’ worth of data to analyse making a total of over 9TB.
- Each day, they need to analyse the new data set for that day, so they wish to complete the analysis of the data for a 24-hour period (25 data sets) in under 2 hours.
- It is not possible to hold on this in memory at one time, so the new process should load only 1 hour of data for processing at a time. If parallel processing is to occur, then 1 hour of data per worker can be loaded as needed.
You have been tasked with investigating the use of parallel processing to achieve the analysis speed required, with the following expectations:
- Test and compare the processing speed of sequential and parallel processing
- Extrapolate your findings to indicate the number of processors required to achieve the target processing time.
- Test how your code responds to common errors, e.g. data that is text instead of numeric, use of NaN in the data as an error code.
- Run automated tests that allow your client to set the tests running and return later to see the results, without user intervention.
The data has been provided by the European Centre for Medium Range Weather Forecasts (ECMWF)