In the third assignment we experimented with optimizations around CPU caching and instruction
scheduling, putting some things into practice from the material discussed in the textbook chapters
on Memory Hierarchy Design and Instruction-Level Parallelism. This final assignment will consider
some aspects from the chapters on Data-Level Parallelism (DLP) and Thread-Level Parallelism
(TLP). In fact, you will be adapting a program in order to explicitly expose parallelism on multi
core processors and GPUs.
Similar to the third assignment this assignment is structured around tasks, which involve the
implementation of a performance improvement (in this case parallelization), followed by experi
ments to analyze and quantify the effectiveness. As deliverables we again expect a report and the
modified materials that were used to conduct the experiments.
The tasks are built upon the same image processing kernels as used in assignment 3. For
completeness sake we included the brief explanation of some aspects of this framework again as
A suitable version of CUDA, which is required for the task on GPU programming, is not installed
on the LIACS workstations by default. We have prepared installations of this software in the
ca2021 environment. This environment can be enabled as follows:
Note that remotelx (huisuil) is not equipped with a GPU. However, all LIACS lab computers
are equipped with an NVIDIA GPU.
1 Processing frames of a video
All tasks in this assignment revolve around a program that applies image operations on a video,
so on a sequence of images instead of a single image. To not bother you with video formats, we
have expanded (part of) a video into frames, and stored these as separate PNG files. You can find
this sequence of frames in the directory with test data files:
The idea is to apply image compositing and the grayscale effect on all frames of this video.
A sequential single-core implementation is provided in the materials for this assignment, filename
video-cpu.cpp. This baseline implementation simply processes the frames one by one on a single
The result of the program will be a directory containing the modified frames. If you want to
see the result as a video, you can use the command (ffmpeg must be installed):
ffplay -framerate 30 -i frame%04d.png
You can also reassemble the modified frames into a new MP4 file. This can be achieved with the
ffmpeg -r 30 -f image2 -i frame%04d.png -vcodec libx264 -crf 25 n
-pix_fmt yuv420p -s 1920×1024 test.mp4
After completion of this command, open test.mp4 with a video player.
You may assume that for the image operations, the input image will always have dimensions
that are a multiple of 64 (but are not necessarily square). This property will greatly simplify
optimization, since many corner cases do not have to be checked and do not have to be dealt with.
Task 1: Multi-core programming
The first task is to implement two CPU multi-core variants (employing Thread-Level Parallelism
(TLP)) of the video processing code:
• Data parallelism: iterate over the images and process individual images in parallel. So, a
single image is divided over the CPU cores and all cores work together on the same image.
• Task parallelism: distribute the images over the cores and thus process multiple images in
parallel. So in this case every CPU core operates on an individual image.
We suggest you use OpenMP to implement these approaches. Use Internet resources to find out
how to use OpenMP. Make two copies of the baseline file video-cpu.cpp and add the required
OpenMP pragmas. Do not forget to add the new targets to the Makefile.
In the experiments we expect you to analyze and compare the performance of both approaches.
Take the following into account:
1. Perform the experiments on a computer with at least 4 cores, preferable 6 or more. The
computers in the LIACS lab rooms have at least 6 physical cores (but are HyperThreaded,
so 12 logical cores are available). You can check the amount of cores and threads using the
2. Perform experiments with different amounts of cores. Which approach scales better when
more cores are allocated to this process? You can use the -n command line option of
video-cpu to configure the number of cores to use.
3. Compare overall execution time and also consider the processing times of individual frames.
For each frame the ‘load time’ and ‘frame time’ are reported. The ‘load time’ is the time
required to read the PNG file into memory. The ‘frame time’ is the time required to apply
the image operations (tiling and grayscale) to this image.
4. Repeat the measurements, for instance 3 to 5 times.
5. Note that video-cpu has a -r command line option to configure the amount of times the
experiment should be repeated. -c outputs all experiment results in CSV format. Take ad
vantage of this! You can further process the results in CSV using a script or in a spreadsheet.