In this third assignment we are going to look into optimizing single-thread performance. In order to
do so we will investigate two aspects to reduce execution time: (1) improve utilization of the CPU
caches, (2) reduce pipeline stalls by applying instruction scheduling. Note that this corresponds
to the material discussed in the textbook in the chapters on Memory Hierarchy Design and the
first part of Instruction-Level Parallelism.
This assignment is split into three tasks based on these two aspects. For each task, you
are provided with baseline implementations that need to be improved. As the first step, the
performance of the baseline needs to be analyzed. Secondly, you can come up with a plan to
improve the performance and implement this. Thirdly, you quantify whether this improvement
has any effect. Finally, you are required to write up your findings in a report. The deliverable will
consist of the report and the modified materials that you used to conduct the experiments.
Note that all analyses in your report must be backed with measurement results. We take
nothing for granted! Just reporting execution time before and after optimization is not sufficient.
For instance, for the caching experiments we expect your analysis to be backed by statistics of the
caching subsystem (e.g. number of cache misses). Also the correctness of your optimized code is
assessed, when your code is 20 times as fast, it must of course still produce the correct result and
not take shortcuts that lead to incorrect results.
Since analysis and optimization is the main aim of this assignment, quality of the delivered
code will not be assessed. Truth be told, in the real world most code that is being experimented
with is a mess. Only once the solution and effective optimizations have been found, a cleaned up
and well-readable code is written.
For task 2 some custom software is required: an Alpha toolchain and the SimpleScalar simulator
(please note that we made a few slight modifications to the SimpleScalar simulator, so use the
version we prepared for you and do not compile it on your own). We have prepared installations
of all necessary software in the ca2021 environment. This environment can be enabled as follows:
Note that for this assignment we assume you are using gcc version 7.5 as the main C compiler
(this is the case for the computers in the lab rooms). If you use a different compiler, please do
note this in your report. It could be the case that more recent versions of gcc already perform
some of the transformations that you are requested to do by hand.
In many cases, we want to study the performance of a particular computation. For example matrix
multiplication or an image operation such as converting colors to grayscale. People typically refer
to such \core” computations as computational kernels. Every task in this assignment involves one
or more computational kernels that are provided as part of the materials. A kernel is often wrapped
in a simple main function that is used to start the program and invoke the kernel. Regularly, you
will find that kernels execute in only a few milliseconds. To be able to take good measurements
of execution time, the kernel is usually repeated a number of times (say 50 times, or 500 times,
depending on the execution time of the particular kernel). Take this into account when processing
The computational kernels that are provided as part of the materials are baseline kernels. You
will be making modifications to these kernels and subsequently compare the performance of the
modified (optimized) code to the baseline. When working on a task, we strongly recommend to
make a copy of the baseline and modify this copy to complete the task. After making a copy of
the file a new target can be added to the Makefile. This way you can create separate executables
for the baseline and the variants with optimizations, which will be very helpful when performing
Task 1: CPU caching
In the first task we will investigate the effect of the (classic) loop interchange and loop blocking
transformations on CPU cache performance. Recall that we also discussed these transformations
in class. It is sometimes quite hard to evaluate the effectiveness of caching optimizations on
desktop hardware powered by x86 CPUs. This is due to the fact that multiple processes are
active at a given time, all making use of the cache, and due to aggressive prefetching performed
by the CPU. Therefore, you will conduct experiments based on simulation (using cachegrind as
cache simulator) and on real hardware (by measuring execution time and sampling hardware
performance counters). We expect that you include results based on simulation and on running
the program directly on the CPU in your report. A short document on the use of cachegrind
and hardware performance counters for software performance analysis on real hardware can be
found on BrightSpace. Additionally, as an example a brief report on a short investigation of the
application of loop interchange on a matrix-vector problem can be found on BrightSpace.