本次美国代写是C语言寄存器和缓存重用性能优化相关的一个assignment
Performance Optimization via R egister and Cache Reuse
Submission Requirements and Instructions
Your Submission to iLearn of this project should contain 2 files: a C code file called
mygemm.c and a PDF report including all experimental results with necessary tables,
charts and analysis. Do not zip the 2 files in any way.
About the mygemm.c file: it is a part of the test framework for this project, which
originally contain several blank Matrix multiplication functions which need to be
implemented. In the coding part, you only need to complete the Matrix multiplication
functions in the mygemm.c file, and all your codes should be completed in this file.
Refer to the posted manual file in the framework for more information of the usage of
this framework. For the convenience of your code test, you can modify the other code
files in the framework during your coding and debugging, but remember that when your
codes are graded, all the other code files in the framework will be the original version.
About running your codes: The evaluation of your codes and the test framework are
both based on the Tardis machine, so it is highly recommended that you always run
your codes on Tardis. The manual file of the framework contains necessary information
of how to run it. Moreover, there is another attached file tardis-tutorial.pdf, which
provides additional information and instructions for running codes on Tardis.
About the corner cases: when the matrix size is not a multiply of the block size, you
need to deal with some corner cases, which is not mandatory in this project. To avoid
them, some matrix sizes in the test framework have been slightly modified. Just remind
that you do not need to deal with the corner cases. However, you can emphasize it in
your report if you have succeeded in dealing with them.
Problems
1. Register Reuse (50 points).
Part 1. (20 points) Assume your computer can complete 4 double floating-point
operations per cycle when operands are in registers and it takes an additional delay of
100 cycles to read/write one operand from/to memory. The clock frequency of your
computer is 2 Ghz. How long it will take for your computer to finish the following
algorithm dgemm0 and dgemm1 respectively for n= 1000? How much time is wasted
on reading/writingoperands from/to memory? Implement the algorithm dgemm0 and
dgemm1 and test them on TARDIS with n= 64, 128, 256, 512, 1024, 2048 or n= 66,
126, 258, 510, 1026, 2046 in the framework provided. Check the correctness of your
implementation with the framework, and report the time spend in the triple loop for
each algorithm which is output by the framework. Calculate the performance (in Gflops)
of each algorithm. Performance is often defined as the number of floating-point
operations performed per second. A performance of 1 Gflops means 1 billion of
floating-point operations per second.
/*dgemm0: simple ijk version triple loop algorithm*/
for (i=0; i<n; i++)
for (j=0; j<n; j++)
for (k=0; k<n; k++)
c[i*n+j] += a[i*n+k] * b[k*n+j];/*dgemm1: simple ijk version triple loop algorithm with register reuse*/
for (i=0; i<n; i++)
for (j=0; j<n; j++) {
register double r = c[i*n+j] ;
for (k=0; k<n; k++)
r += a[i*n+k] * b[k*n+j];
c[i*n+j] = r;
}