本次美国代写是C语言寄存器和缓存重用性能优化相关的一个assignment

# Performance Optimization via R egister and Cache Reuse

## Submission Requirements and Instructions

Your Submission to iLearn of this project should contain 2 files: a C code file called

mygemm.c and a PDF report including all experimental results with necessary tables,

charts and analysis. Do not zip the 2 files in any way.

About the mygemm.c file: it is a part of the test framework for this project, which

originally contain several blank Matrix multiplication functions which need to be

implemented. In the coding part, you only need to complete the Matrix multiplication

functions in the mygemm.c file, and all your codes should be completed in this file.

Refer to the posted manual file in the framework for more information of the usage of

this framework. For the convenience of your code test, you can modify the other code

files in the framework during your coding and debugging, but remember that when your

codes are graded, all the other code files in the framework will be the original version.

About running your codes: The evaluation of your codes and the test framework are

both based on the Tardis machine, so it is highly recommended that you always run

your codes on Tardis. The manual file of the framework contains necessary information

of how to run it. Moreover, there is another attached file tardis-tutorial.pdf, which

provides additional information and instructions for running codes on Tardis.

About the corner cases: when the matrix size is not a multiply of the block size, you

need to deal with some corner cases, which is not mandatory in this project. To avoid

them, some matrix sizes in the test framework have been slightly modified. Just remind

that you do not need to deal with the corner cases. However, you can emphasize it in

your report if you have succeeded in dealing with them.

## Problems

### 1. Register Reuse (50 points).

Part 1. (20 points) Assume your computer can complete 4 double floating-point

operations per cycle when operands are in registers and it takes an additional delay of

100 cycles to read/write one operand from/to memory. The clock frequency of your

computer is 2 Ghz. How long it will take for your computer to finish the following

algorithm dgemm0 and dgemm1 respectively for n= 1000? How much time is wasted

on reading/writingoperands from/to memory? Implement the algorithm dgemm0 and

dgemm1 and test them on TARDIS with n= 64, 128, 256, 512, 1024, 2048 or n= 66,

126, 258, 510, 1026, 2046 in the framework provided. Check the correctness of your

implementation with the framework, and report the time spend in the triple loop for

each algorithm which is output by the framework. Calculate the performance (in Gflops)

of each algorithm. Performance is often defined as the number of floating-point

operations performed per second. A performance of 1 Gflops means 1 billion of

floating-point operations per second.

/*dgemm0: simple ijk version triple loop algorithm*/

for (i=0; i<n; i++)

for (j=0; j<n; j++)

for (k=0; k<n; k++)

c[i*n+j] += a[i*n+k] * b[k*n+j];/*dgemm1: simple ijk version triple loop algorithm with register reuse*/

for (i=0; i<n; i++)

for (j=0; j<n; j++) {

register double r = c[i*n+j] ;

for (k=0; k<n; k++)

r += a[i*n+k] * b[k*n+j];

c[i*n+j] = r;

}