Abstract
This HOWTO gives a brief overview on how to optimze software for apeNEXT.
Table of Contents
In this document it is assumed, that the sourcecode was compiled in the following way (see HOWTO-next-compile for details)
nlcc -S -os -gp sourcefile.c
mpp -os3 sourcefile.sasm
sofan sourcefile.masm sourcefile-sofan.masm
shaker +a -z sourcefile-sofan.masmPlease keep in mind that especially optimization depends on the compile chain, so if you observe other numbers than in the following example this may be due to an updated compile chain.
There exist two tools to statically analyse the optimization of an apeNEXT program. The nperf script can give you a rough overview of how many instructions the different functions use and how well the floating point unit is used. The exact execution order of the instructions is written in the microcode, which can be looked at with the dispminit script.
nperf -asm=sourcefile.sasm -c -l -a sourcefile-sofan.mem > sourcefile-sofan.perfThe -asm option tells nperf were to find the sourcecode which is included in the sasm file if it was compiled with nlcc -gp. The option -l tells the script to write also the labels, -a lets it also print hidden labels and -c enables instruction counting. A typical output then looks like
0000f1 | =============== sourcefile.c ==============
| complex myFuction(complex * in1, complex * in2) { --> myFuction (L)
| complex sp=0.;
55 5 % C: 0 F: 0 M: 0 X: 0 L: 0 I: 3 IQO: 0/0 0/0 0/0
000128 | for (i=0; i<VOLUME; i++) { --> LBL_5 (S)
| sp+=conj(in1[i])*conj(in2[i]);
| }
56 9 % C: 1 F: 0 M: 0 X: 0 L: 0 I: 4 IQO: 0/0 2/2 0/0
dispminit sourcefile-sofan.mem > sourcefile-sofan.ncd
The file sourcefile-sofan.ncd contains the microcode in a human readable form. Here one can see how the different units of the processor work in parallel. Each line represents one instruction (VLIW, very long instruction word; see Wikipedia arcticle for details), each column represents one of these units or an address/register number, on which these units will operate.
The first line in the file gives these columns a name:
ADDR DISP C5 MCC STKC FLW IOC AGU ASEL BS5 P5 BS4 C4 P4 MPC BS3 C3 P3 P2 P1 P0 zADDR
MCC | unit responsible for initiating memory access |
IOC | unit responsible for storing data that arrived from memory in the registers |
MPC | unit for floating point and integer arithmetics |
P0 to P2 | pointer to input registers of MPC |
P3 | pointer to ouput register of MPC |
This part of the HOWTO is not finished and will be extended.
Dynamic analysis of the code means to run it on apeNEXT and query the runtime counters. In contrast to the static analysis you also gather information on how long the processor was stalling, i.e. waiting for some thing to happen, for example data arriving from memory. This can help to identify memory/network bottlenecks and gives the most significant measure of how well the code is optimized.
See HOWTO-runcounters for details.
This part of the HOWTO will explain some optimization techniques using the example of a simple scalar product written in C.
Whether the techniques described here are reasonable highly depends on the function that has to be optimized. Also keep in mind that one technique may vitiate the speedup of another one.
The first naive try of the scalar product is some code like this
complex scalar_product(complex * in1, complex * in2) { int i; complex sp=0.; for (i=0; i < VOLUME; i++) { sp+=conj(in1[i])*in2[i]; } return sp; }
What could we expect from this code? For each iteration, the processor has to fetch two values from the memory, then multiply them and store the result. In principle, memory access and floating point calculation could be done in parallel, since different units of the processor are responsible for this. Also, the result does not have to be written back to memory every time, since it will be used again in the following operation. Neglecting all loop overhead and assuming this possible parallelisation, the bottleneck is the memory bandwidth and we are left with at least 2 clock cycles per itereation, since the apeNEXT architecture's memory bandwidth is one word per cycle.
Looking for the label scalar_product in the resulting nperf output, we find something like
000df6 | } --> scalar_product (L)
| complex scalar_product(complex * in1, complex * in2) {
| complex sp=0.;
55 5 % C: 0 F: 0 M: 0 X: 0 L: 0 I: 3 IQO: 0/0 0/0 0/0
000e2d | for (i=0; i<VOLUME; i++) { --> LBL_118 (S)
| sp+=conj(in1[i])*in2[i];
| }
56 9 % C: 1 F: 0 M: 0 X: 0 L: 0 I: 4 IQO: 0/0 2/2 0/0
Here and in the following sections we will give the runtime counters, normalized by the volume. Please keep in mind that these numbers still depend on the volume, since for example the function overhead does not scale with the volume.
The runtime counters areclk 85.705771 run 49.126877 mbusy 0.009375 mwait 22.949518 nbusy 0.000000 qwait 13.620000
The first optimization we will apply is to put the code of the loop body into the cache. This tells the processor to load the instructions only once at the beginning of the loop, instead of every time the loop body is exectuted. Since the instructions are situated in the same memory as the data, instruction loading would otherwise compete with loading of data into the processor.
We instruct the compiler to do this via the #pragma cache, which is a special syntax extension added to the C compiler for apeNEXT.
The downside of this instruction caching is that during the loading procedure of instructions into the cache, no other operations can be performed. It follows that instruction caching only makes sense if the loop body is going to be executed many times.
{ #pragma cache for (i=0; i<VOLUME; i++) { sp+=conj(in1[i])*conj(in2[i]); } }
Now the nperf output is
000e65 | return sp; --> scalar_product_cache (L)
| complex scalar_product_cache(complex * in1, complex * in2) {
| complex sp=0.;
| {
62 3 % C: 0 F: 0 M: 0 X: 0 L: 0 I: 2 IQO: 0/0 0/0 0/0
000ea3 | --> __DCL_START1 (S)
4 25 % C: 0 F: 0 M: 0 X: 0 L: 0 I: 1 IQO: 0/0 0/0 0/0
000ea7 | for (i=0; i<VOLUME; i++) { --> LBL_124 (S)
| sp+=conj(in1[i])*in2[i];
| }
| }
65 5 % C: 1 F: 0 M: 0 X: 0 L: 0 I: 2 IQO: 0/0 2/2 0/0
000ee8 | --> __DCL_END1 (S)
21 10 % C: 0 F: 0 M: 0 X: 0 L: 0 I: 2 IQO: 0/0 0/0 0/0
The runtime counters are (divided by the volume)
clk 45.840278 run 39.195627 mbusy 0.023125 mwait 0.024767 nbusy 0.000000 qwait 6.596759
A loop produces at the beginning and end of the loop body some overhead, which can become significant if the body itself is very small. It is therefore helpful to duplicate the body, either explicitly or by a pragma that has been introduced for apeNEXT. Here we use the pragma
{ #pragma cache { #pragma unroll 8 for (i=0; i<VOLUME; i++) { sp+=conj(in1[i])*in2[i]; } } }The amount of unrolling has to be carefully tuned, since by unrolling too much the body becomes very large and loading it into cache before the loop starts can easily foil the time gained (or the code may even not fit into the cache).
000f3f | for (i=0; i<VOLUME; i++) { --> LBL_130 (S)
| sp+=conj(in1[i])*in2[i];
| }
| }
| }
128 13 % C: 8 F: 0 M: 0 X: 0 L: 0 I: 9 IQO: 0/0 16/16 0/0
The runtime counters are (divided by the volume)
clk 29.011161 run 12.945627 mbusy 0.041250 mwait 0.046468 nbusy 0.000000 qwait 15.977816
The next step is to improve the memory access. For every single pull from memory the processor has to load the address into a specific register and then initiate the memory access. A subsequent access can then be processed at the earliest 4 clock cycles later. But we can instruct the processor to load a bigger amount of data into the register with one intruction. We therefore define a new type which holds (for example) 8 complex numbers:
typedef struct { complex c[8]; } b8complex; complex scalar_product_burst_cache(complex * in1, complex * in2) { register int i; register complex sp=0.; register b8complex a1, a2; { #pragma cache { for (i=0; i<VOLUME; i+=8) { a1=*((b8complex *) &in1[i]); a2=*((b8complex *) &in2[i]); sp+=conj(a1.c[0])*a2.c[0]; sp+=conj(a1.c[1])*a2.c[1]; sp+=conj(a1.c[2])*a2.c[2]; sp+=conj(a1.c[3])*a2.c[3]; sp+=conj(a1.c[4])*a2.c[4]; sp+=conj(a1.c[5])*a2.c[5]; sp+=conj(a1.c[6])*a2.c[6]; sp+=conj(a1.c[7])*a2.c[7]; } } } return sp; }This includes an implicit loop unrolling.
This results in the nperf output for the loop
132 8 % C: 8 F: 0 M: 0 X: 0 L: 0 I: 2 IQO: 0/0 16/2 0/0
The runtime counters are (divided by the volume)
clk 14.357542 run 13.445627 mbusy 0.032500 mwait 0.029160 nbusy 0.000000 qwait 0.850255
The apeNEXT processor is capable of doing one CNORM operation (like c1*c2+c3) per clock cycle, but the result will not be available immediately in the following cycle. As a consequence, each summation in the above example has to wait for the calculation of the previous line, resulting in a long waiting time. A better approach is to remove the dependencies of the calculations by having one variable for each summation, and add up the partial sums after the end of the loop:
for (i=0; i<VOLUME; i+=8) { a1=*((b8complex *) &in1[i]); a2=*((b8complex *) &in2[i]); sp0+=conj(a1.c[0])*a2.c[0]; sp1+=conj(a1.c[1])*a2.c[1]; sp2+=conj(a1.c[2])*a2.c[2]; sp3+=conj(a1.c[3])*a2.c[3]; sp4+=conj(a1.c[4])*a2.c[4]; sp5+=conj(a1.c[5])*a2.c[5]; sp6+=conj(a1.c[6])*a2.c[6]; sp7+=conj(a1.c[7])*a2.c[7]; } //partial sum sp0=sp0+sp1; sp2=sp2+sp3; sp4=sp4+sp5; sp6=sp6+sp7; sp0=sp0+sp2; sp4=sp4+sp6; return sp0+sp4;
The nperf output for the loop is
76 13 % C: 8 F: 0 M: 0 X: 0 L: 0 I: 2 IQO: 0/0 16/2 0/0
The runtime counters are (divided by the volume)
clk 7.175015 run 6.453752 mbusy 0.029375 mwait 0.023264 nbusy 0.000000 qwait 0.668624
To reduce waiting time for memory accesses, it may be beneficial to use prefetching. The memory access is initiated in one iteration and will then be used in the following iteration.
complex sp_cache_prefetch(complex * in1, complex * in2) { register int i; register complex sp=0, a1, a2, *rin1, *rin2; // Copy adresses to register rin1=in1; rin2=in2; // prefetch values for first iteration prefetch(*((double *) &(rin1[0]))); prefetch(*((double *) &(rin2[0]))); { #pragma cache { for (i=1; i<VOLUME+1; i++) { asm("\tPRAGMA_FREE_QTR"); prefetch(*((double *) &(rin1[i]))); prefetch(*((double *) &(rin2[i]))); fetch(a1); fetch(a2); sp+=conj(a1)*a2; } } } // fetch data that was prefetched in last iteration to emtpy the queue fetch(a1); fetch(a2); return sp;The asm("\tPRAGMA_FREE_QTR"); instructs the compiler not to take care about dependencies of M2Q and Q2R in that section (This is a workaround until some high level pragma in nlcc has been introduced; check the nlcc documentation for up to date information about this). These are exactly the two commands that initiate the memory access and put the data into the registers of the processor. If there are any other memory accesses, these should be made strictly local (i.e. they should not use the queue) via the #pragma localmem pragma (see nlcc documentation for details). After the last iteration we have to remove the extra elements we prefetched in that iteration, in order to empty the queue. Otherwise, a following memory access would read that data and the program would not behave as expected. This additional fetching adds some overhead to the end of the function.
The nperf output for the loop is
52 6 % C: 1 F: 0 M: 0 X: 0 L: 0 I: 2 IQO: 0/0 2/2 0/0
The runtime counters are (divided by the volume)
clk 30.547697 run 26.207502 mbusy 0.022730 mwait 0.052006 nbusy 0.000000 qwait 4.265459
Using all optimization techniques together, the code could look like this
complex sp_cache_burst_partial_prefetch_unroll(complex * in1, complex * in2) { register complex sp0,sp1,sp2,sp3,sp4,sp5,sp6,sp7,*rin1,*rin2; register b8complex a1, a2; register int i,j; register int ind; // Copy adresses to register rin1=in1; rin2=in2; // prefetch values for first iteration { #pragma unroll for (i=0; i<8*8; i+=) { prefetch(*((b8complex *) &(rin1[i]))); prefetch(*((b8complex *) &(rin2[i]))); } } sp0=0.; sp1=0.; sp2=0.; sp3=0.; sp4=0.; sp5=0.; sp6=0.; sp7=0.; { #pragma cache { #pragma unroll 8 for (i=8*8; i<VOLUME+8*8; i+=8) { asm("\tPRAGMA_FREE_QTR"); prefetch(*((b8complex *) &(rin1[i]))); prefetch(*((b8complex *) &(rin2[i]))); fetch(a1); fetch(a2); sp0+=conj(a1.c[0])*a2.c[0]; sp1+=conj(a1.c[1])*a2.c[1]; sp2+=conj(a1.c[2])*a2.c[2]; sp3+=conj(a1.c[3])*a2.c[3]; sp4+=conj(a1.c[4])*a2.c[4]; sp5+=conj(a1.c[5])*a2.c[5]; sp6+=conj(a1.c[6])*a2.c[6]; sp7+=conj(a1.c[7])*a2.c[7]; } } } // fetch data that was prefetched in last iteration to emtpy the queue { #pragma unroll for (j=0; j<8; j++) { fetch(a1); fetch(a2); } } //partial sum sp0=sp0+sp1; sp2=sp2+sp3; sp4=sp4+sp5; sp6=sp6+sp7; sp0=sp0+sp2; sp4=sp4+sp6; return sp0+sp4; }
The nperf output for the loop is
194 38 % C: 64 F: 0 M: 0 X: 0 L: 0 I: 9 IQO: 0/0 128/16 0/0
The runtime counters are (divided by the volume)
clk 4.000001 run 3.060002 mbusy 0.816691 mwait 0.123307 nbusy 0.000000 qwait 0.000000