apeNEXT optimization HOWTO

Nils Christian

Abstract

This HOWTO gives a brief overview on how to optimze software for apeNEXT.


Table of Contents

Tools for analyzing performance
Static analysis
Dynamic analysis
Optimization techniques
Hold loop body in cache
Loop unrolling
Burst memory access
Partial sum
Prefetching
Putting it all together

Tools for analyzing performance

In this document it is assumed, that the sourcecode was compiled in the following way (see HOWTO-next-compile for details)

nlcc -S -os -gp sourcefile.c
mpp -os3 sourcefile.sasm
sofan sourcefile.masm sourcefile-sofan.masm
shaker +a -z sourcefile-sofan.masm
Please keep in mind that especially optimization depends on the compile chain, so if you observe other numbers than in the following example this may be due to an updated compile chain.

Static analysis

There exist two tools to statically analyse the optimization of an apeNEXT program. The nperf script can give you a rough overview of how many instructions the different functions use and how well the floating point unit is used. The exact execution order of the instructions is written in the microcode, which can be looked at with the dispminit script.

nperf

nperf -asm=sourcefile.sasm -c -l -a sourcefile-sofan.mem > sourcefile-sofan.perf
The -asm option tells nperf were to find the sourcecode which is included in the sasm file if it was compiled with nlcc -gp. The option -l tells the script to write also the labels, -a lets it also print hidden labels and -c enables instruction counting. A typical output then looks like

0000f1   | =============== sourcefile.c ==============
         |  complex myFuction(complex * in1, complex * in2) { -->  myFuction (L)
         |    complex sp=0.;
        55   5 %  C: 0  F: 0  M: 0  X: 0  L: 0  I: 3  IQO: 0/0  0/0  0/0

000128   |    for (i=0; i<VOLUME; i++) {                       --> LBL_5 (S)
         |      sp+=conj(in1[i])*conj(in2[i]);
         |    }
        56   9 %  C: 1  F: 0  M: 0  X: 0  L: 0  I: 4  IQO: 0/0  2/2  0/0

Labels are prefixed by -->. After each section (surrounded by two labels) a summary on the instructions is given. The counters are
  • C number of CNORM operations
  • F number of FNORM operations
  • M number of LUTMOVE operations
  • X number of LUTCROSS operations
  • L number of LUT operations
  • I number of integer operations
  • IQO number of MTR input operations / laod bursts, and number of QTR input operations / MTQ laod bursts, and number of RTM output operations / store bursts
For up to date documentation, see nperf -h.

Tip

If you use sourcefile-sofan.ncd (produced by dispminit, see below) as input file instead of sourcefile-sofan.mem, you can safe some time.

dispminit

dispminit sourcefile-sofan.mem > sourcefile-sofan.ncd

The file sourcefile-sofan.ncd contains the microcode in a human readable form. Here one can see how the different units of the processor work in parallel. Each line represents one instruction (VLIW, very long instruction word; see Wikipedia arcticle for details), each column represents one of these units or an address/register number, on which these units will operate.

The first line in the file gives these columns a name:

  ADDR      DISP    C5  MCC   STKC  FLW  IOC  AGU   ASEL BS5 P5 BS4 C4 P4  MPC  BS3 C3 P3 P2 P1 P0          zADDR

For a first understanding it is enough to focus on the following elements:
MCCunit responsible for initiating memory access
IOCunit responsible for storing data that arrived from memory in the registers
MPCunit for floating point and integer arithmetics
P0 to P2pointer to input registers of MPC
P3pointer to ouput register of MPC

This part of the HOWTO is not finished and will be extended.

Dynamic analysis

Dynamic analysis of the code means to run it on apeNEXT and query the runtime counters. In contrast to the static analysis you also gather information on how long the processor was stalling, i.e. waiting for some thing to happen, for example data arriving from memory. This can help to identify memory/network bottlenecks and gives the most significant measure of how well the code is optimized.

See HOWTO-runcounters for details.

Optimization techniques

This part of the HOWTO will explain some optimization techniques using the example of a simple scalar product written in C.

Whether the techniques described here are reasonable highly depends on the function that has to be optimized. Also keep in mind that one technique may vitiate the speedup of another one.

The first naive try of the scalar product is some code like this

complex scalar_product(complex * in1, complex * in2) {
  int i;
  complex sp=0.;
  for (i=0; i < VOLUME; i++) {
    sp+=conj(in1[i])*in2[i];
  }
  return sp;
}

What could we expect from this code? For each iteration, the processor has to fetch two values from the memory, then multiply them and store the result. In principle, memory access and floating point calculation could be done in parallel, since different units of the processor are responsible for this. Also, the result does not have to be written back to memory every time, since it will be used again in the following operation. Neglecting all loop overhead and assuming this possible parallelisation, the bottleneck is the memory bandwidth and we are left with at least 2 clock cycles per itereation, since the apeNEXT architecture's memory bandwidth is one word per cycle.

Looking for the label scalar_product in the resulting nperf output, we find something like

000df6   |  }                                         --> scalar_product (L)
         |  complex scalar_product(complex * in1, complex * in2) {
         |    complex sp=0.;
        55   5 %  C: 0  F: 0  M: 0  X: 0  L: 0  I: 3  IQO: 0/0  0/0  0/0

000e2d   |    for (i=0; i<VOLUME; i++) {                     --> LBL_118 (S)
         |      sp+=conj(in1[i])*in2[i];
         |    }
        56   9 %  C: 1  F: 0  M: 0  X: 0  L: 0  I: 4  IQO: 0/0  2/2  0/0

Interesting here is the for-loop, it takes 56 cycles (roughly), has a floating point pipeline filling of 9% and does 1 complex and 4 integer operation. This is a rather bad performance, since in theory it could perform such operations at each cycle.

Here and in the following sections we will give the runtime counters, normalized by the volume. Please keep in mind that these numbers still depend on the volume, since for example the function overhead does not scale with the volume.

The runtime counters are

clk 85.705771  run 49.126877   mbusy 0.009375  mwait 22.949518 nbusy 0.000000  qwait 13.620000

So in average 85.7 cycles are needed per iteration, where the machine was only 49.1 cycles in run-mode, the rest of the time it was stalling.

Hold loop body in cache

The first optimization we will apply is to put the code of the loop body into the cache. This tells the processor to load the instructions only once at the beginning of the loop, instead of every time the loop body is exectuted. Since the instructions are situated in the same memory as the data, instruction loading would otherwise compete with loading of data into the processor.

We instruct the compiler to do this via the #pragma cache, which is a special syntax extension added to the C compiler for apeNEXT.

The downside of this instruction caching is that during the loading procedure of instructions into the cache, no other operations can be performed. It follows that instruction caching only makes sense if the loop body is going to be executed many times.

  {
#pragma cache
    for (i=0; i<VOLUME; i++) {
      sp+=conj(in1[i])*conj(in2[i]);
    }
  }

Now the nperf output is

000e65   |    return sp;                        --> scalar_product_cache (L)
         |  complex scalar_product_cache(complex * in1, complex * in2) {
         |    complex sp=0.;
         |    {
        62   3 %  C: 0  F: 0  M: 0  X: 0  L: 0  I: 2  IQO: 0/0  0/0  0/0
000ea3   |                                              --> __DCL_START1 (S)
         4  25 %  C: 0  F: 0  M: 0  X: 0  L: 0  I: 1  IQO: 0/0  0/0  0/0

000ea7   |      for (i=0; i<VOLUME; i++) {                   --> LBL_124 (S)
         |        sp+=conj(in1[i])*in2[i];
         |      }
         |    }
        65   5 %  C: 1  F: 0  M: 0  X: 0  L: 0  I: 2  IQO: 0/0  2/2  0/0
000ee8   |                                                --> __DCL_END1 (S)
        21  10 %  C: 0  F: 0  M: 0  X: 0  L: 0  I: 2  IQO: 0/0  0/0  0/0

It looks here as if the loop got longer, but this is not the case, since nperf only counts the number of instructions between labels, but the loop already ends before the __DCL_END1 label. Indeed, having a look at the timings (see below), this simple optimization speeds up the scalar product by a factor of 1.9.

The runtime counters are (divided by the volume)

clk 45.840278  run 39.195627   mbusy 0.023125  mwait 0.024767  nbusy 0.000000  qwait 6.596759

Not only the runtime has decreased in comparison with the unoptimized version, but also the time processor was stalling.

Loop unrolling

A loop produces at the beginning and end of the loop body some overhead, which can become significant if the body itself is very small. It is therefore helpful to duplicate the body, either explicitly or by a pragma that has been introduced for apeNEXT. Here we use the pragma

  {
#pragma cache
    {
#pragma unroll 8
      for (i=0; i<VOLUME; i++) {
	sp+=conj(in1[i])*in2[i];
      }
    }
  }
The amount of unrolling has to be carefully tuned, since by unrolling too much the body becomes very large and loading it into cache before the loop starts can easily foil the time gained (or the code may even not fit into the cache).

000f3f   |        for (i=0; i<VOLUME; i++) {                 --> LBL_130 (S)
         |      sp+=conj(in1[i])*in2[i];
         |        }
         |      }
         |    }
       128  13 %  C: 8  F: 0  M: 0  X: 0  L: 0  I: 9  IQO: 0/0  16/16  0/0

The static analysis gives 128/8 cycles=16 cycles per itereation.

The runtime counters are (divided by the volume)

clk 29.011161  run 12.945627   mbusy 0.041250  mwait 0.046468  nbusy 0.000000  qwait 15.977816

Compared to the example above, the stalling time has increased again. This is due to the fact that now the memory accesses are more agressive.

Burst memory access

The next step is to improve the memory access. For every single pull from memory the processor has to load the address into a specific register and then initiate the memory access. A subsequent access can then be processed at the earliest 4 clock cycles later. But we can instruct the processor to load a bigger amount of data into the register with one intruction. We therefore define a new type which holds (for example) 8 complex numbers:

typedef struct {
  complex c[8];
} b8complex;

complex scalar_product_burst_cache(complex * in1, complex * in2) {
  register int i;
  register complex sp=0.;
  register b8complex a1, a2;

  {
#pragma cache
    {
      for (i=0; i<VOLUME; i+=8) {
        a1=*((b8complex *) &in1[i]);
        a2=*((b8complex *) &in2[i]);
        sp+=conj(a1.c[0])*a2.c[0];
        sp+=conj(a1.c[1])*a2.c[1];
        sp+=conj(a1.c[2])*a2.c[2];
        sp+=conj(a1.c[3])*a2.c[3];
        sp+=conj(a1.c[4])*a2.c[4];
        sp+=conj(a1.c[5])*a2.c[5];
        sp+=conj(a1.c[6])*a2.c[6];
        sp+=conj(a1.c[7])*a2.c[7];
      }
    }
  }
  return sp;
}
This includes an implicit loop unrolling.

This results in the nperf output for the loop

       132   8 %  C: 8  F: 0  M: 0  X: 0  L: 0  I: 2  IQO: 0/0  16/2  0/0

This means that there are 132/8=16.5 instructions per iteration.

The runtime counters are (divided by the volume)

clk 14.357542  run 13.445627   mbusy 0.032500  mwait 0.029160  nbusy 0.000000  qwait 0.850255

As expected, the stalling time has been greatly reduced compared to the bare loop unrolling.

Partial sum

The apeNEXT processor is capable of doing one CNORM operation (like c1*c2+c3) per clock cycle, but the result will not be available immediately in the following cycle. As a consequence, each summation in the above example has to wait for the calculation of the previous line, resulting in a long waiting time. A better approach is to remove the dependencies of the calculations by having one variable for each summation, and add up the partial sums after the end of the loop:

  for (i=0; i<VOLUME; i+=8) {
    a1=*((b8complex *) &in1[i]);
    a2=*((b8complex *) &in2[i]);
    sp0+=conj(a1.c[0])*a2.c[0];
    sp1+=conj(a1.c[1])*a2.c[1];
    sp2+=conj(a1.c[2])*a2.c[2];
    sp3+=conj(a1.c[3])*a2.c[3];
    sp4+=conj(a1.c[4])*a2.c[4];
    sp5+=conj(a1.c[5])*a2.c[5];
    sp6+=conj(a1.c[6])*a2.c[6];
    sp7+=conj(a1.c[7])*a2.c[7];
  }
  //partial sum
  sp0=sp0+sp1;
  sp2=sp2+sp3;
  sp4=sp4+sp5;
  sp6=sp6+sp7;

  sp0=sp0+sp2;
  sp4=sp4+sp6;
  return sp0+sp4;

The nperf output for the loop is

        76  13 %  C: 8  F: 0  M: 0  X: 0  L: 0  I: 2  IQO: 0/0  16/2  0/0

Thus there are 76/8=9.5 instructions per multiplication, plus some overhead at the end of the function for the summation.

The runtime counters are (divided by the volume)

clk 7.175015   run 6.453752    mbusy 0.029375  mwait 0.023264  nbusy 0.000000  qwait 0.668624

Prefetching

To reduce waiting time for memory accesses, it may be beneficial to use prefetching. The memory access is initiated in one iteration and will then be used in the following iteration.

complex sp_cache_prefetch(complex * in1, complex * in2) {
  register int i;
  register complex sp=0, a1, a2, *rin1, *rin2;
  // Copy adresses to register
  rin1=in1; rin2=in2;

  // prefetch values for first iteration
  prefetch(*((double *) &(rin1[0])));
  prefetch(*((double *) &(rin2[0])));
  {
#pragma cache
    {
      for (i=1; i<VOLUME+1; i++) {
        asm("\tPRAGMA_FREE_QTR");
        prefetch(*((double *) &(rin1[i])));
        prefetch(*((double *) &(rin2[i])));
        fetch(a1);
        fetch(a2);
        sp+=conj(a1)*a2;
      }
    }
  }
  // fetch data that was prefetched in last iteration to emtpy the queue
  fetch(a1);
  fetch(a2);

  return sp;
The asm("\tPRAGMA_FREE_QTR"); instructs the compiler not to take care about dependencies of M2Q and Q2R in that section (This is a workaround until some high level pragma in nlcc has been introduced; check the nlcc documentation for up to date information about this). These are exactly the two commands that initiate the memory access and put the data into the registers of the processor. If there are any other memory accesses, these should be made strictly local (i.e. they should not use the queue) via the #pragma localmem pragma (see nlcc documentation for details). After the last iteration we have to remove the extra elements we prefetched in that iteration, in order to empty the queue. Otherwise, a following memory access would read that data and the program would not behave as expected. This additional fetching adds some overhead to the end of the function.

The nperf output for the loop is

        52   6 %  C: 1  F: 0  M: 0  X: 0  L: 0  I: 2  IQO: 0/0  2/2  0/0

The runtime counters are (divided by the volume)

clk 30.547697  run 26.207502   mbusy 0.022730  mwait 0.052006  nbusy 0.000000  qwait 4.265459

Although prefetching is used, the qwait time is not negligible. So in this particular case probably the loop body is not long enough to cover the latency.

Putting it all together

Using all optimization techniques together, the code could look like this

complex sp_cache_burst_partial_prefetch_unroll(complex * in1, complex * in2) {
  register complex sp0,sp1,sp2,sp3,sp4,sp5,sp6,sp7,*rin1,*rin2;
  register b8complex a1, a2;
  register int i,j;
  register int ind;

  // Copy adresses to register
  rin1=in1; rin2=in2;

  // prefetch values for first iteration
  {
#pragma unroll
    for (i=0; i<8*8; i+=) {
      prefetch(*((b8complex *) &(rin1[i])));
      prefetch(*((b8complex *) &(rin2[i])));
    }
  }
  sp0=0.; sp1=0.; sp2=0.; sp3=0.;
  sp4=0.; sp5=0.; sp6=0.; sp7=0.;
  
  {
#pragma cache
    {
#pragma unroll 8
      for (i=8*8; i<VOLUME+8*8; i+=8) {
	asm("\tPRAGMA_FREE_QTR");
	prefetch(*((b8complex *) &(rin1[i])));
	prefetch(*((b8complex *) &(rin2[i])));
	fetch(a1);
	fetch(a2);
	sp0+=conj(a1.c[0])*a2.c[0];
	sp1+=conj(a1.c[1])*a2.c[1];
	sp2+=conj(a1.c[2])*a2.c[2];
	sp3+=conj(a1.c[3])*a2.c[3];
	sp4+=conj(a1.c[4])*a2.c[4];
	sp5+=conj(a1.c[5])*a2.c[5];
	sp6+=conj(a1.c[6])*a2.c[6];
	sp7+=conj(a1.c[7])*a2.c[7];
      }
    }
  }
  // fetch data that was prefetched in last iteration to emtpy the queue
  {
#pragma unroll
    for (j=0; j<8; j++) {
      fetch(a1);
      fetch(a2);
    }
  }

  //partial sum
  sp0=sp0+sp1;
  sp2=sp2+sp3;
  sp4=sp4+sp5;
  sp6=sp6+sp7;

  sp0=sp0+sp2;
  sp4=sp4+sp6;

  return sp0+sp4;
}

The nperf output for the loop is

       194  38 %  C: 64  F: 0  M: 0  X: 0  L: 0  I: 9  IQO: 0/0  128/16  0/0

This is 194/(8*8)=3.03 instructions per iteration.

The runtime counters are (divided by the volume)

clk 4.000001   run 3.060002    mbusy 0.816691  mwait 0.123307  nbusy 0.000000  qwait 0.000000

So the overall speedup compared to the simple scalar_product is roughly 21. The optimal value of 2 cycles per iteration, as stated at the beginning of this chapter, is not reached due to loop and function overhead.