Programming language benchmarks on the AMD64 dual core

A.D. Corlan > Language Benchmarks > AMD64x2

Programming language benchmarks
on the AMD64 dual core 3600+

This page is about a simple polynomial benchmark program that was tested on an AMD 64 dual core processor running at 2GHz (the '3600+' variety). The 32 bits version is always the same binary that is compiled and run on the P-III. One purpose of this exercise was to see to what extent using the existing multiprogramming technology available in Ada 95 (as implemented by the gnat compiler available with gcc 4.0 for example) leads to speed improvements with multicore processors. Times are in seconds.

Debian Woody ran on the i386 and Ubuntu desktop 64, v6.06 on the AMD64.

In multitasking examples the real time (that is, elapsed time for the whole command, including system time) while in the other cases the user time is reported and wrote (rt). The user time for the dual core machine is a sum of the time both cores are kept busy and is less relevant.

Some examples are run with 500_000 loops, some with 5_000_000 (and designated 'x10') and the resulting time divided by 10 and designated by an 'x' in front of the timing. This serves to illustrate the per-process overhead that appears in some tasking examples.

Language	P-III	AMD64x2	AMD64x2
	300MHz	3600+, 32 bit	3600+, 64 bit
Ada + hand vectorized F90, gcc-4.0+Intel fortran 10, x100 cases			xx 0.03
vectorizable F90, Intel fortran 10			0.06
C, SSE2 intrinsics by hand, gcc 4.0.3			0.08
C, SSE2 intrinsics by hand, Intel C++			0.08
vectorizable C, Intel C++ v10			0.11
vectorizable C, gcc 4.0.3			0.20
vectorizable F90, gfortran 4.0			0.21
g77 2.9.5/gfortran-4.0 amd64	2.73	0.41	0.40
C, float, hand optimised gcc-amd64 4.0.3	3.65		0.41
hand optimsed C, Intel C++, V10			0.41
C, Intel C++, v10			0.81
C, double gcc-i386 V2.95.4/gcc-amd64 4.0.3	3.14	0.41	0.41
Ada, Float gcc-i386 V2.95.4/gcc-amd64 4.0.3	2.76	0.41	0.51
Ada, Long_Float gcc-i386 V2.95.4/gcc-amd64 4.0.3	3.21	0.46	0.51
Ada, non-comm tasks gcc-i386 V2.95.4/gcc-amd64 4.0.3 (rt)	5.59	0.53	0.39
Ada, non-comm tasks, x10 cases (rt)	x 2.78	x 0.21	x 0.28
Ada, comm tasks gcc-i386 V2.95.4/gcc-amd64 4.0.3 (rt)	5.16	0.38	0.34
Ada, comm tasks, x10 cases (rt)	x 4.98	x 0.50	x 0.32
Ada, comm tasks, x10 cases, x1000 comm (rt)	x 5.35	x 0.50	x 0.39
CommonLisp single-float/SBCL 0.9.8			0.60
Perl bytecompiled 5.6.1 i386/5.8.7 amd64	515.04		70.48
R 1.5.1 i386/2.2.1 amd64	5662.84		84.92

Discussion: In all cases the 32 bit binaries, compiled on the i386, ran faster on the AMD64 than 64 bit binaries compiled, with more recent compilers, on the AMD64 itself.

There was some overhead associated with tasking, that occured once per process and dissapeared when the runnning time was made 10 times longer. The best improvement between i386@300MHz and AMD64x2@3600+ was obtained by running the same 32 bit two-task binary, and was about 13.3 times; it was not visibly slower than the single task binary on the i386.

Communication between tasks, using rendevous, added more overhead, that seemes to have been handled better in 64bit native mode.

The spectacular improvement in R performance may be due, in part, to improvement in the implementation with V 2.2.1. CommonLisp compiled code is faster than C (can't avoid noticing :)).

However, a very substantial improvement in speed was obtained by employing vectorization features of the intel fortran compiler, in particular when using a data type consisting of an 8-real vector (piece-of-eight :). The seven-fold increase in speed was obtained using a single core.

Conclusion: It seems the best approach is to develop programs by dividing the job between tasks, if possible noncommunicating, in Ada, even when they are compiled on single-core machines. Recompilation on the 64 bit platform seems unnecessary, the portable 32 bit binary is at least as good as the 64 bit one. Combining this with vectorization by a compiler such as the intel fortran compiler could lead to the maximum exploatation of the SSE instructions and the multicore facilities.

Ada Long_Float

atestl.adb

with Ada.Command_Line; use Ada.Command_Line;
with Ada.Text_Io; use Ada.Text_Io;

procedure Atestl is

   Pol: array(1..100) of Long_Float;
   N: Integer:= Integer'Value(Argument(1));
   X: Long_Float:= Long_Float'Value(Argument(2));
   S: Long_Float;
   Mu: Long_Float:= 10.0;
   Pu: Long_Float:= 0.0;

begin
   for I in 1..N loop
      for J in 1..100 loop
         Mu := (Mu + 2.0) / 2.0;
         Pol(J) := Mu;
      end loop;
      S := 0.0;
      for J in 1..100 loop
         S := X * S + Pol(J);
      end loop;
      Pu := Pu+S;
   end loop;
   Put_Line(Long_Float'Image(Pu));
end Atestl;

Compile and run with:

gcc -c -O6 atestl.adb
gnatmake atestl
time ./atestl 500000 0.2

Ada Non-communicating tasks

patest0.adb

with Ada.Command_Line; use Ada.Command_Line;
with Ada.Text_Io; use Ada.Text_Io;

procedure Patest is

   N: Integer:= 250_000;
   X: Float:= 0.2;

   task type Atest;

   task body Atest is

      Pol: array(1..100) of Float;
      S: Float;
      Mu: Float:= 10.0;
      Pu: Float:= 0.0;

   begin
      for I in 1..N loop
         for J in 1..100 loop
            Mu := (Mu + 2.0) / 2.0;
            Pol(J) := Mu;
         end loop;
         S := 0.0;
         for J in 1..100 loop
            S := X * S + Pol(J);
         end loop;
         Pu := Pu+S;
      end loop;
      Put_Line(Float'Image(Pu));
   end Atest;

   At1, At2: Atest;

begin
   null;
end;

Compile and run with:

gcc -c -O6 patest0.adb
gnatmake patest0
time ./patest0

the 250_000 is replaced with 2_500_000 for the 'x10' run.

Ada Communicating tasks

patest.adb

with Ada.Command_Line; use Ada.Command_Line;
with Ada.Text_Io; use Ada.Text_Io;

procedure Patest is

   NN: Integer:= 2_500_000;
   Fact: Integer:= 1;
   XX: Float:= 0.2;

   task type Atest is
      entry Compute(N0: Integer; X0: Float);
   end Atest;

   task body Atest is

      Pol: array(1..100) of Float;
      S: Float;
      Mu: Float:= 10.0;
      Pu: Float:= 0.0;
      N: Integer;
      X: Float;

   begin
      loop
         accept Compute(N0: Integer; X0: Float) do
            N := N0;
            X := X0;
         end Compute;
         if N=0 then
            exit;
         end if;
         for I in 1..N loop
            for J in 1..100 loop
               Mu := (Mu + 2.0) / 2.0;
               Pol(J) := Mu;
            end loop;
            S := 0.0;
            for J in 1..100 loop
               S := X * S + Pol(J);
            end loop;
            Pu := Pu+S;
         end loop;
      end loop;
      Put_Line(Float'Image(Pu));
      end Atest;

   At1, At2, At3, At4: Atest;

begin
   Nn := Integer'Value(Argument(1));
   Xx := Float'Value(Argument(2));
   if Argument_Count>2 then
      Fact := Integer'Value(Argument(3));
   end if;
   for L in 1..Fact loop
      At1.Compute(Nn,Xx);
      At2.Compute(Nn,Xx);
      At3.Compute(Nn,Xx);
      At4.Compute(Nn,Xx);
   end loop;
   At1.Compute(0,0.0);
   At2.Compute(0,0.0);
   At3.Compute(0,0.0);
   At4.Compute(0,0.0);
end;

Compile and run with:

gcc -c -O6 patest.adb
gnatmake patest
time ./patest 125000 0.2 1

The third argument gives the 'intensity of communication'. If you use arguments like 125 0.2 1000 it will perform 1000 entry calls.

Stright fortran

tespol.f



      program tespol
      dimension pol(100)
      real pol
      integer i,j,n
      real su,pu,mu
      real x

      n = 500000
      x = 0.2
      mu = 10.0
      pu = 0.0
      do i = 1,n
         do j=1,100
            mu = (mu + 2.0) / 2.0
            pol(j) = mu
         enddo
         su = 0.0
         do j=1,100
            su = x * su + pol(j)
         enddo
         pu = pu + su
      enddo
      write (*,*) pu
      end

Compile with:

g77 -o tespol -O6 tespol.f

or ...

ifort -o tespol tespol.f

or ...

gfortran-4.0 -o tespol -O6 tespol.f

Hand vectorized fortran

tespol8.f


     program tespol
     dimension pol(100,8)
     real pol
     integer i,j,n
     real pu,mu
     real x
     dimension su(8)
     real su   

     n = 61250
     x = 0.2 
     mu = 10.0
     pu = 0.0
     do i = 1,n
        do j=1,100
           mu = (mu + 2.0) / 2.0 
           pol(j,:) = mu
        enddo
        su = 0.0
        do j=1,100
           su(:) = x * su(:) + pol(j,:)
        enddo
        pu = pu + sum(su)
     enddo
     write (*,*) pu
     end

Compile with:

ifort -o tespol tespol.f


or ...


gfortran-4.0 -o tespol -O6 tespol.f





C using xmm intrinsics


tepol8.c
#include <stdio.h>
#include <stdlib.h>

main(short argc, char **argv) {
  float mu = 10.0;
  float x,s[8];
  float pu = 0.0;
  int su, i, j, n, c;
  float pol[100][8];

  n = atol(argv[1]);
  x = atof(argv[2]);
  for(i=0; i<n; i++) {
    for (j=0; j<100; j++) {
      mu = (mu + 2.0) / 2.0;
      for(c=0; c<8; c++) /* vectorized by icc */
	pol[j][c] = mu;
    }
    for(c=0; c<8; c++) /* vectorized by icc */
      s[c] = 0.0;
    for (j=0; j<100; j++) {
      for(c=0; c<8; c++) /* vectorized by icc */
	s[c] = x*s[c] + pol[j][c];
    }
    for(c=0; c<8; c++)
      pu += s[c];
  }
  printf("%f\n",pu);
}


Compile and run with:

icc -o tepol8 tepol8.c
time ./tepol8 500000 0.2


or

gcc -O6 -o tepol8 tepol8.c
time ./intepol8 500000 0.2






C using xmm intrinsics


intepol8.c
#include <stdio.h>
#include <stdlib.h>
#include <xmmintrin.h>

float pol[800];
float pubuf[8];

main(short argc, char **argv) {
  __m128 mu, x, s1, s2, polj1, polj2, two, pu;
  int su, i, j, n;

  n = atol(argv[1]);
  x = _mm_set1_ps(atof(argv[2]));
  mu = _mm_set1_ps(0.0);
  two = _mm_set1_ps(2.0);
  pu = _mm_set1_ps(10.0);

  for(i=0; i<n; i++) {
    for (j=0; j<100; j++) {
      mu = _mm_div_ps(_mm_add_ps(mu, two), two);
      _mm_storeu_ps(pol+8*j,mu);
      _mm_storeu_ps(pol+8*j+4,mu);
    }
    s1 = _mm_set1_ps(0.0);
    s2 = _mm_set1_ps(0.0);
    for (j=0; j<100; j++) {
      polj1 = _mm_loadu_ps(pol+8*j);
      polj2 = _mm_loadu_ps(pol+8*j+4);
      s1 = _mm_add_ps(_mm_mul_ps(x,s1), polj1);
      s2 = _mm_add_ps(_mm_mul_ps(x,s2), polj2);
    }
    pu = _mm_add_ps(pu,s1);
    pu = _mm_add_ps(pu,s2);
  }
  _mm_storeu_ps(pubuf,pu);
  printf("%f\n",pubuf[0]+pubuf[1]+pubuf[2]+pubuf[3]);
}



Compile and run with:

icc -o intepol8 intepol8.c
time ./intepol8 62500 0.2


or

gcc -O6 -o intepol8 intepol8.c
time ./intepol8 62500 0.2






Vectorizable fortran and Ada multitasking


bipol.adb

procedure Bipol is

  pragma Linker_Options("stespol8.o");
  pragma Linker_Options("ztezpol8.o");
  pragma Linker_Options("-ldl");


  function Stespol8 return Float;
  pragma Import(Fortran, Stespol8,"stespol8_");
  function ztezpol8 return Float;
  pragma Import(Fortran, ztezpol8,"ztezpol8_");

  task Ctas1;
  task body Ctas1 is

     Sum: Float:= 0.0;

  begin
     for I in 1..100 loop
        Sum := Sum+Stespol8;
     end loop;
     Put_Line(Float'Image(sum));
  end Ctas1;



 task Ctas2;
 task body Ctas2 is

     Sum: Float:= 0.0;
 begin
     for I in 1..100 loop
        Sum := Sum+Ztezpol8;
     end loop;
     Put_Line(Float'Image(Sum));
  end Ctas2;


begin
  null;
end Bipol;


stespol8.f

     function stespol8
     dimension pol(100,8)
     real pol
     integer i,j,n
     real pu,mu
     real x
     dimension su(8)  
     real su

     n = 3125000
     x = 0.2
     mu = 10.0
     pu = 0.0
     do i = 1,n
        do j=1,100
           mu = (mu + 2.0) / 2.0
           pol(j,:) = mu
        enddo
        su = 0.0   
        do j=1,100
           su(:) = x * su(:) + pol(j,:)
        enddo
        pu = pu + sum(su)
     enddo
     stespol8 = pu
     return
     end


ztezpol8.f


      function ztezpol8
      dimension pol(100,8)
      real pol
      integer i,j,n
      real pu,mu
      real x
      dimension su(8)
      real su

      n = 3125000
      x = 0.2
      mu = 10.0
      pu = 0.0
      do i = 1,n
         do j=1,100
            mu = (mu + 2.0) / 2.0
            pol(j,:) = mu
         enddo
         su = 0.0
         do j=1,100
            su(:) = x * su(:) + pol(j,:)
         enddo
         pu = pu + sum(su)
      enddo
      ztezpol8 = pu
      return
      end

Compile with:

ifort -c -o stespol8.o stespol8.f
ifort -c -o ztezpol8.o ztezpol8.f
gnatmake bipol