A.D. Corlan > Language Benchmarks > AMD64x2
Programming language benchmarks
on the AMD64 dual core 3600+
This page is about a simple polynomial benchmark program that was tested on an AMD 64 dual core processor running at 2GHz (the '3600+' variety). The 32 bits version is always the same binary that is compiled and run on the P-III. One purpose of this exercise was to see to what extent using the existing multiprogramming technology available in Ada 95 (as implemented by the gnat compiler available with gcc 4.0 for example) leads to speed improvements with multicore processors. Times are in seconds.
Debian Woody ran on the i386 and Ubuntu desktop 64, v6.06 on the AMD64.
In multitasking examples the real time (that is, elapsed time for the whole command, including system time) while in the other cases the user time is reported and wrote (rt). The user time for the dual core machine is a sum of the time both cores are kept busy and is less relevant.
Some examples are run with 500_000 loops, some with 5_000_000 (and designated 'x10') and the resulting time divided by 10 and designated by an 'x' in front of the timing. This serves to illustrate the per-process overhead that appears in some tasking examples.
Discussion: In all cases the 32 bit binaries, compiled on the i386, ran faster on the AMD64 than 64 bit binaries compiled, with more recent compilers, on the AMD64 itself.
There was some overhead associated with tasking, that occured once per process and dissapeared when the runnning time was made 10 times longer. The best improvement between i386@300MHz and AMD64x2@3600+ was obtained by running the same 32 bit two-task binary, and was about 13.3 times; it was not visibly slower than the single task binary on the i386.
Communication between tasks, using rendevous, added more overhead, that seemes to have been handled better in 64bit native mode.
The spectacular improvement in R performance may be due, in part, to improvement in the implementation with V 2.2.1. CommonLisp compiled code is faster than C (can't avoid noticing :)).
However, a very substantial improvement in speed was obtained by employing vectorization features of the intel fortran compiler, in particular when using a data type consisting of an 8-real vector (piece-of-eight :). The seven-fold increase in speed was obtained using a single core.
Conclusion: It seems the best approach is to develop programs by dividing the job between tasks, if possible noncommunicating, in Ada, even when they are compiled on single-core machines. Recompilation on the 64 bit platform seems unnecessary, the portable 32 bit binary is at least as good as the 64 bit one. Combining this with vectorization by a compiler such as the intel fortran compiler could lead to the maximum exploatation of the SSE instructions and the multicore facilities.
atestl.adb
with Ada.Command_Line; use Ada.Command_Line; with Ada.Text_Io; use Ada.Text_Io; procedure Atestl is Pol: array(1..100) of Long_Float; N: Integer:= Integer'Value(Argument(1)); X: Long_Float:= Long_Float'Value(Argument(2)); S: Long_Float; Mu: Long_Float:= 10.0; Pu: Long_Float:= 0.0; begin for I in 1..N loop for J in 1..100 loop Mu := (Mu + 2.0) / 2.0; Pol(J) := Mu; end loop; S := 0.0; for J in 1..100 loop S := X * S + Pol(J); end loop; Pu := Pu+S; end loop; Put_Line(Long_Float'Image(Pu)); end Atestl;Compile and run with:
gcc -c -O6 atestl.adb gnatmake atestl time ./atestl 500000 0.2
patest0.adb
with Ada.Command_Line; use Ada.Command_Line; with Ada.Text_Io; use Ada.Text_Io; procedure Patest is N: Integer:= 250_000; X: Float:= 0.2; task type Atest; task body Atest is Pol: array(1..100) of Float; S: Float; Mu: Float:= 10.0; Pu: Float:= 0.0; begin for I in 1..N loop for J in 1..100 loop Mu := (Mu + 2.0) / 2.0; Pol(J) := Mu; end loop; S := 0.0; for J in 1..100 loop S := X * S + Pol(J); end loop; Pu := Pu+S; end loop; Put_Line(Float'Image(Pu)); end Atest; At1, At2: Atest; begin null; end;Compile and run with:
gcc -c -O6 patest0.adb gnatmake patest0 time ./patest0the 250_000 is replaced with 2_500_000 for the 'x10' run.
patest.adb
with Ada.Command_Line; use Ada.Command_Line; with Ada.Text_Io; use Ada.Text_Io; procedure Patest is NN: Integer:= 2_500_000; Fact: Integer:= 1; XX: Float:= 0.2; task type Atest is entry Compute(N0: Integer; X0: Float); end Atest; task body Atest is Pol: array(1..100) of Float; S: Float; Mu: Float:= 10.0; Pu: Float:= 0.0; N: Integer; X: Float; begin loop accept Compute(N0: Integer; X0: Float) do N := N0; X := X0; end Compute; if N=0 then exit; end if; for I in 1..N loop for J in 1..100 loop Mu := (Mu + 2.0) / 2.0; Pol(J) := Mu; end loop; S := 0.0; for J in 1..100 loop S := X * S + Pol(J); end loop; Pu := Pu+S; end loop; end loop; Put_Line(Float'Image(Pu)); end Atest; At1, At2, At3, At4: Atest; begin Nn := Integer'Value(Argument(1)); Xx := Float'Value(Argument(2)); if Argument_Count>2 then Fact := Integer'Value(Argument(3)); end if; for L in 1..Fact loop At1.Compute(Nn,Xx); At2.Compute(Nn,Xx); At3.Compute(Nn,Xx); At4.Compute(Nn,Xx); end loop; At1.Compute(0,0.0); At2.Compute(0,0.0); At3.Compute(0,0.0); At4.Compute(0,0.0); end;Compile and run with:
gcc -c -O6 patest.adb gnatmake patest time ./patest 125000 0.2 1The third argument gives the 'intensity of communication'. If you use arguments like 125 0.2 1000 it will perform 1000 entry calls.
tespol.f
program tespol dimension pol(100) real pol integer i,j,n real su,pu,mu real x n = 500000 x = 0.2 mu = 10.0 pu = 0.0 do i = 1,n do j=1,100 mu = (mu + 2.0) / 2.0 pol(j) = mu enddo su = 0.0 do j=1,100 su = x * su + pol(j) enddo pu = pu + su enddo write (*,*) pu endCompile with:
g77 -o tespol -O6 tespol.for ...
ifort -o tespol tespol.for ...
gfortran-4.0 -o tespol -O6 tespol.f
tespol8.f
program tespol dimension pol(100,8) real pol integer i,j,n real pu,mu real x dimension su(8) real su n = 61250 x = 0.2 mu = 10.0 pu = 0.0 do i = 1,n do j=1,100 mu = (mu + 2.0) / 2.0 pol(j,:) = mu enddo su = 0.0 do j=1,100 su(:) = x * su(:) + pol(j,:) enddo pu = pu + sum(su) enddo write (*,*) pu end Compile with:Compile with:ifort -o tespol tespol.for ...gfortran-4.0 -o tespol -O6 tespol.f
C using xmm intrinsicstepol8.c
#include <stdio.h> #include <stdlib.h> main(short argc, char **argv) { float mu = 10.0; float x,s[8]; float pu = 0.0; int su, i, j, n, c; float pol[100][8]; n = atol(argv[1]); x = atof(argv[2]); for(i=0; i<n; i++) { for (j=0; j<100; j++) { mu = (mu + 2.0) / 2.0; for(c=0; c<8; c++) /* vectorized by icc */ pol[j][c] = mu; } for(c=0; c<8; c++) /* vectorized by icc */ s[c] = 0.0; for (j=0; j<100; j++) { for(c=0; c<8; c++) /* vectorized by icc */ s[c] = x*s[c] + pol[j][c]; } for(c=0; c<8; c++) pu += s[c]; } printf("%f\n",pu); }Compile and run with:icc -o tepol8 tepol8.c time ./tepol8 500000 0.2orgcc -O6 -o tepol8 tepol8.c time ./intepol8 500000 0.2
C using xmm intrinsicsintepol8.c
#include <stdio.h> #include <stdlib.h> #include <xmmintrin.h> float pol[800]; float pubuf[8]; main(short argc, char **argv) { __m128 mu, x, s1, s2, polj1, polj2, two, pu; int su, i, j, n; n = atol(argv[1]); x = _mm_set1_ps(atof(argv[2])); mu = _mm_set1_ps(0.0); two = _mm_set1_ps(2.0); pu = _mm_set1_ps(10.0); for(i=0; i<n; i++) { for (j=0; j<100; j++) { mu = _mm_div_ps(_mm_add_ps(mu, two), two); _mm_storeu_ps(pol+8*j,mu); _mm_storeu_ps(pol+8*j+4,mu); } s1 = _mm_set1_ps(0.0); s2 = _mm_set1_ps(0.0); for (j=0; j<100; j++) { polj1 = _mm_loadu_ps(pol+8*j); polj2 = _mm_loadu_ps(pol+8*j+4); s1 = _mm_add_ps(_mm_mul_ps(x,s1), polj1); s2 = _mm_add_ps(_mm_mul_ps(x,s2), polj2); } pu = _mm_add_ps(pu,s1); pu = _mm_add_ps(pu,s2); } _mm_storeu_ps(pubuf,pu); printf("%f\n",pubuf[0]+pubuf[1]+pubuf[2]+pubuf[3]); }Compile and run with:icc -o intepol8 intepol8.c time ./intepol8 62500 0.2orgcc -O6 -o intepol8 intepol8.c time ./intepol8 62500 0.2
Vectorizable fortran and Ada multitaskingbipol.adb
procedure Bipol is pragma Linker_Options("stespol8.o"); pragma Linker_Options("ztezpol8.o"); pragma Linker_Options("-ldl"); function Stespol8 return Float; pragma Import(Fortran, Stespol8,"stespol8_"); function ztezpol8 return Float; pragma Import(Fortran, ztezpol8,"ztezpol8_"); task Ctas1; task body Ctas1 is Sum: Float:= 0.0; begin for I in 1..100 loop Sum := Sum+Stespol8; end loop; Put_Line(Float'Image(sum)); end Ctas1; task Ctas2; task body Ctas2 is Sum: Float:= 0.0; begin for I in 1..100 loop Sum := Sum+Ztezpol8; end loop; Put_Line(Float'Image(Sum)); end Ctas2; begin null; end Bipol;stespol8.ffunction stespol8 dimension pol(100,8) real pol integer i,j,n real pu,mu real x dimension su(8) real su n = 3125000 x = 0.2 mu = 10.0 pu = 0.0 do i = 1,n do j=1,100 mu = (mu + 2.0) / 2.0 pol(j,:) = mu enddo su = 0.0 do j=1,100 su(:) = x * su(:) + pol(j,:) enddo pu = pu + sum(su) enddo stespol8 = pu return endztezpol8.ffunction ztezpol8 dimension pol(100,8) real pol integer i,j,n real pu,mu real x dimension su(8) real su n = 3125000 x = 0.2 mu = 10.0 pu = 0.0 do i = 1,n do j=1,100 mu = (mu + 2.0) / 2.0 pol(j,:) = mu enddo su = 0.0 do j=1,100 su(:) = x * su(:) + pol(j,:) enddo pu = pu + sum(su) enddo ztezpol8 = pu return end
ifort -c -o stespol8.o stespol8.f ifort -c -o ztezpol8.o ztezpol8.f gnatmake bipol
Copyright (c) 2003-2007 Alexandru Dan Corlan et al.