Re: comparing CPUs running blowfish [Re: fastest blowfish.asm?]

New Message Reply About this list Date view Thread view Subject view Author view

Eric Young (eay@cryptsoft.com)
Wed, 25 Mar 1998 15:41:27 +1000 (EST)


On Wed, 25 Mar 1998, Adam Back wrote:
> This could be read as a story about the vagaries of attempting to tune
> assembler language on the various pentiums and clones...

:-), you think that is bad, try playing with the bf_opts.c program.
It tweaks the different build options available for the inner loop in the C
code. For my current version (0.8.2?), on a pentium 133, NT,

options BF ecb/s
<nothing> 243567.89 100.0%
ptr 220833.43 90.7%
ptr2 120301.05 49.4%

For a pentium pro 200, linux
options BF ecb/s
<nothing> 886146.89 100.0%
ptr 837613.43 94.5%
ptr2 538408.93 60.8%
bf-686.pl 907446.56 102.4%

I strongly suspect I have the default C options for gcc-x86 stuffed up in
the 0.8.2 release. I probably do not check much because of the asm
availablity. As you can see, the C code is damn close to the asm on the
pentium pro. And this was just compiled with gcc -O3 -fomit-frame-pointerA

The different inner loop variations help differnt architectures
quite a bit, but since I'm now in a x86/sparc2 environment, it is rather hard
to test these things :-).
hmm... looking at bf_locl.h, I turn on BF_PTR2 for all x86 boxes, very very
bad :-(. I think this is a cut&paste error from the DES code, where
the DES_PTR2 inner loop variant was a big win for x86.

> Follows is timings of a AMD K6 MMX, AMD K5 and Intel MMX all clocked
> at 166 Mhz.

I can do pentium 100/133 and ppro 200. What program are us using for timing?
For the fast ciphers, the byte/pack/unpack overhead can cost alot, so the
specific function being used matters quite a bit. I assume the speed program?

> The Intel MMX really wins with Eric's VTuned code as compared to the
> other CPUs. Interestingly the AMD k6 seems to do generally better on
> the C code version. I am left wondering how an AMD k6 would perform
> if there were a hand tuned version targetted for it. (I guess VTune
> doesn't know about AMD k6 specific coding tricks).

The problem for the pentium is avoiding delays caused by using a register just
loaded as an index register (Address Generation Interlock stalls). If things
can be didled around this problem, things really cook. Obviously there are
also other things about which instruction goes in which execution unit.
VTune is really really good if you are into this kind of thing. It shows all
the gory details in nice colours and gives explinations on what causes the
stalls. If intel has VTune for the Mercd (which I expect them too), things
will be really nice (assuming the first generation or 2 will have some weird
things re: timings). If intel is doing the 'push all the smarts into the
compiler' again, expect there to be some strange things going on :-).

For the pentium pro, most things are hidden, and you
just have to avoid doing byte loads into registers followed by a read of the
32bit register. VTune does not help anywhere as much. With register
renaming, all kinds of weird things can be happening under the covers, so it
then comes down to using common sense about data dependancies.

> None of the CPUs seem to like the gcc (C only) code in libbf-0.8.2b
> (at least under linux with plain gcc-2.7.2). the libbf-0.7.2m C code
> is performing much faster. Unless I am doing something dumb, or there
> is a bug in the timing code for one of the versions.

Try editing bf_locl.h to remove BF_PTR2 being defined for x86 as mentioned
above. I put in my (well, peter gutmans actually :-) 'pick best option for
CPU/compier' and did not actually test it :-(.

> tests. Similar perhaps to comparing Pentium Pro with Pentium which I
> think may performs worse than a Pentium on pentium specific code.

It depends. For things like RC5, the pentium pro screams because of faster
instructions.
The one thing about intel is that at least their is only really 2 CPU at this
point in time (586, ppro/pen2). For things like sparc, there are umpteen
thousand million variations that doing specific assember for different chips
(lets not even talk about v7 vs v8 vs v9) is truely horiffying :-).

> gcc -O3 -fomit-frame-pointer -DBF_PTR2 -m486 -DCPU=586 -c *.cgcc -o bfspeed bf_cbc.o bf_cfb64.o bf_ecb.o bf_enc.o bf_ofb64.o bf_skey.o bfspeed.o
                                 ^^^^^^^
                bad bad bad, evil (hands up in the sign of the cross...)

eric


New Message Reply About this list Date view Thread view Subject view Author view

 
All trademarks and copyrights are the property of their respective owners.

Other Directory Sites: SeekWonder | Directory Owners Forum

The following archive was created by hippie-mail 7.98617-22 on Fri Aug 21 1998 - 17:16:14 ADT