This is a quickly hacked up explanation of the changes made to the DES
algorithm layout.  It was written very quickly one morning 
in response to an email message and so may (probably :-) contains
errors but if you are interested, it gives a quick overview of the
changes from the FIPS version of DES that most of the fast DES
implementations make.  You had better have my source code and the
FIPS paper in front of you before you start reading :-).


On Tue, 12 Oct 1993, AAA ZZZZZZZ wrote:
> I have the book "Numerical Recipes in C" and am trying to match your algorithm
> (in fcrypt.c) with that of the DES algorithm in the book.  It seems that you 
> have replaced a some of the code with table lookups. I was wondering if your 
> could explain how you derived the entries for the arrays "skb" and "SPtrans".
> (I believe that SPtrans is the s-box with some sort of bit-fiddling.)  Also,
> could you possibly map the remainder of your algorithm to the generic DES
> algorithm just to make sure that I do not miss anything ?
Ah, yes, my version has had a few things changed :-).
Ok, first I'll attempt to describe the D_ENCRYPT macro (from des_locl.h)
#define D_ENCRYPT(L,R,S)        \
        u=(R^s[S  ]); \
        t=R^s[S+1]; \
        t=((t>>4)+(t<<28)); \
        L^=     des_SPtrans[1][(t    )&0x3f]| \
                des_SPtrans[3][(t>> 8)&0x3f]| \
                des_SPtrans[5][(t>>16)&0x3f]| \
                des_SPtrans[7][(t>>24)&0x3f]| \
                des_SPtrans[0][(u    )&0x3f]| \
                des_SPtrans[2][(u>> 8)&0x3f]| \
                des_SPtrans[4][(u>>16)&0x3f]| \
                des_SPtrans[6][(u>>24)&0x3f];
R contains 32 bits that are input into the E function.  If you look at the
E function,
31  1  2  3  4  5
 4  5  6  7  8  9
 8  9 10 11 12 13
12 13 14 15 16 17
16 17 18 19 20 21
20 21 22 23 24 25
24 25 26 27 28 29
28 29 30 31 32  1
If everything is first rotated up by one before we begin, the E
translation becomes
 1  2  3  4  5  6
 5  6  7  8  9 10
 9 10 11 12 13 14
13 14 15 16 17 18
17 18 19 20 21 22
21 22 23 24 25 26
25 26 27 28 29 30
29 30 31 32  1  2
Which you will notice relates to
 1  2  3  4  5  6    7  8
 5  6  7  8  9 10   11 12 
 9 10 11 12 13 14   15 16
13 14 15 16 17 18   19 20
17 18 19 20 21 22   23 24
21 22 23 24 25 26   27 28
25 26 27 28 29 30   31 32
29 30 31 32  1  2    3  4
So if we organise the 48 bits in the K table to occupy 2 32 bit words with
the 1st, 3rd, 5th and 7th 6bit groups padded with 2 bits so they are on 8 byte
boundaries, and the second word containing the even 6bit groups aligned
in the same way, except that the K data for the second word is rotated up
4 bits, so
u=R^K[word0]; t=R^K[word1]; produces
 1  2  3  4  5  6  *  *
 9 10 11 12 13 14  *  *
17 18 19 20 21 22  *  *
25 26 27 28 29 30  *  *
and
 1  2  *  *  5  6  7  8  
 9 10  *  * 13 14 15 16
17 18  *  * 21 22 23 24
25 26  *  * 29 30 31 32
which is then rotated by 4 to produce the required result
 5  6  7  8  9 10  *  *
13 14 15 16 17 18  *  *
21 22 23 24 25 26  *  *
29 30 31 32  1  2  *  *
These 8 'chunks' are used for looking up SPtrans (the input to the S
tables).
You will notice that because crypt uses a modification in the E table, this
section is slightly different.  crypt swaps bits between the top and
bottom 16 bits in R, I achieve this via masks and xor trick I will
explain later.
SPtrans is a table that maps the S boxes directly to the output 32bit by
incorporating the P function.  So, for example, the first S box, 6bits is
used as an index so for the value 0, SPtrans[0][0] gives
0x00820200 or 100000100000001000000000 or bits 10, 18 and 24
the S table in FIPS gives 14 or 2, 3 and 4.  If I am remembering
correctly, the DES paper always labels the right most bit #1 so we really
have bits 1, 2 and 3, and this maps via the P function to 9, 17 and 23.
Note that an input 6bits of value 23, 010111, maps to row 2, column 13,
not 1, 11 as I would expect (from my little endian upbringing).
You will notice that these values are out by 1 from what is actually kept
in SPtrans, thats because the first 1 bit rotation we did is now built
into SPtrans.  So inside the inner loop, we actually work with data that
is rotated by 1 bit so things work more easily in the E function.  The
SPtrans table was generated via the above translation from the original S
and P tables.  I did it quite some time ago and have lost the perl script
I used to do this but it would not be hard to redo this work.

You will notice that after the D_ENCRYPT functions are finished we rotate
the results back into the correct form.

Regarding the PERM_OP operation.  I'll copy the comments from des_local.h
        /* IP and FP
        * The problem is more of a geometric problem that random bit fiddling.
         0  1  2  3  4  5  6  7      62 54 46 38 30 22 14  6
         8  9 10 11 12 13 14 15      60 52 44 36 28 20 12  4
        16 17 18 19 20 21 22 23      58 50 42 34 26 18 10  2
        24 25 26 27 28 29 30 31  to  56 48 40 32 24 16  8  0

        32 33 34 35 36 37 38 39      63 55 47 39 31 23 15  7
        40 41 42 43 44 45 46 47      61 53 45 37 29 21 13  5
        48 49 50 51 52 53 54 55      59 51 43 35 27 19 11  3
        56 57 58 59 60 61 62 63      57 49 41 33 25 17  9  1

        The output has been subject to swaps of the form
        0 1 -> 3 1 but the odd and even bits have been put into
        2 3    2 0
        different words.  The main trick is to remember that
        t=((l>>size)^r)&(mask);
        r^=t;
        l^=(t<<size);
        can be used to swap and move bits between words.
        So l =  0  1  2  3  r = 16 17 18 19
                4  5  6  7      20 21 22 23
                8  9 10 11      24 25 26 27
               12 13 14 15      28 29 30 31
        becomes (for size == 2 and mask == 0x3333)
           t =   2^16  3^17 -- --   l =  0  1 16 17  r =  2  3 18 19
                 6^20  7^21 -- --        4  5 20 21       6  7 22 23
                10^24 11^25 -- --        8  9 24 25      10 11 24 25
                14^28 15^29 -- --       12 13 28 29      14 15 28 29

        Thanks for hints from Richard Outerbridge - he told me IP&FP
        could be done in 15 xor, 10 shifts and 5 ands.
        When I finally started to think of the problem in 2D
        I first got ~42 operations without xors.  When I remembered
        how to use xors :-) I got it to its final state.
        */
It is really quite evil but fast :-).

For the creation of the key_schedule (skb), similar optimising techniques
have been used.
PC1 has been done with the old xor trick, some-one has told me it
is possible to get it a few instructions faster but I don't feel like
beating my brain against a wall to work out how when it will only speed
things up 1%-2% and only in the key generation.
skb is basically the PC2 table with indexes 0..3 for C and 4..7 for D.
You will notice if you look at PC2 in FIPS that the bottom 24 bits are
made from C and the top 24 from D, it is sort of 2 table mashed together.
This lets us generate a direct mapping.  The lookup is a little strange
since some bits are just not used and so I did some ugly shifts etc to
remove the holes to help keep the table size down.
The output from the skb lookups is also aligned so that it fits into the
xor in DES easily.
eg, the output from des_skb is byte aligned for the S tables 0213 and 4657.
We convert this to 0246 and 1357 and rotate the 1357 data by 4 bits.

> The reason I am asking this is that I was given the task of taking your DES
> algorithm and porting it to the Cray and modifying it to take advantage
> of the vectorisation in an attempt to make the algorithm faster.  Also, I
> have been attempting to optimise the algorithm to squeeze as much performance
> out of the algorithm as possible.  To do this, it would be extremely helpful
> if you would be so kind as to help me.  I will gladly share my results with
> you.
I believe my code still currently works on a Cray.  To do the port I just
made sure I cleared the top 32bits of words when needed and continued to
operate on 32 bit quantities so I believe most of the structures on the
Cray will be only half full.  The problem with the algorithm is that it is
not vectorisable in this form.  It is mostly a random lookup problem.
One speedup that can be done is to combine S tables so you only perform 4
12 bit (or 14bit if you don't want to do shifting) instead of 8 lookups.
The ufc package does this but my code is often faster because having only
small tables it often fits into on chip caches whereas big random access
tables do not.
There may be other ways to implement the algorithm that will speed it up
but the only obvious optimisation that springs to mind is to take
advantage of the 64bit word size.
BTW it is probably not worth even looking at speeding up set_key since it
is only being run once for each 25 DES's when in crypt so the 1% speedup
in DES has 25 times the effect of 1% speedup in set_key :-).

> P.S.  I downloaded your DES library and am going over the code in it.
The most recent version is from 8th of this month (mostly copyright
message changes and addition of triple cbc encryption).

> P.S.S.  Could you explain PERM_OP and HPERM_OP ??
As mentioned above they are an neat way of take advantage of the regularity of
some of the permutation, quite non-intuitive :-).

hope this helps

eric