This patch increases the performance of AES x86-64 implementation. The average increment is more than 6.3% and the max increment is more than 10.2% on Intel CORE 2 CPU. The performance increment is gained via the following methods: - Two additional temporary registers are used to hold the subset of the state, so that the dependency between instructions is reduced. - The expanded key is loaded via 2 64bit load instead of 4 32-bit load. This patch is based on 2.6.25-rc8-mm1. The file attached is the test data via: modprobe tcrypt mode=200 - dmesg_1_core-stockn: stock kernel data - dmesg_1_core-op4n: patched kernel data - percent.txt: (time_patched - time_stock) / time_stock * 100 Signed-off-by: Huang Ying <ying.huang@intel.com> --- arch/x86/crypto/aes-x86_64-asm_64.S | 101 ++++++++++++++++++++---------------- include/crypto/aes.h | 1 2 files changed, 58 insertions(+), 44 deletions(-) --- a/arch/x86/crypto/aes-x86_64-asm_64.S +++ b/arch/x86/crypto/aes-x86_64-asm_64.S @@ -46,70 +46,81 @@ #define R7 %rbp #define R7E %ebp #define R8 %r8 +#define R8E %r8d #define R9 %r9 +#define R9E %r9d #define R10 %r10 #define R11 %r11 +#define R12 %r12 +#define R12E %r12d +#define R16 %rsp #define prologue(FUNC,KEY,B128,B192,r1,r2,r3,r4,r5,r6,r7,r8,r9,r10,r11) \ .global FUNC; \ .type FUNC,@function; \ .align 8; \ -FUNC: movq r1,r2; \ - movq r3,r4; \ - leaq BASE+KEY+48+4(r8),r9; \ - movq r10,r11; \ - movl (r7),r5 ## E; \ - movl 4(r7),r1 ## E; \ - movl 8(r7),r6 ## E; \ - movl 12(r7),r7 ## E; \ - movl BASE+0(r8),r10 ## E; \ - xorl -48(r9),r5 ## E; \ - xorl -44(r9),r1 ## E; \ - xorl -40(r9),r6 ## E; \ - xorl -36(r9),r7 ## E; \ - cmpl $24,r10 ## E; \ +FUNC: subq $24,r11; \ + movl (r6),r4 ## E; \ + leaq BASE+KEY+48+8(r7),r8; \ + movq r1,(r11); \ + movq r9,r10; \ + movl 4(r6),r1 ## E; \ + movq r2,8(r11); \ + movl 8(r6),r5 ## E; \ + movq r3,16(r11); \ + movl 12(r6),r6 ## E; \ + movl BASE+0(r7),r9 ...
From your description I would assume that the performance can only increase. However, on my |model name : AMD Athlon(tm) 64 Processor 3200+ the opposite is the case [1], [2]. I dunno why and I didn't mixup patched & unpached :). I checked this patch on |model name : Intel(R) Core(TM)2 CPU T7200 @ 2.00GHz and the performance really increases [3], [4]. [1] http://download.breakpoint.cc/aes_patch/patched.txt [2] http://download.breakpoint.cc/aes_patch/unpatched.txt [3] http://download.breakpoint.cc/aes_patch/perf_patched.txt Sebastian --
En. I have no AMD machine. So I have not tested the patch on it. Maybe there are some pipeline or load/store unit difference between Intel and AMD CPUs. Tomorrow I can split the patch into a set of small patches, with one patch for one small step. Can you help me to test these patches Because the key is loaded in 64bit in this patch, I want to align the Best Regards, Huang Ying --
It would be also quite possible to use two different implementations, one for AMD another for Intel. crypto frame work should have no problems dealing with that. -Andi --
Yes that would definitely be an option. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
cut Alexander Kjeldaas <astor@fast.no> from CC coz his mails bounce. Than this won't work all the time. To make it bulletproof - set .cra_alignmask in the glue code properly - use the attribute aligned thing - retrieve your private struct via crypto_tfm_ctx_aligned() You might want to take a look on padlock-aes.c. The same thing is done there but instead of crypto_tfm_ctx_aligned() a private function is used (to let the compiler optimize most of the code away). Depending on Herbert's mood you might get away with this as well (what would be Sebastian --
On Wed, 2008-04-16 at 20:40 +0200, Sebastian Siewior wrote: As far as I know, the CRYPTO_MINALIGN is defined in include/linux/crypto.h as __alignof__(unsigned long long), and the __crt_ctx in crypto_tfm is aligned in CRYPTO_MINALIGN. So I think adding a pad is sufficient for x86_64 implementation. Best Regards, Huang Ying --
It should be sufficient but it would be better to use an align attribute to better document the intention. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
OK. I will use the align attribute. Best Regards, Huang Ying --
Doesn't this imply that kmalloc() returns memory that is always pointer Sebastian --
Parse error :) kmalloc returns memory that should always be aligned to CRYPTO_MINALIGN and in particular it's always pointer aligned. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Hi, Sebastian, The files attached is the separated patches, from step1 to step 7. Thank you very much for your help. Best Regards, Huang Ying
I've run the following script:
|#!/bin/bash
|check_error()
|{
| r=$?
| if [ ! $r -eq 0 ]
| then
| exit 1
| fi
|}
|
|modprobe tcrypt mode=200
|modprobe tcrypt mode=200
|dmesg -c > step-0.txt
|
|for ((i=1; i<=7; i++))
|do
| quilt push step${i}.patch
| check_error
|
| make
| check_error
|
| rmmod aes_x86_64
| check_error
|
| insmod arch/x86/crypto/aes-x86_64.ko
| check_error
|
| modprobe tcrypt mode=200
| modprobe tcrypt mode=200
| dmesg -c > step-${i}.txt
|done
Sebastian
Hi, Sebastian, Thank you very much for your help. From the result you sent, the biggest performance degradation is between step 4 and step 5. In that step, one more register is saved before and restored after encryption/decryption. So I think the reason maybe the read/write port throughput of CPU. I changed the patches to group the read or write together instead of interleaving. Can you help me to test these new patches? The new patches is attached with the mail. Best Regards, Huang Ying
Ah so it is just a local problem you say? I may get my fingers on Sebastian --
Hi, Sebastian, I mean the read/write port design difference between CPU Thank you very much. Best Regards, Huang Ying --
No, that is what I meant somehow. It is possible that AMD improved this in a later CPU generation. The AMD box I have is a pretty old one in comparison to Intel's dual core. It is also possible that they are Sebastian --
one of the more important details in evaluating these changes would be the family/model/stepping of the processors being microbenchmarked... could you folks include /proc/cpuinfo with the results? also -- please drop the #define for R16 to %rsp ... it obfuscates more than it helps anything. thanks -dean --
Hi, The file attached is /proc/cpuinfo of my testing machine. Best Regards,
Hi, Sebastian, It seems that the performance degradation between step4 to step5 is decreased. But the overall performance degradation between step0 to step7 is still about 5%. I also test the patches on Pentium 4 CPUs, and the performance decreased too. So I think this optimization is CPU micro-architecture dependent. While the dependency between instructions are reduced, more registers (at most 3) are saved/restored before/after encryption/decryption. If the CPU has no extra execution unit for newly independent instructions but more registers are saved/restored, the performance will decrease. We maybe should select different implementation based on micro-architecture. Best Regards, Huang Ying --
