Re: [PATCH -mm crypto] AES: x86_64 asm implementation optimization

Previous thread: ←# 我想申請網路刷卡 ◆∵◢ by zhu air on Tuesday, April 8, 2008 - 11:32 pm. (1 message)

Next thread: [ANNOUNCE] GIT 1.5.5 by Junio C Hamano on Tuesday, April 8, 2008 - 11:51 pm. (1 message)
From: Huang, Ying
Date: Tuesday, April 8, 2008 - 11:41 pm

This patch increases the performance of AES x86-64 implementation. The
average increment is more than 6.3% and the max increment is
more than 10.2% on Intel CORE 2 CPU. The performance increment is
gained via the following methods:

- Two additional temporary registers are used to hold the subset of
  the state, so that the dependency between instructions is reduced.

- The expanded key is loaded via 2 64bit load instead of 4 32-bit load.

This patch is based on 2.6.25-rc8-mm1.

The file attached is the test data via: modprobe tcrypt mode=200

- dmesg_1_core-stockn:	stock kernel data
- dmesg_1_core-op4n:	patched kernel data
- percent.txt:		(time_patched - time_stock) / time_stock * 100

Signed-off-by: Huang Ying <ying.huang@intel.com>

---
 arch/x86/crypto/aes-x86_64-asm_64.S |  101 ++++++++++++++++++++----------------
 include/crypto/aes.h                |    1 
 2 files changed, 58 insertions(+), 44 deletions(-)

--- a/arch/x86/crypto/aes-x86_64-asm_64.S
+++ b/arch/x86/crypto/aes-x86_64-asm_64.S
@@ -46,70 +46,81 @@
 #define R7	%rbp
 #define R7E	%ebp
 #define R8	%r8
+#define R8E	%r8d
 #define R9	%r9
+#define R9E	%r9d
 #define R10	%r10
 #define R11	%r11
+#define R12	%r12
+#define R12E	%r12d
+#define R16	%rsp
 
 #define prologue(FUNC,KEY,B128,B192,r1,r2,r3,r4,r5,r6,r7,r8,r9,r10,r11) \
 	.global	FUNC;			\
 	.type	FUNC,@function;		\
 	.align	8;			\
-FUNC:	movq	r1,r2;			\
-	movq	r3,r4;			\
-	leaq	BASE+KEY+48+4(r8),r9;	\
-	movq	r10,r11;		\
-	movl	(r7),r5 ## E;		\
-	movl	4(r7),r1 ## E;		\
-	movl	8(r7),r6 ## E;		\
-	movl	12(r7),r7 ## E;		\
-	movl	BASE+0(r8),r10 ## E;	\
-	xorl	-48(r9),r5 ## E;	\
-	xorl	-44(r9),r1 ## E;	\
-	xorl	-40(r9),r6 ## E;	\
-	xorl	-36(r9),r7 ## E;	\
-	cmpl	$24,r10 ## E;		\
+FUNC:	subq	$24,r11;		\
+	movl	(r6),r4 ## E;		\
+	leaq	BASE+KEY+48+8(r7),r8;	\
+	movq	r1,(r11);		\
+	movq	r9,r10;			\
+	movl	4(r6),r1 ## E;		\
+	movq	r2,8(r11);		\
+	movl	8(r6),r5 ## E;		\
+	movq	r3,16(r11);		\
+	movl	12(r6),r6 ## E;		\
+	movl	BASE+0(r7),r9 ...
From: Sebastian Siewior
Date: Wednesday, April 16, 2008 - 12:31 am

From your description I would assume that the performance can only
increase. However, on my
|model name      : AMD Athlon(tm) 64 Processor 3200+
the opposite is the case [1], [2]. I dunno why and I didn't mixup
patched & unpached :). I checked this patch on
|model name      : Intel(R) Core(TM)2 CPU         T7200  @ 2.00GHz
and the performance really increases [3], [4].

[1] http://download.breakpoint.cc/aes_patch/patched.txt
[2] http://download.breakpoint.cc/aes_patch/unpatched.txt
[3] http://download.breakpoint.cc/aes_patch/perf_patched.txt


Sebastian
--

From: Huang, Ying
Date: Wednesday, April 16, 2008 - 1:19 am

En. I have no AMD machine. So I have not tested the patch on it. Maybe
there are some pipeline or load/store unit difference between Intel and
AMD CPUs. Tomorrow I can split the patch into a set of small patches,
with one patch for one small step. Can you help me to test these patches

Because the key is loaded in 64bit in this patch, I want to align the

Best Regards,
Huang Ying

--

From: Andi Kleen
Date: Wednesday, April 16, 2008 - 1:23 am

It would be also quite possible to use two different implementations,
one for AMD another for Intel.  crypto frame work should have no
problems dealing with that.

-Andi
--

From: Herbert Xu
Date: Wednesday, April 16, 2008 - 2:50 am

Yes that would definitely be an option.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Sebastian Siewior
Date: Wednesday, April 16, 2008 - 11:40 am

cut Alexander Kjeldaas <astor@fast.no> from CC coz his mails bounce.


Than this won't work all the time. To make it bulletproof
- set .cra_alignmask in the glue code properly
- use the attribute aligned thing
- retrieve your private struct via crypto_tfm_ctx_aligned()

You might want to take a look on padlock-aes.c. The same thing is done
there but instead of crypto_tfm_ctx_aligned() a private function is
used (to let the compiler optimize most of the code away). Depending on
Herbert's mood you might get away with this as well (what would be

Sebastian
--

From: Huang, Ying
Date: Wednesday, April 16, 2008 - 6:52 pm

On Wed, 2008-04-16 at 20:40 +0200, Sebastian Siewior wrote:

As far as I know, the CRYPTO_MINALIGN is defined in
include/linux/crypto.h as __alignof__(unsigned long long), and the
__crt_ctx in crypto_tfm is aligned in CRYPTO_MINALIGN. So I think adding
a pad is sufficient for x86_64 implementation.

Best Regards,
Huang Ying

--

From: Herbert Xu
Date: Wednesday, April 16, 2008 - 8:34 pm

It should be sufficient but it would be better to use an align
attribute to better document the intention.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Huang, Ying
Date: Wednesday, April 16, 2008 - 9:53 pm

OK. I will use the align attribute.

Best Regards,
Huang Ying
--

From: Sebastian Siewior
Date: Wednesday, April 23, 2008 - 3:28 pm

Doesn't this imply that kmalloc() returns memory that is always pointer

Sebastian
--

From: Herbert Xu
Date: Wednesday, April 23, 2008 - 5:51 pm

Parse error :)

kmalloc returns memory that should always be aligned to
CRYPTO_MINALIGN and in particular it's always pointer aligned.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Huang, Ying
Date: Wednesday, April 16, 2008 - 8:36 pm

Hi, Sebastian,

The files attached is the separated patches, from step1 to step 7. Thank
you very much for your help.

Best Regards,
Huang Ying

From: Sebastian Siewior
Date: Wednesday, April 23, 2008 - 3:32 pm

I've run the following script:

|#!/bin/bash
|check_error()
|{
|        r=$?
|        if [ ! $r -eq 0 ]
|        then
|                exit 1
|        fi
|}
|
|modprobe tcrypt mode=200
|modprobe tcrypt mode=200
|dmesg -c > step-0.txt
|
|for ((i=1; i<=7; i++))
|do
|        quilt push step${i}.patch
|        check_error
|
|        make
|        check_error
|
|        rmmod aes_x86_64
|        check_error
|
|        insmod arch/x86/crypto/aes-x86_64.ko
|        check_error
|
|        modprobe tcrypt mode=200
|        modprobe tcrypt mode=200
|        dmesg -c > step-${i}.txt
|done


Sebastian
From: Huang, Ying
Date: Thursday, April 24, 2008 - 8:11 pm

Hi, Sebastian,

Thank you very much for your help. From the result you sent, the biggest
performance degradation is between step 4 and step 5. In that step, one
more register is saved before and restored after encryption/decryption.
So I think the reason maybe the read/write port throughput of CPU.

I changed the patches to group the read or write together instead of
interleaving. Can you help me to test these new patches? The new patches
is attached with the mail.

Best Regards,
Huang Ying

From: Sebastian Siewior
Date: Friday, April 25, 2008 - 12:12 am

Ah so it is just a local problem you say? I may get my fingers on

Sebastian
--

From: Huang, Ying
Date: Friday, April 25, 2008 - 12:21 am

Hi, Sebastian,


I mean the read/write port design difference between CPU
Thank you very much.

Best Regards,
Huang Ying

--

From: Sebastian Siewior
Date: Friday, April 25, 2008 - 12:37 am

No, that is what I meant somehow. It is possible that AMD improved this
in a later CPU generation. The AMD box I have is a pretty old one in
comparison to Intel's dual core. It is also possible that they are

Sebastian
--

From: Sebastian Siewior
Date: Tuesday, April 29, 2008 - 3:12 pm

Hi Huang,


Sebastian
From: dean gaudet
Date: Saturday, May 3, 2008 - 11:25 pm

one of the more important details in evaluating these changes would be the 
family/model/stepping of the processors being microbenchmarked... could 
you folks include /proc/cpuinfo with the results?

also -- please drop the #define for R16 to %rsp ... it obfuscates more 
than it helps anything.

thanks
-dean

--

From: Huang, Ying
Date: Tuesday, May 6, 2008 - 10:12 pm

Hi,


The file attached is /proc/cpuinfo of my testing machine.

Best Regards,
From: Huang, Ying
Date: Tuesday, May 6, 2008 - 10:26 pm

Hi, Sebastian,


It seems that the performance degradation between step4 to step5 is
decreased. But the overall performance degradation between step0 to
step7 is still about 5%.

I also test the patches on Pentium 4 CPUs, and the performance decreased
too. So I think this optimization is CPU micro-architecture dependent.

While the dependency between instructions are reduced, more registers
(at most 3) are saved/restored before/after encryption/decryption. If
the CPU has no extra execution unit for newly independent instructions
but more registers are saved/restored, the performance will decrease.

We maybe should select different implementation based on
micro-architecture.

Best Regards,
Huang Ying

--

Previous thread: ←# 我想申請網路刷卡 ◆∵◢ by zhu air on Tuesday, April 8, 2008 - 11:32 pm. (1 message)

Next thread: [ANNOUNCE] GIT 1.5.5 by Junio C Hamano on Tuesday, April 8, 2008 - 11:51 pm. (1 message)