OpenSSL is built with the generic linux settings for most targets,
including aarch64. These generic settings are designed for 32-bit CPU and
provide no assembler optmization: this is widely suboptimal for aarch64.
This patch simply switches to the aarch64 settings that are already
available in OpenSSL.
Here is the output of "openssl speed" before the optimization, with
"(...)" representing build flags that didn't change:
OpenSSL 1.0.2l 25 May 2017
options:bn(64,32) rc4(ptr,char) des(idx,cisc,2,int) aes(partial) blowfish(ptr)
compiler: aarch64-openwrt-linux-musl-gcc (...)
And after this patch, OpenSSL uses 64 bit mode and assembler optimizations:
OpenSSL 1.0.2l 25 May 2017
options:bn(64,64) rc4(ptr,char) des(idx,cisc,2,int) aes(partial) blowfish(ptr)
compiler: aarch64-openwrt-linux-musl-gcc (...) -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM
Here are some benchmarks on a pine64+ running latest LEDE master r5142-
20d363aed3:
before# openssl speed sha aes blowfish
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
sha1 3918.89k 9982.43k 19148.03k 24933.03k 27325.78k
sha256 4604.51k 10240.64k 17472.51k 21355.18k 22801.07k
sha512 3662.19k 14539.41k 21443.16k 29544.11k 33177.60k
blowfish cbc 16266.63k 16940.86k 17176.92k 17237.33k 17252.35k
aes-128 cbc 19712.95k 21447.40k 22091.09k 22258.35k 22304.09k
aes-192 cbc 17680.12k 19064.47k 19572.14k 19703.13k 19737.26k
aes-256 cbc 15986.67k 17132.48k 17537.28k 17657.17k 17689.26k
after# openssl speed sha aes blowfish
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
sha1 6770.87k 26172.80k 86878.38k 205649.58k 345978.20k
sha256 20913.93k 74663.85k 184658.18k 290891.09k 351032.66k
sha512 7633.10k 30110.14k 50083.24k 71883.43k 82485.25k
blowfish cbc 16224.93k 16933.55k 17173.76k 17234.94k 17252.35k
aes-128 cbc 19425.74k 21193.31k 22065.74k 22304.77k 22380.54k
aes-192 cbc 17452.29k 18883.84k 19536.90k 19741.70k 19800.06k
aes-256 cbc 15815.89k 17003.01k 17530.03k 17695.40k 17746.60k
For some reason AES and blowfish do not benefit, but SHA performance
improves between 1.7x and 15x. SHA256 clearly benefits the most from the
optimization (4.5x on small blocks, 15x on large blocks!).
When using EVP (with "openssl speed -evp <algo>"):
# Before, EVP mode
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
sha1 3824.46k 10049.66k 19170.56k 24947.03k 27325.78k
sha256 3368.33k 8511.15k 16061.44k 20772.52k 22721.88k
sha512 2845.23k 11381.57k 19467.69k 28512.26k 33008.30k
bf-cbc 15146.74k 16623.83k 17092.01k 17211.39k 17249.62k
aes-128-cbc 17873.03k 20870.61k 21933.65k 22216.36k 22301.35k
aes-192-cbc 16184.18k 18607.15k 19447.13k 19670.02k 19737.26k
aes-256-cbc 14774.06k 16757.25k 17457.58k 17639.42k 17686.53k
# After, EVP mode
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
sha1 7056.97k 27142.10k 89515.86k 209155.41k 347419.99k
sha256 7745.70k 29750.06k 95341.48k 211001.69k 332376.75k
sha512 4550.47k 18086.06k 39997.10k 65880.75k 81431.21k
bf-cbc 15129.20k 16619.03k 17090.56k 17212.76k 17246.89k
aes-128-cbc 99619.74k 269032.34k 450214.23k 567353.00k 613933.06k
aes-192-cbc 93180.74k 231017.79k 361766.66k 433671.51k 461731.16k
aes-256-cbc 89343.23k 209858.58k 310160.04k 362234.88k 380878.85k
Blowfish does not seem to have assembler optimization at all, and SHA
still benefits (between 1.6x and 14.5x) but is generally slower than in
non-EVP mode.
However, AES performance is improved between 5.5x and 27.5x, which is
really impressive! For aes-128-cbc on large blocks, a core i7-6600U
@2.60GHz is only twice as fast...
Signed-off-by: Baptiste Jonglez <git@bitsofnetworks.org>
PKG_BASE:=1.0.2
PKG_BUGFIX:=l
PKG_VERSION:=$(PKG_BASE)$(PKG_BUGFIX)
-PKG_RELEASE:=1
+PKG_RELEASE:=2
PKG_USE_MIPS16:=0
PKG_BUILD_PARALLEL:=0
OPENSSL_OPTIONS+=no-sse2
ifeq ($(CONFIG_mips)$(CONFIG_mipsel),y)
OPENSSL_TARGET:=linux-mips-openwrt
+ else ifeq ($(CONFIG_aarch64),y)
+ OPENSSL_TARGET:=linux-aarch64-openwrt
else ifeq ($(CONFIG_arm)$(CONFIG_armeb),y)
OPENSSL_TARGET:=linux-armv4-openwrt
else
--- a/Configure
+++ b/Configure
-@@ -470,6 +470,12 @@ my %table=(
+@@ -470,6 +470,13 @@ my %table=(
"linux-alpha-ccc","ccc:-fast -readonly_strings -DL_ENDIAN::-D_REENTRANT:::SIXTY_FOUR_BIT_LONG RC4_CHUNK DES_INT DES_PTR DES_RISC1 DES_UNROLL:${alpha_asm}",
"linux-alpha+bwx-ccc","ccc:-fast -readonly_strings -DL_ENDIAN::-D_REENTRANT:::SIXTY_FOUR_BIT_LONG RC4_CHAR RC4_CHUNK DES_INT DES_PTR DES_RISC1 DES_UNROLL:${alpha_asm}",
+# OpenWrt targets
+"linux-armv4-openwrt","gcc:-DTERMIOS \$(OPENWRT_OPTIMIZATION_FLAGS) -fomit-frame-pointer -Wall::-D_REENTRANT::-ldl:BN_LLONG RC4_CHAR RC4_CHUNK DES_INT DES_UNROLL BF_PTR:${armv4_asm}:dlfcn:linux-shared:-fPIC::.so.\$(SHLIB_MAJOR).\$(SHLIB_MINOR)",
++"linux-aarch64-openwrt","gcc:-DTERMIOS \$(OPENWRT_OPTIMIZATION_FLAGS) -fomit-frame-pointer -Wall::-D_REENTRANT::-ldl:SIXTY_FOUR_BIT_LONG RC4_CHAR RC4_CHUNK DES_INT DES_UNROLL BF_PTR:${aarch64_asm}:linux64:dlfcn:linux-shared:-fPIC::.so.\$(SHLIB_MAJOR).\$(SHLIB_MINOR)",
+"linux-x86_64-openwrt", "gcc:-m64 -DL_ENDIAN -DTERMIOS \$(OPENWRT_OPTIMIZATION_FLAGS) -fomit-frame-pointer -Wall::-D_REENTRANT::-ldl:SIXTY_FOUR_BIT_LONG RC4_CHUNK DES_INT DES_UNROLL:${x86_64_asm}:elf:dlfcn:linux-shared:-fPIC:-m64:.so.\$(SHLIB_MAJOR).\$(SHLIB_MINOR):::64",
+"linux-mips-openwrt","gcc:-DTERMIOS \$(OPENWRT_OPTIMIZATION_FLAGS) -fomit-frame-pointer -Wall::-D_REENTRANT::-ldl:BN_LLONG RC4_CHAR RC4_CHUNK DES_INT DES_UNROLL BF_PTR:${mips32_asm}:o32:dlfcn:linux-shared:-fPIC::.so.\$(SHLIB_MAJOR).\$(SHLIB_MINOR)",
+"linux-generic-openwrt","gcc:-DTERMIOS \$(OPENWRT_OPTIMIZATION_FLAGS) -fomit-frame-pointer -Wall::-D_REENTRANT::-ldl:BN_LLONG RC4_CHAR RC4_CHUNK DES_INT DES_UNROLL BF_PTR:${no_asm}:dlfcn:linux-shared:-fPIC::.so.\$(SHLIB_MAJOR).\$(SHLIB_MINOR)",