Hardware accelerated crypto on AMD Puma cores
Let's start with the good stuff:
# grep -F "model name" /proc/cpuinfo model name : AMD GX-212ZC SOC with Radeon(TM) R1E Graphics # modprobe ccp_crypto # grep -e -ccp /proc/crypto driver : rsa-ccp driver : hmac-sha256-ccp driver : sha256-ccp driver : hmac-sha224-ccp driver : sha224-ccp driver : hmac-sha1-ccp driver : sha1-ccp driver : xts-aes-ccp driver : cmac-aes-ccp driver : rfc3686-ctr-aes-ccp driver : ctr-aes-ccp driver : ofb-aes-ccp driver : cfb-aes-ccp driver : cbc-aes-ccp driver : ecb-aes-ccp
Okay, now what?
Brief history of userspace crypto on Linux
Accessing crypto hardware from userspace has always sucked on Linux. You would have all those wonderful drivers for obscure crypto accelerators with no easy way to leverage them. Sure, the kernel could, but that was often not enough. Some libraries like OpenSSL would implement their own hardware-specific engines such as one for VIA Padlock security engine. Generally though the range of userspace-supported hardware was spotty and a resulted in a lot of duplicated effort.
Then there were hacks. The most notable, Cryptodev, was introduced in 2010 and provided a userspace crypto API in a similar fashion to OpenBSD. It never got upstreamed and received its last update in 2017. In 2011 (2.6.38) Linux developers provided their own interface called AF_ALG which was fairly low-level and frankly a pain to use. It did not exactly take the world of crypto by storm. As the CPUs got faster the relevance of hardware crypto accelerators slowly faded. OpenSSL introduced partial AF_ALG support in 2016 (1.1.0). Disappointingly, five years later all it can do is accelerate AES CBC. There is, however, a modern alternative.
Kernel developer Stephan Mueller created libkcapi which (while still using AF_ALG under the hood) provides the user with high-level functions the kind you might find in a regular crypto library like OpenSSL. It also comes with a few utilites such as sha256sum drop-in replacements making benchmarking quite easy.
So is it even worth it?
As they say: the proof is in the pudding.
For the test I created a 680MB file with random data and placed it in tmpfs. I'll be comparing busybox 'naive' implementation, openssl's more hand-tuned one and finally the kcapi utility using the hardware crypto unit. Keep in mind that this is a 1.2GHz embedded CPU from 2014.
Let's start with SHA-256.
# time busybox sha256sum 680MB.dat 7ed754892b3f673a85d18488cd675fa0f9f41d2840846d1bc8601c107713cb7e 680MB.dat real 0m 20.31s
Oof. A bit slow.
# time openssl dgst -sha256 680MB.dat SHA256(680MB.dat)= 7ed754892b3f673a85d18488cd675fa0f9f41d2840846d1bc8601c107713cb7e real 0m 9.42s
#time sha256sum 680MB.dat 7ed754892b3f673a85d18488cd675fa0f9f41d2840846d1bc8601c107713cb7e 680MB.dat real 0m 4.45s
Nice! The hardware accelerated sha256sum ends up twice as fast as OpenSSL and four times as fast as busybox.
And hey, the hashes even match.
SHA-224 results are very similar but SHA-1 is a bit disappointing. 17.5s for busybox, 4.6s for openssl and 4.7s for the hardware accelerator.
I did not bother testing AES, as the core already supports AES-NI ISA extensions which are much easier to use and usually faster too (especially for smaller blocks of data - some old results from comparable hardware here).
I want to use the board as a media server/torrent box running Transmission. As you might recall, the bittorrent protocol uses SHA-1 hashes to verify chunks of data. You see where I am going with this.
several hours of banging my head against a wall a brief introduction to the CMake build system I cobbled together a complete patch here
While raw performance is comparable to the OpenSSL implementation it does move the heavy lifting off the main CPU thread making the system much more responsive when hashing.
Alternative with WolfSSL?
WolfSSL is advertised as a small TLS library for embedded devices. Interestingly, it also supports the AF_ALG interface on Linux. Better yet, transmission can be also built against it. Unfortunately while transmission 3.0 builds just fine against WolfSSL 4.6, adding a torrent results in an immediate segmentation fault. I did not investigate further.
A few parting thoughts, in no particular order.
- libkcapi is nice and easy
- your hardware probably has underutilized features
- crypto accelerators still make a difference on embedded boards