Skip to content

Optimize gcm_gf_mult using PCLMULQDQ and PMULL#714

Open
sjlombardo wants to merge 1 commit intolibtom:developfrom
sjlombardo:develop
Open

Optimize gcm_gf_mult using PCLMULQDQ and PMULL#714
sjlombardo wants to merge 1 commit intolibtom:developfrom
sjlombardo:develop

Conversation

@sjlombardo
Copy link
Copy Markdown

The current gcm_gf_mult table optimization is great for large payloads, but it is less well-suited for small ones. For example, encrypting the same total volume of data in 4 KB chunks vs. a single 1 MB message is 3-10x slower with the table-optimized gcm_gf_mult implementation. This creates a bottleneck for programs that process many small messages with different keys.

Unfortunately disabling LTC_GCM_TABLES introduces the reverse problem. It speeds things up considerably for small messages but degrades performance for larger ones.

Using hardware carry-less multiply solves these problems for both cases. This PR implements multiplication for GCM using Intel PCLMULQDQ on x86/x86_64 and PMULL on ARM AArch64. Benchmarking shows it is consistently 3-10x faster across messages between 1K and 4K. Hardware acceleration scales better under load, performs predictably regardless of payload size, and makes key setup faster. On balance, this eliminates the performance tradeoff. If there is interest, additional benchmark details can be provided.

Hardware support for this approach is extensive. The required instructions for x86/x86_64 have been around for over 10 years. Similarly, AArch64 support in ARMv8+crypto is available in all reasonably modern devices including most phones, desktops, and server hardware.

The core implementation is based on well-understood and referenced papers:

In order to minimize changes, the existing gcm_gf_mult is renamed to gcm_gf_mult_sw internally and a very simple gcm_gf_mult dispatcher sits in front of it. It establishes whether there is hardware support and routes to the appropriate implementation or the software fallback. There are no public API changes to LibTomCrypt.

CPU detection follows the same pattern as the recent AES-NI work on develop using CPUID on x86/x86_64, getauxval on ARM Linux, and sysctlbyname on ARM Apple.

The new functions use __attribute__((target(...))) under GCC and Clang to avoid global -mpclmul or +crypto compiler flag changes to the makefile(s). Compile-time opt-out is possible with LTC_NO_GCM_PCLMUL or LTC_NO_GCM_PMULL. Both are disabled under LTC_NO_ASM. If neither of the previous macros is set then LTC_GCM_TABLES is automatically disabled since the intrinsic path is preferable.

It wasn't necessary to add tests because the existing GCM test suite covers this adequately and there is no API change. The test suite was run and passed on Linux x86_64, Linux AArch64, Windows x86_64, Windows ARM64, macOS x86_64, and macOS AArch64.

GCM table setup disproportionately hurts LTC performance
with small messages. Disabling LTC_GCM_TABLES helps for
small payloads but hurts large ones.

This implements hardware carry-less multiplication for GCM that
performs well for both cases using PCLMULQDQ on x86/x86_64 and
PMULL on AArch64.

There are no public API changes.

These features can be disabled with LTC_NO_GCM_PCLMUL,
LTC_NO_GCM_PMULL, or LTC_NO_ASM. LTC_GCM_TABLES is disabled
automatically when no opt-out macro is set.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant