Optimize gcm_gf_mult using PCLMULQDQ and PMULL by sjlombardo · Pull Request #714 · libtom/libtomcrypt

sjlombardo · 2026-04-02T18:46:21Z

The current gcm_gf_mult table optimization is great for large payloads, but it is less well-suited for small ones. For example, encrypting the same total volume of data in 4 KB chunks vs. a single 1 MB message is 3-10x slower with the table-optimized gcm_gf_mult implementation. This creates a bottleneck for programs that process many small messages with different keys.

Unfortunately disabling LTC_GCM_TABLES introduces the reverse problem. It speeds things up considerably for small messages but degrades performance for larger ones.

Using hardware carry-less multiply solves these problems for both cases. This PR implements multiplication for GCM using Intel PCLMULQDQ on x86/x86_64 and PMULL on ARM AArch64. Benchmarking shows it is consistently 3-10x faster across messages between 1K and 4K. Hardware acceleration scales better under load, performs predictably regardless of payload size, and makes key setup faster. On balance, this eliminates the performance tradeoff. If there is interest, additional benchmark details can be provided.

Hardware support for this approach is extensive. The required instructions for x86/x86_64 have been around for over 10 years. Similarly, AArch64 support in ARMv8+crypto is available in all reasonably modern devices including most phones, desktops, and server hardware.

The core implementation is based on well-understood and referenced papers:

From Intel Carry-Less Multiplication Instruction and its Usage for Computing the GCM Mode (Gueron & Kounavis), algorithms 1 and 5 (the implementation follows the paper directly and is attributed in the source code comments)
From Implementing GCM on ARMv8 (Gouvea & Lopez), algorithms 3 and 5

In order to minimize changes, the existing gcm_gf_mult is renamed to gcm_gf_mult_sw internally and a very simple gcm_gf_mult dispatcher sits in front of it. It establishes whether there is hardware support and routes to the appropriate implementation or the software fallback. There are no public API changes to LibTomCrypt.

CPU detection follows the same pattern as the recent AES-NI work on develop using CPUID on x86/x86_64, getauxval on ARM Linux, and sysctlbyname on ARM Apple.

The new functions use __attribute__((target(...))) under GCC and Clang to avoid global -mpclmul or +crypto compiler flag changes to the makefile(s). Compile-time opt-out is possible with LTC_NO_GCM_PCLMUL or LTC_NO_GCM_PMULL. Both are disabled under LTC_NO_ASM. If neither of the previous macros is set then LTC_GCM_TABLES is automatically disabled since the intrinsic path is preferable.

It wasn't necessary to add tests because the existing GCM test suite covers this adequately and there is no API change. The test suite was run and passed on Linux x86_64, Linux AArch64, Windows x86_64, Windows ARM64, macOS x86_64, and macOS AArch64.

GCM table setup disproportionately hurts LTC performance with small messages. Disabling LTC_GCM_TABLES helps for small payloads but hurts large ones. This implements hardware carry-less multiplication for GCM that performs well for both cases using PCLMULQDQ on x86/x86_64 and PMULL on AArch64. There are no public API changes. These features can be disabled with LTC_NO_GCM_PCLMUL, LTC_NO_GCM_PMULL, or LTC_NO_ASM. LTC_GCM_TABLES is disabled automatically when no opt-out macro is set.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize gcm_gf_mult using PCLMULQDQ and PMULL#714

Optimize gcm_gf_mult using PCLMULQDQ and PMULL#714
sjlombardo wants to merge 1 commit intolibtom:developfrom
sjlombardo:develop

sjlombardo commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sjlombardo commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant