Optimize gcm_gf_mult using PCLMULQDQ and PMULL#714
Open
sjlombardo wants to merge 1 commit intolibtom:developfrom
Open
Optimize gcm_gf_mult using PCLMULQDQ and PMULL#714sjlombardo wants to merge 1 commit intolibtom:developfrom
sjlombardo wants to merge 1 commit intolibtom:developfrom
Conversation
GCM table setup disproportionately hurts LTC performance with small messages. Disabling LTC_GCM_TABLES helps for small payloads but hurts large ones. This implements hardware carry-less multiplication for GCM that performs well for both cases using PCLMULQDQ on x86/x86_64 and PMULL on AArch64. There are no public API changes. These features can be disabled with LTC_NO_GCM_PCLMUL, LTC_NO_GCM_PMULL, or LTC_NO_ASM. LTC_GCM_TABLES is disabled automatically when no opt-out macro is set.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The current gcm_gf_mult table optimization is great for large payloads, but it is less well-suited for small ones. For example, encrypting the same total volume of data in 4 KB chunks vs. a single 1 MB message is 3-10x slower with the table-optimized
gcm_gf_multimplementation. This creates a bottleneck for programs that process many small messages with different keys.Unfortunately disabling LTC_GCM_TABLES introduces the reverse problem. It speeds things up considerably for small messages but degrades performance for larger ones.
Using hardware carry-less multiply solves these problems for both cases. This PR implements multiplication for GCM using Intel PCLMULQDQ on x86/x86_64 and PMULL on ARM AArch64. Benchmarking shows it is consistently 3-10x faster across messages between 1K and 4K. Hardware acceleration scales better under load, performs predictably regardless of payload size, and makes key setup faster. On balance, this eliminates the performance tradeoff. If there is interest, additional benchmark details can be provided.
Hardware support for this approach is extensive. The required instructions for x86/x86_64 have been around for over 10 years. Similarly, AArch64 support in ARMv8+crypto is available in all reasonably modern devices including most phones, desktops, and server hardware.
The core implementation is based on well-understood and referenced papers:
In order to minimize changes, the existing
gcm_gf_multis renamed togcm_gf_mult_swinternally and a very simplegcm_gf_multdispatcher sits in front of it. It establishes whether there is hardware support and routes to the appropriate implementation or the software fallback. There are no public API changes to LibTomCrypt.CPU detection follows the same pattern as the recent AES-NI work on
developusingCPUIDon x86/x86_64,getauxvalon ARM Linux, andsysctlbynameon ARM Apple.The new functions use
__attribute__((target(...)))under GCC and Clang to avoid global-mpclmulor+cryptocompiler flag changes to the makefile(s). Compile-time opt-out is possible withLTC_NO_GCM_PCLMULorLTC_NO_GCM_PMULL. Both are disabled underLTC_NO_ASM. If neither of the previous macros is set thenLTC_GCM_TABLESis automatically disabled since the intrinsic path is preferable.It wasn't necessary to add tests because the existing GCM test suite covers this adequately and there is no API change. The test suite was run and passed on Linux x86_64, Linux AArch64, Windows x86_64, Windows ARM64, macOS x86_64, and macOS AArch64.