Tentative support for avx512vl extensions to 256 bit registers by serge-sans-paille · Pull Request #1345 · xtensor-stack/xsimd

serge-sans-paille · 2026-05-10T16:26:19Z

In addition to missing instructions (e.g. bas on int64_t etc) this mostly changes the mask representation from vector register to scalar, thus the big diff.

DiamonDinoia · 2026-05-11T20:45:08Z

Hi @serge-sans-paille,

Are we sure that we only need avx512f for this? It seems to me that instructions like https://diamondinoia.com/simdref/#_mm256_cmp_epi32_mask requires avx512f + VL.

Let me know where I am wrong.

Cheers,
Marco

serge-sans-paille · 2026-05-11T20:50:44Z

You're right, all of this requires avx512f+avx512vl. It turns out most build have both (see https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512) but we currently don't have anything to model avx512vl, which should be the parent of avx512f_256.

DiamonDinoia · 2026-05-11T20:52:42Z

I guess it might be called avx512vl_256 at that point as we will also have avx512vl_128. What do you think?

I agree, most CPU have 512+extensions.

serge-sans-paille · 2026-05-12T06:08:56Z

I guess it might be called avx512vl_256 at that point as we will also have avx512vl_128. What do you think?

I agree, most CPU have 512+extensions.

It looks like avx512vl does not have any 512bit instruction :-) but it still has avx512 in its name and it implies avx512f, so I agree with you.

DiamonDinoia · 2026-05-12T15:23:09Z

Actually, my suggestion might be wrong: https://diamondinoia.com/simdref/#_mm_maskz_andnot_pd

There are instructions that require DQ + VL for example.

serge-sans-paille · 2026-05-12T20:59:20Z

based on the graph above, looks like DQ => VL, so we should still be fine?

DiamonDinoia · 2026-05-12T21:03:06Z

It seems like it, also basically all architectures listed in the chart have DQ,VL,BW together. So in any case is should be fine.

serge-sans-paille · 2026-05-13T20:34:05Z

@DiamonDinoia I've move the minimal creation of avx512vl to #1350 , once merged I'll rebase this PR

DiamonDinoia · 2026-05-13T20:38:06Z

Sure! Ping me when you need a review. It might be spotty when I'm on holiday

avx512vl just extends 128 and 256 bits register with some operations, it does not have any 512 bit instructions, so the description is mostly empty and preliminary work for #1345

In addition to missing instructions (e.g. bas on int64_t etc) this mostly changes the mask representation from vector register to scalar, thus the big diff.

serge-sans-paille · 2026-05-14T07:13:12Z

Sure! Ping me when you need a review. It might be spotty when I'm on holiday

ping ;-)

DiamonDinoia

Hi @serge-sans-paille,

Just some suggestions and maybe we could add to the tests:

TEST_CASE_TEMPLATE_DEFINE("batch_bool mask hygiene", B, batch_bool_hygiene_id)
{
    using value_type = typename B::value_type;
    using batch_type = xsimd::batch<value_type, typename B::arch_type>;

    SUBCASE("any(a != a) is false")        
    {
        batch_type a(value_type(1));
        CHECK_FALSE(xsimd::any(a != a));
    }

    SUBCASE("any(~(a == a)) is false")        
    {
        batch_type a(value_type(1));
        CHECK_FALSE(xsimd::any(~(a == a)));
    }

    SUBCASE("eq(false_mask, false_mask) is all-true")  
    {
        auto m0 = (batch_type(value_type(1)) != batch_type(value_type(1)));
        CHECK_UNARY(xsimd::all(m0 == m0));
    }

    SUBCASE("from_mask ignores bits above lane count")  
    {
        constexpr std::size_t N = B::size;
        uint64_t valid_all_true = (N == 64) ? ~uint64_t(0)
                                            : ((uint64_t(1) << N) - 1);
        uint64_t junk            = valid_all_true | (uint64_t(1) << 63);
        B clean = B::from_mask(valid_all_true);
        B dirty = B::from_mask(junk);
        CHECK_EQ(clean.mask(), dirty.mask());
        CHECK_UNARY(xsimd::all(clean == dirty));
    }

    SUBCASE("batch_bool stored to bool[] is canonical 0/1") 
    {
        batch_type a(value_type(1)), b(value_type(2));
        auto m = (a == b);                    // all false
        alignas(64) bool buf[B::size + 1] = {true, true};  // sentinel
        xsimd::store_aligned(buf, m);
        for (std::size_t i = 0; i < B::size; ++i)
        {
            // bit-level check: must be exactly 0, not just falsy
            CHECK_EQ(*reinterpret_cast<uint8_t const*>(&buf[i]), uint8_t(0));
        }
    }
}
TEST_CASE_TEMPLATE_APPLY(batch_bool_hygiene_id, batch_bool_types);
```cpp
TEST_CASE_TEMPLATE_DEFINE("store_masked respects Mode", B, store_masked_mode_id)
{
    using T = typename B::value_type;
    using A = typename B::arch_type;
    constexpr std::size_t N = B::size;

    // Unaligned-mode + unaligned pointer: must not fault.       
    alignas(64) T big[2 * N + 1] = {};
    T* unaligned_p = big + 1;                 // sizeof(T)-aligned only

    struct AllTrue { static constexpr bool get(std::size_t, std::size_t) { return true; } };
    auto cst = xsimd::make_batch_bool_constant<T, AllTrue, A>();

    B v(T(7));
    xsimd::kernel::store_masked(unaligned_p, v, cst,
                                xsimd::unaligned_mode{},
                                xsimd::kernel::requires_arch<A>{});
    for (std::size_t i = 0; i < N; ++i)
        CHECK_EQ(unaligned_p[i], T(7));

    // Overload resolution: store_masked with same-typed mask must compile.
    // If C3 regresses, this becomes a compile error.                
    auto signed_cst = xsimd::make_batch_bool_constant<T, AllTrue, A>();
    alignas(64) T aligned_buf[N] = {};
    xsimd::kernel::store_masked(aligned_buf, v, signed_cst,
                                xsimd::aligned_mode{},
                                xsimd::kernel::requires_arch<A>{});
    for (std::size_t i = 0; i < N; ++i)
        CHECK_EQ(aligned_buf[i], T(7));
}
TEST_CASE_TEMPLATE_APPLY(store_masked_mode_id, batch_types_for_masked_store);

DiamonDinoia · 2026-05-14T15:48:01Z

+        template <class A, class T, class = std::enable_if_t<std::is_integral<T>::value>>
+        XSIMD_INLINE batch_bool<T, A> neq(batch<T, A> const& self, batch<T, A> const& other, requires_arch<avx512vl_256>) noexcept
+        {
+            return ~(self == other);


I think this flips junk into bits above batch_bool::size.

We might need to:

constexpr auto active_mask = register_type(register_type(-1) >> (sizeof(register_type) * 4)); return ~(self == other) & active_mask;

DiamonDinoia · 2026-05-14T15:48:24Z

+        XSIMD_INLINE batch_bool<T, A> eq(batch_bool<T, A> const& self, batch_bool<T, A> const& other, requires_arch<avx512vl_256>) noexcept
+        {
+            using register_type = typename batch_bool<T, A>::register_type;
+            return register_type(~self.data ^ other.data);


Same as neq here.

DiamonDinoia · 2026-05-14T15:48:52Z

+        XSIMD_INLINE batch_bool<T, A> bitwise_not(batch_bool<T, A> const& self, requires_arch<avx512vl_256>) noexcept
+        {
+            using register_type = typename batch_bool<T, A>::register_type;
+            return register_type(~self.data);


here also we have a problem with the content of the high bits.

DiamonDinoia · 2026-05-14T15:50:50Z

+            return _mm256_mask_loadu_epi32(_mm256_setzero_si256(), imm_mask, mem);
+        }
+
+        // store masked


We are ignoring the alignment here. We should use _mm256_mask_loadu_epi32 and so on for not 32-byte unaligned batches.

DiamonDinoia · 2026-05-14T15:51:30Z

+
+        // load masked
+        template <class A, bool... Values, class Mode>
+        XSIMD_INLINE batch<int32_t, A> load_masked(int32_t const* mem, batch_bool_constant<int32_t, A, Values...> mask, convert<int32_t>, Mode, requires_arch<avx512vl_256>) noexcept


here instead we are always hardcoding the unaligned which is safe but suboptimal/

DiamonDinoia · 2026-05-14T15:56:08Z

        if [[ '${{ matrix.sys.flags }}' == 'avx512' ]]; then
          CMAKE_EXTRA_ARGS="$CMAKE_EXTRA_ARGS -DTARGET_ARCH=skylake-avx512"
        fi
+        if [[ '${{ matrix.sys.flags }}' == 'avx512vl_256' ]]; then


The matrix passes -DXSIMD_DEFAULT_ARCH=avx512f_256 but the build system only references that variable inside the emulated path (test/CMakeLists.txt:50). For a non-emulated build CMake emits:

CMake Warning: Manually-specified variables were not used by the project: XSIMD_DEFAULT_ARCH

DiamonDinoia · 2026-05-14T15:57:41Z


        inline bool avx512vl() const noexcept { return avx512_enabled() && leaf7().all_bits_set<x86_cpuid_leaf7::ebx::avx512vl>(); }

+        inline bool avx512vl_256() const noexcept { return avx_enabled() && leaf7().all_bits_set<x86_cpuid_leaf7::ebx::avx512vl>(); }


we should check OPMASK XSAVE.

avx_enabled() validates XCR0[1:2] (XMM+YMM). EVEX-encoded AVX-512 instructions use the opmask register state (XCR0[5]) regardless of register width. Without OS-managed OPMASK XSAVE, k0–k7 are not preserved across context switches so we get a silent corruption.

DiamonDinoia · 2026-05-14T15:58:46Z

+        template <class T, class A>
+        XSIMD_INLINE batch_bool<T, A> from_mask(batch_bool<T, A> const&, uint64_t mask, requires_arch<avx512vl_256>) noexcept
+        {
+            return static_cast<typename batch_bool<T, A>::register_type>(mask);


I think we should and this with

((uint64_t(1) << batch_bool<T,A>::size) - 1)

we should truncate the junk bits.

DiamonDinoia · 2026-05-14T16:02:59Z

+            using register_type = typename batch_bool<T, A>::register_type;
+            constexpr auto size = batch_bool<T, A>::size;
+            for (std::size_t i = 0; i < size; ++i)
+                mem[i] = self.data & (register_type(1) << i);


This way we store 0 or 2^i which is safe because c++ treats 0 as false, everything else as true. But a user doing something like arr[mem[i]] with arr of size 2 might do an out of bound access that is difficult to track down. I would add & 1 to make sure we store 0s and 1s.

DiamonDinoia · 2026-05-14T17:33:04Z

+            {
+                return _mm256_rol_epi32(self, count);
+            }
+            XSIMD_IF_CONSTEXPR(sizeof(T) == 8)


here we do not use else but we do below.
I do not like using else but it is better to do so as with -O0 the unused-size intrinsic is still instantiated and may emit

serge-sans-paille requested a review from DiamonDinoia May 10, 2026 16:29

serge-sans-paille mentioned this pull request May 13, 2026

[x86_64] Minimal support for avx512vl #1350

Open

[x86_64] Minimal support for avx512vl

0defe75

avx512vl just extends 128 and 256 bits register with some operations, it does not have any 512 bit instructions, so the description is mostly empty and preliminary work for #1345

serge-sans-paille force-pushed the feature/avx512f_256 branch from 4b5ae7d to 9d15e04 Compare May 14, 2026 07:06

serge-sans-paille changed the title ~~Tentative support for avx512f extensions to 256 bit registers~~ Tentative support for avx512vl extensions to 256 bit registers May 14, 2026

Tentative support for avx512vl extensions to 256 bit registers

ecdac94

In addition to missing instructions (e.g. bas on int64_t etc) this mostly changes the mask representation from vector register to scalar, thus the big diff.

serge-sans-paille force-pushed the feature/avx512f_256 branch from 9d15e04 to ecdac94 Compare May 14, 2026 07:09

DiamonDinoia requested changes May 14, 2026

View reviewed changes


		inline bool avx512vl() const noexcept { return avx512_enabled() && leaf7().all_bits_set<x86_cpuid_leaf7::ebx::avx512vl>(); }

		inline bool avx512vl_256() const noexcept { return avx_enabled() && leaf7().all_bits_set<x86_cpuid_leaf7::ebx::avx512vl>(); }

Conversation

serge-sans-paille commented May 10, 2026

Uh oh!

DiamonDinoia commented May 11, 2026

Uh oh!

serge-sans-paille commented May 11, 2026

Uh oh!

DiamonDinoia commented May 11, 2026

Uh oh!

serge-sans-paille commented May 12, 2026

Uh oh!

DiamonDinoia commented May 12, 2026

Uh oh!

serge-sans-paille commented May 12, 2026

Uh oh!

DiamonDinoia commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

serge-sans-paille commented May 13, 2026

Uh oh!

DiamonDinoia commented May 13, 2026

Uh oh!

serge-sans-paille commented May 14, 2026

Uh oh!

DiamonDinoia left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DiamonDinoia commented May 12, 2026 •

edited

Loading

DiamonDinoia left a comment •

edited

Loading