Tentative support for avx512vl extensions to 256 bit registers#1345
Tentative support for avx512vl extensions to 256 bit registers#1345serge-sans-paille wants to merge 2 commits into
Conversation
|
Are we sure that we only need avx512f for this? It seems to me that instructions like https://diamondinoia.com/simdref/#_mm256_cmp_epi32_mask requires avx512f + VL. Let me know where I am wrong. Cheers, |
|
You're right, all of this requires avx512f+avx512vl. It turns out most build have both (see https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512) but we currently don't have anything to model avx512vl, which should be the parent of avx512f_256. |
|
I guess it might be called avx512vl_256 at that point as we will also have avx512vl_128. What do you think? I agree, most CPU have 512+extensions. |
It looks like avx512vl does not have any 512bit instruction :-) but it still has avx512 in its name and it implies avx512f, so I agree with you. |
|
Actually, my suggestion might be wrong: https://diamondinoia.com/simdref/#_mm_maskz_andnot_pd There are instructions that require DQ + VL for example. |
|
based on the graph above, looks like DQ => VL, so we should still be fine? |
|
It seems like it, also basically all architectures listed in the chart have DQ,VL,BW together. So in any case is should be fine. |
|
@DiamonDinoia I've move the minimal creation of avx512vl to #1350 , once merged I'll rebase this PR |
|
Sure! Ping me when you need a review. It might be spotty when I'm on holiday |
avx512vl just extends 128 and 256 bits register with some operations, it does not have any 512 bit instructions, so the description is mostly empty and preliminary work for #1345
4b5ae7d to
9d15e04
Compare
In addition to missing instructions (e.g. bas on int64_t etc) this mostly changes the mask representation from vector register to scalar, thus the big diff.
9d15e04 to
ecdac94
Compare
ping ;-) |
There was a problem hiding this comment.
Just some suggestions and maybe we could add to the tests:
TEST_CASE_TEMPLATE_DEFINE("batch_bool mask hygiene", B, batch_bool_hygiene_id)
{
using value_type = typename B::value_type;
using batch_type = xsimd::batch<value_type, typename B::arch_type>;
SUBCASE("any(a != a) is false")
{
batch_type a(value_type(1));
CHECK_FALSE(xsimd::any(a != a));
}
SUBCASE("any(~(a == a)) is false")
{
batch_type a(value_type(1));
CHECK_FALSE(xsimd::any(~(a == a)));
}
SUBCASE("eq(false_mask, false_mask) is all-true")
{
auto m0 = (batch_type(value_type(1)) != batch_type(value_type(1)));
CHECK_UNARY(xsimd::all(m0 == m0));
}
SUBCASE("from_mask ignores bits above lane count")
{
constexpr std::size_t N = B::size;
uint64_t valid_all_true = (N == 64) ? ~uint64_t(0)
: ((uint64_t(1) << N) - 1);
uint64_t junk = valid_all_true | (uint64_t(1) << 63);
B clean = B::from_mask(valid_all_true);
B dirty = B::from_mask(junk);
CHECK_EQ(clean.mask(), dirty.mask());
CHECK_UNARY(xsimd::all(clean == dirty));
}
SUBCASE("batch_bool stored to bool[] is canonical 0/1")
{
batch_type a(value_type(1)), b(value_type(2));
auto m = (a == b); // all false
alignas(64) bool buf[B::size + 1] = {true, true}; // sentinel
xsimd::store_aligned(buf, m);
for (std::size_t i = 0; i < B::size; ++i)
{
// bit-level check: must be exactly 0, not just falsy
CHECK_EQ(*reinterpret_cast<uint8_t const*>(&buf[i]), uint8_t(0));
}
}
}
TEST_CASE_TEMPLATE_APPLY(batch_bool_hygiene_id, batch_bool_types);
```cpp
TEST_CASE_TEMPLATE_DEFINE("store_masked respects Mode", B, store_masked_mode_id)
{
using T = typename B::value_type;
using A = typename B::arch_type;
constexpr std::size_t N = B::size;
// Unaligned-mode + unaligned pointer: must not fault.
alignas(64) T big[2 * N + 1] = {};
T* unaligned_p = big + 1; // sizeof(T)-aligned only
struct AllTrue { static constexpr bool get(std::size_t, std::size_t) { return true; } };
auto cst = xsimd::make_batch_bool_constant<T, AllTrue, A>();
B v(T(7));
xsimd::kernel::store_masked(unaligned_p, v, cst,
xsimd::unaligned_mode{},
xsimd::kernel::requires_arch<A>{});
for (std::size_t i = 0; i < N; ++i)
CHECK_EQ(unaligned_p[i], T(7));
// Overload resolution: store_masked with same-typed mask must compile.
// If C3 regresses, this becomes a compile error.
auto signed_cst = xsimd::make_batch_bool_constant<T, AllTrue, A>();
alignas(64) T aligned_buf[N] = {};
xsimd::kernel::store_masked(aligned_buf, v, signed_cst,
xsimd::aligned_mode{},
xsimd::kernel::requires_arch<A>{});
for (std::size_t i = 0; i < N; ++i)
CHECK_EQ(aligned_buf[i], T(7));
}
TEST_CASE_TEMPLATE_APPLY(store_masked_mode_id, batch_types_for_masked_store);| template <class A, class T, class = std::enable_if_t<std::is_integral<T>::value>> | ||
| XSIMD_INLINE batch_bool<T, A> neq(batch<T, A> const& self, batch<T, A> const& other, requires_arch<avx512vl_256>) noexcept | ||
| { | ||
| return ~(self == other); |
There was a problem hiding this comment.
I think this flips junk into bits above batch_bool::size.
We might need to:
constexpr auto active_mask = register_type(register_type(-1) >> (sizeof(register_type) * 4));
return ~(self == other) & active_mask;
| XSIMD_INLINE batch_bool<T, A> eq(batch_bool<T, A> const& self, batch_bool<T, A> const& other, requires_arch<avx512vl_256>) noexcept | ||
| { | ||
| using register_type = typename batch_bool<T, A>::register_type; | ||
| return register_type(~self.data ^ other.data); |
| XSIMD_INLINE batch_bool<T, A> bitwise_not(batch_bool<T, A> const& self, requires_arch<avx512vl_256>) noexcept | ||
| { | ||
| using register_type = typename batch_bool<T, A>::register_type; | ||
| return register_type(~self.data); |
There was a problem hiding this comment.
here also we have a problem with the content of the high bits.
| return _mm256_mask_loadu_epi32(_mm256_setzero_si256(), imm_mask, mem); | ||
| } | ||
|
|
||
| // store masked |
There was a problem hiding this comment.
We are ignoring the alignment here. We should use _mm256_mask_loadu_epi32 and so on for not 32-byte unaligned batches.
|
|
||
| // load masked | ||
| template <class A, bool... Values, class Mode> | ||
| XSIMD_INLINE batch<int32_t, A> load_masked(int32_t const* mem, batch_bool_constant<int32_t, A, Values...> mask, convert<int32_t>, Mode, requires_arch<avx512vl_256>) noexcept |
There was a problem hiding this comment.
here instead we are always hardcoding the unaligned which is safe but suboptimal/
| if [[ '${{ matrix.sys.flags }}' == 'avx512' ]]; then | ||
| CMAKE_EXTRA_ARGS="$CMAKE_EXTRA_ARGS -DTARGET_ARCH=skylake-avx512" | ||
| fi | ||
| if [[ '${{ matrix.sys.flags }}' == 'avx512vl_256' ]]; then |
There was a problem hiding this comment.
The matrix passes -DXSIMD_DEFAULT_ARCH=avx512f_256 but the build system only references that variable inside the emulated path (test/CMakeLists.txt:50). For a non-emulated build CMake emits:
CMake Warning: Manually-specified variables were not used by the project:
XSIMD_DEFAULT_ARCH
|
|
||
| inline bool avx512vl() const noexcept { return avx512_enabled() && leaf7().all_bits_set<x86_cpuid_leaf7::ebx::avx512vl>(); } | ||
|
|
||
| inline bool avx512vl_256() const noexcept { return avx_enabled() && leaf7().all_bits_set<x86_cpuid_leaf7::ebx::avx512vl>(); } |
There was a problem hiding this comment.
we should check OPMASK XSAVE.
avx_enabled() validates XCR0[1:2] (XMM+YMM). EVEX-encoded AVX-512 instructions use the opmask register state (XCR0[5]) regardless of register width. Without OS-managed OPMASK XSAVE, k0–k7 are not preserved across context switches so we get a silent corruption.
| template <class T, class A> | ||
| XSIMD_INLINE batch_bool<T, A> from_mask(batch_bool<T, A> const&, uint64_t mask, requires_arch<avx512vl_256>) noexcept | ||
| { | ||
| return static_cast<typename batch_bool<T, A>::register_type>(mask); |
There was a problem hiding this comment.
I think we should and this with
((uint64_t(1) << batch_bool<T,A>::size) - 1)we should truncate the junk bits.
| using register_type = typename batch_bool<T, A>::register_type; | ||
| constexpr auto size = batch_bool<T, A>::size; | ||
| for (std::size_t i = 0; i < size; ++i) | ||
| mem[i] = self.data & (register_type(1) << i); |
There was a problem hiding this comment.
This way we store 0 or 2^i which is safe because c++ treats 0 as false, everything else as true. But a user doing something like arr[mem[i]] with arr of size 2 might do an out of bound access that is difficult to track down. I would add & 1 to make sure we store 0s and 1s.
| { | ||
| return _mm256_rol_epi32(self, count); | ||
| } | ||
| XSIMD_IF_CONSTEXPR(sizeof(T) == 8) |
There was a problem hiding this comment.
here we do not use else but we do below.
I do not like using else but it is better to do so as with -O0 the unused-size intrinsic is still instantiated and may emit
In addition to missing instructions (e.g. bas on int64_t etc) this mostly changes the mask representation from vector register to scalar, thus the big diff.