[Debian-med-packaging] Bug#1088545: simde: Cherry-pick LSX support from simde upstream
zhangdandan
zhangdandan at loongson.cn
Thu Nov 28 08:54:46 GMT 2024
Source: simde
Version: 0.8.2-1
Severity: wishlist
Tags: patch FTBFS
User: debian-loongarch at lists.debian.org
Usertags: loong64
Dear maintainers,
Compiling the obs-studio failed(depends simde arch=all) for loong64 in
the Debian Package Auto-Building environment.
The error log is as follows (simde lacks LSX support),
1. The first error is related to simde's x86/sse.h.
```
/usr/include/simde/x86/sse.h: In function ‘simde_x_mm_round_ps’:
/usr/include/simde/x86/sse.h:632:22: error: incompatible types when
assigning to type ‘v2i64’ from type ‘__m128’ {aka ‘__vector(4) float’}
632 | r_.lsx_i64 = __lsx_vfrintrne_s(a_.lsx_f32);
| ^~~~~~~~~~~~~~~~~
/usr/include/simde/x86/sse.h:651:22: error: incompatible types when
assigning to type ‘v2i64’ from type ‘__m128’ {aka ‘__vector(4) float’}
651 | r_.lsx_i64 = __lsx_vfrintrm_s(a_.lsx_f32);
| ^~~~~~~~~~~~~~~~
/usr/include/simde/x86/sse.h:670:22: error: incompatible types when
assigning to type ‘v2i64’ from type ‘__m128’ {aka ‘__vector(4) float’}
670 | r_.lsx_i64 = __lsx_vfrintrp_s(a_.lsx_f32);
| ^~~~~~~~~~~~~~~~
/usr/include/simde/x86/sse.h:689:22: error: incompatible types when
assigning to type ‘v2i64’ from type ‘__m128’ {aka ‘__vector(4) float’}
689 | r_.lsx_i64 = __lsx_vfrintrz_s(a_.lsx_f32);
```
Based on the first error, cherry-pick LSX support from x86/sse.h, the
first error was solved.
2. The second error is related to simde's x86/sse2.h.
```
In file included from
/home/11.25/obs-studio/obs-studio-30.2.3+dfsg/libobs/graphics/../util/sse-intrin.h:31,
from
/home/11.25/obs-studio/obs-studio-30.2.3+dfsg/libobs/graphics/vec4.h:23,
from
/home/11.25/obs-studio/obs-studio-30.2.3+dfsg/libobs/graphics/image-file.c:22:
/usr/include/simde/x86/sse2.h:260:24: error: conflicting types for
‘__m128i’; have ‘simde__m128i’
260 | typedef simde__m128i __m128i;
```
Based on the second error, cherry-pick LSX support from x86/sse2.h, the
second error was solved.
In summary, about adding LSX support for sse.h and sse2.h has been
merged in simde's master branch.
The details is as follows,
```
* x86/sse.h: Fix type convert error for LSX.
- Applied-Upstream: master,
https://github.com/simd-everywhere/simde/pull/1215
* x86/sse2.h: Add LSX support for sse2.h.
- Applied-Upstream: master,
https://github.com/simd-everywhere/simde/pull/1236
```
The upstream simde code that supports LSX has been merged into the
master branch(there is currently no supporting release version).
the latest community release version is 0.8.2-1.
So, I have added LSX support for Debian simde 0.8.2-1.
Please consider the patch I attached(Cherry-pick LSX support from simde
upstream).
1.Please refer to the total patch
simde-cherry-pick-LSX-support-from-upstream.debdiff.
2.You can also refer to the independent patches:
simde-fix-type-convert-error-for-LSX.patch
simde-add-LSX-support-for-sse2-header-file.patch
Based on attached patch, I built simde and installed
libsimde-dev_0.8.2-1+loong64_all.deb.
And then, obs-studio was built successfully on my local ENV (The above
two compilation errors of obs-studio have been resolved).
```
dh_md5sums
-O--builddirectory=/home/obs-studio/obs-studio-30.2.3\+dfsg/obj-loongarch64-linux-gnu
dh_builddeb
-O--builddirectory=/home/obs-studio/obs-studio-30.2.3\+dfsg/obj-loongarch64-linux-gnu
dpkg-deb: building package 'obs-studio-dbgsym' in
'../obs-studio-dbgsym_30.2.3+dfsg-2_loong64.deb'.
dpkg-deb: building package 'obs-studio' in
'../obs-studio_30.2.3+dfsg-2_loong64.deb'.
dpkg-deb: building package 'obs-plugins' in
'../obs-plugins_30.2.3+dfsg-2_loong64.deb'.
dpkg-deb: building package 'libobs0t64' in
'../libobs0t64_30.2.3+dfsg-2_loong64.deb'.
dpkg-deb: building package 'obs-plugins-dbgsym' in
'../obs-plugins-dbgsym_30.2.3+dfsg-2_loong64.deb'.
dpkg-deb: building package 'libobs-dev' in
'../libobs-dev_30.2.3+dfsg-2_loong64.deb'.
dpkg-deb: building package 'libobs0t64-dbgsym' in
'../libobs0t64-dbgsym_30.2.3+dfsg-2_loong64.deb'.
dpkg-genbuildinfo --build=binary
-O../obs-studio_30.2.3+dfsg-2_loong64.buildinfo
dpkg-genchanges --build=binary
-O../obs-studio_30.2.3+dfsg-2_loong64.changes
```
Could you add LSX support for simde in the next upload?
You opinions are welcome.
Best regards,
Dandan Zhang
-------------- next part --------------
A non-text attachment was scrubbed...
Name: simde-add-LSX-support-for-sse2-header-file.patch
Type: text/x-patch
Size: 112954 bytes
Desc: not available
URL: <http://alioth-lists.debian.net/pipermail/debian-med-packaging/attachments/20241128/dfbb2aa5/attachment-0002.bin>
-------------- next part --------------
diff -Nru simde-0.8.2/debian/changelog simde-0.8.2/debian/changelog
--- simde-0.8.2/debian/changelog 2024-05-02 11:04:35.000000000 +0000
+++ simde-0.8.2/debian/changelog 2024-11-27 09:59:12.000000000 +0000
@@ -1,3 +1,12 @@
+simde (0.8.2-1+loong64) unreleased; urgency=medium
+
+ * x86/sse.h: Fix type convert error for LSX.
+ - Applied-Upstream: master, https://github.com/simd-everywhere/simde/pull/1215
+ * x86/sse2.h: Add LSX support for sse2.h.
+ - Applied-Upstream: master, https://github.com/simd-everywhere/simde/pull/1236
+
+ -- Dandan Zhang <zhangdandan at loongson.cn> Wed, 27 Nov 2024 17:59:12 +0800
+
simde (0.8.2-1) unstable; urgency=medium
* New upstream version
diff -Nru simde-0.8.2/debian/patches/series simde-0.8.2/debian/patches/series
--- simde-0.8.2/debian/patches/series 2024-04-16 08:46:36.000000000 +0000
+++ simde-0.8.2/debian/patches/series 2024-11-27 09:50:14.000000000 +0000
@@ -1,2 +1,4 @@
munit
pkgconfig
+simde-fix-type-convert-error-for-LSX.patch
+simde-add-LSX-support-for-sse2-header-file.patch
diff -Nru simde-0.8.2/debian/patches/simde-add-LSX-support-for-sse2-header-file.patch simde-0.8.2/debian/patches/simde-add-LSX-support-for-sse2-header-file.patch
--- simde-0.8.2/debian/patches/simde-add-LSX-support-for-sse2-header-file.patch 1970-01-01 00:00:00.000000000 +0000
+++ simde-0.8.2/debian/patches/simde-add-LSX-support-for-sse2-header-file.patch 2024-11-27 09:59:12.000000000 +0000
@@ -0,0 +1,2284 @@
+Description: loongarch: add lsx support for sse2.h
+ .
+ simde (0.8.2-1+loong64) unreleased; urgency=medium
+ .
+ * loongarch: add lsx support for sse2.h.
+ - Applied-Upstream: master, https://github.com/simd-everywhere/simde/pull/1236
+Author: Dandan Zhang <zhangdandan at loongson.cn>
+---
+The information above should follow the Patch Tagging Guidelines, please
+checkout https://dep.debian.net/deps/dep3/ to learn about the format. Here
+are templates for supplementary fields that you might want to add:
+
+Applied-Upstream: master, https://github.com/simd-everywhere/simde/pull/1236
+Signed-Off-By: HecaiYuan
+Last-Update: 2024-11-27
+
+--- simde-0.8.2.orig/simde/x86/sse2.h
++++ simde-0.8.2/simde/x86/sse2.h
+@@ -139,6 +139,17 @@ typedef union {
+ SIMDE_ALIGN_TO_16 SIMDE_POWER_ALTIVEC_VECTOR(unsigned long long) altivec_u64;
+ SIMDE_ALIGN_TO_16 SIMDE_POWER_ALTIVEC_VECTOR(double) altivec_f64;
+ #endif
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ v16i8 lsx_i8;
++ v8i16 lsx_i16;
++ v4i32 lsx_i32;
++ v2i64 lsx_i64;
++ v16u8 lsx_u8;
++ v8u16 lsx_u16;
++ v4u32 lsx_u32;
++ v2u64 lsx_u64;
++ v4f32 lsx_f32;
++ v2f64 lsx_f64;
+ #endif
+ } simde__m128i_private;
+
+@@ -223,6 +234,17 @@ typedef union {
+ SIMDE_ALIGN_TO_16 SIMDE_POWER_ALTIVEC_VECTOR(unsigned long long) altivec_u64;
+ SIMDE_ALIGN_TO_16 SIMDE_POWER_ALTIVEC_VECTOR(double) altivec_f64;
+ #endif
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ v16i8 lsx_i8;
++ v8i16 lsx_i16;
++ v4i32 lsx_i32;
++ v2i64 lsx_i64;
++ v16u8 lsx_u8;
++ v8u16 lsx_u16;
++ v4u32 lsx_u32;
++ v2u64 lsx_u64;
++ v4f32 lsx_f32;
++ v2f64 lsx_f64;
+ #endif
+ } simde__m128d_private;
+
+@@ -248,6 +270,9 @@ typedef union {
+ #else
+ typedef simde__m128d_private simde__m128d;
+ #endif
++#elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ typedef v2i64 simde__m128i;
++ typedef v2f64 simde__m128d;
+ #elif defined(SIMDE_VECTOR_SUBSCRIPT)
+ typedef int64_t simde__m128i SIMDE_ALIGN_TO_16 SIMDE_VECTOR(16) SIMDE_MAY_ALIAS;
+ typedef simde_float64 simde__m128d SIMDE_ALIGN_TO_16 SIMDE_VECTOR(16) SIMDE_MAY_ALIAS;
+@@ -328,6 +353,17 @@ simde__m128d_to_private(simde__m128d v)
+ SIMDE_X86_GENERATE_CONVERSION_FUNCTION(m128i, SIMDE_POWER_ALTIVEC_VECTOR(unsigned long long), altivec, u64)
+ SIMDE_X86_GENERATE_CONVERSION_FUNCTION(m128i, SIMDE_POWER_ALTIVEC_VECTOR(signed long long), altivec, i64)
+ #endif
++#elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ SIMDE_X86_GENERATE_CONVERSION_FUNCTION(m128i, v16i8, lsx, i8)
++ SIMDE_X86_GENERATE_CONVERSION_FUNCTION(m128i, v8i16, lsx, i16)
++ SIMDE_X86_GENERATE_CONVERSION_FUNCTION(m128i, v4i32, lsx, i32)
++ SIMDE_X86_GENERATE_CONVERSION_FUNCTION(m128i, v2i64, lsx, i64)
++ SIMDE_X86_GENERATE_CONVERSION_FUNCTION(m128i, v16u8, lsx, u8)
++ SIMDE_X86_GENERATE_CONVERSION_FUNCTION(m128i, v8u16, lsx, u16)
++ SIMDE_X86_GENERATE_CONVERSION_FUNCTION(m128i, v4u32, lsx, u32)
++ SIMDE_X86_GENERATE_CONVERSION_FUNCTION(m128i, v2u64, lsx, u64)
++ SIMDE_X86_GENERATE_CONVERSION_FUNCTION(m128i, v4f32, lsx, f32)
++ SIMDE_X86_GENERATE_CONVERSION_FUNCTION(m128i, v2f64, lsx, f64)
+ #endif /* defined(SIMDE_ARM_NEON_A32V7_NATIVE) */
+
+ #if defined(SIMDE_ARM_NEON_A32V7_NATIVE)
+@@ -375,6 +411,17 @@ simde__m128d_to_private(simde__m128d v)
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ SIMDE_X86_GENERATE_CONVERSION_FUNCTION(m128d, v128_t, wasm, v128);
+ SIMDE_X86_GENERATE_CONVERSION_FUNCTION(m128i, v128_t, wasm, v128);
++#elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ SIMDE_X86_GENERATE_CONVERSION_FUNCTION(m128d, v16i8, lsx, i8)
++ SIMDE_X86_GENERATE_CONVERSION_FUNCTION(m128d, v8i16, lsx, i16)
++ SIMDE_X86_GENERATE_CONVERSION_FUNCTION(m128d, v4i32, lsx, i32)
++ SIMDE_X86_GENERATE_CONVERSION_FUNCTION(m128d, v2i64, lsx, i64)
++ SIMDE_X86_GENERATE_CONVERSION_FUNCTION(m128d, v16u8, lsx, u8)
++ SIMDE_X86_GENERATE_CONVERSION_FUNCTION(m128d, v8u16, lsx, u16)
++ SIMDE_X86_GENERATE_CONVERSION_FUNCTION(m128d, v4u32, lsx, u32)
++ SIMDE_X86_GENERATE_CONVERSION_FUNCTION(m128d, v2u64, lsx, u64)
++ SIMDE_X86_GENERATE_CONVERSION_FUNCTION(m128d, v4f32, lsx, f32)
++ SIMDE_X86_GENERATE_CONVERSION_FUNCTION(m128d, v2f64, lsx, f64)
+ #endif /* defined(SIMDE_ARM_NEON_A32V7_NATIVE) */
+
+ SIMDE_FUNCTION_ATTRIBUTES
+@@ -390,6 +437,9 @@ simde_mm_set_pd (simde_float64 e1, simde
+ #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE)
+ SIMDE_ALIGN_TO_16 simde_float64 data[2] = { e0, e1 };
+ r_.neon_f64 = vld1q_f64(data);
++ #elif defined(SIMD_LOONGARCH_LSX_NATIVE)
++ SIMDE_ALIGN_TO_16 simde_float64 data[2] = { e0, e1 };
++ r_.lsx_i64 = __lsx_vld(data, 0);
+ #else
+ r_.f64[0] = e0;
+ r_.f64[1] = e1;
+@@ -416,6 +466,8 @@ simde_mm_set1_pd (simde_float64 a) {
+ r_.neon_f64 = vdupq_n_f64(a);
+ #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE)
+ r_.altivec_f64 = vec_splats(HEDLEY_STATIC_CAST(double, a));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vldrepl_d(&a, 0);
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.i64) / sizeof(r_.i64[0])) ; i++) {
+@@ -451,6 +503,9 @@ simde_x_mm_abs_pd(simde__m128d a) {
+ r_.altivec_f64 = vec_abs(a_.altivec_f64);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_f64x2_abs(a_.wasm_v128);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ uint64_t u64_ = UINT64_C(0x7FFFFFFFFFFFFFFF);
++ r_.lsx_i64 = __lsx_vand_v(__lsx_vldrepl_d(&u64_, 0), a_.lsx_i64);
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) {
+@@ -481,6 +536,8 @@ simde_x_mm_not_pd(simde__m128d a) {
+ r_.altivec_i32 = vec_nor(a_.altivec_i32, a_.altivec_i32);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_v128_not(a_.wasm_v128);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vnor_v(a_.lsx_i64, a_.lsx_i64);
+ #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS)
+ r_.i32f = ~a_.i32f;
+ #else
+@@ -518,6 +575,8 @@ simde_x_mm_select_pd(simde__m128d a, sim
+ r_.i64 = a_.i64 ^ ((a_.i64 ^ b_.i64) & mask_.i64);
+ #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE)
+ r_.neon_i64 = vbslq_s64(mask_.neon_u64, b_.neon_i64, a_.neon_i64);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vbitsel_v(a_.lsx_i64, b_.lsx_i64, mask_.lsx_u64)
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.i64) / sizeof(r_.i64[0])) ; i++) {
+@@ -546,6 +605,8 @@ simde_mm_add_epi8 (simde__m128i a, simde
+ r_.altivec_i8 = vec_add(a_.altivec_i8, b_.altivec_i8);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i8x16_add(a_.wasm_v128, b_.wasm_v128);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vadd_b(a_.lsx_i64, b_.lsx_i64);
+ #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS)
+ r_.i8 = a_.i8 + b_.i8;
+ #else
+@@ -579,6 +640,8 @@ simde_mm_add_epi16 (simde__m128i a, simd
+ r_.altivec_i16 = vec_add(a_.altivec_i16, b_.altivec_i16);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i16x8_add(a_.wasm_v128, b_.wasm_v128);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vadd_h(a_.lsx_i64, b_.lsx_i64);
+ #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS)
+ r_.i16 = a_.i16 + b_.i16;
+ #else
+@@ -612,6 +675,8 @@ simde_mm_add_epi32 (simde__m128i a, simd
+ r_.altivec_i32 = vec_add(a_.altivec_i32, b_.altivec_i32);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i32x4_add(a_.wasm_v128, b_.wasm_v128);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vadd_w(a_.lsx_i64, b_.lsx_i64);
+ #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS)
+ r_.i32 = a_.i32 + b_.i32;
+ #else
+@@ -643,6 +708,8 @@ simde_mm_add_epi64 (simde__m128i a, simd
+ r_.neon_i64 = vaddq_s64(a_.neon_i64, b_.neon_i64);
+ #elif defined(SIMDE_POWER_ALTIVEC_P8_NATIVE)
+ r_.altivec_i64 = vec_add(a_.altivec_i64, b_.altivec_i64);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vadd_d(a_.lsx_i64, b_.lsx_i64);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i64x2_add(a_.wasm_v128, b_.wasm_v128);
+ #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS)
+@@ -680,6 +747,8 @@ simde_mm_add_pd (simde__m128d a, simde__
+ r_.altivec_f64 = vec_add(a_.altivec_f64, b_.altivec_f64);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_f64x2_add(a_.wasm_v128, b_.wasm_v128);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_f64 = __lsx_vfadd_d(a_.lsx_f64, b_.lsx_f64);
+ #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS)
+ r_.f64 = a_.f64 + b_.f64;
+ #else
+@@ -717,6 +786,8 @@ simde_mm_move_sd (simde__m128d a, simde_
+ #endif
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i64x2_shuffle(a_.wasm_v128, b_.wasm_v128, 2, 1);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vextrins_d(b_.lsx_i64, a_.lsx_i64, 0b00010001);
+ #elif defined(SIMDE_SHUFFLE_VECTOR_)
+ r_.f64 = SIMDE_SHUFFLE_VECTOR_(64, 16, a_.f64, b_.f64, 2, 1);
+ #else
+@@ -751,6 +822,8 @@ simde_x_mm_broadcastlow_pd(simde__m128d
+ r_.altivec_f64 = vec_splat(a_.altivec_f64, 0);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_f64x2_splat(a_.f64[0]);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vreplvei_d(a_.lsx_i64, 0);
+ #elif defined(SIMDE_SHUFFLE_VECTOR_)
+ r_.f64 = SIMDE_SHUFFLE_VECTOR_(64, 16, a_.f64, a_.f64, 0, 0);
+ #else
+@@ -778,10 +851,12 @@ simde_mm_add_sd (simde__m128d a, simde__
+ r_,
+ a_ = simde__m128d_to_private(a),
+ b_ = simde__m128d_to_private(b);
+-
+- r_.f64[0] = a_.f64[0] + b_.f64[0];
+- r_.f64[1] = a_.f64[1];
+-
++ #if defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vextrins_d(a_.lsx_i64, (__m128i)__lsx_vfadd_d(b_.lsx_f64, a_.lsx_f64), 0);
++ #else
++ r_.f64[0] = a_.f64[0] + b_.f64[0];
++ r_.f64[1] = a_.f64[1];
++ #endif
+ return simde__m128d_from_private(r_);
+ #endif
+ }
+@@ -830,6 +905,8 @@ simde_mm_adds_epi8 (simde__m128i a, simd
+ r_.wasm_v128 = wasm_i8x16_add_sat(a_.wasm_v128, b_.wasm_v128);
+ #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE)
+ r_.altivec_i8 = vec_adds(a_.altivec_i8, b_.altivec_i8);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vsadd_b(a_.lsx_i64, b_.lsx_i64);
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.i8) / sizeof(r_.i8[0])) ; i++) {
+@@ -861,6 +938,8 @@ simde_mm_adds_epi16 (simde__m128i a, sim
+ r_.wasm_v128 = wasm_i16x8_add_sat(a_.wasm_v128, b_.wasm_v128);
+ #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE)
+ r_.altivec_i16 = vec_adds(a_.altivec_i16, b_.altivec_i16);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vsadd_h(a_.lsx_i64, b_.lsx_i64);
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.i16) / sizeof(r_.i16[0])) ; i++) {
+@@ -892,6 +971,8 @@ simde_mm_adds_epu8 (simde__m128i a, simd
+ r_.wasm_v128 = wasm_u8x16_add_sat(a_.wasm_v128, b_.wasm_v128);
+ #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE)
+ r_.altivec_u8 = vec_adds(a_.altivec_u8, b_.altivec_u8);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vsadd_bu(a_.lsx_i64, b_.lsx_i64);
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.u8) / sizeof(r_.u8[0])) ; i++) {
+@@ -923,6 +1004,8 @@ simde_mm_adds_epu16 (simde__m128i a, sim
+ r_.wasm_v128 = wasm_u16x8_add_sat(a_.wasm_v128, b_.wasm_v128);
+ #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE)
+ r_.altivec_u16 = vec_adds(a_.altivec_u16, b_.altivec_u16);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vsadd_hu(a_.lsx_i64, b_.lsx_i64);
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.u16) / sizeof(r_.u16[0])) ; i++) {
+@@ -954,6 +1037,8 @@ simde_mm_and_pd (simde__m128d a, simde__
+ r_.wasm_v128 = wasm_v128_and(a_.wasm_v128, b_.wasm_v128);
+ #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE)
+ r_.altivec_f64 = vec_and(a_.altivec_f64, b_.altivec_f64);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vand_v(a_.lsx_i64, b_.lsx_i64);
+ #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS)
+ r_.i32f = a_.i32f & b_.i32f;
+ #else
+@@ -987,6 +1072,8 @@ simde_mm_and_si128 (simde__m128i a, simd
+ r_.altivec_u32f = vec_and(a_.altivec_u32f, b_.altivec_u32f);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_v128_and(a_.wasm_v128, b_.wasm_v128);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vand_v(a_.lsx_i64, b_.lsx_i64);
+ #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS)
+ r_.i32f = a_.i32f & b_.i32f;
+ #else
+@@ -1022,6 +1109,8 @@ simde_mm_andnot_pd (simde__m128d a, simd
+ r_.altivec_f64 = vec_andc(b_.altivec_f64, a_.altivec_f64);
+ #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE)
+ r_.altivec_i32f = vec_andc(b_.altivec_i32f, a_.altivec_i32f);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vandn_v(a_.lsx_i64, b_.lsx_i64);
+ #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS)
+ r_.i32f = ~a_.i32f & b_.i32f;
+ #else
+@@ -1055,6 +1144,8 @@ simde_mm_andnot_si128 (simde__m128i a, s
+ r_.altivec_i32 = vec_andc(b_.altivec_i32, a_.altivec_i32);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_v128_andnot(b_.wasm_v128, a_.wasm_v128);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vandn_v(a_.lsx_i64, b_.lsx_i64);
+ #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS)
+ r_.i32f = ~a_.i32f & b_.i32f;
+ #else
+@@ -1088,6 +1179,8 @@ simde_mm_xor_pd (simde__m128d a, simde__
+ r_.wasm_v128 = wasm_v128_xor(a_.wasm_v128, b_.wasm_v128);
+ #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE)
+ r_.neon_i64 = veorq_s64(a_.neon_i64, b_.neon_i64);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vxor_v(a_.lsx_i64, b_.lsx_i64);
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.i32f) / sizeof(r_.i32f[0])) ; i++) {
+@@ -1119,6 +1212,8 @@ simde_mm_avg_epu8 (simde__m128i a, simde
+ r_.wasm_v128 = wasm_u8x16_avgr(a_.wasm_v128, b_.wasm_v128);
+ #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE)
+ r_.altivec_u8 = vec_avg(a_.altivec_u8, b_.altivec_u8);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vavgr_bu(a_.lsx_i64, b_.lsx_i64);
+ #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && defined(SIMDE_CONVERT_VECTOR_)
+ uint16_t wa SIMDE_VECTOR(32);
+ uint16_t wb SIMDE_VECTOR(32);
+@@ -1158,6 +1253,8 @@ simde_mm_avg_epu16 (simde__m128i a, simd
+ r_.wasm_v128 = wasm_u16x8_avgr(a_.wasm_v128, b_.wasm_v128);
+ #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE)
+ r_.altivec_u16 = vec_avg(a_.altivec_u16, b_.altivec_u16);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vavgr_hu(a_.lsx_i64, b_.lsx_i64);
+ #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) && defined(SIMDE_CONVERT_VECTOR_)
+ uint32_t wa SIMDE_VECTOR(32);
+ uint32_t wb SIMDE_VECTOR(32);
+@@ -1194,6 +1291,8 @@ simde_mm_setzero_si128 (void) {
+ r_.altivec_i32 = vec_splats(HEDLEY_STATIC_CAST(signed int, 0));
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i32x4_splat(INT32_C(0));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vreplgr2vr_w(0);
+ #elif defined(SIMDE_VECTOR_SUBSCRIPT)
+ r_.i32 = __extension__ (__typeof__(r_.i32)) { 0, 0, 0, 0 };
+ #else
+@@ -1245,6 +1344,9 @@ simde_mm_bslli_si128 (simde__m128i a, co
+ }
+ #if defined(SIMDE_X86_SSE2_NATIVE) && !defined(__PGI)
+ #define simde_mm_bslli_si128(a, imm8) _mm_slli_si128(a, imm8)
++#elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ #define simde_mm_bslli_si128(a, imm8) \
++ (((imm8)<=0) ? (a) : (((imm8)>15) ? simde_mm_setzero_si128() : simde__m128i_from_lsx_i8((v16i8)__lsx_vbsll_v(simde__m128i_to_private(a).lsx_i64, (imm8)))))
+ #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(__clang__)
+ #define simde_mm_bslli_si128(a, imm8) \
+ simde__m128i_from_neon_i8(((imm8) <= 0) ? simde__m128i_to_neon_i8(a) : (((imm8) > 15) ? (vdupq_n_s8(0)) : (vextq_s8(vdupq_n_s8(0), simde__m128i_to_neon_i8(a), 16 - (imm8)))))
+@@ -1340,6 +1442,9 @@ simde_mm_bsrli_si128 (simde__m128i a, co
+ }
+ #if defined(SIMDE_X86_SSE2_NATIVE) && !defined(__PGI)
+ #define simde_mm_bsrli_si128(a, imm8) _mm_srli_si128(a, imm8)
++#elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ #define simde_mm_bsrli_si128(a, imm8) \
++ (((imm8)<=0) ? (a) : (((imm8)>15) ? simde_mm_setzero_si128() : simde__m128i_from_lsx_i8((v16i8)__lsx_vbsrl_v(simde__m128i_to_private(a).lsx_i64, (imm8)))))
+ #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) && !defined(__clang__)
+ #define simde_mm_bsrli_si128(a, imm8) \
+ simde__m128i_from_neon_i8(((imm8 < 0) || (imm8 > 15)) ? vdupq_n_s8(0) : (vextq_s8(simde__m128i_to_private(a).neon_i8, vdupq_n_s8(0), ((imm8 & 15) != 0) ? imm8 : (imm8 & 15))))
+@@ -1436,6 +1541,8 @@ simde_mm_comieq_sd (simde__m128d a, simd
+ return !!vgetq_lane_u64(vceqq_f64(a_.neon_f64, b_.neon_f64), 0);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ return wasm_f64x2_extract_lane(a_.wasm_v128, 0) == wasm_f64x2_extract_lane(b_.wasm_v128, 0);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ return !!__lsx_vpickve2gr_d(__lsx_vfcmp_ceq_d(b_.lsx_f64, a_.lsx_f64), 0);
+ #else
+ return a_.f64[0] == b_.f64[0];
+ #endif
+@@ -1458,6 +1565,8 @@ simde_mm_comige_sd (simde__m128d a, simd
+ return !!vgetq_lane_u64(vcgeq_f64(a_.neon_f64, b_.neon_f64), 0);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ return wasm_f64x2_extract_lane(a_.wasm_v128, 0) >= wasm_f64x2_extract_lane(b_.wasm_v128, 0);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ return !!__lsx_vpickve2gr_d(__lsx_vfcmp_cle_d(b_.lsx_f64, a_.lsx_f64), 0);
+ #else
+ return a_.f64[0] >= b_.f64[0];
+ #endif
+@@ -1480,6 +1589,8 @@ simde_mm_comigt_sd (simde__m128d a, simd
+ return !!vgetq_lane_u64(vcgtq_f64(a_.neon_f64, b_.neon_f64), 0);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ return wasm_f64x2_extract_lane(a_.wasm_v128, 0) > wasm_f64x2_extract_lane(b_.wasm_v128, 0);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ return !!__lsx_vpickve2gr_d(__lsx_vfcmp_clt_d(b_.lsx_f64, a_.lsx_f64), 0);
+ #else
+ return a_.f64[0] > b_.f64[0];
+ #endif
+@@ -1502,6 +1613,8 @@ simde_mm_comile_sd (simde__m128d a, simd
+ return !!vgetq_lane_u64(vcleq_f64(a_.neon_f64, b_.neon_f64), 0);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ return wasm_f64x2_extract_lane(a_.wasm_v128, 0) <= wasm_f64x2_extract_lane(b_.wasm_v128, 0);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ return !!__lsx_vpickve2gr_d(__lsx_vfcmp_cle_d(a_.lsx_f64, b_.lsx_f64), 0);
+ #else
+ return a_.f64[0] <= b_.f64[0];
+ #endif
+@@ -1524,6 +1637,8 @@ simde_mm_comilt_sd (simde__m128d a, simd
+ return !!vgetq_lane_u64(vcltq_f64(a_.neon_f64, b_.neon_f64), 0);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ return wasm_f64x2_extract_lane(a_.wasm_v128, 0) < wasm_f64x2_extract_lane(b_.wasm_v128, 0);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ return !!__lsx_vpickve2gr_d(__lsx_vfcmp_clt_d(a_.lsx_f64, b_.lsx_f64), 0);
+ #else
+ return a_.f64[0] < b_.f64[0];
+ #endif
+@@ -1546,6 +1661,8 @@ simde_mm_comineq_sd (simde__m128d a, sim
+ return !vgetq_lane_u64(vceqq_f64(a_.neon_f64, b_.neon_f64), 0);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ return wasm_f64x2_extract_lane(a_.wasm_v128, 0) != wasm_f64x2_extract_lane(b_.wasm_v128, 0);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ return !__lsx_vpickve2gr_d(__lsx_vfcmp_ceq_d(b_.lsx_f64, a_.lsx_f64), 0);
+ #else
+ return a_.f64[0] != b_.f64[0];
+ #endif
+@@ -1579,6 +1696,9 @@ simde_x_mm_copysign_pd(simde__m128d dest
+ #else
+ r_.altivec_f64 = vec_cpsgn(src_.altivec_f64, dest_.altivec_f64);
+ #endif
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ const v2f64 sign_pos = {-0.0f, -0.0f};
++ r_.lsx_i64 = __lsx_vbitsel_v(dest_.lsx_i64, src_.lsx_i64, (v2i64)sign_pos);
+ #elif defined(simde_math_copysign)
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) {
+@@ -1605,6 +1725,8 @@ simde_mm_castpd_ps (simde__m128d a) {
+ return _mm_castpd_ps(a);
+ #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE)
+ return vreinterpretq_f32_f64(a);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ return (simde__m128)a;
+ #else
+ simde__m128 r;
+ simde_memcpy(&r, &a, sizeof(a));
+@@ -1622,6 +1744,8 @@ simde_mm_castpd_si128 (simde__m128d a) {
+ return _mm_castpd_si128(a);
+ #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE)
+ return vreinterpretq_s64_f64(a);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ return (simde__m128i)a;
+ #else
+ simde__m128i r;
+ simde_memcpy(&r, &a, sizeof(a));
+@@ -1639,6 +1763,8 @@ simde_mm_castps_pd (simde__m128 a) {
+ return _mm_castps_pd(a);
+ #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE)
+ return vreinterpretq_f64_f32(a);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ return (simde__m128d)a;
+ #else
+ simde__m128d r;
+ simde_memcpy(&r, &a, sizeof(a));
+@@ -1656,6 +1782,8 @@ simde_mm_castps_si128 (simde__m128 a) {
+ return _mm_castps_si128(a);
+ #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE)
+ return simde__m128i_from_neon_i32(simde__m128_to_private(a).neon_i32);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ return (simde__m128i)a;
+ #else
+ simde__m128i r;
+ simde_memcpy(&r, &a, sizeof(a));
+@@ -1673,6 +1801,8 @@ simde_mm_castsi128_pd (simde__m128i a) {
+ return _mm_castsi128_pd(a);
+ #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE)
+ return vreinterpretq_f64_s64(a);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ return (simde__m128d)a;
+ #else
+ simde__m128d r;
+ simde_memcpy(&r, &a, sizeof(a));
+@@ -1692,6 +1822,8 @@ simde_mm_castsi128_ps (simde__m128i a) {
+ return HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(float), a);
+ #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE)
+ return simde__m128_from_neon_i32(simde__m128i_to_private(a).neon_i32);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ return HEDLEY_REINTERPRET_CAST(__m128, a);
+ #else
+ simde__m128 r;
+ simde_memcpy(&r, &a, sizeof(a));
+@@ -1719,6 +1851,8 @@ simde_mm_cmpeq_epi8 (simde__m128i a, sim
+ r_.wasm_v128 = wasm_i8x16_eq(a_.wasm_v128, b_.wasm_v128);
+ #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE)
+ r_.altivec_i8 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed char), vec_cmpeq(a_.altivec_i8, b_.altivec_i8));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vseq_b(a_.lsx_i64, b_.lsx_i64);
+ #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS)
+ r_.i8 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i8), (a_.i8 == b_.i8));
+ #else
+@@ -1752,6 +1886,8 @@ simde_mm_cmpeq_epi16 (simde__m128i a, si
+ r_.wasm_v128 = wasm_i16x8_eq(a_.wasm_v128, b_.wasm_v128);
+ #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE)
+ r_.altivec_i16 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed short), vec_cmpeq(a_.altivec_i16, b_.altivec_i16));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vseq_h(a_.lsx_i64, b_.lsx_i64);
+ #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS)
+ r_.i16 = (a_.i16 == b_.i16);
+ #else
+@@ -1785,6 +1921,8 @@ simde_mm_cmpeq_epi32 (simde__m128i a, si
+ r_.wasm_v128 = wasm_i32x4_eq(a_.wasm_v128, b_.wasm_v128);
+ #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE)
+ r_.altivec_i32 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed int), vec_cmpeq(a_.altivec_i32, b_.altivec_i32));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vseq_w(a_.lsx_i64, b_.lsx_i64);
+ #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS)
+ r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), a_.i32 == b_.i32);
+ #else
+@@ -1820,6 +1958,8 @@ simde_mm_cmpeq_pd (simde__m128d a, simde
+ r_.altivec_f64 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(double), vec_cmpeq(a_.altivec_f64, b_.altivec_f64));
+ #elif defined(SIMDE_MIPS_MSA_NATIVE)
+ r_.msa_i32 = __msa_addv_w(a_.msa_i32, b_.msa_i32);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vfcmp_ceq_d(a_.lsx_f64, b_.lsx_f64);
+ #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS)
+ r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), (a_.f64 == b_.f64));
+ #else
+@@ -1851,9 +1991,12 @@ simde_mm_cmpeq_sd (simde__m128d a, simde
+ a_ = simde__m128d_to_private(a),
+ b_ = simde__m128d_to_private(b);
+
+- r_.u64[0] = (a_.u64[0] == b_.u64[0]) ? ~UINT64_C(0) : 0;
+- r_.u64[1] = a_.u64[1];
+-
++ #if defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vextrins_d(a_.lsx_i64, (__m128i)__lsx_vfcmp_ceq_d(a_.lsx_f64, b_.lsx_f64), 0);
++ #else
++ r_.u64[0] = (a_.u64[0] == b_.u64[0]) ? ~UINT64_C(0) : 0;
++ r_.u64[1] = a_.u64[1];
++ #endif
+ return simde__m128d_from_private(r_);
+ #endif
+ }
+@@ -1876,6 +2019,8 @@ simde_mm_cmpneq_pd (simde__m128d a, simd
+ r_.neon_u32 = vmvnq_u32(vreinterpretq_u32_u64(vceqq_f64(b_.neon_f64, a_.neon_f64)));
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_f64x2_ne(a_.wasm_v128, b_.wasm_v128);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vfcmp_cune_d(a_.lsx_f64, b_.lsx_f64);
+ #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS)
+ r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), (a_.f64 != b_.f64));
+ #else
+@@ -1906,11 +2051,12 @@ simde_mm_cmpneq_sd (simde__m128d a, simd
+ r_,
+ a_ = simde__m128d_to_private(a),
+ b_ = simde__m128d_to_private(b);
+-
+- r_.u64[0] = (a_.f64[0] != b_.f64[0]) ? ~UINT64_C(0) : UINT64_C(0);
+- r_.u64[1] = a_.u64[1];
+-
+-
++ #if defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vextrins_d(a_.lsx_i64, (__m128i)__lsx_vfcmp_cune_d(a_.lsx_f64, b_.lsx_f64), 0);
++ #else
++ r_.u64[0] = (a_.f64[0] != b_.f64[0]) ? ~UINT64_C(0) : UINT64_C(0);
++ r_.u64[1] = a_.u64[1];
++ #endif
+ return simde__m128d_from_private(r_);
+ #endif
+ }
+@@ -1935,6 +2081,8 @@ simde_mm_cmplt_epi8 (simde__m128i a, sim
+ r_.altivec_i8 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed char),vec_cmplt(a_.altivec_i8, b_.altivec_i8));
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i8x16_lt(a_.wasm_v128, b_.wasm_v128);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vslt_b(a_.lsx_i64, b_.lsx_i64);
+ #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS)
+ r_.i8 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i8), (a_.i8 < b_.i8));
+ #else
+@@ -1968,6 +2116,8 @@ simde_mm_cmplt_epi16 (simde__m128i a, si
+ r_.altivec_i16 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed short), vec_cmplt(a_.altivec_i16, b_.altivec_i16));
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i16x8_lt(a_.wasm_v128, b_.wasm_v128);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vslt_h(a_.lsx_i64, b_.lsx_i64);
+ #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS)
+ r_.i16 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i16), (a_.i16 < b_.i16));
+ #else
+@@ -2001,6 +2151,8 @@ simde_mm_cmplt_epi32 (simde__m128i a, si
+ r_.altivec_i32 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed int), vec_cmplt(a_.altivec_i32, b_.altivec_i32));
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i32x4_lt(a_.wasm_v128, b_.wasm_v128);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vslt_w(a_.lsx_i64, b_.lsx_i64);
+ #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS)
+ r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), (a_.i32 < b_.i32));
+ #else
+@@ -2034,6 +2186,8 @@ simde_mm_cmplt_pd (simde__m128d a, simde
+ r_.altivec_f64 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(double), vec_cmplt(a_.altivec_f64, b_.altivec_f64));
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_f64x2_lt(a_.wasm_v128, b_.wasm_v128);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vfcmp_clt_d(a_.lsx_f64, b_.lsx_f64);
+ #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS)
+ r_.i64 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i64), (a_.f64 < b_.f64));
+ #else
+@@ -2065,9 +2219,12 @@ simde_mm_cmplt_sd (simde__m128d a, simde
+ a_ = simde__m128d_to_private(a),
+ b_ = simde__m128d_to_private(b);
+
+- r_.u64[0] = (a_.f64[0] < b_.f64[0]) ? ~UINT64_C(0) : UINT64_C(0);
+- r_.u64[1] = a_.u64[1];
+-
++ #if defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vextrins_d(a_.lsx_i64, (__m128i)__lsx_vfcmp_clt_d(a_.lsx_f64, b_.lsx_f64), 0);
++ #else
++ r_.u64[0] = (a_.f64[0] < b_.f64[0]) ? ~UINT64_C(0) : UINT64_C(0);
++ r_.u64[1] = a_.u64[1];
++ #endif
+ return simde__m128d_from_private(r_);
+ #endif
+ }
+@@ -2094,6 +2251,8 @@ simde_mm_cmple_pd (simde__m128d a, simde
+ r_.wasm_v128 = wasm_f64x2_le(a_.wasm_v128, b_.wasm_v128);
+ #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE)
+ r_.altivec_f64 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(double), vec_cmple(a_.altivec_f64, b_.altivec_f64));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vfcmp_cle_d(a_.lsx_f64, b_.lsx_f64);
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) {
+@@ -2122,10 +2281,12 @@ simde_mm_cmple_sd (simde__m128d a, simde
+ r_,
+ a_ = simde__m128d_to_private(a),
+ b_ = simde__m128d_to_private(b);
+-
+- r_.u64[0] = (a_.f64[0] <= b_.f64[0]) ? ~UINT64_C(0) : UINT64_C(0);
+- r_.u64[1] = a_.u64[1];
+-
++ #if defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vextrins_d(a_.lsx_i64, (__m128i)__lsx_vfcmp_cle_d(a_.lsx_f64, b_.lsx_f64), 0);
++ #else
++ r_.u64[0] = (a_.f64[0] <= b_.f64[0]) ? ~UINT64_C(0) : UINT64_C(0);
++ r_.u64[1] = a_.u64[1];
++ #endif
+ return simde__m128d_from_private(r_);
+ #endif
+ }
+@@ -2150,6 +2311,8 @@ simde_mm_cmpgt_epi8 (simde__m128i a, sim
+ r_.wasm_v128 = wasm_i8x16_gt(a_.wasm_v128, b_.wasm_v128);
+ #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE)
+ r_.altivec_i8 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed char), vec_cmpgt(a_.altivec_i8, b_.altivec_i8));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vslt_b(b_.lsx_i64, a_.lsx_i64);
+ #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS)
+ r_.i8 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i8), (a_.i8 > b_.i8));
+ #else
+@@ -2183,6 +2346,8 @@ simde_mm_cmpgt_epi16 (simde__m128i a, si
+ r_.wasm_v128 = wasm_i16x8_gt(a_.wasm_v128, b_.wasm_v128);
+ #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE)
+ r_.altivec_i16 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed short), vec_cmpgt(a_.altivec_i16, b_.altivec_i16));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vslt_h(b_.lsx_i64, a_.lsx_i64);
+ #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS)
+ r_.i16 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i16), (a_.i16 > b_.i16));
+ #else
+@@ -2216,6 +2381,8 @@ simde_mm_cmpgt_epi32 (simde__m128i a, si
+ r_.wasm_v128 = wasm_i32x4_gt(a_.wasm_v128, b_.wasm_v128);
+ #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE)
+ r_.altivec_i32 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed int), vec_cmpgt(a_.altivec_i32, b_.altivec_i32));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vslt_w(b_.lsx_i64, a_.lsx_i64);
+ #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS)
+ r_.i32 = HEDLEY_REINTERPRET_CAST(__typeof__(r_.i32), (a_.i32 > b_.i32));
+ #else
+@@ -2251,6 +2418,8 @@ simde_mm_cmpgt_pd (simde__m128d a, simde
+ r_.wasm_v128 = wasm_f64x2_gt(a_.wasm_v128, b_.wasm_v128);
+ #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE)
+ r_.altivec_f64 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(double), vec_cmpgt(a_.altivec_f64, b_.altivec_f64));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vfcmp_clt_d(b_.lsx_f64, a_.lsx_f64);
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) {
+@@ -2279,10 +2448,12 @@ simde_mm_cmpgt_sd (simde__m128d a, simde
+ r_,
+ a_ = simde__m128d_to_private(a),
+ b_ = simde__m128d_to_private(b);
+-
+- r_.u64[0] = (a_.f64[0] > b_.f64[0]) ? ~UINT64_C(0) : UINT64_C(0);
+- r_.u64[1] = a_.u64[1];
+-
++ #if defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vextrins_d(a_.lsx_i64, (__m128i)__lsx_vfcmp_clt_d(b_.lsx_f64, a_.lsx_f64), 0);
++ #else
++ r_.u64[0] = (a_.f64[0] > b_.f64[0]) ? ~UINT64_C(0) : UINT64_C(0);
++ r_.u64[1] = a_.u64[1];
++ #endif
+ return simde__m128d_from_private(r_);
+ #endif
+ }
+@@ -2309,6 +2480,8 @@ simde_mm_cmpge_pd (simde__m128d a, simde
+ r_.wasm_v128 = wasm_f64x2_ge(a_.wasm_v128, b_.wasm_v128);
+ #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE)
+ r_.altivec_f64 = HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(double), vec_cmpge(a_.altivec_f64, b_.altivec_f64));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vfcmp_cle_d(b_.lsx_f64, a_.lsx_f64);
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) {
+@@ -2338,9 +2511,12 @@ simde_mm_cmpge_sd (simde__m128d a, simde
+ a_ = simde__m128d_to_private(a),
+ b_ = simde__m128d_to_private(b);
+
+- r_.u64[0] = (a_.f64[0] >= b_.f64[0]) ? ~UINT64_C(0) : UINT64_C(0);
+- r_.u64[1] = a_.u64[1];
+-
++ #if defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vextrins_d(a_.lsx_i64, (__m128i)__lsx_vfcmp_cle_d(b_.lsx_f64, a_.lsx_f64), 0);
++ #else
++ r_.u64[0] = (a_.f64[0] >= b_.f64[0]) ? ~UINT64_C(0) : UINT64_C(0);
++ r_.u64[1] = a_.u64[1];
++ #endif
+ return simde__m128d_from_private(r_);
+ #endif
+ }
+@@ -2473,6 +2649,9 @@ simde_mm_cmpord_pd (simde__m128d a, simd
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_v128_and(wasm_f64x2_eq(a_.wasm_v128, a_.wasm_v128),
+ wasm_f64x2_eq(b_.wasm_v128, b_.wasm_v128));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vand_v(__lsx_vfcmp_ceq_d(a_.lsx_f64, a_.lsx_f64),
++ __lsx_vfcmp_ceq_d(b_.lsx_f64, b_.lsx_f64));
+ #elif defined(simde_math_isnan)
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) {
+@@ -2500,6 +2679,9 @@ simde_mm_cvtsd_f64 (simde__m128d a) {
+ return HEDLEY_STATIC_CAST(simde_float64, vgetq_lane_f64(a_.neon_f64, 0));
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ return HEDLEY_STATIC_CAST(simde_float64, wasm_f64x2_extract_lane(a_.wasm_v128, 0));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ __lsx_vstelm_d(a_.lsx_i64, &a_.f64, 0, 0);
++ return a_.f64[0];
+ #else
+ return a_.f64[0];
+ #endif
+@@ -2524,7 +2706,10 @@ simde_mm_cmpord_sd (simde__m128d a, simd
+ a_ = simde__m128d_to_private(a),
+ b_ = simde__m128d_to_private(b);
+
+- #if defined(simde_math_isnan)
++ #if defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vextrins_d(a_.lsx_i64, __lsx_vand_v(__lsx_vfcmp_ceq_d(a_.lsx_f64,
++ a_.lsx_f64), __lsx_vfcmp_ceq_d(b_.lsx_f64, b_.lsx_f64)), 0);
++ #elif defined(simde_math_isnan)
+ r_.u64[0] = (!simde_math_isnan(a_.f64[0]) && !simde_math_isnan(b_.f64[0])) ? ~UINT64_C(0) : UINT64_C(0);
+ r_.u64[1] = a_.u64[1];
+ #else
+@@ -2556,6 +2741,9 @@ simde_mm_cmpunord_pd (simde__m128d a, si
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_v128_or(wasm_f64x2_ne(a_.wasm_v128, a_.wasm_v128),
+ wasm_f64x2_ne(b_.wasm_v128, b_.wasm_v128));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vor_v(__lsx_vfcmp_cune_d(a_.lsx_f64, a_.lsx_f64),
++ __lsx_vfcmp_cune_d(b_.lsx_f64, b_.lsx_f64));
+ #elif defined(simde_math_isnan)
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) {
+@@ -2587,7 +2775,9 @@ simde_mm_cmpunord_sd (simde__m128d a, si
+ a_ = simde__m128d_to_private(a),
+ b_ = simde__m128d_to_private(b);
+
+- #if defined(simde_math_isnan)
++ #if defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vextrins_d(a_.lsx_i64, __lsx_vor_v(__lsx_vfcmp_cune_d(a_.lsx_f64, a_.lsx_f64), __lsx_vfcmp_cune_d(b_.lsx_f64, b_.lsx_f64)), 0);
++ #elif defined(simde_math_isnan)
+ r_.u64[0] = (simde_math_isnan(a_.f64[0]) || simde_math_isnan(b_.f64[0])) ? ~UINT64_C(0) : UINT64_C(0);
+ r_.u64[1] = a_.u64[1];
+ #else
+@@ -2612,6 +2802,8 @@ simde_mm_cvtepi32_pd (simde__m128i a) {
+
+ #if defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_f64x2_convert_low_i32x4(a_.wasm_v128);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_f64 = __lsx_vffintl_d_w(a_.lsx_i64);
+ #elif defined(SIMDE_CONVERT_VECTOR_)
+ SIMDE_CONVERT_VECTOR_(r_.f64, a_.m64_private[0].i32);
+ #else
+@@ -2648,6 +2840,8 @@ simde_mm_cvtepi32_ps (simde__m128i a) {
+ #endif
+ r_.altivec_f32 = vec_ctf(a_.altivec_i32, 0);
+ HEDLEY_DIAGNOSTIC_POP
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_f32 = __lsx_vffint_s_w(a_.lsx_i64);
+ #elif defined(SIMDE_CONVERT_VECTOR_)
+ SIMDE_CONVERT_VECTOR_(r_.f32, a_.i32);
+ #else
+@@ -2699,9 +2893,13 @@ simde_mm_cvtpd_epi32 (simde__m128d a) {
+ #else
+ simde__m128i_private r_;
+
+- r_.m64[0] = simde_mm_cvtpd_pi32(a);
+- r_.m64[1] = simde_mm_setzero_si64();
+-
++ #if defined(SIMDE_LOONGARCH_LSX_NATIVE) && defined(SIMDE_FAST_NANS)
++ const v2f64 zero_f64 = {-0.0f, -0.0f};
++ r_.lsx_i64 = __lsx_vftintrne_w_d(zero_f64, simde__m128d_to_private(a).lsx_f64);
++ #else
++ r_.m64[0] = simde_mm_cvtpd_pi32(a);
++ r_.m64[1] = simde_mm_setzero_si64();
++ #endif
+ return simde__m128i_from_private(r_);
+ #endif
+ }
+@@ -2724,6 +2922,9 @@ simde_mm_cvtpd_ps (simde__m128d a) {
+ r_.altivec_f32 = vec_float2(a_.altivec_f64, vec_splats(0.0));
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_f32x4_demote_f64x2_zero(a_.wasm_v128);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ const v2f64 zero_f64 = {-0.0f, -0.0f};
++ r_.lsx_f32 = __lsx_vfcvt_s_d(zero_f64, a_.lsx_f64);
+ #elif HEDLEY_HAS_BUILTIN(__builtin_shufflevector) && HEDLEY_HAS_BUILTIN(__builtin_convertvector)
+ float __attribute__((__vector_size__(8))) z = { 0.0f, 0.0f };
+ r_.f32 =
+@@ -2792,6 +2993,9 @@ simde_mm_cvtps_epi32 (simde__m128 a) {
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE) && defined(SIMDE_FAST_CONVERSION_RANGE) && defined(SIMDE_FAST_ROUND_TIES)
+ a_ = simde__m128_to_private(a);
+ r_.wasm_v128 = wasm_i32x4_trunc_sat_f32x4(a_.wasm_v128);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) && defined(SIMDE_FAST_CONVERSION_RANGE) && defined(SIMDE_FAST_ROUND_TIES)
++ a_ = simde__m128_to_private(a);
++ r_.lsx_i32 = __lsx_vftintrne_w_s(a_.lsx_f32);
+ #else
+ a_ = simde__m128_to_private(simde_x_mm_round_ps(a, SIMDE_MM_FROUND_TO_NEAREST_INT, 1));
+ SIMDE_VECTORIZE
+@@ -2828,6 +3032,8 @@ simde_mm_cvtps_pd (simde__m128 a) {
+ SIMDE_CONVERT_VECTOR_(r_.f64, a_.m64_private[0].f32);
+ #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE)
+ r_.neon_f64 = vcvt_f64_f32(vget_low_f32(a_.neon_f32));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_f64 = __lsx_vfcvtl_d_s(a_.lsx_f32);
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) {
+@@ -2847,6 +3053,9 @@ int32_t
+ simde_mm_cvtsd_si32 (simde__m128d a) {
+ #if defined(SIMDE_X86_SSE2_NATIVE)
+ return _mm_cvtsd_si32(a);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) && defined(SIMDE_FAST_CONVERSION_RANGE)
++ simde__m128d_private a_ = simde__m128d_to_private(a);
++ return __lsx_vpickve2gr_w(__lsx_vftintrne_w_d(a_.lsx_f64, a_.lsx_f64), 0);
+ #else
+ simde__m128d_private a_ = simde__m128d_to_private(a);
+
+@@ -2874,7 +3083,11 @@ simde_mm_cvtsd_si64 (simde__m128d a) {
+ #endif
+ #else
+ simde__m128d_private a_ = simde__m128d_to_private(a);
+- return SIMDE_CONVERT_FTOI(int64_t, simde_math_round(a_.f64[0]));
++ #if defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ return __lsx_vpickve2gr_d(__lsx_vftintrne_l_d(a_.lsx_f64), 0);
++ #else
++ return SIMDE_CONVERT_FTOI(int64_t, simde_math_round(a_.f64[0]));
++ #endif
+ #endif
+ }
+ #define simde_mm_cvtsd_si64x(a) simde_mm_cvtsd_si64(a)
+@@ -2896,6 +3109,8 @@ simde_mm_cvtsd_ss (simde__m128 a, simde_
+
+ #if defined(SIMDE_ARM_NEON_A64V8_NATIVE)
+ r_.neon_f32 = vsetq_lane_f32(vcvtxd_f32_f64(vgetq_lane_f64(b_.neon_f64, 0)), a_.neon_f32, 0);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vextrins_w(a_.lsx_i64, __lsx_vfcvt_s_d(b_.lsx_f64, b_.lsx_f64), 0);
+ #else
+ r_.f32[0] = HEDLEY_STATIC_CAST(simde_float32, b_.f64[0]);
+
+@@ -2926,6 +3141,8 @@ simde_x_mm_cvtsi128_si16 (simde__m128i a
+ (void) a_;
+ #endif
+ return vec_extract(a_.altivec_i16, 0);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ return __lsx_vpickve2gr_h(a_.lsx_i64, 0);
+ #else
+ return a_.i16[0];
+ #endif
+@@ -2949,6 +3166,8 @@ simde_mm_cvtsi128_si32 (simde__m128i a)
+ (void) a_;
+ #endif
+ return vec_extract(a_.altivec_i32, 0);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ return __lsx_vpickve2gr_w(a_.lsx_i64, 0);
+ #else
+ return a_.i32[0];
+ #endif
+@@ -2973,6 +3192,8 @@ simde_mm_cvtsi128_si64 (simde__m128i a)
+ return vec_extract(HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed long long), a_.i64), 0);
+ #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE)
+ return vgetq_lane_s64(a_.neon_i64, 0);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ return __lsx_vpickve2gr_d(a_.lsx_i64, 0);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ return HEDLEY_STATIC_CAST(int64_t, wasm_i64x2_extract_lane(a_.wasm_v128, 0));
+ #endif
+@@ -2996,6 +3217,9 @@ simde_mm_cvtsi32_sd (simde__m128d a, int
+
+ #if defined(SIMDE_ARM_NEON_A64V8_NATIVE)
+ r_.neon_f64 = vsetq_lane_f64(HEDLEY_STATIC_CAST(float64_t, b), a_.neon_f64, 0);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ simde_float64 b_float64 = (simde_float64)b;
++ r_.lsx_i64 = __lsx_vextrins_d(a_.lsx_i64, __lsx_vldrepl_d(&(b_float64), 0), 0);
+ #else
+ r_.f64[0] = HEDLEY_STATIC_CAST(simde_float64, b);
+ r_.i64[1] = a_.i64[1];
+@@ -3017,6 +3241,8 @@ simde_x_mm_cvtsi16_si128 (int16_t a) {
+ r_.neon_i16 = vsetq_lane_s16(a, vdupq_n_s16(0), 0);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i16x8_make(a, 0, 0, 0, 0, 0, 0, 0);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vinsgr2vr_h(__lsx_vreplgr2vr_h(0), a, 0);
+ #else
+ r_.i16[0] = a;
+ r_.i16[1] = 0;
+@@ -3043,6 +3269,8 @@ simde_mm_cvtsi32_si128 (int32_t a) {
+ r_.neon_i32 = vsetq_lane_s32(a, vdupq_n_s32(0), 0);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i32x4_make(a, 0, 0, 0);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vinsgr2vr_w(__lsx_vreplgr2vr_w(0), a, 0);
+ #else
+ r_.i32[0] = a;
+ r_.i32[1] = 0;
+@@ -3073,6 +3301,9 @@ simde_mm_cvtsi64_sd (simde__m128d a, int
+
+ #if defined(SIMDE_ARM_NEON_A64V8_NATIVE)
+ r_.neon_f64 = vsetq_lane_f64(HEDLEY_STATIC_CAST(float64_t, b), a_.neon_f64, 0);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ simde_float64 b_float64 = (simde_float64)b;
++ r_.lsx_i64 = __lsx_vextrins_d(a_.lsx_i64, __lsx_vldrepl_d(&(b_float64), 0), 0);
+ #else
+ r_.f64[0] = HEDLEY_STATIC_CAST(simde_float64, b);
+ r_.f64[1] = a_.f64[1];
+@@ -3103,6 +3334,8 @@ simde_mm_cvtsi64_si128 (int64_t a) {
+ r_.neon_i64 = vsetq_lane_s64(a, vdupq_n_s64(0), 0);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i64x2_make(a, 0);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vinsgr2vr_d(__lsx_vreplgr2vr_d(0), a, 0);
+ #else
+ r_.i64[0] = a;
+ r_.i64[1] = 0;
+@@ -3130,8 +3363,11 @@ simde_mm_cvtss_sd (simde__m128d a, simde
+ a_ = simde__m128d_to_private(a);
+ simde__m128_private b_ = simde__m128_to_private(b);
+
+- a_.f64[0] = HEDLEY_STATIC_CAST(simde_float64, b_.f32[0]);
+-
++ #if defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ a_.lsx_i64 = __lsx_vextrins_d(a_.lsx_i64, (__m128i)__lsx_vfcvtl_d_s(b_.lsx_f32), 0);
++ #else
++ a_.f64[0] = HEDLEY_STATIC_CAST(simde_float64, b_.f32[0]);
++ #endif
+ return simde__m128d_from_private(a_);
+ #endif
+ }
+@@ -3177,9 +3413,13 @@ simde_mm_cvttpd_epi32 (simde__m128d a) {
+ #else
+ simde__m128i_private r_;
+
++ #if defined(SIMDE_LOONGARCH_LSX_NATIVE) && defined(SIMDE_FAST_NANS)
++ const v2f64 zero_f64 = {-0.0f, -0.0f};
++ r_.lsx_i64 = __lsx_vftintrz_w_d(zero_i64, simde__m128d_to_private(a).lsx_f64);
++ #else
+ r_.m64[0] = simde_mm_cvttpd_pi32(a);
+ r_.m64[1] = simde_mm_setzero_si64();
+-
++ #endif
+ return simde__m128i_from_private(r_);
+ #endif
+ }
+@@ -3234,6 +3474,25 @@ simde_mm_cvttps_epi32 (simde__m128 a) {
+
+ r_.wasm_v128 = wasm_v128_bitselect(r_.wasm_v128, wasm_i32x4_splat(INT32_MIN), valid_input);
+ #endif
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ __m128i temp = __lsx_vftintrz_w_s(a_.lsx_f32);
++ #if !defined(SIMDE_FAST_CONVERSION_RANGE) || !defined(SIMDE_FAST_NANS)
++ #if !defined(SIMDE_FAST_CONVERSION_RANGE) && !defined(SIMDE_FAST_NANS)
++ simde_float32 f1 = 2147483648.0f;
++ __m128i valid_input =
++ __lsx_vand_v(
++ __lsx_vfcmp_clt_s(a_.lsx_f32, (__m128)__lsx_vldrepl_w(&f1, 0)),
++ __lsx_vfcmp_ceq_s(a_.lsx_f32, a_.lsx_f32)
++ );
++ #elif !defined(SIMDE_FAST_CONVERSION_RANGE)
++ simde_float32 f1 = 2147483648.0f;
++ __m128i valid_input = __lsx_vfcmp_clt_s(a_.lsx_f32, (__m128)__lsx_vldrepl_w(&f1, 0));
++ #elif !defined(SIMDE_FAST_NANS)
++ __m128i valid_input = __lsx_vfcmp_ceq_s(a_.lsx_f32, a_.lsx_f32);
++ #endif
++
++ r_.lsx_i64 = __lsx_vbitsel_v(__lsx_vreplgr2vr_w(INT32_MIN), temp, valid_input);
++ #endif
+ #elif defined(SIMDE_CONVERT_VECTOR_) && !defined(SIMDE_ARCH_POWER)
+ SIMDE_CONVERT_VECTOR_(r_.i32, a_.f32);
+
+@@ -3277,6 +3536,9 @@ int32_t
+ simde_mm_cvttsd_si32 (simde__m128d a) {
+ #if defined(SIMDE_X86_SSE2_NATIVE)
+ return _mm_cvttsd_si32(a);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE) && defined(SIMDE_FAST_CONVERSION_RANGE)
++ simde__m128d_private a_ = simde__m128d_to_private(a);
++ return __lsx_vpickve2gr_w(__lsx_vftintrz_w_d(a_.lsx_f64, a_.lsx_f64), 0);
+ #else
+ simde__m128d_private a_ = simde__m128d_to_private(a);
+ simde_float64 v = a_.f64[0];
+@@ -3301,6 +3563,9 @@ simde_mm_cvttsd_si64 (simde__m128d a) {
+ #else
+ return _mm_cvttsd_si64x(a);
+ #endif
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ simde__m128d_private a_ = simde__m128d_to_private(a);
++ return __lsx_vpickve2gr_d(__lsx_vftintrz_l_d(a_.lsx_f64), 0);
+ #else
+ simde__m128d_private a_ = simde__m128d_to_private(a);
+ return SIMDE_CONVERT_FTOI(int64_t, a_.f64[0]);
+@@ -3329,6 +3594,8 @@ simde_mm_div_pd (simde__m128d a, simde__
+ r_.neon_f64 = vdivq_f64(a_.neon_f64, b_.neon_f64);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_f64x2_div(a_.wasm_v128, b_.wasm_v128);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_f64 = __lsx_vfdiv_d(b_.lsx_f64, a_.lsx_f64);
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) {
+@@ -3361,6 +3628,9 @@ simde_mm_div_sd (simde__m128d a, simde__
+ #if defined(SIMDE_ARM_NEON_A64V8_NATIVE)
+ float64x2_t temp = vdivq_f64(a_.neon_f64, b_.neon_f64);
+ r_.neon_f64 = vsetq_lane_f64(vgetq_lane(a_.neon_f64, 1), temp, 1);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ __m128d temp = __lsx_vfdiv_d(a_.lsx_f64, b_.lsx_f64);
++ r_.lsx_i64 = __lsx_vextrins_d(a_.lsx_i64, (__m128i)temp, 0);
+ #else
+ r_.f64[0] = a_.f64[0] / b_.f64[0];
+ r_.f64[1] = a_.f64[1];
+@@ -3398,6 +3668,8 @@ simde_mm_extract_epi16 (simde__m128i a,
+ #define simde_mm_extract_epi16(a, imm8) (HEDLEY_STATIC_CAST(int32_t, vgetq_lane_s16(simde__m128i_to_private(a).neon_i16, (imm8))) & (INT32_C(0x0000ffff)))
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ #define simde_mm_extract_epi16(a, imm8) HEDLEY_STATIC_CAST(int32_t, wasm_u16x8_extract_lane(simde__m128i_to_wasm_v128((a)), (imm8) & 7))
++#elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ #define simde_mm_extract_epi16(a, imm8) HEDLEY_STATIC_CAST(int32_t, __lsx_vpickve2gr_hu(simde__m128i_to_private(a).lsx_i64, imm8))
+ #endif
+ #if defined(SIMDE_X86_SSE2_ENABLE_NATIVE_ALIASES)
+ #define _mm_extract_epi16(a, imm8) simde_mm_extract_epi16(a, imm8)
+@@ -3417,6 +3689,8 @@ simde_mm_insert_epi16 (simde__m128i a, i
+ #define simde_mm_insert_epi16(a, i, imm8) simde__m128i_from_neon_i16(vsetq_lane_s16((i), simde__m128i_to_neon_i16(a), (imm8)))
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ #define simde_mm_insert_epi16(a, i, imm8) wasm_i16x8_replace_lane(simde__m128i_to_wasm_v128((a)), (imm8) & 7, (i))
++#elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ #define simde_mm_insert_epi16(a, i, imm8) simde__m128i_from_lsx_i16((v8i16)__lsx_vinsgr2vr_h(simde__m128i_to_private(a).lsx_i64, i, imm8))
+ #endif
+ #if defined(SIMDE_X86_SSE2_ENABLE_NATIVE_ALIASES)
+ #define _mm_insert_epi16(a, i, imm8) simde_mm_insert_epi16(a, i, imm8)
+@@ -3436,6 +3710,8 @@ simde_mm_load_pd (simde_float64 const me
+ r_.neon_u32 = vld1q_u32(HEDLEY_REINTERPRET_CAST(uint32_t const*, mem_addr));
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_v128_load(mem_addr);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vld(mem_addr, 0);
+ #else
+ simde_memcpy(&r_, SIMDE_ALIGN_ASSUME_LIKE(mem_addr, simde__m128d), sizeof(r_));
+ #endif
+@@ -3456,6 +3732,8 @@ simde_mm_load1_pd (simde_float64 const*
+ return simde__m128d_from_neon_f64(vld1q_dup_f64(mem_addr));
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ return simde__m128d_from_wasm_v128(wasm_v128_load64_splat(mem_addr));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ return (simde__m128d)__lsx_vldrepl_d(mem_addr, 0);
+ #else
+ return simde_mm_set1_pd(*mem_addr);
+ #endif
+@@ -3478,6 +3756,8 @@ simde_mm_load_sd (simde_float64 const* m
+ r_.neon_f64 = vsetq_lane_f64(*mem_addr, vdupq_n_f64(0), 0);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_v128_load64_zero(HEDLEY_REINTERPRET_CAST(const void*, mem_addr));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vextrins_d(__lsx_vreplgr2vr_d(0), __lsx_vldrepl_d(mem_addr, 0), 0);
+ #else
+ r_.f64[0] = *mem_addr;
+ r_.u64[1] = UINT64_C(0);
+@@ -3497,6 +3777,8 @@ simde_mm_load_si128 (simde__m128i const*
+ return _mm_load_si128(HEDLEY_REINTERPRET_CAST(__m128i const*, mem_addr));
+ #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE)
+ return vld1q_s64(HEDLEY_REINTERPRET_CAST(int64_t const*, mem_addr));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ return (simde__m128i)__lsx_vld(mem_addr, 0);
+ #else
+ simde__m128i_private r_;
+
+@@ -3527,6 +3809,8 @@ simde_mm_loadh_pd (simde__m128d a, simde
+ r_.neon_f64 = vcombine_f64(vget_low_f64(a_.neon_f64), vld1_f64(HEDLEY_REINTERPRET_CAST(const float64_t*, mem_addr)));
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_v128_load64_lane(HEDLEY_REINTERPRET_CAST(const void*, mem_addr), a_.wasm_v128, 1);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vilvl_d(__lsx_vldrepl_d(mem_addr, 0), a_.lsx_i64);
+ #else
+ simde_float64 t;
+
+@@ -3555,6 +3839,8 @@ simde_mm_loadl_epi64 (simde__m128i const
+
+ #if defined(SIMDE_ARM_NEON_A32V7_NATIVE)
+ r_.neon_i64 = vcombine_s64(vld1_s64(HEDLEY_REINTERPRET_CAST(int64_t const *, mem_addr)), vdup_n_s64(0));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vinsgr2vr_d(__lsx_vldrepl_d(mem_addr, 0), 0, 1);
+ #else
+ r_.i64[0] = value;
+ r_.i64[1] = 0;
+@@ -3582,6 +3868,8 @@ simde_mm_loadl_pd (simde__m128d a, simde
+ HEDLEY_REINTERPRET_CAST(const float64_t*, mem_addr)), vget_high_f64(a_.neon_f64));
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_v128_load64_lane(HEDLEY_REINTERPRET_CAST(const void*, mem_addr), a_.wasm_v128, 0);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vilvh_d(a_.lsx_i64, __lsx_vldrepl_d(mem_addr, 0));
+ #else
+ r_.f64[0] = *mem_addr;
+ r_.u64[1] = a_.u64[1];
+@@ -3612,6 +3900,9 @@ simde_mm_loadr_pd (simde_float64 const m
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ v128_t tmp = wasm_v128_load(mem_addr);
+ r_.wasm_v128 = wasm_i64x2_shuffle(tmp, tmp, 1, 0);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ __m128i temp = __lsx_vld(mem_addr, 0);
++ r_.lsx_i64 = __lsx_vshuf4i_d(temp, temp, 0b0001);
+ #else
+ r_.f64[0] = mem_addr[1];
+ r_.f64[1] = mem_addr[0];
+@@ -3631,6 +3922,8 @@ simde_mm_loadu_pd (simde_float64 const m
+ return _mm_loadu_pd(mem_addr);
+ #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE)
+ return vld1q_f64(mem_addr);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ return (simde__m128d)__lsx_vld(mem_addr, 0);
+ #else
+ simde__m128d_private r_;
+
+@@ -3658,6 +3951,8 @@ simde_mm_loadu_epi8(void const * mem_add
+
+ #if defined(SIMDE_ARM_NEON_A32V7_NATIVE)
+ r_.neon_i8 = vld1q_s8(HEDLEY_REINTERPRET_CAST(int8_t const*, mem_addr));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vld(mem_addr, 0);
+ #else
+ simde_memcpy(&r_, mem_addr, sizeof(r_));
+ #endif
+@@ -3687,6 +3982,8 @@ simde_mm_loadu_epi16(void const * mem_ad
+
+ #if defined(SIMDE_ARM_NEON_A32V7_NATIVE)
+ r_.neon_i16 = vreinterpretq_s16_s8(vld1q_s8(HEDLEY_REINTERPRET_CAST(int8_t const*, mem_addr)));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vld(mem_addr, 0);
+ #else
+ simde_memcpy(&r_, mem_addr, sizeof(r_));
+ #endif
+@@ -3715,6 +4012,8 @@ simde_mm_loadu_epi32(void const * mem_ad
+
+ #if defined(SIMDE_ARM_NEON_A32V7_NATIVE)
+ r_.neon_i32 = vreinterpretq_s32_s8(vld1q_s8(HEDLEY_REINTERPRET_CAST(int8_t const*, mem_addr)));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vld(HEDLEY_REINTERPRET_CAST(int8_t const*, mem_addr), 0);
+ #else
+ simde_memcpy(&r_, mem_addr, sizeof(r_));
+ #endif
+@@ -3744,6 +4043,8 @@ simde_mm_loadu_epi64(void const * mem_ad
+
+ #if defined(SIMDE_ARM_NEON_A32V7_NATIVE)
+ r_.neon_i64 = vreinterpretq_s64_s8(vld1q_s8(HEDLEY_REINTERPRET_CAST(int8_t const*, mem_addr)));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vld(mem_addr, 0);
+ #else
+ simde_memcpy(&r_, mem_addr, sizeof(r_));
+ #endif
+@@ -3776,6 +4077,8 @@ simde_mm_loadu_si128 (void const* mem_ad
+ HEDLEY_DIAGNOSTIC_POP
+ #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE)
+ r_.neon_i8 = vld1q_s8(HEDLEY_REINTERPRET_CAST(int8_t const*, mem_addr));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vld(mem_addr, 0);
+ #else
+ simde_memcpy(&r_, mem_addr, sizeof(r_));
+ #endif
+@@ -3822,6 +4125,9 @@ simde_mm_madd_epi16 (simde__m128i a, sim
+ r_.i32 =
+ __builtin_shufflevector(p32, p32, 0, 2, 4, 6) +
+ __builtin_shufflevector(p32, p32, 1, 3, 5, 7);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ __m128i temp_ev = __lsx_vmulwev_w_h(a_.lsx_i64, b_.lsx_i64);
++ r_.lsx_i64 = __lsx_vmaddwod_w_h(temp_ev, a_.lsx_i64, b_.lsx_i64);
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_) / sizeof(r_.i16[0])) ; i += 2) {
+@@ -3846,11 +4152,17 @@ simde_mm_maskmoveu_si128 (simde__m128i a
+ a_ = simde__m128i_to_private(a),
+ mask_ = simde__m128i_to_private(mask);
+
+- for (size_t i = 0 ; i < (sizeof(a_.i8) / sizeof(a_.i8[0])) ; i++) {
+- if (mask_.u8[i] & 0x80) {
+- mem_addr[i] = a_.i8[i];
++ #if defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ __m128i temp = __lsx_vld(mem_addr, 0);
++ __m128i temp1 = __lsx_vbitsel_v(temp, a_.lsx_i64, __lsx_vslti_b(mask_.lsx_i64, 0));
++ __lsx_vst(temp1, mem_addr, 0);
++ #else
++ for (size_t i = 0 ; i < (sizeof(a_.i8) / sizeof(a_.i8[0])) ; i++) {
++ if (mask_.u8[i] & 0x80) {
++ mem_addr[i] = a_.i8[i];
++ }
+ }
+- }
++ #endif
+ #endif
+ }
+ #if defined(SIMDE_X86_SSE2_ENABLE_NATIVE_ALIASES)
+@@ -3899,6 +4211,8 @@ simde_mm_movemask_epi8 (simde__m128i a)
+ r = HEDLEY_STATIC_CAST(int32_t, vec_extract(vec_vbpermq(a_.altivec_u8, perm), 14));
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r = HEDLEY_STATIC_CAST(int32_t, wasm_i8x16_bitmask(a_.wasm_v128));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r = __lsx_vpickve2gr_w(__lsx_vmskltz_b(a_.lsx_i64), 0);
+ #else
+ SIMDE_VECTORIZE_REDUCTION(|:r)
+ for (size_t i = 0 ; i < (sizeof(a_.u8) / sizeof(a_.u8[0])) ; i++) {
+@@ -3940,6 +4254,8 @@ simde_mm_movemask_pd (simde__m128d a) {
+ r = HEDLEY_STATIC_CAST(int32_t, vec_extract(HEDLEY_REINTERPRET_CAST(SIMDE_POWER_ALTIVEC_VECTOR(signed int), res), 2));
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r = HEDLEY_STATIC_CAST(int32_t, wasm_i64x2_bitmask(a_.wasm_v128));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r = __lsx_vpickve2gr_w(__lsx_vmskltz_d(a_.lsx_i64), 0);
+ #else
+ SIMDE_VECTORIZE_REDUCTION(|:r)
+ for (size_t i = 0 ; i < (sizeof(a_.u64) / sizeof(a_.u64[0])) ; i++) {
+@@ -3965,6 +4281,8 @@ simde_mm_movepi64_pi64 (simde__m128i a)
+
+ #if defined(SIMDE_ARM_NEON_A64V8_NATIVE)
+ r_.neon_i64 = vget_low_s64(a_.neon_i64);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.i64[0] = __lsx_vpickve2gr_d(a_.lsx_i64, 0);
+ #else
+ r_.i64[0] = a_.i64[0];
+ #endif
+@@ -3987,6 +4305,8 @@ simde_mm_movpi64_epi64 (simde__m64 a) {
+
+ #if defined(SIMDE_ARM_NEON_A32V7_NATIVE)
+ r_.neon_i64 = vcombine_s64(a_.neon_i64, vdup_n_s64(0));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vinsgr2vr_d(__lsx_vreplgr2vr_d(0), a_.i64[0], 0);
+ #else
+ r_.i64[0] = a_.i64[0];
+ r_.i64[1] = 0;
+@@ -4016,6 +4336,8 @@ simde_mm_min_epi16 (simde__m128i a, simd
+ r_.wasm_v128 = wasm_i16x8_min(a_.wasm_v128, b_.wasm_v128);
+ #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE)
+ r_.altivec_i16 = vec_min(a_.altivec_i16, b_.altivec_i16);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vmin_h(a_.lsx_i64, b_.lsx_i64);
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.i16) / sizeof(r_.i16[0])) ; i++) {
+@@ -4047,6 +4369,8 @@ simde_mm_min_epu8 (simde__m128i a, simde
+ r_.wasm_v128 = wasm_u8x16_min(a_.wasm_v128, b_.wasm_v128);
+ #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE)
+ r_.altivec_u8 = vec_min(a_.altivec_u8, b_.altivec_u8);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vmin_bu(a_.lsx_i64, b_.lsx_i64);
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.u8) / sizeof(r_.u8[0])) ; i++) {
+@@ -4078,6 +4402,8 @@ simde_mm_min_pd (simde__m128d a, simde__
+ r_.neon_f64 = vminq_f64(a_.neon_f64, b_.neon_f64);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_f64x2_min(a_.wasm_v128, b_.wasm_v128);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_f64 = __lsx_vfmin_d(a_.lsx_f64, b_.lsx_f64);
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) {
+@@ -4110,6 +4436,8 @@ simde_mm_min_sd (simde__m128d a, simde__
+ #if defined(SIMDE_ARM_NEON_A64V8_NATIVE)
+ float64x2_t temp = vminq_f64(a_.neon_f64, b_.neon_f64);
+ r_.neon_f64 = vsetq_lane_f64(vgetq_lane(a_.neon_f64, 1), temp, 1);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vextrins_d(a_.lsx_i64, (__m128i)__lsx_vfmin_d(a_.lsx_f64, b_.lsx_f64), 0);
+ #else
+ r_.f64[0] = (a_.f64[0] < b_.f64[0]) ? a_.f64[0] : b_.f64[0];
+ r_.f64[1] = a_.f64[1];
+@@ -4139,6 +4467,8 @@ simde_mm_max_epi16 (simde__m128i a, simd
+ r_.wasm_v128 = wasm_i16x8_max(a_.wasm_v128, b_.wasm_v128);
+ #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE)
+ r_.altivec_i16 = vec_max(a_.altivec_i16, b_.altivec_i16);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vmax_h(a_.lsx_i64, b_.lsx_i64);
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.i16) / sizeof(r_.i16[0])) ; i++) {
+@@ -4170,6 +4500,8 @@ simde_mm_max_epu8 (simde__m128i a, simde
+ r_.wasm_v128 = wasm_u8x16_max(a_.wasm_v128, b_.wasm_v128);
+ #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE)
+ r_.altivec_u8 = vec_max(a_.altivec_u8, b_.altivec_u8);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vmax_bu(a_.lsx_i64, b_.lsx_i64);
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.u8) / sizeof(r_.u8[0])) ; i++) {
+@@ -4201,6 +4533,8 @@ simde_mm_max_pd (simde__m128d a, simde__
+ r_.wasm_v128 = wasm_f64x2_max(a_.wasm_v128, b_.wasm_v128);
+ #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE)
+ r_.neon_f64 = vmaxq_f64(a_.neon_f64, b_.neon_f64);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_f64 = __lsx_vfmax_d(a_.lsx_f64, b_.lsx_f64);
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) {
+@@ -4233,6 +4567,8 @@ simde_mm_max_sd (simde__m128d a, simde__
+ #if defined(SIMDE_ARM_NEON_A64V8_NATIVE)
+ float64x2_t temp = vmaxq_f64(a_.neon_f64, b_.neon_f64);
+ r_.neon_f64 = vsetq_lane_f64(vgetq_lane(a_.neon_f64, 1), temp, 1);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vextrins_d(a_.lsx_i64, (__m128i)__lsx_vfmax_d(a_.lsx_f64, b_.lsx_f64), 0);
+ #else
+ r_.f64[0] = (a_.f64[0] > b_.f64[0]) ? a_.f64[0] : b_.f64[0];
+ r_.f64[1] = a_.f64[1];
+@@ -4259,6 +4595,8 @@ simde_mm_move_epi64 (simde__m128i a) {
+ r_.neon_i64 = vsetq_lane_s64(0, a_.neon_i64, 1);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i64x2_shuffle(a_.wasm_v128, wasm_i64x2_const(0, 0), 0, 2);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vilvl_d(__lsx_vreplgr2vr_d(0), a_.lsx_i64);
+ #else
+ r_.i64[0] = a_.i64[0];
+ r_.i64[1] = 0;
+@@ -4290,6 +4628,8 @@ simde_mm_mul_epu32 (simde__m128i a, simd
+ r_.wasm_v128 = wasm_u64x2_extmul_low_u32x4(
+ wasm_i32x4_shuffle(a_.wasm_v128, a_.wasm_v128, 0, 2, 0, 2),
+ wasm_i32x4_shuffle(b_.wasm_v128, b_.wasm_v128, 0, 2, 0, 2));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vmulwev_d_wu(a_.lsx_i64, b_.lsx_i64);
+ #elif defined(SIMDE_SHUFFLE_VECTOR_) && (SIMDE_ENDIAN_ORDER == SIMDE_ENDIAN_LITTLE)
+ __typeof__(a_.u32) z = { 0, };
+ a_.u32 = SIMDE_SHUFFLE_VECTOR_(32, 16, a_.u32, z, 0, 4, 2, 6);
+@@ -4320,6 +4660,8 @@ simde_x_mm_mul_epi64 (simde__m128i a, si
+
+ #if defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i64x2_mul(a_.wasm_v128, b_.wasm_v128);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vmul_d(a_.lsx_i64, b_.lsx_i64);
+ #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS)
+ r_.i64 = a_.i64 * b_.i64;
+ #else
+@@ -4340,7 +4682,9 @@ simde_x_mm_mod_epi64 (simde__m128i a, si
+ a_ = simde__m128i_to_private(a),
+ b_ = simde__m128i_to_private(b);
+
+- #if defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_PGI_30104)
++ #if defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vmod_d(a_.lsx_i64, b_.lsx_i64);
++ #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS) && !defined(SIMDE_BUG_PGI_30104)
+ r_.i64 = a_.i64 % b_.i64;
+ #else
+ SIMDE_VECTORIZE
+@@ -4369,6 +4713,8 @@ simde_mm_mul_pd (simde__m128d a, simde__
+ r_.neon_f64 = vmulq_f64(a_.neon_f64, b_.neon_f64);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_f64x2_mul(a_.wasm_v128, b_.wasm_v128);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_f64 = __lsx_vfmul_d(a_.lsx_f64, b_.lsx_f64);
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) {
+@@ -4401,6 +4747,8 @@ simde_mm_mul_sd (simde__m128d a, simde__
+ #if defined(SIMDE_ARM_NEON_A64V8_NATIVE)
+ float64x2_t temp = vmulq_f64(a_.neon_f64, b_.neon_f64);
+ r_.neon_f64 = vsetq_lane_f64(vgetq_lane(a_.neon_f64, 1), temp, 1);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vextrins_d(a_.lsx_i64, (__m128i)__lsx_vfmul_d(a_.lsx_f64, b_.lsx_f64), 0);
+ #else
+ r_.f64[0] = a_.f64[0] * b_.f64[0];
+ r_.f64[1] = a_.f64[1];
+@@ -4466,6 +4814,8 @@ simde_mm_mulhi_epi16 (simde__m128i a, si
+ const v128_t lo = wasm_i32x4_extmul_low_i16x8(a_.wasm_v128, b_.wasm_v128);
+ const v128_t hi = wasm_i32x4_extmul_high_i16x8(a_.wasm_v128, b_.wasm_v128);
+ r_.wasm_v128 = wasm_i16x8_shuffle(lo, hi, 1, 3, 5, 7, 9, 11, 13, 15);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vmuh_h(a_.lsx_i64, b_.lsx_i64);
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.i16) / sizeof(r_.i16[0])) ; i++) {
+@@ -4509,6 +4859,8 @@ simde_mm_mulhi_epu16 (simde__m128i a, si
+ const v128_t lo = wasm_u32x4_extmul_low_u16x8(a_.wasm_v128, b_.wasm_v128);
+ const v128_t hi = wasm_u32x4_extmul_high_u16x8(a_.wasm_v128, b_.wasm_v128);
+ r_.wasm_v128 = wasm_i16x8_shuffle(lo, hi, 1, 3, 5, 7, 9, 11, 13, 15);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vmuh_hu(a_.lsx_i64, b_.lsx_i64);
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.u16) / sizeof(r_.u16[0])) ; i++) {
+@@ -4542,6 +4894,8 @@ simde_mm_mullo_epi16 (simde__m128i a, si
+ r_.altivec_i16 = vec_mul(a_.altivec_i16, b_.altivec_i16);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i16x8_mul(a_.wasm_v128, b_.wasm_v128);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vmul_h(a_.lsx_i64, b_.lsx_i64);
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.i16) / sizeof(r_.i16[0])) ; i++) {
+@@ -4573,6 +4927,8 @@ simde_mm_or_pd (simde__m128d a, simde__m
+ r_.wasm_v128 = wasm_v128_or(a_.wasm_v128, b_.wasm_v128);
+ #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE)
+ r_.neon_i64 = vorrq_s64(a_.neon_i64, b_.neon_i64);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vor_v(a_.lsx_i64, b_.lsx_i64);
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.i32f) / sizeof(r_.i32f[0])) ; i++) {
+@@ -4604,6 +4960,8 @@ simde_mm_or_si128 (simde__m128i a, simde
+ r_.altivec_i32 = vec_or(a_.altivec_i32, b_.altivec_i32);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_v128_or(a_.wasm_v128, b_.wasm_v128);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vor_v(a_.lsx_i64, b_.lsx_i64);
+ #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS)
+ r_.i32f = a_.i32f | b_.i32f;
+ #else
+@@ -4639,6 +4997,8 @@ simde_mm_packs_epi16 (simde__m128i a, si
+ r_.altivec_i8 = vec_packs(a_.altivec_i16, b_.altivec_i16);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i8x16_narrow_i16x8(a_.wasm_v128, b_.wasm_v128);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vssrarni_b_h(b_.lsx_i64, a_.lsx_i64, 0);
+ #elif defined(SIMDE_CONVERT_VECTOR_) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector)
+ int16_t SIMDE_VECTOR(32) v = SIMDE_SHUFFLE_VECTOR_(16, 32, a_.i16, b_.i16, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);
+ const int16_t SIMDE_VECTOR(32) min = { INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN, INT8_MIN };
+@@ -4688,6 +5048,8 @@ simde_mm_packs_epi32 (simde__m128i a, si
+ r_.sse_m128i = _mm_packs_epi32(a_.sse_m128i, b_.sse_m128i);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i16x8_narrow_i32x4(a_.wasm_v128, b_.wasm_v128);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vssrarni_h_w(b_.lsx_i64, a_.lsx_i64, 0);
+ #elif defined(SIMDE_CONVERT_VECTOR_) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector)
+ int32_t SIMDE_VECTOR(32) v = SIMDE_SHUFFLE_VECTOR_(32, 32, a_.i32, b_.i32, 0, 1, 2, 3, 4, 5, 6, 7);
+ const int32_t SIMDE_VECTOR(32) min = { INT16_MIN, INT16_MIN, INT16_MIN, INT16_MIN, INT16_MIN, INT16_MIN, INT16_MIN, INT16_MIN };
+@@ -4743,6 +5105,8 @@ simde_mm_packus_epi16 (simde__m128i a, s
+ r_.altivec_u8 = vec_packsu(a_.altivec_i16, b_.altivec_i16);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_u8x16_narrow_i16x8(a_.wasm_v128, b_.wasm_v128);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vssrarni_bu_h(b_.lsx_i64, a_.lsx_i64, 0);
+ #elif defined(SIMDE_CONVERT_VECTOR_) && HEDLEY_HAS_BUILTIN(__builtin_shufflevector) && defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR)
+ int16_t v SIMDE_VECTOR(32) = SIMDE_SHUFFLE_VECTOR_(16, 32, a_.i16, b_.i16, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);
+
+@@ -4786,6 +5150,8 @@ simde_mm_pause (void) {
+ __asm__ __volatile__ ("or 27,27,27" ::: "memory");
+ #elif defined(SIMDE_ARCH_WASM)
+ __asm__ __volatile__ ("nop");
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ __asm__ __volatile ("dbar 0");
+ #elif defined(HEDLEY_GCC_VERSION)
+ #if defined(SIMDE_ARCH_RISCV)
+ __builtin_riscv_pause();
+@@ -4814,6 +5180,19 @@ simde_mm_sad_epu8 (simde__m128i a, simde
+ r_.neon_u64 = vcombine_u64(
+ vpaddl_u32(vpaddl_u16(vget_low_u16(t))),
+ vpaddl_u32(vpaddl_u16(vget_high_u16(t))));
++ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
++ v128_t tmp = wasm_v128_or(wasm_u8x16_sub_sat(a_.wasm_v128, b_.wasm_v128),
++ wasm_u8x16_sub_sat(b_.wasm_v128, a_.wasm_v128));
++ tmp = wasm_i16x8_add(wasm_u16x8_shr(tmp, 8),
++ wasm_v128_and(tmp, wasm_i16x8_splat(0x00FF)));
++ tmp = wasm_i16x8_add(tmp, wasm_i32x4_shl(tmp, 16));
++ tmp = wasm_i16x8_add(tmp, wasm_i64x2_shl(tmp, 32));
++ r_.wasm_v128 = wasm_u64x2_shr(tmp, 48);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ __m128i temp = __lsx_vabsd_bu(a_.lsx_i64, b_.lsx_i64);
++ temp = __lsx_vhaddw_hu_bu(temp, temp);
++ temp = __lsx_vhaddw_wu_hu(temp, temp);
++ r_.lsx_i64 = __lsx_vhaddw_du_wu(temp, temp);
+ #else
+ for (size_t i = 0 ; i < (sizeof(r_.i64) / sizeof(r_.i64[0])) ; i++) {
+ uint16_t tmp = 0;
+@@ -4858,6 +5237,13 @@ simde_mm_set_epi8 (int8_t e15, int8_t e1
+ e8, e9, e10, e11,
+ e12, e13, e14, e15};
+ r_.neon_i8 = vld1q_s8(data);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ SIMDE_ALIGN_LIKE_16(v16i8) int8_t data[16] = {
++ e0, e1, e2, e3,
++ e4, e5, e6, e7,
++ e8, e9, e10, e11,
++ e12, e13, e14, e15};
++ r_.lsx_i64 = __lsx_vld(data, 0);
+ #else
+ r_.i8[ 0] = e0;
+ r_.i8[ 1] = e1;
+@@ -4898,6 +5284,9 @@ simde_mm_set_epi16 (int16_t e7, int16_t
+ r_.neon_i16 = vld1q_s16(data);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i16x8_make(e0, e1, e2, e3, e4, e5, e6, e7);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ SIMDE_ALIGN_LIKE_16(v8i16) int16_t data[8] = {e0, e1, e2, e3, e4, e5, e6, e7};
++ r_.lsx_i64 = __lsx_vld(data, 0);
+ #else
+ r_.i16[0] = e0;
+ r_.i16[1] = e1;
+@@ -4924,6 +5313,8 @@ simde_mm_loadu_si16 (void const* mem_add
+ HEDLEY_INTEL_VERSION_CHECK(20,21,1) || \
+ HEDLEY_GCC_VERSION_CHECK(12,1,0))
+ return _mm_loadu_si16(mem_addr);
++ #elif defined(SIMD_LOONGARCH_LSX_NATIVE)
++ return __lsx_vld(mem_addr, 0);
+ #else
+ int16_t val;
+ simde_memcpy(&val, mem_addr, sizeof(val));
+@@ -4947,6 +5338,9 @@ simde_mm_set_epi32 (int32_t e3, int32_t
+ r_.neon_i32 = vld1q_s32(data);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i32x4_make(e0, e1, e2, e3);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ SIMDE_ALIGN_LIKE_16(v4i32) int32_t data[4] = {e0, e1, e2, e3};
++ r_.lsx_i64 = __lsx_vld(data, 0);
+ #else
+ r_.i32[0] = e0;
+ r_.i32[1] = e1;
+@@ -4975,6 +5369,8 @@ simde_mm_loadu_si32 (void const* mem_add
+ simde__m128i_private r_;
+ r_.neon_i32 = vsetq_lane_s32(* HEDLEY_REINTERPRET_CAST(const int32_t *, mem_addr), vdupq_n_s32(0), 0);
+ return simde__m128i_from_private(r_);
++ #elif defined(SIMD_LOONGARCH_LSX_NATIVE)
++ return __lsx_vld(mem_addr, 0);
+ #else
+ int32_t val;
+ simde_memcpy(&val, mem_addr, sizeof(val));
+@@ -4995,6 +5391,9 @@ simde_mm_set_epi64 (simde__m64 e1, simde
+
+ #if defined(SIMDE_ARM_NEON_A32V7_NATIVE)
+ r_.neon_i64 = vcombine_s64(simde__m64_to_neon_i64(e0), simde__m64_to_neon_i64(e1));
++ #elif defined(SIMD_LOONGARCH_LSX_NATIVE)
++ SIMDE_ALIGN_TO_16 simde__m64 data[2] = {e0, e1};
++ r_.lsx_i64 = __lsx_vld(data, 0);
+ #else
+ r_.m64[0] = e0;
+ r_.m64[1] = e1;
+@@ -5020,6 +5419,9 @@ simde_mm_set_epi64x (int64_t e1, int64_t
+ r_.neon_i64 = vld1q_s64(data);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i64x2_make(e0, e1);
++ #elif defined(SIMD_LOONGARCH_LSX_NATIVE)
++ SIMDE_ALIGN_LIKE_16(v2i64) int64_t data[2] = {e0, e1};
++ r_.lsx_i64 = __lsx_vld(data, 0);
+ #else
+ r_.i64[0] = e0;
+ r_.i64[1] = e1;
+@@ -5040,6 +5442,8 @@ simde_mm_loadu_si64 (void const* mem_add
+ HEDLEY_GCC_VERSION_CHECK(11,0,0) || \
+ HEDLEY_INTEL_VERSION_CHECK(20,21,1))
+ return _mm_loadu_si64(mem_addr);
++ #elif defined(SIMD_LOONGARCH_LSX_NATIVE)
++ return __lsx_vld(mem_addr, 0);
+ #else
+ int64_t val;
+ simde_memcpy(&val, mem_addr, sizeof(val));
+@@ -5074,6 +5478,13 @@ simde_x_mm_set_epu8 (uint8_t e15, uint8_
+ r_.neon_u8 = vld1q_u8(data);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_u8x16_make(e0, e1, e2, e3, e4, e5, e6, e7, e8, e9, e10, e11, e12, e13, e14, e15);
++ #elif defined(SIMD_LOONGARCH_LSX_NATIVE)
++ SIMDE_ALIGN_LIKE_16(v16u8) uint8_t data[16] = {
++ e0, e1, e2, e3,
++ e4, e5, e6, e7,
++ e8, e9, e10, e11,
++ e12, e13, e14, e15};
++ r_.lsx_i64 = __lsx_vld(data, 0);
+ #else
+ r_.u8[ 0] = e0; r_.u8[ 1] = e1; r_.u8[ 2] = e2; r_.u8[ 3] = e3;
+ r_.u8[ 4] = e4; r_.u8[ 5] = e5; r_.u8[ 6] = e6; r_.u8[ 7] = e7;
+@@ -5101,6 +5512,9 @@ simde_x_mm_set_epu16 (uint16_t e7, uint1
+ r_.neon_u16 = vld1q_u16(data);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_u16x8_make(e0, e1, e2, e3, e4, e5, e6, e7);
++ #elif defined(SIMD_LOONGARCH_LSX_NATIVE)
++ SIMDE_ALIGN_LIKE_16(v8u16) uint16_t data[8] = {e0, e1, e2, e3, e4, e5, e6, e7};
++ r_.lsx_i64 = __lsx_vld(data, 0);
+ #else
+ r_.u16[0] = e0; r_.u16[1] = e1; r_.u16[2] = e2; r_.u16[3] = e3;
+ r_.u16[4] = e4; r_.u16[5] = e5; r_.u16[6] = e6; r_.u16[7] = e7;
+@@ -5124,6 +5538,9 @@ simde_x_mm_set_epu32 (uint32_t e3, uint3
+ r_.neon_u32 = vld1q_u32(data);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_u32x4_make(e0, e1, e2, e3);
++ #elif defined(SIMD_LOONGARCH_LSX_NATIVE)
++ SIMDE_ALIGN_LIKE_16(v4u32) uint32_t data[4] = {e0, e1, e2, e3};
++ r_.lsx_i64 = __lsx_vld(data, 0);
+ #else
+ r_.u32[0] = e0;
+ r_.u32[1] = e1;
+@@ -5148,6 +5565,9 @@ simde_x_mm_set_epu64x (uint64_t e1, uint
+ r_.neon_u64 = vld1q_u64(data);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_u64x2_make(e0, e1);
++ #elif defined(SIMD_LOONGARCH_LSX_NATIVE)
++ SIMDE_ALIGN_LIKE_16(v2u64) uint64_t data[2] = {e0, e1};
++ r_.lsx_i64 = __lsx_vld(data, 0);
+ #else
+ r_.u64[0] = e0;
+ r_.u64[1] = e1;
+@@ -5166,6 +5586,8 @@ simde_mm_set_sd (simde_float64 a) {
+ return vsetq_lane_f64(a, vdupq_n_f64(SIMDE_FLOAT64_C(0.0)), 0);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ return simde__m128d_from_wasm_v128(wasm_f64x2_make(a, 0));
++ #elif defined(SIMD_LOONGARCH_LSX_NATIVE)
++ return (__m128d)__lsx_vinsgr2vr_d(__lsx_vldrepl_d(&a, 0), 0, 1);
+ #else
+ return simde_mm_set_pd(SIMDE_FLOAT64_C(0.0), a);
+ #endif
+@@ -5188,6 +5610,8 @@ simde_mm_set1_epi8 (int8_t a) {
+ r_.wasm_v128 = wasm_i8x16_splat(a);
+ #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE)
+ r_.altivec_i8 = vec_splats(HEDLEY_STATIC_CAST(signed char, a));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vreplgr2vr_b(a);
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.i8) / sizeof(r_.i8[0])) ; i++) {
+@@ -5216,6 +5640,8 @@ simde_mm_set1_epi16 (int16_t a) {
+ r_.wasm_v128 = wasm_i16x8_splat(a);
+ #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE)
+ r_.altivec_i16 = vec_splats(HEDLEY_STATIC_CAST(signed short, a));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vreplgr2vr_h(a);
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.i16) / sizeof(r_.i16[0])) ; i++) {
+@@ -5244,6 +5670,8 @@ simde_mm_set1_epi32 (int32_t a) {
+ r_.wasm_v128 = wasm_i32x4_splat(a);
+ #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE)
+ r_.altivec_i32 = vec_splats(HEDLEY_STATIC_CAST(signed int, a));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vreplgr2vr_w(a);
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.i32) / sizeof(r_.i32[0])) ; i++) {
+@@ -5272,6 +5700,8 @@ simde_mm_set1_epi64x (int64_t a) {
+ r_.wasm_v128 = wasm_i64x2_splat(a);
+ #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE)
+ r_.altivec_i64 = vec_splats(HEDLEY_STATIC_CAST(signed long long, a));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vreplgr2vr_d(a);
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.i64) / sizeof(r_.i64[0])) ; i++) {
+@@ -5507,6 +5937,8 @@ simde_mm_shuffle_epi32 (simde__m128i a,
+ }
+ #if defined(SIMDE_X86_SSE2_NATIVE)
+ #define simde_mm_shuffle_epi32(a, imm8) _mm_shuffle_epi32((a), (imm8))
++#elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ #define simde_mm_shuffle_epi32(a, imm8) (__lsx_vshuf4i_w(simde__m128i_to_private(a).lsx_i64, (imm8)))
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ #define simde_mm_shuffle_epi32(a, imm8) (__extension__ ({ \
+ const simde__m128i_private simde_tmp_a_ = simde__m128i_to_private(a); \
+@@ -5561,6 +5993,21 @@ simde_mm_shuffle_pd (simde__m128d a, sim
+ }
+ #if defined(SIMDE_X86_SSE2_NATIVE) && !defined(__PGI)
+ #define simde_mm_shuffle_pd(a, b, imm8) _mm_shuffle_pd((a), (b), (imm8))
++#elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ #define simde_mm_shuffle_pd(a, b, imm8) \
++ ({ \
++ simde__m128d res; \
++ if ((imm8) & 0x01) { \
++ res = (simde__m128d)__lsx_vshuf4i_d(simde__m128d_to_private(a).lsx_i64, simde__m128d_to_private(b).lsx_i64, 0b1001); \
++ } else if ((imm8) & 0x02) { \
++ res = (simde__m128d)__lsx_vshuf4i_d(simde__m128d_to_private(a).lsx_i64, simde__m128d_to_private(b).lsx_i64, 0b1100); \
++ } else if ((imm8) & 0x03) { \
++ res = (simde__m128d)__lsx_vshuf4i_d(simde__m128d_to_private(a).lsx_i64, simde__m128d_to_private(b).lsx_i64, 0b1101); \
++ } else { \
++ res = (simde__m128d)__lsx_vshuf4i_d(simde__m128d_to_private(a).lsx_i64, simde__m128d_to_private(b).lsx_i64, 0b1000); \
++ } \
++ res; \
++ })
+ #elif defined(SIMDE_SHUFFLE_VECTOR_)
+ #define simde_mm_shuffle_pd(a, b, imm8) (__extension__ ({ \
+ simde__m128d_from_private((simde__m128d_private) { .f64 = \
+@@ -5594,6 +6041,9 @@ simde_mm_shufflehi_epi16 (simde__m128i a
+ }
+ #if defined(SIMDE_X86_SSE2_NATIVE)
+ #define simde_mm_shufflehi_epi16(a, imm8) _mm_shufflehi_epi16((a), (imm8))
++#elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ #define simde_mm_shufflehi_epi16(a, imm8) \
++ ((simde__m128i)__lsx_vextrins_d(__lsx_vshuf4i_h(simde__m128i_to_private(a).lsx_i64, imm8), simde__m128i_to_private(a).lsx_i64, 0x00))
+ #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE) && defined(SIMDE_STATEMENT_EXPR_)
+ #define simde_mm_shufflehi_epi16(a, imm8) \
+ (__extension__ ({ \
+@@ -5654,6 +6104,9 @@ simde_mm_shufflelo_epi16 (simde__m128i a
+ }
+ #if defined(SIMDE_X86_SSE2_NATIVE)
+ #define simde_mm_shufflelo_epi16(a, imm8) _mm_shufflelo_epi16((a), (imm8))
++#elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ #define simde_mm_shufflelo_epi16(a, imm8) \
++ ((simde__m128i)__lsx_vextrins_d(__lsx_vshuf4i_h(simde__m128i_to_private(a).lsx_i64, imm8), simde__m128i_to_private(a).lsx_i64, 0b00010001))
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ #define simde_mm_shufflelo_epi16(a, imm8) \
+ simde__m128i_from_wasm_v128( \
+@@ -5711,6 +6164,8 @@ simde_mm_sll_epi16 (simde__m128i a, simd
+ r_.u16 = (a_.u16 << count_.u64[0]);
+ #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE)
+ r_.neon_u16 = vshlq_u16(a_.neon_u16, vdupq_n_s16(HEDLEY_STATIC_CAST(int16_t, count_.u64[0])));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vslli_h(a_.lsx_i64, count_.u64[0]);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = ((wasm_i64x2_extract_lane(count_.wasm_v128, 0) < 16) ? wasm_i16x8_shl(a_.wasm_v128, HEDLEY_STATIC_CAST(int32_t, wasm_i64x2_extract_lane(count_.wasm_v128, 0))) : wasm_i16x8_const(0,0,0,0,0,0,0,0));
+ #else
+@@ -5745,6 +6200,8 @@ simde_mm_sll_epi32 (simde__m128i a, simd
+ r_.u32 = (a_.u32 << count_.u64[0]);
+ #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE)
+ r_.neon_u32 = vshlq_u32(a_.neon_u32, vdupq_n_s32(HEDLEY_STATIC_CAST(int32_t, count_.u64[0])));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vslli_w(a_.lsx_i64, count_.u64[0]);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = ((wasm_i64x2_extract_lane(count_.wasm_v128, 0) < 32) ? wasm_i32x4_shl(a_.wasm_v128, HEDLEY_STATIC_CAST(int32_t, wasm_i64x2_extract_lane(count_.wasm_v128, 0))) : wasm_i32x4_const(0,0,0,0));
+ #else
+@@ -5780,6 +6237,8 @@ simde_mm_sll_epi64 (simde__m128i a, simd
+ r_.neon_u64 = vshlq_u64(a_.neon_u64, vdupq_n_s64(HEDLEY_STATIC_CAST(int64_t, s)));
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = (s < 64) ? wasm_i64x2_shl(a_.wasm_v128, HEDLEY_STATIC_CAST(uint32_t, s)) : wasm_i64x2_const(0,0);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vsll_d(a_.lsx_i64, __lsx_vreplgr2vr_d(HEDLEY_STATIC_CAST(int64_t, s)));
+ #else
+ #if !defined(SIMDE_BUG_GCC_94488)
+ SIMDE_VECTORIZE
+@@ -5812,6 +6271,8 @@ simde_mm_sqrt_pd (simde__m128d a) {
+ r_.wasm_v128 = wasm_f64x2_sqrt(a_.wasm_v128);
+ #elif defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) || defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE)
+ r_.altivec_f64 = vec_sqrt(a_.altivec_f64);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_f64 = __lsx_vfsqrt_d(a_.lsx_f64);
+ #elif defined(simde_math_sqrt)
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) {
+@@ -5843,7 +6304,9 @@ simde_mm_sqrt_sd (simde__m128d a, simde_
+ a_ = simde__m128d_to_private(a),
+ b_ = simde__m128d_to_private(b);
+
+- #if defined(simde_math_sqrt)
++ #if defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vextrins_d(a_.lsx_i64, (__m128i)__lsx_vfsqrt_d(b_.lsx_f64), 0);
++ #elif defined(simde_math_sqrt)
+ r_.f64[0] = simde_math_sqrt(b_.f64[0]);
+ r_.f64[1] = a_.f64[1];
+ #else
+@@ -5872,6 +6335,8 @@ simde_mm_srl_epi16 (simde__m128i a, simd
+
+ #if defined(SIMDE_ARM_NEON_A32V7_NATIVE)
+ r_.neon_u16 = vshlq_u16(a_.neon_u16, vdupq_n_s16(HEDLEY_STATIC_CAST(int16_t, -cnt)));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vsrl_h(a_.lsx_i64, __lsx_vreplgr2vr_h(cnt));
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.u16) / sizeof(r_.u16[0])) ; i++) {
+@@ -5903,6 +6368,8 @@ simde_mm_srl_epi32 (simde__m128i a, simd
+ r_.neon_u32 = vshlq_u32(a_.neon_u32, vdupq_n_s32(HEDLEY_STATIC_CAST(int32_t, -cnt)));
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_u32x4_shr(a_.wasm_v128, HEDLEY_STATIC_CAST(uint32_t, cnt));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vsrl_w(a_.lsx_i64, __lsx_vreplgr2vr_w(cnt));
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.u32) / sizeof(r_.u32[0])) ; i++) {
+@@ -5934,6 +6401,8 @@ simde_mm_srl_epi64 (simde__m128i a, simd
+ r_.neon_u64 = vshlq_u64(a_.neon_u64, vdupq_n_s64(HEDLEY_STATIC_CAST(int64_t, -cnt)));
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_u64x2_shr(a_.wasm_v128, HEDLEY_STATIC_CAST(uint32_t, cnt));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vsrl_d(a_.lsx_i64, __lsx_vreplgr2vr_d(cnt));
+ #else
+ #if !defined(SIMDE_BUG_GCC_94488)
+ SIMDE_VECTORIZE
+@@ -5965,6 +6434,8 @@ simde_mm_srai_epi16 (simde__m128i a, con
+ r_.neon_i16 = vshlq_s16(a_.neon_i16, vdupq_n_s16(HEDLEY_STATIC_CAST(int16_t, -cnt)));
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i16x8_shr(a_.wasm_v128, HEDLEY_STATIC_CAST(uint32_t, cnt));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vsra_h(a_.lsx_i64, __lsx_vreplgr2vr_h(cnt));
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_) / sizeof(r_.i16[0])) ; i++) {
+@@ -5996,6 +6467,8 @@ simde_mm_srai_epi32 (simde__m128i a, con
+ r_.neon_i32 = vshlq_s32(a_.neon_i32, vdupq_n_s32(-cnt));
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i32x4_shr(a_.wasm_v128, HEDLEY_STATIC_CAST(uint32_t, cnt));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vsra_w(a_.lsx_i64, __lsx_vreplgr2vr_w(cnt));
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_) / sizeof(r_.i32[0])) ; i++) {
+@@ -6029,6 +6502,8 @@ simde_mm_sra_epi16 (simde__m128i a, simd
+ r_.neon_i16 = vshlq_s16(a_.neon_i16, vdupq_n_s16(HEDLEY_STATIC_CAST(int16_t, -cnt)));
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i16x8_shr(a_.wasm_v128, HEDLEY_STATIC_CAST(uint32_t, cnt));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vsra_h(a_.lsx_i64, __lsx_vreplgr2vr_h(cnt));
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.i16) / sizeof(r_.i16[0])) ; i++) {
+@@ -6060,6 +6535,8 @@ simde_mm_sra_epi32 (simde__m128i a, simd
+ r_.neon_i32 = vshlq_s32(a_.neon_i32, vdupq_n_s32(HEDLEY_STATIC_CAST(int32_t, -cnt)));
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i32x4_shr(a_.wasm_v128, HEDLEY_STATIC_CAST(uint32_t, cnt));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vsra_w(a_.lsx_i64, __lsx_vreplgr2vr_w(cnt));
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.i32) / sizeof(r_.i32[0])) ; i++) {
+@@ -6114,6 +6591,8 @@ simde_mm_slli_epi16 (simde__m128i a, con
+ #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE)
+ #define simde_mm_slli_epi16(a, imm8) \
+ ((imm8 & ~15) ? simde_mm_setzero_si128() : simde__m128i_from_altivec_i16(vec_sl(simde__m128i_to_altivec_i16(a), vec_splat_u16(HEDLEY_STATIC_CAST(unsigned short, imm8)))))
++#elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ #define simde_mm_slli_epi16(a, imm8) ((imm8 & ~15) ? simde_mm_setzero_si128() : simde__m128i_from_lsx_i64(__lsx_vslli_h(simde__m128i_to_private(a).lsx_i64, ((imm8) & 15))))
+ #endif
+ #if defined(SIMDE_X86_SSE2_ENABLE_NATIVE_ALIASES)
+ #define _mm_slli_epi16(a, imm8) simde_mm_slli_epi16(a, imm8)
+@@ -6169,6 +6648,8 @@ simde_mm_slli_epi32 (simde__m128i a, con
+ } \
+ ret; \
+ }))
++#elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ #define simde_mm_slli_epi32(a, imm8) ((imm8 & ~31) ? simde_mm_setzero_si128() : simde__m128i_from_lsx_i64(__lsx_vslli_w(simde__m128i_to_private(a).lsx_i64, ((imm8) & 31))))
+ #endif
+ #if defined(SIMDE_X86_SSE2_ENABLE_NATIVE_ALIASES)
+ #define _mm_slli_epi32(a, imm8) simde_mm_slli_epi32(a, imm8)
+@@ -6209,6 +6690,8 @@ simde_mm_slli_epi64 (simde__m128i a, con
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ #define simde_mm_slli_epi64(a, imm8) \
+ ((imm8 < 64) ? wasm_i64x2_shl(simde__m128i_to_private(a).wasm_v128, imm8) : wasm_i64x2_const(0,0))
++#elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ #define simde_mm_slli_epi64(a, imm8) ((imm8 & ~63) ? simde_mm_setzero_si128() : simde__m128i_from_lsx_i64(__lsx_vslli_d(simde__m128i_to_private(a).lsx_i64, ((imm8) & 63))))
+ #endif
+ #if defined(SIMDE_X86_SSE2_ENABLE_NATIVE_ALIASES)
+ #define _mm_slli_epi64(a, imm8) simde_mm_slli_epi64(a, imm8)
+@@ -6252,6 +6735,8 @@ simde_mm_srli_epi16 (simde__m128i a, con
+ #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE)
+ #define simde_mm_srli_epi16(a, imm8) \
+ ((imm8 & ~15) ? simde_mm_setzero_si128() : simde__m128i_from_altivec_i16(vec_sr(simde__m128i_to_altivec_i16(a), vec_splat_u16(HEDLEY_STATIC_CAST(unsigned short, imm8)))))
++#elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ #define simde_mm_srli_epi16(a, imm8) ((imm8 & ~15) ? simde_mm_setzero_si128() : simde__m128i_from_lsx_i64(__lsx_vsrli_h(simde__m128i_to_private(a).lsx_i64, ((imm8) & 15))))
+ #endif
+ #if defined(SIMDE_X86_SSE2_ENABLE_NATIVE_ALIASES)
+ #define _mm_srli_epi16(a, imm8) simde_mm_srli_epi16(a, imm8)
+@@ -6307,6 +6792,8 @@ simde_mm_srli_epi32 (simde__m128i a, con
+ } \
+ ret; \
+ }))
++#elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ #define simde_mm_srli_epi32(a, imm8) ((imm8 & ~31) ? simde_mm_setzero_si128() : simde__m128i_from_lsx_i64(__lsx_vsrli_w(simde__m128i_to_private(a).lsx_i64, ((imm8) & 31))))
+ #endif
+ #if defined(SIMDE_X86_SSE2_ENABLE_NATIVE_ALIASES)
+ #define _mm_srli_epi32(a, imm8) simde_mm_srli_epi32(a, imm8)
+@@ -6351,6 +6838,8 @@ simde_mm_srli_epi64 (simde__m128i a, con
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ #define simde_mm_srli_epi64(a, imm8) \
+ ((imm8 < 64) ? wasm_u64x2_shr(simde__m128i_to_private(a).wasm_v128, imm8) : wasm_i64x2_const(0,0))
++#elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ #define simde_mm_srli_epi64(a, imm8) ((imm8 & ~63) ? simde_mm_setzero_si128() : simde__m128i_from_lsx_i64(__lsx_vsrli_d(simde__m128i_to_private(a).lsx_i64, ((imm8) & 63))))
+ #endif
+ #if defined(SIMDE_X86_SSE2_ENABLE_NATIVE_ALIASES)
+ #define _mm_srli_epi64(a, imm8) simde_mm_srli_epi64(a, imm8)
+@@ -6365,6 +6854,8 @@ simde_mm_store_pd (simde_float64 mem_add
+ vst1q_f64(mem_addr, simde__m128d_to_private(a).neon_f64);
+ #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE)
+ vst1q_s64(HEDLEY_REINTERPRET_CAST(int64_t*, mem_addr), simde__m128d_to_private(a).neon_i64);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ __lsx_vst(simde__m128d_to_private(a).lsx_i64, mem_addr, 0);
+ #else
+ simde_memcpy(SIMDE_ALIGN_ASSUME_LIKE(mem_addr, simde__m128d), &a, sizeof(a));
+ #endif
+@@ -6383,6 +6874,8 @@ simde_mm_store1_pd (simde_float64 mem_ad
+
+ #if defined(SIMDE_ARM_NEON_A64V8_NATIVE)
+ vst1q_f64(mem_addr, vdupq_laneq_f64(a_.neon_f64, 0));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ __lsx_vst(__lsx_vilvl_d(a_.lsx_i64, a_.lsx_i64), mem_addr, 0);
+ #else
+ mem_addr[0] = a_.f64[0];
+ mem_addr[1] = a_.f64[0];
+@@ -6411,6 +6904,8 @@ simde_mm_store_sd (simde_float64* mem_ad
+ simde_memcpy(HEDLEY_REINTERPRET_CAST(int64_t*, mem_addr), &v, sizeof(v));
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ wasm_v128_store64_lane(HEDLEY_REINTERPRET_CAST(void*, mem_addr), a_.wasm_v128, 0);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ __lsx_vstelm_d(a_.lsx_i64, mem_addr, 0, 0);
+ #else
+ simde_float64 v = a_.f64[0];
+ simde_memcpy(mem_addr, &v, sizeof(simde_float64));
+@@ -6431,6 +6926,8 @@ simde_mm_store_si128 (simde__m128i* mem_
+
+ #if defined(SIMDE_ARM_NEON_A32V7_NATIVE)
+ vst1q_s32(HEDLEY_REINTERPRET_CAST(int32_t*, mem_addr), a_.neon_i32);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ __lsx_vst(a_.lsx_i64, mem_addr, 0);
+ #else
+ simde_memcpy(SIMDE_ALIGN_ASSUME_LIKE(mem_addr, simde__m128i), &a_, sizeof(a_));
+ #endif
+@@ -6452,6 +6949,8 @@ void
+ *mem_addr = vgetq_lane_f64(a_.neon_f64, 1);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ wasm_v128_store64_lane(HEDLEY_REINTERPRET_CAST(void*, mem_addr), a_.wasm_v128, 1);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ __lsx_vstelm_d(a_.lsx_i64, mem_addr, 0, 1);
+ #else
+ *mem_addr = a_.f64[1];
+ #endif
+@@ -6466,6 +6965,8 @@ void
+ simde_mm_storel_epi64 (simde__m128i* mem_addr, simde__m128i a) {
+ #if defined(SIMDE_X86_SSE2_NATIVE)
+ _mm_storel_epi64(HEDLEY_STATIC_CAST(__m128i*, mem_addr), a);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ __lsx_vstelm_d(simde__m128i_to_private(a).lsx_i64, mem_addr, 0, 0);
+ #else
+ simde__m128i_private a_ = simde__m128i_to_private(a);
+ int64_t tmp;
+@@ -6498,6 +6999,8 @@ simde_mm_storel_pd (simde_float64* mem_a
+ _mm_storel_pd(mem_addr, a);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ wasm_v128_store64_lane(HEDLEY_REINTERPRET_CAST(void*, mem_addr), simde__m128d_to_wasm_v128(a), 0);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ __lsx_vstelm_d(simde__m128d_to_private(a).lsx_f64, mem_addr, 0, 0);
+ #else
+ simde__m128d_private a_ = simde__m128d_to_private(a);
+
+@@ -6530,6 +7033,9 @@ simde_mm_storer_pd (simde_float64 mem_ad
+ #elif defined(SIMDE_SHUFFLE_VECTOR_)
+ a_.f64 = SIMDE_SHUFFLE_VECTOR_(64, 16, a_.f64, a_.f64, 1, 0);
+ simde_mm_store_pd(mem_addr, simde__m128d_from_private(a_));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ __m128i temp = __lsx_vshuf4i_d(a_.lsx_i64, a_.lsx_i64, 0b0001);
++ __lsx_vst(temp, mem_addr, 0);
+ #else
+ mem_addr[0] = a_.f64[1];
+ mem_addr[1] = a_.f64[0];
+@@ -6547,6 +7053,8 @@ simde_mm_storeu_pd (simde_float64* mem_a
+ _mm_storeu_pd(mem_addr, a);
+ #elif defined(SIMDE_ARM_NEON_A64V8_NATIVE)
+ vst1q_f64(mem_addr, simde__m128d_to_private(a).neon_f64);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ __lsx_vst(simde__m128d_to_private(a).lsx_f64, mem_addr, 0);
+ #else
+ simde_memcpy(mem_addr, &a, sizeof(a));
+ #endif
+@@ -6560,6 +7068,8 @@ void
+ simde_mm_storeu_si128 (void* mem_addr, simde__m128i a) {
+ #if defined(SIMDE_X86_SSE2_NATIVE)
+ _mm_storeu_si128(HEDLEY_STATIC_CAST(__m128i*, mem_addr), a);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ __lsx_vst(simde__m128i_to_private(a).lsx_i64, mem_addr, 0);
+ #else
+ simde_memcpy(mem_addr, &a, sizeof(a));
+ #endif
+@@ -6576,6 +7086,8 @@ simde_mm_storeu_si16 (void* mem_addr, si
+ HEDLEY_GCC_VERSION_CHECK(11,0,0) || \
+ HEDLEY_INTEL_VERSION_CHECK(20,21,1))
+ _mm_storeu_si16(mem_addr, a);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ __lsx_vstelm_h(simde__m128i_to_private(a).lsx_i64, mem_addr, 0, 0);
+ #else
+ int16_t val = simde_x_mm_cvtsi128_si16(a);
+ simde_memcpy(mem_addr, &val, sizeof(val));
+@@ -6595,6 +7107,8 @@ simde_mm_storeu_si32 (void* mem_addr, si
+ _mm_storeu_si32(mem_addr, a);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ wasm_v128_store32_lane(mem_addr, simde__m128i_to_wasm_v128(a), 0);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ __lsx_vstelm_w(simde__m128i_to_private(a).lsx_i64, mem_addr, 0, 0);
+ #else
+ int32_t val = simde_mm_cvtsi128_si32(a);
+ simde_memcpy(mem_addr, &val, sizeof(val));
+@@ -6612,6 +7126,8 @@ simde_mm_storeu_si64 (void* mem_addr, si
+ HEDLEY_GCC_VERSION_CHECK(11,0,0) || \
+ HEDLEY_INTEL_VERSION_CHECK(20,21,1))
+ _mm_storeu_si64(mem_addr, a);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ __lsx_vstelm_d(simde__m128i_to_private(a).lsx_i64, mem_addr, 0, 0);
+ #else
+ int64_t val = simde_mm_cvtsi128_si64(a);
+ simde_memcpy(mem_addr, &val, sizeof(val));
+@@ -6629,7 +7145,7 @@ simde_mm_stream_pd (simde_float64 mem_ad
+ #elif HEDLEY_HAS_BUILTIN(__builtin_nontemporal_store) && ( \
+ defined(SIMDE_VECTOR_SUBSCRIPT) || defined(SIMDE_ARM_NEON_A64V8_NATIVE) || \
+ defined(SIMDE_WASM_SIMD128_NATIVE) || defined(SIMDE_POWER_ALTIVEC_P7_NATIVE) || \
+- defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE))
++ defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) || defined(SIMDE_LOONGARCH_LSX_NATIVE))
+ __builtin_nontemporal_store(a, SIMDE_ALIGN_CAST(__typeof__(a)*, mem_addr));
+ #else
+ simde_mm_store_pd(mem_addr, a);
+@@ -6647,7 +7163,7 @@ simde_mm_stream_si128 (simde__m128i* mem
+ #elif HEDLEY_HAS_BUILTIN(__builtin_nontemporal_store) && ( \
+ defined(SIMDE_VECTOR_SUBSCRIPT) || defined(SIMDE_ARM_NEON_A32V7_NATIVE) || \
+ defined(SIMDE_WASM_SIMD128_NATIVE) || defined(SIMDE_POWER_ALTIVEC_P6_NATIVE) || \
+- defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE))
++ defined(SIMDE_ZARCH_ZVECTOR_13_NATIVE) || defined(SIMDE_LOONGARCH_LSX_NATIVE))
+ __builtin_nontemporal_store(a, SIMDE_ALIGN_CAST(__typeof__(a)*, mem_addr));
+ #else
+ simde_mm_store_si128(mem_addr, a);
+@@ -6666,6 +7182,8 @@ simde_mm_stream_si32 (int32_t* mem_addr,
+ __builtin_nontemporal_store(a, mem_addr);
+ #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE)
+ vst1q_lane_s32(mem_addr, vdupq_n_s32(a), 0);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ return __lsx_vstelm_w(__lsx_vreplgr2vr_w(a), mem_addr, 0, 0);
+ #else
+ *mem_addr = a;
+ #endif
+@@ -6683,6 +7201,8 @@ simde_mm_stream_si64 (int64_t* mem_addr,
+ __builtin_nontemporal_store(a, mem_addr);
+ #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE)
+ vst1_s64(mem_addr, vdup_n_s64(a));
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ return __lsx_vstelm_d(__lsx_vreplgr2vr_d(a), mem_addr, 0, 0);
+ #else
+ *mem_addr = a;
+ #endif
+@@ -6708,6 +7228,8 @@ simde_mm_sub_epi8 (simde__m128i a, simde
+ r_.neon_i8 = vsubq_s8(a_.neon_i8, b_.neon_i8);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i8x16_sub(a_.wasm_v128, b_.wasm_v128);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vsub_b(a_.lsx_i64, b_.lsx_i64);
+ #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS)
+ r_.i8 = a_.i8 - b_.i8;
+ #else
+@@ -6739,6 +7261,8 @@ simde_mm_sub_epi16 (simde__m128i a, simd
+ r_.neon_i16 = vsubq_s16(a_.neon_i16, b_.neon_i16);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i16x8_sub(a_.wasm_v128, b_.wasm_v128);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vsub_h(a_.lsx_i64, b_.lsx_i64);
+ #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS)
+ r_.i16 = a_.i16 - b_.i16;
+ #else
+@@ -6770,6 +7294,8 @@ simde_mm_sub_epi32 (simde__m128i a, simd
+ r_.neon_i32 = vsubq_s32(a_.neon_i32, b_.neon_i32);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i32x4_sub(a_.wasm_v128, b_.wasm_v128);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vsub_w(a_.lsx_i64, b_.lsx_i64);
+ #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS)
+ r_.i32 = a_.i32 - b_.i32;
+ #else
+@@ -6801,6 +7327,8 @@ simde_mm_sub_epi64 (simde__m128i a, simd
+ r_.neon_i64 = vsubq_s64(a_.neon_i64, b_.neon_i64);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i64x2_sub(a_.wasm_v128, b_.wasm_v128);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vsub_d(a_.lsx_i64, b_.lsx_i64);
+ #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS)
+ r_.i64 = a_.i64 - b_.i64;
+ #else
+@@ -6829,6 +7357,8 @@ simde_x_mm_sub_epu32 (simde__m128i a, si
+ r_.u32 = a_.u32 - b_.u32;
+ #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE)
+ r_.neon_u32 = vsubq_u32(a_.neon_u32, b_.neon_u32);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vsub_w(a_.lsx_i64, b_.lsx_i64);
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.u32) / sizeof(r_.u32[0])) ; i++) {
+@@ -6856,6 +7386,8 @@ simde_mm_sub_pd (simde__m128d a, simde__
+ r_.neon_f64 = vsubq_f64(a_.neon_f64, b_.neon_f64);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_f64x2_sub(a_.wasm_v128, b_.wasm_v128);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_f64 = __lsx_vfsub_d(a_.lsx_f64, b_.lsx_f64);
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_.f64) / sizeof(r_.f64[0])) ; i++) {
+@@ -6884,10 +7416,12 @@ simde_mm_sub_sd (simde__m128d a, simde__
+ r_,
+ a_ = simde__m128d_to_private(a),
+ b_ = simde__m128d_to_private(b);
+-
+- r_.f64[0] = a_.f64[0] - b_.f64[0];
+- r_.f64[1] = a_.f64[1];
+-
++ #if defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vextrins_d(a_.lsx_i64, (__m128i)__lsx_vfsub_d(a_.lsx_f64, b_.lsx_f64), 0);
++ #else
++ r_.f64[0] = a_.f64[0] - b_.f64[0];
++ r_.f64[1] = a_.f64[1];
++ #endif
+ return simde__m128d_from_private(r_);
+ #endif
+ }
+@@ -6936,6 +7470,8 @@ simde_mm_subs_epi8 (simde__m128i a, simd
+ r_.neon_i8 = vqsubq_s8(a_.neon_i8, b_.neon_i8);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i8x16_sub_sat(a_.wasm_v128, b_.wasm_v128);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vssub_b(a_.lsx_i64, b_.lsx_i64);
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_) / sizeof(r_.i8[0])) ; i++) {
+@@ -6965,6 +7501,8 @@ simde_mm_subs_epi16 (simde__m128i a, sim
+ r_.neon_i16 = vqsubq_s16(a_.neon_i16, b_.neon_i16);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i16x8_sub_sat(a_.wasm_v128, b_.wasm_v128);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vssub_h(a_.lsx_i64, b_.lsx_i64);
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_) / sizeof(r_.i16[0])) ; i++) {
+@@ -6996,6 +7534,8 @@ simde_mm_subs_epu8 (simde__m128i a, simd
+ r_.wasm_v128 = wasm_u8x16_sub_sat(a_.wasm_v128, b_.wasm_v128);
+ #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE)
+ r_.altivec_u8 = vec_subs(a_.altivec_u8, b_.altivec_u8);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vssub_bu(a_.lsx_i64, b_.lsx_i64);
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_) / sizeof(r_.u8[0])) ; i++) {
+@@ -7027,6 +7567,8 @@ simde_mm_subs_epu16 (simde__m128i a, sim
+ r_.wasm_v128 = wasm_u16x8_sub_sat(a_.wasm_v128, b_.wasm_v128);
+ #elif defined(SIMDE_POWER_ALTIVEC_P6_NATIVE)
+ r_.altivec_u16 = vec_subs(a_.altivec_u16, b_.altivec_u16);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vssub_hu(a_.lsx_i64, b_.lsx_i64);
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < (sizeof(r_) / sizeof(r_.u16[0])) ; i++) {
+@@ -7060,6 +7602,8 @@ simde_mm_ucomieq_sd (simde__m128d a, sim
+ r = !!(vgetq_lane_u64(vorrq_u64(a_or_b_nan, a_eq_b), 0) != 0);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ return wasm_f64x2_extract_lane(a_.wasm_v128, 0) == wasm_f64x2_extract_lane(b_.wasm_v128, 0);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r = !!__lsx_vpickve2gr_d(__lsx_vfcmp_ceq_d(a_.lsx_f64, b_.lsx_f64), 0);
+ #elif defined(SIMDE_HAVE_FENV_H)
+ fenv_t envp;
+ int x = feholdexcept(&envp);
+@@ -7096,6 +7640,8 @@ simde_mm_ucomige_sd (simde__m128d a, sim
+ r = !!(vgetq_lane_u64(vandq_u64(a_and_b_not_nan, a_ge_b), 0) != 0);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ return wasm_f64x2_extract_lane(a_.wasm_v128, 0) >= wasm_f64x2_extract_lane(b_.wasm_v128, 0);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r = !!__lsx_vpickve2gr_d(__lsx_vfcmp_cle_d(b_.lsx_f64, a_.lsx_f64), 0);
+ #elif defined(SIMDE_HAVE_FENV_H)
+ fenv_t envp;
+ int x = feholdexcept(&envp);
+@@ -7132,6 +7678,8 @@ simde_mm_ucomigt_sd (simde__m128d a, sim
+ r = !!(vgetq_lane_u64(vandq_u64(a_and_b_not_nan, a_gt_b), 0) != 0);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ return wasm_f64x2_extract_lane(a_.wasm_v128, 0) > wasm_f64x2_extract_lane(b_.wasm_v128, 0);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r = !!__lsx_vpickve2gr_d(__lsx_vfcmp_clt_d(b_.lsx_f64, a_.lsx_f64), 0);
+ #elif defined(SIMDE_HAVE_FENV_H)
+ fenv_t envp;
+ int x = feholdexcept(&envp);
+@@ -7168,6 +7716,8 @@ simde_mm_ucomile_sd (simde__m128d a, sim
+ r = !!(vgetq_lane_u64(vorrq_u64(a_or_b_nan, a_le_b), 0) != 0);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ return wasm_f64x2_extract_lane(a_.wasm_v128, 0) <= wasm_f64x2_extract_lane(b_.wasm_v128, 0);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r = !!__lsx_vpickve2gr_d(__lsx_vfcmp_cle_d(a_.lsx_f64, b_.lsx_f64), 0);
+ #elif defined(SIMDE_HAVE_FENV_H)
+ fenv_t envp;
+ int x = feholdexcept(&envp);
+@@ -7204,6 +7754,8 @@ simde_mm_ucomilt_sd (simde__m128d a, sim
+ r = !!(vgetq_lane_u64(vorrq_u64(a_or_b_nan, a_lt_b), 0) != 0);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ return wasm_f64x2_extract_lane(a_.wasm_v128, 0) < wasm_f64x2_extract_lane(b_.wasm_v128, 0);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r = !!__lsx_vpickve2gr_d(__lsx_vfcmp_clt_d(a_.lsx_f64, b_.lsx_f64), 0);
+ #elif defined(SIMDE_HAVE_FENV_H)
+ fenv_t envp;
+ int x = feholdexcept(&envp);
+@@ -7240,6 +7792,8 @@ simde_mm_ucomineq_sd (simde__m128d a, si
+ r = !!(vgetq_lane_u64(vandq_u64(a_and_b_not_nan, a_neq_b), 0) != 0);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ return wasm_f64x2_extract_lane(a_.wasm_v128, 0) != wasm_f64x2_extract_lane(b_.wasm_v128, 0);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r = !!__lsx_vpickve2gr_d(__lsx_vfcmp_cune_d(a_.lsx_f64, b_.lsx_f64), 0);
+ #elif defined(SIMDE_HAVE_FENV_H)
+ fenv_t envp;
+ int x = feholdexcept(&envp);
+@@ -7303,6 +7857,8 @@ simde_mm_unpackhi_epi8 (simde__m128i a,
+ r_.neon_i8 = vcombine_s8(result.val[0], result.val[1]);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i8x16_shuffle(a_.wasm_v128, b_.wasm_v128, 8, 24, 9, 25, 10, 26, 11, 27, 12, 28, 13, 29, 14, 30, 15, 31);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vilvh_b(b_.lsx_i64, a_.lsx_i64);
+ #elif defined(SIMDE_SHUFFLE_VECTOR_)
+ r_.i8 = SIMDE_SHUFFLE_VECTOR_(8, 16, a_.i8, b_.i8, 8, 24, 9, 25, 10, 26, 11, 27, 12, 28, 13, 29, 14, 30, 15, 31);
+ #else
+@@ -7340,6 +7896,8 @@ simde_mm_unpackhi_epi16 (simde__m128i a,
+ r_.neon_i16 = vcombine_s16(result.val[0], result.val[1]);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i16x8_shuffle(a_.wasm_v128, b_.wasm_v128, 4, 12, 5, 13, 6, 14, 7, 15);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vilvh_h(b_.lsx_i64, a_.lsx_i64);
+ #elif defined(SIMDE_SHUFFLE_VECTOR_)
+ r_.i16 = SIMDE_SHUFFLE_VECTOR_(16, 16, a_.i16, b_.i16, 4, 12, 5, 13, 6, 14, 7, 15);
+ #else
+@@ -7377,6 +7935,8 @@ simde_mm_unpackhi_epi32 (simde__m128i a,
+ r_.neon_i32 = vcombine_s32(result.val[0], result.val[1]);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i32x4_shuffle(a_.wasm_v128, b_.wasm_v128, 2, 6, 3, 7);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vilvh_w(b_.lsx_i64, a_.lsx_i64);
+ #elif defined(SIMDE_SHUFFLE_VECTOR_)
+ r_.i32 = SIMDE_SHUFFLE_VECTOR_(32, 16, a_.i32, b_.i32, 2, 6, 3, 7);
+ #else
+@@ -7411,6 +7971,8 @@ simde_mm_unpackhi_epi64 (simde__m128i a,
+ r_.neon_i64 = vcombine_s64(a_h, b_h);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i64x2_shuffle(a_.wasm_v128, b_.wasm_v128, 1, 3);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vilvh_d(b_.lsx_i64, a_.lsx_i64);
+ #elif defined(SIMDE_SHUFFLE_VECTOR_)
+ r_.i64 = SIMDE_SHUFFLE_VECTOR_(64, 16, a_.i64, b_.i64, 1, 3);
+ #else
+@@ -7445,6 +8007,8 @@ simde_mm_unpackhi_pd (simde__m128d a, si
+ r_.wasm_v128 = wasm_i64x2_shuffle(a_.wasm_v128, b_.wasm_v128, 1, 3);
+ #elif defined(SIMDE_SHUFFLE_VECTOR_)
+ r_.f64 = SIMDE_SHUFFLE_VECTOR_(64, 16, a_.f64, b_.f64, 1, 3);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vilvh_d(b_.lsx_i64, a_.lsx_i64);
+ #else
+ SIMDE_VECTORIZE
+ for (size_t i = 0 ; i < ((sizeof(r_) / sizeof(r_.f64[0])) / 2) ; i++) {
+@@ -7480,6 +8044,8 @@ simde_mm_unpacklo_epi8 (simde__m128i a,
+ r_.neon_i8 = vcombine_s8(result.val[0], result.val[1]);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i8x16_shuffle(a_.wasm_v128, b_.wasm_v128, 0, 16, 1, 17, 2, 18, 3, 19, 4, 20, 5, 21, 6, 22, 7, 23);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vilvl_b(b_.lsx_i64, a_.lsx_i64);
+ #elif defined(SIMDE_SHUFFLE_VECTOR_)
+ r_.i8 = SIMDE_SHUFFLE_VECTOR_(8, 16, a_.i8, b_.i8, 0, 16, 1, 17, 2, 18, 3, 19, 4, 20, 5, 21, 6, 22, 7, 23);
+ #else
+@@ -7517,6 +8083,8 @@ simde_mm_unpacklo_epi16 (simde__m128i a,
+ r_.neon_i16 = vcombine_s16(result.val[0], result.val[1]);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i16x8_shuffle(a_.wasm_v128, b_.wasm_v128, 0, 8, 1, 9, 2, 10, 3, 11);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vilvl_h(b_.lsx_i64, a_.lsx_i64);
+ #elif defined(SIMDE_SHUFFLE_VECTOR_)
+ r_.i16 = SIMDE_SHUFFLE_VECTOR_(16, 16, a_.i16, b_.i16, 0, 8, 1, 9, 2, 10, 3, 11);
+ #else
+@@ -7554,6 +8122,8 @@ simde_mm_unpacklo_epi32 (simde__m128i a,
+ r_.neon_i32 = vcombine_s32(result.val[0], result.val[1]);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i32x4_shuffle(a_.wasm_v128, b_.wasm_v128, 0, 4, 1, 5);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vilvl_w(b_.lsx_i64, a_.lsx_i64);
+ #elif defined(SIMDE_SHUFFLE_VECTOR_)
+ r_.i32 = SIMDE_SHUFFLE_VECTOR_(32, 16, a_.i32, b_.i32, 0, 4, 1, 5);
+ #else
+@@ -7588,6 +8158,8 @@ simde_mm_unpacklo_epi64 (simde__m128i a,
+ r_.neon_i64 = vcombine_s64(a_l, b_l);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i64x2_shuffle(a_.wasm_v128, b_.wasm_v128, 0, 2);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vilvl_d(b_.lsx_i64, a_.lsx_i64);
+ #elif defined(SIMDE_SHUFFLE_VECTOR_)
+ r_.i64 = SIMDE_SHUFFLE_VECTOR_(64, 16, a_.i64, b_.i64, 0, 2);
+ #else
+@@ -7620,6 +8192,8 @@ simde_mm_unpacklo_pd (simde__m128d a, si
+ r_.neon_f64 = vzip1q_f64(a_.neon_f64, b_.neon_f64);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_i64x2_shuffle(a_.wasm_v128, b_.wasm_v128, 0, 2);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vilvl_d(b_.lsx_i64, a_.lsx_i64);
+ #elif defined(SIMDE_SHUFFLE_VECTOR_)
+ r_.f64 = SIMDE_SHUFFLE_VECTOR_(64, 16, a_.f64, b_.f64, 0, 2);
+ #else
+@@ -7654,6 +8228,8 @@ simde_x_mm_negate_pd(simde__m128d a) {
+ r_.neon_f64 = vnegq_f64(a_.neon_f64);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_f64x2_neg(a_.wasm_v128);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vneg_d(a_.lsx_i64);
+ #elif defined(SIMDE_VECTOR_NEGATE)
+ r_.f64 = -a_.f64;
+ #else
+@@ -7684,6 +8260,8 @@ simde_mm_xor_si128 (simde__m128i a, simd
+ r_.altivec_i32 = vec_xor(a_.altivec_i32, b_.altivec_i32);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_v128_xor(b_.wasm_v128, a_.wasm_v128);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vxor_v(a_.lsx_i64, b_.lsx_i64);
+ #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS)
+ r_.i32f = a_.i32f ^ b_.i32f;
+ #else
+@@ -7716,6 +8294,8 @@ simde_x_mm_not_si128 (simde__m128i a) {
+ r_.altivec_i32 = vec_nor(a_.altivec_i32, a_.altivec_i32);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_v128_not(a_.wasm_v128);
++ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
++ r_.lsx_i64 = __lsx_vnor_v(a_.lsx_i64, a_.lsx_i64);
+ #elif defined(SIMDE_VECTOR_SUBSCRIPT_OPS)
+ r_.i32f = ~a_.i32f;
+ #else
diff -Nru simde-0.8.2/debian/patches/simde-fix-type-convert-error-for-LSX.patch simde-0.8.2/debian/patches/simde-fix-type-convert-error-for-LSX.patch
--- simde-0.8.2/debian/patches/simde-fix-type-convert-error-for-LSX.patch 1970-01-01 00:00:00.000000000 +0000
+++ simde-0.8.2/debian/patches/simde-fix-type-convert-error-for-LSX.patch 2024-11-27 09:59:12.000000000 +0000
@@ -0,0 +1,55 @@
+Description: x86/sse.h: Fix type convert error for LSX.
+ .
+ simde (0.8.2-1+loong64) unreleased; urgency=medium
+ .
+ * x86/sse.h: Fix type convert error for LSX.
+ - Applied-Upstream: master, https://github.com/simd-everywhere/simde/pull/1215
+Author: Dandan Zhang <zhangdandan at loongson.cn>
+
+---
+The information above should follow the Patch Tagging Guidelines, please
+checkout https://dep.debian.net/deps/dep3/ to learn about the format. Here
+are templates for supplementary fields that you might want to add:
+
+Applied-Upstream: master, https://github.com/simd-everywhere/simde/pull/1215
+Signed-Off-By: yinshiyou
+Last-Update: 2024-11-25
+
+--- simde-0.8.2.orig/simde/x86/sse.h
++++ simde-0.8.2/simde/x86/sse.h
+@@ -629,7 +629,7 @@ simde_x_mm_round_ps (simde__m128 a, int
+ #elif defined(SIMDE_ARM_NEON_A32V8_NATIVE)
+ r_.neon_f32 = vrndnq_f32(a_.neon_f32);
+ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
+- r_.lsx_i64 = __lsx_vfrintrne_s(a_.lsx_f32);
++ r_.lsx_f32 = __lsx_vfrintrne_s(a_.lsx_f32);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_f32x4_nearest(a_.wasm_v128);
+ #elif defined(simde_math_roundevenf)
+@@ -648,7 +648,7 @@ simde_x_mm_round_ps (simde__m128 a, int
+ #elif defined(SIMDE_ARM_NEON_A32V8_NATIVE)
+ r_.neon_f32 = vrndmq_f32(a_.neon_f32);
+ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
+- r_.lsx_i64 = __lsx_vfrintrm_s(a_.lsx_f32);
++ r_.lsx_f32 = __lsx_vfrintrm_s(a_.lsx_f32);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_f32x4_floor(a_.wasm_v128);
+ #elif defined(simde_math_floorf)
+@@ -667,7 +667,7 @@ simde_x_mm_round_ps (simde__m128 a, int
+ #elif defined(SIMDE_ARM_NEON_A32V8_NATIVE)
+ r_.neon_f32 = vrndpq_f32(a_.neon_f32);
+ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
+- r_.lsx_i64 = __lsx_vfrintrp_s(a_.lsx_f32);
++ r_.lsx_f32 = __lsx_vfrintrp_s(a_.lsx_f32);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_f32x4_ceil(a_.wasm_v128);
+ #elif defined(simde_math_ceilf)
+@@ -686,7 +686,7 @@ simde_x_mm_round_ps (simde__m128 a, int
+ #elif defined(SIMDE_ARM_NEON_A32V8_NATIVE)
+ r_.neon_f32 = vrndq_f32(a_.neon_f32);
+ #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
+- r_.lsx_i64 = __lsx_vfrintrz_s(a_.lsx_f32);
++ r_.lsx_f32 = __lsx_vfrintrz_s(a_.lsx_f32);
+ #elif defined(SIMDE_WASM_SIMD128_NATIVE)
+ r_.wasm_v128 = wasm_f32x4_trunc(a_.wasm_v128);
+ #elif defined(simde_math_truncf)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: simde-fix-type-convert-error-for-LSX.patch
Type: text/x-patch
Size: 2536 bytes
Desc: not available
URL: <http://alioth-lists.debian.net/pipermail/debian-med-packaging/attachments/20241128/dfbb2aa5/attachment-0003.bin>
More information about the Debian-med-packaging
mailing list