[Pkg-opencl-devel] Bug#913141: Bug#913141: beignet: Segmentation fault while running opencv_perf_dnn

Thu Nov 8 13:37:48 GMT 2018

On 07/11/18 23:40, Rebecca N. Palmer wrote:
> That's a crash while trying to compile something.
> 
> Is this bug present in LLVM 7?  LLVM 3.9 has just been removed, so isn't 
> an option.
> 
> Do any of the tests (/usr/lib/x86_64-linux-gnu/beignet/utest_run from 
> the beignet-dev package) also crash?
> 
> Please install libllvm6.0-dbgsym and beignet-opencl-icd-dbgsym from the 
> debug symbols archive ( 
> https://www.debian.org/releases/stable/amd64/release-notes/ch-whats-new.en.html#debug-archive 
> ) and run the trace again.  If possible, also get the source it is 
> trying to compile (program->source in clBuildProgram, it will probably 
> be long enough that you need to adjust gdb's print settings).
> 

Hi Rebecca,

First of all, the tests from /usr/lib/x86_64-linux-gnu/beignet/utest_run, all run fine. 

Here is the output of an opencv_perf_dnn run with libllvm6.0-dbgsym and beignet-opencl-icd-dbgsym installed.
I added some code to dump the buildflags, source etc.

The interesting part is that the first test of the same source code (same source hash 2cdd81c1843105011ecb613a4f6f9e26) 
compiles ok. It's the second test that crashes. The difference is in the buildflags.

Working:
 -D TYPE=1 -D Dtype=float -D Dtype2=float2 -D Dtype4=float4 -D Dtype8=float8 -D Dtype16=float16 -D as_Dtype=as_float -D as_Dtype2=as_float2 -D as_Dtype4=as_float4 -D as_Dtype8=as_float8 -D KERNEL_WIDTH=3 -D KERNEL_HEIGHT=3 -D STRIDE_X=1 -D STRIDE_Y=1 -D DILATION_X=1 -D DILATION_Y=1 -D KERNEL_BASIC -cl-fast-relaxed-math -D ConvolveBasic=BASIC_k3x3_cn576_g1_s1x1_d1x1_b1_in64x48_p1x1_num1_M512_activ0_eltwise0_FP32_4_1_1_1 -D CHANNELS=576 -D APPLY_BIAS=1 -D OUTPUT_Z=512 -D ZPAR=1 -D INTEL_DEVICE

Crashing:
 -D TYPE=1 -D Dtype=float -D Dtype2=float2 -D Dtype4=float4 -D Dtype8=float8 -D Dtype16=float16 -D as_Dtype=as_float -D as_Dtype2=as_float2 -D as_Dtype4=as_float4 -D as_Dtype8=as_float8 -D KERNEL_WIDTH=3 -D KERNEL_HEIGHT=3 -D STRIDE_X=1 -D STRIDE_Y=1 -D DILATION_X=1 -D DILATION_Y=1 -D INPUT_PAD_W=1 -D INPUT_PAD_H=1 -D INPUT_PAD_RIGHT=1 -D INPUT_PAD_BOTTOM=1 -cl-fast-relaxed-math  -D GEMM_LIKE_CONV_32_1 -D Conv_Interleaved=U_GEMM_LIKE_CONV_k3x3_cn576_g1_s1x1_d1x1_b1_in64x48_p1x1_num1_M512_activ0_eltwise0_FP32_5_1_8_32_SIMD8 -cl-mad-enable -D KERNEL_GEMM_LIKE -D INPUT_DEPTH=576 -D WIDTH1=512 -D OUT_PADDING_LEFT=0 -D OUT_PADDING_HEIGHT=0 -D OUT_DEPTH=512 -D NUM_BATCHES=1 -D DY=1 -D DX=32 -D KERNEL_WIDTH_DIV2=1 -D KERNEL_SLICE_DIV2=4 -D TILE_N_LAST=0 -D TILE_N_LAST_DIV8=0 -D APPLY_BIAS=1 -D INTEL_DEVICE

Below is a dump from the complete run until the crash in the second test and a backtrace.

--
Sander

root at t440s:/mnt/scratch/src/opencv/build/bin# gdb ./opencv_perf_dnn 
GNU gdb (Debian 8.1-4+b1) 8.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./opencv_perf_dnn...(no debugging symbols found)...done.
(gdb) run
Starting program: /mnt/scratch/src/opencv/build/bin/opencv_perf_dnn 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fffe0ada700 (LWP 15404)]
[New Thread 0x7fffe02d9700 (LWP 15405)]
[New Thread 0x7fffddad7700 (LWP 15406)]
[New Thread 0x7fffdd2d6700 (LWP 15407)]
[New Thread 0x7fffdcad5700 (LWP 15408)]
[New Thread 0x7fffdc6d4700 (LWP 15409)]
[New Thread 0x7fffd5ffe700 (LWP 15410)]
Time compensation is 0
[New Thread 0x7fffd53f2700 (LWP 15448)]
[Thread 0x7fffd53f2700 (LWP 15448) exited]
CTEST_FULL_OUTPUT
OpenCV version: 4.0.0-pre
OpenCV VCS version: 4.0.0-alpha-103-ga9c8a526c-dirty
Build type: Release
Compiler: /usr/bin/c++  (ver 8.2.0)
Parallel framework: tbb
CPU features: SSE SSE2 SSE3 *SSE4.1 *SSE4.2 *FP16 *AVX *AVX2 *AVX512-SKX?
Intel(R) IPP version: ippIP AVX2 (l9) 2017.0.3 (-) Jul 29 2017
OpenCL Platforms: 
    Intel Gen OCL Driver
        iGPU: Intel(R) HD Graphics Haswell Ultrabook GT2 Mobile (OpenCL 1.2 beignet 1.3)
Current OpenCL device: 
    Type = iGPU
    Name = Intel(R) HD Graphics Haswell Ultrabook GT2 Mobile
    Version = OpenCL 1.2 beignet 1.3
    Driver version = 1.3
    Address bits = 32
    Compute units = 20
    Max work group size = 512
    Local memory size = 64 KB
    Max memory allocation size = 1 GB 512 MB
    Double support = No
    Host unified memory = Yes
    Device extensions:
        cl_khr_global_int32_base_atomics
        cl_khr_global_int32_extended_atomics
        cl_khr_local_int32_base_atomics
        cl_khr_local_int32_extended_atomics
        cl_khr_byte_addressable_store
        cl_khr_3d_image_writes
        cl_khr_image2d_from_buffer
        cl_khr_depth_images
        cl_khr_spir
        cl_khr_icd
        cl_intel_accelerator
        cl_intel_subgroups
        cl_intel_subgroups_short
        cl_khr_gl_sharing
    Has AMD Blas = No
    Has AMD Fft = No
    Preferred vector width char = 16
    Preferred vector width short = 8
    Preferred vector width int = 4
    Preferred vector width long = 2
    Preferred vector width float = 4
    Preferred vector width double = 0
[==========] Running 363 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 300 tests from Conv
[ RUN      ] Conv.conv/0, where GetParam() = (GFLOPS=10.087, K=[3 x 3], IN={1, 576, 38, 50}, OCN=512, PM=SAME, BIAS, OCV/CPU)
IN=4275 Kb [ 1 576 38 50 ]    OUT=3800 Kb [ 1 512 38 50 ]    Weights(parameters): 10370 Kb    MFLOPS=10087
.
[ PERFSTAT ]    (samples=100   mean=120.74   median=115.05   min=106.14   stddev=12.88 (10.7%))
[       OK ] Conv.conv/0 (12788 ms)
[ RUN      ] Conv.conv/1, where GetParam() = (GFLOPS=10.087, K=[3 x 3], IN={1, 576, 38, 50}, OCN=512, PM=SAME, BIAS, OCV/OCL)
[New Thread 0x7fffd53f2700 (LWP 15686)]
OpenCV(ocl4dnn): consider to specify kernel configuration cache directory 
                 via OPENCV_OCL4DNN_CONFIG_PATH parameter.

########################################
BUILDING PROGRAM:
sourceModule_:
dnn
sourceName_:
conv_layer_spatial
sourceHash_:
2cdd81c1843105011ecb613a4f6f9e26
buildflags:
 -D TYPE=1 -D Dtype=float -D Dtype2=float2 -D Dtype4=float4 -D Dtype8=float8 -D Dtype16=float16 -D as_Dtype=as_float -D as_Dtype2=as_float2 -D as_Dtype4=as_float4 -D as_Dtype8=as_float8 -D KERNEL_WIDTH=3 -D KERNEL_HEIGHT=3 -D STRIDE_X=1 -D STRIDE_Y=1 -D DILATION_X=1 -D DILATION_Y=1 -D KERNEL_BASIC -cl-fast-relaxed-math -D ConvolveBasic=BASIC_k3x3_cn576_g1_s1x1_d1x1_b1_in64x48_p1x1_num1_M512_activ0_eltwise0_FP32_4_1_1_1 -D CHANNELS=576 -D APPLY_BIAS=1 -D OUTPUT_Z=512 -D ZPAR=1 -D INTEL_DEVICE
codeStr_:
(null)
src_->sourceAddr_:
-135676032
Source:
#if defined(cl_khr_fp16)
#pragma OPENCL EXTENSION cl_khr_fp16 : enable
#endif
#define KERNEL_ARG_DTYPE float
#define TYPE_FLOAT  1
#define TYPE_HALF   2
#if defined(FUSED_CONV_RELU)
#define ACTIVATION_RELU_FUNCTION(x, c) ((Dtype)(x) > 0 ? (Dtype)(x) : ((Dtype)(x) * (negative_slope)))
#define FUSED_ARG KERNEL_ARG_DTYPE negative_slope,
#elif defined(FUSED_CONV_PRELU)
#define ACTIVATION_RELU_FUNCTION(x, c) ((Dtype)(x) > 0 ? (Dtype)(x) : ((Dtype)(x) * (negative_slope[c])))
#define FUSED_ARG __global const KERNEL_ARG_DTYPE* negative_slope,
#elif defined(FUSED_CONV_POWER)
#define ACTIVATION_RELU_FUNCTION(x, c) pow(x, (Dtype)power)
#define FUSED_ARG KERNEL_ARG_DTYPE power,
#elif defined(FUSED_CONV_TANH)
#define ACTIVATION_RELU_FUNCTION(x, c) tanh(x)
#define FUSED_ARG
#elif defined(FUSED_CONV_RELU6)
#define ACTIVATION_RELU_FUNCTION(x, c) (clamp((Dtype)(x), (Dtype)min_value, (Dtype)max_value))
#define FUSED_ARG KERNEL_ARG_DTYPE min_value, KERNEL_ARG_DTYPE max_value,
#else
#define ACTIVATION_RELU_FUNCTION(x, c) (x)
#define FUSED_ARG
#endif
#ifdef FUSED_CONV_ELTWISE
#define ACTIVATION_FUNCTION(_dst_, _offset_, _data_, _channel_) do { (_dst_)[(_offset_)] = ACTIVATION_RELU_FUNCTION(eltwise_data[(_offset_)] + (_data_), _channel_);} while(0)
#define ELTWISE_DATA_ARG __global Dtype* eltwise_data,
#else
#define ACTIVATION_FUNCTION(_dst_, _offset_, _data_, _channel_) do { (_dst_)[(_offset_)] = ACTIVATION_RELU_FUNCTION(_data_, _channel_);} while(0)
#define ELTWISE_DATA_ARG
#endif
#if APPLY_BIAS
#define BIAS_KERNEL_ARG __global Dtype * biases_base,
#else
#define BIAS_KERNEL_ARG
#endif
#define __CAT(x, y) x##y
#define CAT(x, y) __CAT(x, y)
#define LOOP0(VAR, STMT)
#define LOOP1(VAR, STMT) (STMT); (VAR)++;
#define LOOP2(VAR, STMT) LOOP1(VAR, STMT); (STMT); (VAR)++;
#define LOOP3(VAR, STMT) LOOP2(VAR, STMT); (STMT); (VAR)++;
#define LOOP4(VAR, STMT) LOOP3(VAR, STMT); (STMT); (VAR)++;
#define LOOP5(VAR, STMT) LOOP4(VAR, STMT); (STMT); (VAR)++;
#define LOOP6(VAR, STMT) LOOP5(VAR, STMT); (STMT); (VAR)++;
#define LOOP7(VAR, STMT) LOOP6(VAR, STMT); (STMT); (VAR)++;
#define LOOP8(VAR, STMT) LOOP7(VAR, STMT); (STMT); (VAR)++;
#define LOOP9(VAR, STMT) LOOP8(VAR, STMT); (STMT); (VAR)++;
#define LOOP10(VAR, STMT) LOOP9(VAR, STMT); (STMT); (VAR)++;
#define LOOP11(VAR, STMT) LOOP10(VAR, STMT); (STMT); (VAR)++;
#define LOOP12(VAR, STMT) LOOP11(VAR, STMT); (STMT); (VAR)++;
#define LOOP13(VAR, STMT) LOOP12(VAR, STMT); (STMT); (VAR)++;
#define LOOP14(VAR, STMT) LOOP13(VAR, STMT); (STMT); (VAR)++;
#define LOOP15(VAR, STMT) LOOP14(VAR, STMT); (STMT); (VAR)++;
#define LOOP16(VAR, STMT) LOOP15(VAR, STMT); (STMT); (VAR)++;
#define LOOP(N, VAR, STMT) CAT(LOOP, N)((VAR), (STMT))
#if defined(convolve_simd) || defined(Conv_Interleaved)
#if TYPE == TYPE_HALF
#define INT_TYPE ushort
#define INT_TYPE2 ushort2
#define INT_TYPE4 ushort4
#define INT_TYPE8 ushort8
#define SUB_GROUP_BLOCK_READ2 intel_sub_group_block_read_us2
#define SUB_GROUP_BLOCK_READ4 intel_sub_group_block_read_us4
#define SUB_GROUP_BLOCK_READ8 intel_sub_group_block_read_us8
#define SUB_GROUP_BLOCK_READ intel_sub_group_block_read_us
#else
#define INT_TYPE uint
#define INT_TYPE2 uint2
#define INT_TYPE4 uint4
#define INT_TYPE8 uint8
#define SUB_GROUP_BLOCK_READ2 intel_sub_group_block_read2
#define SUB_GROUP_BLOCK_READ4 intel_sub_group_block_read4
#define SUB_GROUP_BLOCK_READ8 intel_sub_group_block_read8
#define SUB_GROUP_BLOCK_READ intel_sub_group_block_read
#endif
#endif
#ifdef KERNEL_BASIC
__kernel void ConvolveBasic(
ELTWISE_DATA_ARG
FUSED_ARG
__global Dtype* image_data,
int image_offset,
__global Dtype* kernel_data,
int kernel_offset,
__global Dtype* bias,
const int bias_offset,
__global Dtype* convolved_image_base,
const int convolved_image_base_offset,
const int convolved_image_offset,
const ushort input_width,
const ushort input_height,
const ushort output_width,
const ushort output_height,
const ushort pad_w,
const ushort pad_h
)
{
__global Dtype* convolved_image = convolved_image_base + convolved_image_base_offset;
const int outputX = get_global_id(0);
const int outputY = get_global_id(1);
const int kernelNum = get_global_id(2) * ZPAR;
if (outputX < output_width && outputY < output_height)
{
Dtype sum[ZPAR];
for (int kern = 0; kern < ZPAR; kern++)
{
sum[kern] = 0.0f;
}
const int org_y = outputY * STRIDE_Y - pad_h;
const int org_x = outputX * STRIDE_X - pad_w;
const int currentKernelOffset = kernel_offset + kernelNum*KERNEL_HEIGHT*KERNEL_WIDTH*CHANNELS;
#if APPLY_BIAS
const int biasIndex = bias_offset + kernelNum;
#endif
const int local_image_offset = org_y * input_width + org_x;
const int imageSize = input_width * input_height;
__global Dtype* image_dataPtr = (image_data + (image_offset + local_image_offset));
__global Dtype* kernel_dataPtr = (kernel_data + (currentKernelOffset));
for (int c = 0; c < CHANNELS; c++)
{
for (int y = 0; y < KERNEL_HEIGHT; y++)
{
for (int x = 0; x < KERNEL_WIDTH; x++)
{
int y_ = org_y + y * DILATION_Y;
int x_ = org_x + x * DILATION_X;
if (!(y_ >= 0 && y_ < input_height && x_ >= 0 && x_ < input_width))
{
continue;
}
for (int kern = 0; kern < ZPAR; kern++)
{
sum[kern] += image_dataPtr[x * DILATION_X] * kernel_dataPtr[kern*KERNEL_HEIGHT*KERNEL_WIDTH*CHANNELS + x];
}
}
image_dataPtr += input_width * DILATION_Y;
kernel_dataPtr += KERNEL_WIDTH;
}
image_dataPtr += imageSize - input_width*KERNEL_HEIGHT*DILATION_Y;
}
for (int kern = 0; kern < ZPAR; kern++)
{
if (kernelNum + kern < OUTPUT_Z)
{
int offset = convolved_image_offset + (kernelNum+kern)*output_height*output_width + outputY*output_width + outputX;
#if APPLY_BIAS
ACTIVATION_FUNCTION(convolved_image, offset, sum[kern] + bias[biasIndex + kern], biasIndex + kern);
#else
ACTIVATION_FUNCTION(convolved_image, offset, sum[kern], biasIndex + kern);
#endif
}
}
}
}
#elif defined KERNEL_IDLF
__attribute__((reqd_work_group_size(1, 1, SIMD_SIZE)))
__attribute__((intel_reqd_sub_group_size(SIMD_SIZE)))
__kernel void
convolve_simd(
ELTWISE_DATA_ARG
FUSED_ARG
__global Dtype* inputs,
__global Dtype* weights,
BIAS_KERNEL_ARG
__global Dtype* outputs_base,
const int outputs_offset,
const ushort input_width,
const ushort input_height,
const ushort output_width,
const ushort output_height)
{
__global Dtype* outputs = outputs_base + outputs_offset;
unsigned int oc = get_global_id(0) * OUT_BLOCK_WIDTH;
unsigned int or = get_global_id(1) * OUT_BLOCK_HEIGHT;
unsigned int fm = get_global_id(2);
unsigned int fmg = get_group_id(2);
unsigned int lid = get_local_id(2);
Dtype out[OUT_BLOCK_WIDTH * OUT_BLOCK_HEIGHT] = { 0.0f };
unsigned int weight_addr = (fmg % FILTERS_IN_GROUP) *
INPUT_DEPTH * KERNEL_WIDTH * KERNEL_HEIGHT * SIMD_SIZE + lid;
unsigned int num_in_batch = fm / ALIGNED_NUM_FILTERS;
unsigned int input_batch_offset = num_in_batch * INPUT_PITCH * TOTAL_INPUT_DEPTH_SIZE;
int curr_y = or * STRIDE_Y;
int curr_x = oc * STRIDE_X + lid;
int in_addr = input_batch_offset
+  (curr_y - INPUT_PAD_H) * INPUT_WIDTH
+   curr_x - INPUT_PAD_W;
const int in_limit = (get_global_size(2) / ALIGNED_NUM_FILTERS) * TOTAL_INPUT_DEPTH_SIZE * INPUT_PITCH - 1;
Dtype in_buf[INVEC_SIZE];
for(int kd = 0; kd < INPUT_DEPTH; kd++)
{
#if INPUT_PAD_W != 0 || INPUT_PAD_H != 0 || INPUT_PAD_BOTTOM != 0 || INPUT_PAD_RIGHT != 0
const bool cx_out_of_range = !(curr_x >= INPUT_PAD_W && curr_x < INPUT_WIDTH + INPUT_PAD_W);
int in_offset = in_addr;
__attribute__((opencl_unroll_hint(INVEC_SIZE)))
for (int reg = 0; reg < INVEC_SIZE; reg++, in_offset += INPUT_WIDTH)
{
Dtype input = inputs[clamp(in_offset, 0, in_limit)];
int cy = curr_y + reg;
in_buf[reg] = (cx_out_of_range || cy < INPUT_PAD_H || cy >= INPUT_HEIGHT + INPUT_PAD_H) ? 0 : input;
}
#else
int in_offset = in_addr;
__attribute__((opencl_unroll_hint(INVEC_SIZE)))
for (int reg = 0; reg < INVEC_SIZE; reg++, in_offset += INPUT_WIDTH)
{
in_buf[reg] = inputs[min(in_offset, in_limit)];
}
#endif
in_addr += INPUT_PITCH;
#define BLOCK_IN(n, c) intel_sub_group_shuffle(in_buf[n], (c))
int kr = 0;
LOOP(KERNEL_HEIGHT, kr,
{
int kc = 0;
LOOP(KERNEL_WIDTH, kc,
{
Dtype weight_value = weights[weight_addr];
weight_addr += SIMD_SIZE;
for (int br=0; br < OUT_BLOCK_HEIGHT; br++)
{
for(int bc=0; bc < OUT_BLOCK_WIDTH; bc++)
{
Dtype input = BLOCK_IN((br * STRIDE_Y + kr * DILATION_Y), bc * STRIDE_X + kc * DILATION_X);
out[br * OUT_BLOCK_WIDTH + bc] = mad(weight_value, input, out[br * OUT_BLOCK_WIDTH + bc]);
}
}
});
});
}
fm = fm % ALIGNED_NUM_FILTERS;
#if LEFT_FILTERS > 0
if (fm < NUM_FILTERS)
#endif
{
unsigned int out_addr = (num_in_batch * TOTAL_OUTPUT_DEPTH + fm) * OUTPUT_PITCH;
out_addr += or * output_width + oc;
#if APPLY_BIAS
Dtype bias = biases_base[fm];
#else
Dtype bias = 0;
#endif
for(unsigned int r = 0; r < OUT_BLOCK_HEIGHT; r++)
{
if (r + or >= output_height) break;
for(unsigned int c = 0; c < OUT_BLOCK_WIDTH; c++)
{
if (c + oc >= output_width) break;
ACTIVATION_FUNCTION(outputs, out_addr + r * output_width + c, bias + out[r * OUT_BLOCK_WIDTH + c], fm);
}
}
}
}
#elif defined KERNEL_GEMM_LIKE
#if APPLY_BIAS
#define SUBGROUP_GET_BIAS(k, i) intel_sub_group_shuffle(bias[k], i)
#else
#define SUBGROUP_GET_BIAS(k, i) ((Dtype)0)
#endif
#ifdef Conv_Interleaved
typedef struct float1 { float s0; } float1;
typedef struct float5 { float s0; float s1; float s2; float s3; float s4; } float5;
typedef struct float6 { float s0; float s1; float s2; float s3; float s4; float s5; } float6;
typedef struct float7 { float s0; float s1; float s2; float s3; float s4; float s5; float s6; } float7;
typedef struct float9 { float s0; float s1; float s2; float s3; float s4; float s5; float s6; float s7; float s8; } float9;
typedef struct float10 { float s0; float s1; float s2; float s3; float s4; float s5;
float s6; float s7; float s8; float s9;} float10;
typedef struct float11 { float s0; float s1; float s2; float s3; float s4; float s5;
float s6; float s7; float s8; float s9; float sa;} float11;
typedef struct float12 { float s0; float s1; float s2; float s3; float s4; float s5;
float s6; float s7; float s8; float s9; float sa; float sb; } float12;
typedef struct float13 { float s0; float s1; float s2; float s3; float s4; float s5;
float s6; float s7; float s8; float s9; float sa; float sb; float sc;} float13;
typedef struct float14 { float s0; float s1; float s2; float s3; float s4; float s5;
float s6; float s7; float s8; float s9; float sa; float sb; float sc; float sd; } float14;
typedef struct float15 { float s0; float s1; float s2; float s3; float s4; float s5;
float s6; float s7; float s8; float s9; float sa; float sb; float sc; float sd; float se; } float15;
typedef struct float0 { float s0; } float0;
typedef struct half1 { half s0; } half1;
typedef struct half5 { half s0; half s1; half s2; half s3; half s4; } half5;
typedef struct half6 { half s0; half s1; half s2; half s3; half s4; half s5; } half6;
typedef struct half7 { half s0; half s1; half s2; half s3; half s4; half s5; half s6; } half7;
typedef struct half9 { half s0; half s1; half s2; half s3; half s4; half s5; half s6; half s7; half s8; } half9;
typedef struct half10 { half s0; half s1; half s2; half s3; half s4; half s5;
half s6; half s7; half s8; half s9; } half10;
typedef struct half11 { half s0; half s1; half s2; half s3; half s4; half s5;
half s6; half s7; half s8; half s9; half sa; } half11;
typedef struct half12 { half s0; half s1; half s2; half s3; half s4; half s5;
half s6; half s7; half s8; half s9; half sa; half sb; } half12;
typedef struct half13 { half s0; half s1; half s2; half s3; half s4; half s5;
half s6; half s7; half s8; half s9; half sa; half sb; half sc; } half13;
typedef struct half14 { half s0; half s1; half s2; half s3; half s4; half s5;
half s6; half s7; half s8; half s9; half sa; half sb; half sc; half sd; } half14;
typedef struct half15 { half s0; half s1; half s2; half s3; half s4; half s5;
half s6; half s7; half s8; half s9; half sa; half sb; half sc; half sd; half se; } half15;
typedef struct half0 { half s0; } half0;
#define OUT_PITCH_X output_width
#define ROW_PITCH input_width
#define GEMM_LIKE_KERNEL_ARGS     \
ELTWISE_DATA_ARG              \
FUSED_ARG                     \
const __global Dtype *src0,   \
const __global Dtype *src1,   \
BIAS_KERNEL_ARG               \
__global Dtype *dst_base,     \
const int dst_offset,         \
const ushort input_width,     \
const ushort input_height,    \
const ushort output_width,    \
const ushort output_height,   \
const int out_pitch_y,     \
const int out_pitch_z,     \
const int aligned_input_size, \
const int slice_pitch
#endif
#ifdef GEMM_LIKE_CONV_32_1
#define TILE_M          1
#define TILE_K          KERNEL_WIDTH
#define TILE_N          32
__attribute__((intel_reqd_sub_group_size(8)))
__kernel void Conv_Interleaved(GEMM_LIKE_KERNEL_ARGS)
{
__global Dtype *dst = dst_base + dst_offset;
const int group_x = get_group_id(0);
const int group_y = get_group_id(1);
const int global_x = get_global_id(0);
const int global_y = get_global_id(1);
const int global_z = get_global_id(2);
int interleaved_y;
int kernel_y;
int kernel_idx;
#define DOT_PRODUCT_8( _result, _rowA, colB )    \
{   \
_result.s0 = mad( _rowA, sub_group_broadcast( colB, 0 ), _result.s0 );  \
_result.s1 = mad( _rowA, sub_group_broadcast( colB, 1 ), _result.s1 );  \
_result.s2 = mad( _rowA, sub_group_broadcast( colB, 2 ), _result.s2 );  \
_result.s3 = mad( _rowA, sub_group_broadcast( colB, 3 ), _result.s3 );  \
_result.s4 = mad( _rowA, sub_group_broadcast( colB, 4 ), _result.s4 );  \
_result.s5 = mad( _rowA, sub_group_broadcast( colB, 5 ), _result.s5 );  \
_result.s6 = mad( _rowA, sub_group_broadcast( colB, 6 ), _result.s6 );  \
_result.s7 = mad( _rowA, sub_group_broadcast( colB, 7 ), _result.s7 );  \
}
typedef CAT( Dtype, KERNEL_WIDTH ) Dtype_t;
if( TILE_N_LAST == 0 || global_x < WIDTH1 / TILE_N )
{
Dtype8  blockC00 = 0.f;
Dtype8  blockC10 = 0.f;
Dtype8  blockC20 = 0.f;
Dtype8  blockC30 = 0.f;
int curr_x = ( global_y % output_width ) * STRIDE_X;
int curr_y = ( global_y / output_width ) * STRIDE_Y;
#if INPUT_PAD_H != 0 || INPUT_PAD_W != 0 || DILATION_X != 1 || DILATION_Y != 1 || INPUT_PAD_BOTTOM != 0 || INPUT_PAD_RIGHT != 0
int saved_y = curr_y;
#endif
const __global Dtype *src0_read = src0
+ aligned_input_size * global_z
+ (curr_y - INPUT_PAD_H) * ROW_PITCH
+ (curr_x - INPUT_PAD_W);
const __global Dtype *src1_read = src1 + ( global_x * TILE_N  * 2);
int patch_depth = 0;
do
{
int patch_row = 0;
#if INPUT_PAD_H != 0 || INPUT_PAD_W != 0 || DILATION_X != 1 || DILATION_Y != 1 || INPUT_PAD_BOTTOM != 0 || INPUT_PAD_RIGHT != 0
curr_y = saved_y;
#endif
do
{
const bool kernel_width_is_odd = KERNEL_WIDTH % 2 == 1;
#if INPUT_PAD_W == 0 && INPUT_PAD_H == 0 && DILATION_X == 1 && DILATION_Y == 1 && INPUT_PAD_BOTTOM == 0 && INPUT_PAD_RIGHT == 0
#if KERNEL_WIDTH == 3
Dtype_t blockA00 = vload3(0, src0_read);
Dtype*  pblockA00 = (Dtype*)(&blockA00);
#else
Dtype_t blockA00 = ( (const __global Dtype_t*)src0_read )[  0  ];
Dtype*  pblockA00 = (Dtype*)(&blockA00);
#endif
#else
Dtype_t blockA00;
Dtype*  pblockA00 = (Dtype*)(&blockA00);
int pos = 0;
LOOP(KERNEL_WIDTH, pos,
{
if (curr_y >= INPUT_PAD_H &&
curr_y < input_height + INPUT_PAD_H &&
curr_x + pos * DILATION_X >= INPUT_PAD_W &&
curr_x + pos * DILATION_X < input_width + INPUT_PAD_W)
pblockA00[pos] = src0_read[pos * DILATION_X];
else
pblockA00[pos] = 0;
})
curr_y += DILATION_Y;
#endif
src0_read += (ROW_PITCH * DILATION_Y);
Dtype blockB00[KERNEL_WIDTH*4];
Dtype8* p8BlockB00 = (Dtype8*)blockB00;
Dtype4* p4BlockB00 = (Dtype4*)blockB00;
Dtype*  pBlockB00 =  (Dtype* )blockB00;
interleaved_y = 0;
LOOP(KERNEL_WIDTH_DIV2, interleaved_y,
{
p8BlockB00[interleaved_y] = as_Dtype8( SUB_GROUP_BLOCK_READ8( (const __global INT_TYPE *)src1_read ) );
src1_read += WIDTH1 * 2;
} )
if ( kernel_width_is_odd )
{
p4BlockB00[KERNEL_WIDTH - 1] = as_Dtype4( SUB_GROUP_BLOCK_READ4( (const __global INT_TYPE *)src1_read ) );
src1_read += WIDTH1 * 2;
}
kernel_idx = 0;
interleaved_y = 0;
LOOP(KERNEL_WIDTH_DIV2, interleaved_y,
{
kernel_y = interleaved_y * 2;
DOT_PRODUCT_8( blockC00, pblockA00[kernel_y    ], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC00, pblockA00[kernel_y + 1], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC10, pblockA00[kernel_y    ], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC10, pblockA00[kernel_y + 1], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC20, pblockA00[kernel_y    ], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC20, pblockA00[kernel_y + 1], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC30, pblockA00[kernel_y    ], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC30, pblockA00[kernel_y + 1], pBlockB00[kernel_idx] ); kernel_idx++;
} )
kernel_y = interleaved_y * 2;
if ( kernel_width_is_odd )
{
DOT_PRODUCT_8( blockC00, pblockA00[kernel_y], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC10, pblockA00[kernel_y], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC20, pblockA00[kernel_y], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC30, pblockA00[kernel_y], pBlockB00[kernel_idx] ); kernel_idx++;
}
}
while( ++patch_row < KERNEL_HEIGHT );
src0_read += slice_pitch - ( KERNEL_HEIGHT * ROW_PITCH * DILATION_Y);
}
while ( ++patch_depth < INPUT_DEPTH );
int out_offset = global_z * out_pitch_z
+ ( group_x * TILE_N ) * out_pitch_y
+ ( ( global_y * TILE_M ) / output_width + OUT_PADDING_HEIGHT) * OUT_PITCH_X
+ ( ( global_y * TILE_M ) % output_width ) + OUT_PADDING_LEFT;
__global Dtype *out = dst + out_offset;
#if APPLY_BIAS
Dtype bias[4];
Dtype4 *bias_vec;
bias_vec = (Dtype4*)bias;
*bias_vec = as_Dtype4(SUB_GROUP_BLOCK_READ4((__global INT_TYPE *)biases_base + group_x * TILE_N));
if (group_x > 0xFFFFFFFEul) {
dst[0] = bias[0] + bias[1] + bias[2] + bias[3];
}
#else
const Dtype bias[4] = {0, 0, 0, 0};
#endif
if (global_y * TILE_M < output_width * output_height )
{
for (int i = 0; i < 8; i++)
{
ACTIVATION_FUNCTION(dst, out_offset + ( 0 + i ) * out_pitch_y, blockC00[i] + SUBGROUP_GET_BIAS(0, i), group_x * TILE_N + i);
ACTIVATION_FUNCTION(dst, out_offset + ( 8 + i ) * out_pitch_y, blockC10[i] + SUBGROUP_GET_BIAS(1, i), group_x * TILE_N + 8 + i);
ACTIVATION_FUNCTION(dst, out_offset + ( 16 + i ) * out_pitch_y, blockC20[i] + SUBGROUP_GET_BIAS(2, i), group_x * TILE_N + 16 + i);
ACTIVATION_FUNCTION(dst, out_offset + ( 24 + i ) * out_pitch_y, blockC30[i] + SUBGROUP_GET_BIAS(3, i), group_x * TILE_N + 24 + i);
}
}
}
#if TILE_N_LAST > 0
else
{
int i = 0;
Dtype8  blockC[TILE_N_LAST_DIV8];
LOOP(TILE_N_LAST_DIV8, i,
{
blockC[i] = 0.f;
} )
int curr_x = ( global_y % output_width ) * STRIDE_X;
int curr_y = ( global_y / output_width ) * STRIDE_Y;
#if INPUT_PAD_H != 0 || INPUT_PAD_W != 0 || DILATION_X != 1 || DILATION_Y != 1 || INPUT_PAD_BOTTOM != 0 || INPUT_PAD_RIGHT != 0
int saved_y = curr_y;
#endif
const __global Dtype *src0_read = src0
+ aligned_input_size * global_z
+ (curr_y - INPUT_PAD_H) * ROW_PITCH
+ (curr_x - INPUT_PAD_W);
const __global Dtype *src1_read = src1 + ( global_x * TILE_N  * 2);
int patch_depth = 0;
do
{
int patch_row = 0;
#if INPUT_PAD_H != 0 || INPUT_PAD_W != 0 || DILATION_X != 1 || DILATION_Y != 1 || INPUT_PAD_BOTTOM != 0 || INPUT_PAD_RIGHT != 0
curr_y = saved_y;
#endif
do
{
const bool kernel_width_is_odd = KERNEL_WIDTH % 2 == 1;
#if INPUT_PAD_W == 0 && INPUT_PAD_H == 0 && DILATION_X == 1 && DILATION_Y == 1 && INPUT_PAD_BOTTOM == 0 && INPUT_PAD_RIGHT == 0
Dtype_t blockA00 = ( (const __global Dtype_t*)src0_read )[  0  ];
Dtype*  pblockA00 = (Dtype*)(&blockA00);
#else
Dtype_t blockA00;
Dtype*  pblockA00 = (Dtype*)(&blockA00);
int pos = 0;
LOOP(KERNEL_WIDTH, pos,
{
if (curr_y >= INPUT_PAD_H &&
curr_y < input_height + INPUT_PAD_H &&
curr_x + pos * DILATION_X >= INPUT_PAD_W &&
curr_x + pos * DILATION_X < input_width + INPUT_PAD_W)
pblockA00[pos] = src0_read[pos * DILATION_X];
else
pblockA00[pos] = 0;
})
curr_y += DILATION_Y;
#endif
src0_read += (ROW_PITCH * DILATION_Y);
Dtype blockB[KERNEL_WIDTH * TILE_N_LAST_DIV8];
interleaved_y = 0;
LOOP(KERNEL_WIDTH_DIV2, interleaved_y,
{
#if TILE_N_LAST_DIV8 == 1
Dtype2* p2BlockB = (Dtype2* )blockB;
p2BlockB[interleaved_y] = as_Dtype2( SUB_GROUP_BLOCK_READ2( (const __global INT_TYPE*)src1_read ) );
#elif TILE_N_LAST_DIV8 == 2
Dtype4* p4BlockB = (Dtype4* )blockB;
p4BlockB[interleaved_y] = as_Dtype4( SUB_GROUP_BLOCK_READ4( (const __global INT_TYPE*)src1_read ) );
#elif TILE_N_LAST_DIV8 == 3
Dtype6* p6BlockB = (Dtype6* )blockB;
(*((Dtype8*)(&p6BlockB[interleaved_y]))).s0123 = as_Dtype4( SUB_GROUP_BLOCK_READ4( (const __global INT_TYPE*)src1_read ) );
(*((Dtype8*)(&p6BlockB[interleaved_y]))).s45 = as_Dtype2( SUB_GROUP_BLOCK_READ2( (const __global INT_TYPE*)(src1_read + 4 * 8) ) );
#endif
src1_read += WIDTH1 * 2;
} )
if ( kernel_width_is_odd )
{
#if TILE_N_LAST_DIV8 == 1
Dtype* pBlockB = (Dtype* )blockB;
pBlockB[KERNEL_WIDTH - 1] = as_Dtype( SUB_GROUP_BLOCK_READ( (const __global INT_TYPE*)src1_read ) );
#elif TILE_N_LAST_DIV8 == 2
Dtype2* p2BlockB = (Dtype2* )blockB;
p2BlockB[KERNEL_WIDTH - 1] = as_Dtype2( SUB_GROUP_BLOCK_READ2( (const __global INT_TYPE*)src1_read ) );
#elif TILE_N_LAST_DIV8 == 3
Dtype3* p3BlockB = (Dtype3* )blockB;
p3BlockB[KERNEL_WIDTH - 1].s01 = as_Dtype2( SUB_GROUP_BLOCK_READ2( (const __global INT_TYPE*)src1_read ) );
p3BlockB[KERNEL_WIDTH - 1].s2 = as_Dtype( SUB_GROUP_BLOCK_READ( (const __global INT_TYPE*) (src1_read + 2 * 8) ) );
#endif
src1_read += WIDTH1 * 2;
}
Dtype* pBlockB = (Dtype*)blockB;
kernel_idx = 0;
interleaved_y = 0;
LOOP(KERNEL_WIDTH_DIV2, interleaved_y,
{
kernel_y = interleaved_y * 2;
DOT_PRODUCT_8( blockC[0], pblockA00[kernel_y    ], pBlockB[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC[0], pblockA00[kernel_y + 1], pBlockB[kernel_idx] ); kernel_idx++;
#if TILE_N_LAST_DIV8 >= 2
DOT_PRODUCT_8( blockC[1], pblockA00[kernel_y    ], pBlockB[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC[1], pblockA00[kernel_y + 1], pBlockB[kernel_idx] ); kernel_idx++;
#if TILE_N_LAST_DIV8 >= 3
DOT_PRODUCT_8( blockC[2], pblockA00[kernel_y    ], pBlockB[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC[2], pblockA00[kernel_y + 1], pBlockB[kernel_idx] ); kernel_idx++;
#endif
#endif
} )
kernel_y = interleaved_y * 2;
if ( kernel_width_is_odd )
{
DOT_PRODUCT_8( blockC[0], pblockA00[kernel_y], pBlockB[kernel_idx] ); kernel_idx++;
#if TILE_N_LAST_DIV8 >= 2
DOT_PRODUCT_8( blockC[1], pblockA00[kernel_y], pBlockB[kernel_idx] ); kernel_idx++;
#if TILE_N_LAST_DIV8 >= 3
DOT_PRODUCT_8( blockC[2], pblockA00[kernel_y], pBlockB[kernel_idx] ); kernel_idx++;
#endif
#endif
}
}
while( ++patch_row < KERNEL_HEIGHT );
src0_read += slice_pitch - ( KERNEL_HEIGHT * ROW_PITCH * DILATION_Y );
}
while ( ++patch_depth < INPUT_DEPTH );
int out_offset = global_z * out_pitch_z
+ ( group_x * TILE_N ) * out_pitch_y
+ ( ( global_y * TILE_M ) / output_width + OUT_PADDING_HEIGHT) * OUT_PITCH_X
+ ( ( global_y * TILE_M ) % output_width ) + OUT_PADDING_LEFT;
__global Dtype *out = dst + out_offset;
#if APPLY_BIAS
Dtype bias[4];
Dtype4 *bias_vec;
bias_vec = (Dtype4*)bias;
*bias_vec = as_Dtype4(SUB_GROUP_BLOCK_READ4((__global INT_TYPE *)biases_base + group_x * TILE_N));
if (group_x > 0xFFFFFFFEul) {
dst[0] = bias[0] + bias[1] + bias[2] + bias[3];
}
#else
const Dtype bias[4] = {0, 0, 0, 0};
#endif
if (global_y * TILE_M < output_width * output_height )
{
for (int i = 0; i < 8; i++)
{
if ( TILE_N_LAST_DIV8 > 0 )
{
ACTIVATION_FUNCTION(dst, out_offset + ( 0+i) * out_pitch_y, blockC[0][i] + SUBGROUP_GET_BIAS(0, i), group_x * TILE_N + i);
}
if ( TILE_N_LAST_DIV8 > 1 )
{
ACTIVATION_FUNCTION(dst, out_offset + ( 8+i) * out_pitch_y, blockC[1][i] + SUBGROUP_GET_BIAS(1, i), group_x * TILE_N + i + 8);
}
if ( TILE_N_LAST_DIV8 > 2 )
{
ACTIVATION_FUNCTION(dst, out_offset + (16+i) * out_pitch_y, blockC[2][i] + SUBGROUP_GET_BIAS(2, i), group_x * TILE_N + i + 16);
}
if ( TILE_N_LAST_DIV8 > 3 )
{
ACTIVATION_FUNCTION(dst, out_offset + (24+i) * out_pitch_y, blockC[3][i] + SUBGROUP_GET_BIAS(3, i), group_x * TILE_N + i + 24);
}
}
}
}
#endif
}
#endif
#ifdef GEMM_LIKE_CONV_32_2
#define TILE_M          2
#define TILE_K          KERNEL_WIDTH
#define TILE_N          32
__attribute__((intel_reqd_sub_group_size(8)))
__kernel void Conv_Interleaved(GEMM_LIKE_KERNEL_ARGS)
{
__global Dtype *dst = dst_base + dst_offset;
const int group_x = get_group_id(0);
const int group_y = get_group_id(1);
const int global_x = get_global_id(0);
const int global_y = get_global_id(1);
const int global_z = get_global_id(2);
int interleaved_y;
int kernel_y;
int kernel_idx;
#define DOT_PRODUCT_8( _result, _rowA, colB )    \
{   \
_result.s0 = mad( _rowA, sub_group_broadcast( colB, 0 ), _result.s0 );  \
_result.s1 = mad( _rowA, sub_group_broadcast( colB, 1 ), _result.s1 );  \
_result.s2 = mad( _rowA, sub_group_broadcast( colB, 2 ), _result.s2 );  \
_result.s3 = mad( _rowA, sub_group_broadcast( colB, 3 ), _result.s3 );  \
_result.s4 = mad( _rowA, sub_group_broadcast( colB, 4 ), _result.s4 );  \
_result.s5 = mad( _rowA, sub_group_broadcast( colB, 5 ), _result.s5 );  \
_result.s6 = mad( _rowA, sub_group_broadcast( colB, 6 ), _result.s6 );  \
_result.s7 = mad( _rowA, sub_group_broadcast( colB, 7 ), _result.s7 );  \
}
typedef CAT( Dtype, KERNEL_WIDTH ) Dtype_t;
if( TILE_N_LAST == 0 || global_x < WIDTH1 / TILE_N )
{
Dtype8  blockC00 = 0.f;
Dtype8  blockC10 = 0.f;
Dtype8  blockC20 = 0.f;
Dtype8  blockC30 = 0.f;
Dtype8  blockC01 = 0.f;
Dtype8  blockC11 = 0.f;
Dtype8  blockC21 = 0.f;
Dtype8  blockC31 = 0.f;
int curr_x0 = ( ( global_y * TILE_M + 0 ) % output_width ) * STRIDE_X;
int curr_x1 = ( ( global_y * TILE_M + 1 ) % output_width ) * STRIDE_X;
int curr_y0 = ( ( global_y * TILE_M + 0 ) / output_width ) * STRIDE_Y;
int curr_y1 = ( ( global_y * TILE_M + 1 ) / output_width ) * STRIDE_Y;
#if INPUT_PAD_H != 0 || INPUT_PAD_W != 0 || DILATION_X != 1 || DILATION_Y != 1 || INPUT_PAD_BOTTOM != 0 || INPUT_PAD_RIGHT != 0
int saved_y0 = curr_y0;
int saved_y1 = curr_y1;
#endif
const __global Dtype *src0_read0 = src0
+ aligned_input_size * global_z
+ (curr_y0 - INPUT_PAD_H) * ROW_PITCH
+ curr_x0 - INPUT_PAD_W;
const __global Dtype *src0_read1 = src0
+ aligned_input_size * global_z
+ (curr_y1 - INPUT_PAD_H) * ROW_PITCH
+ curr_x1 - INPUT_PAD_W;
const __global Dtype *src1_read = src1 + ( global_x * TILE_N * 2);
int patch_depth = 0;
do
{
int patch_row = 0;
do
{
const bool kernel_width_is_odd = KERNEL_WIDTH % 2 == 1;
#if INPUT_PAD_H == 0 && INPUT_PAD_W == 0 && DILATION_X == 1 && DILATION_Y == 1 && INPUT_PAD_BOTTOM == 0 && INPUT_PAD_RIGHT == 0
#if KERNEL_WIDTH == 3
Dtype_t blockA00 = vload3(0, src0_read0); src0_read0 += ROW_PITCH;
Dtype_t blockA01 = vload3(0, src0_read1); src0_read1 += ROW_PITCH;
Dtype*  pblockA00 = (Dtype*)(&blockA00);
Dtype*  pblockA01 = (Dtype*)(&blockA01);
#else
Dtype_t blockA00 = { (Dtype)0.f };
Dtype_t blockA01 = { (Dtype)0.f };
Dtype*  pblockA00 = (Dtype*)(&blockA00);
Dtype*  pblockA01 = (Dtype*)(&blockA01);
int pos = 0;
LOOP(KERNEL_WIDTH, pos,
{
if (curr_x0 + pos < input_width)
pblockA00[pos] = src0_read0[pos];
if (curr_x1 + pos < input_width)
pblockA01[pos] = src0_read1[pos];
})
src0_read0 += ROW_PITCH;
src0_read1 += ROW_PITCH;
#endif
#else
Dtype_t blockA00;
Dtype*  pblockA00 = (Dtype*)(&blockA00);
int pos = 0;
LOOP(KERNEL_WIDTH, pos,
{
if (curr_y0 >= INPUT_PAD_H &&
curr_y0 < input_height + INPUT_PAD_H &&
curr_x0 + pos * DILATION_X >= INPUT_PAD_W &&
curr_x0 + pos * DILATION_X < input_width + INPUT_PAD_W)
pblockA00[pos] = src0_read0[pos * DILATION_X];
else
pblockA00[pos] = 0;
})
curr_y0 += DILATION_Y;
Dtype_t blockA01;
Dtype*  pblockA01 = (Dtype*)(&blockA01);
pos = 0;
LOOP(KERNEL_WIDTH, pos,
{
if (curr_y1 >= INPUT_PAD_H &&
curr_y1 < input_height + INPUT_PAD_H &&
curr_x1 + pos * DILATION_X >= INPUT_PAD_W &&
curr_x1 + pos * DILATION_X < input_width + INPUT_PAD_W)
pblockA01[pos] = src0_read1[pos * DILATION_X];
else
pblockA01[pos] = 0;
})
curr_y1 += DILATION_Y;
src0_read0 += (ROW_PITCH * DILATION_Y);
src0_read1 += (ROW_PITCH * DILATION_Y);
#endif
Dtype blockB00[KERNEL_WIDTH*4];
Dtype8* p8BlockB00 = (Dtype8*)blockB00;
Dtype4* p4BlockB00 = (Dtype4*)blockB00;
Dtype*  pBlockB00 =  (Dtype* )blockB00;
interleaved_y = 0;
LOOP(KERNEL_WIDTH_DIV2, interleaved_y,
{
p8BlockB00[interleaved_y] = as_Dtype8( SUB_GROUP_BLOCK_READ8( (const __global INT_TYPE*)src1_read ) );
src1_read += WIDTH1 * 2;
} )
if ( kernel_width_is_odd )
{
p4BlockB00[KERNEL_WIDTH - 1] = as_Dtype4( SUB_GROUP_BLOCK_READ4( (const __global INT_TYPE*)src1_read ) );
src1_read += WIDTH1 * 2;
}
kernel_idx = 0;
interleaved_y = 0;
LOOP(KERNEL_WIDTH_DIV2, interleaved_y,
{
kernel_y = interleaved_y * 2;
DOT_PRODUCT_8( blockC00, pblockA00[kernel_y    ], pBlockB00[kernel_idx] );
DOT_PRODUCT_8( blockC01, pblockA01[kernel_y    ], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC00, pblockA00[kernel_y + 1], pBlockB00[kernel_idx] );
DOT_PRODUCT_8( blockC01, pblockA01[kernel_y + 1], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC10, pblockA00[kernel_y    ], pBlockB00[kernel_idx] );
DOT_PRODUCT_8( blockC11, pblockA01[kernel_y    ], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC10, pblockA00[kernel_y + 1], pBlockB00[kernel_idx] );
DOT_PRODUCT_8( blockC11, pblockA01[kernel_y + 1], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC20, pblockA00[kernel_y    ], pBlockB00[kernel_idx] );
DOT_PRODUCT_8( blockC21, pblockA01[kernel_y    ], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC20, pblockA00[kernel_y + 1], pBlockB00[kernel_idx] );
DOT_PRODUCT_8( blockC21, pblockA01[kernel_y + 1], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC30, pblockA00[kernel_y    ], pBlockB00[kernel_idx] );
DOT_PRODUCT_8( blockC31, pblockA01[kernel_y    ], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC30, pblockA00[kernel_y + 1], pBlockB00[kernel_idx] );
DOT_PRODUCT_8( blockC31, pblockA01[kernel_y + 1], pBlockB00[kernel_idx] ); kernel_idx++;
} )
if ( kernel_width_is_odd )
{
kernel_y = interleaved_y * 2;
DOT_PRODUCT_8( blockC00, pblockA00[kernel_y], pBlockB00[kernel_idx] );
DOT_PRODUCT_8( blockC01, pblockA01[kernel_y], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC10, pblockA00[kernel_y], pBlockB00[kernel_idx] );
DOT_PRODUCT_8( blockC11, pblockA01[kernel_y], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC20, pblockA00[kernel_y], pBlockB00[kernel_idx] );
DOT_PRODUCT_8( blockC21, pblockA01[kernel_y], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC30, pblockA00[kernel_y], pBlockB00[kernel_idx] );
DOT_PRODUCT_8( blockC31, pblockA01[kernel_y], pBlockB00[kernel_idx] ); kernel_idx++;
}
}
while( ++patch_row < KERNEL_HEIGHT );
#if INPUT_PAD_W != 0 || INPUT_PAD_H != 0 || DILATION_X != 1 || DILATION_Y != 1 || INPUT_PAD_BOTTOM != 0 || INPUT_PAD_RIGHT != 0
curr_y0 = saved_y0;
curr_y1 = saved_y1;
#endif
src0_read0 += slice_pitch - ( KERNEL_HEIGHT * ROW_PITCH * DILATION_Y );
src0_read1 += slice_pitch - ( KERNEL_HEIGHT * ROW_PITCH * DILATION_Y );
}
while ( ++patch_depth < INPUT_DEPTH );
int out0_offset = global_z * out_pitch_z
+ ( group_x * TILE_N ) * out_pitch_y
+ ( ( global_y * TILE_M + 0 ) / output_width + OUT_PADDING_HEIGHT ) * OUT_PITCH_X
+ ( ( global_y * TILE_M + 0 ) % output_width ) + OUT_PADDING_LEFT;
int out1_offset = global_z * out_pitch_z
+ ( group_x * TILE_N ) * out_pitch_y
+ ( ( global_y * TILE_M + 1 ) / output_width + OUT_PADDING_HEIGHT ) * OUT_PITCH_X
+ ( ( global_y * TILE_M + 1 ) % output_width ) + OUT_PADDING_LEFT;
#if APPLY_BIAS
Dtype bias[4];
Dtype4 *bias_vec;
bias_vec = (Dtype4*)bias;
*bias_vec = as_Dtype4(SUB_GROUP_BLOCK_READ4((__global INT_TYPE *)biases_base + group_x * TILE_N));
if (group_x > 0xFFFFFFFEul) {
dst[0] = bias[0] + bias[1] + bias[2] + bias[3];
}
#else
const Dtype bias[4] = {0, 0, 0, 0};
#endif
if( global_y * TILE_M < output_width * output_height )
{
for( int i = 0; i < 8; i++ )
{
ACTIVATION_FUNCTION(dst, out0_offset + ( 0+i) * out_pitch_y, blockC00[i] + SUBGROUP_GET_BIAS(0, i), group_x * TILE_N + i);
ACTIVATION_FUNCTION(dst, out0_offset + ( 8+i) * out_pitch_y, blockC10[i] + SUBGROUP_GET_BIAS(1, i), group_x * TILE_N + i + 8);
ACTIVATION_FUNCTION(dst, out0_offset + (16+i) * out_pitch_y, blockC20[i] + SUBGROUP_GET_BIAS(2, i), group_x * TILE_N + i + 16);
ACTIVATION_FUNCTION(dst, out0_offset + (24+i) * out_pitch_y, blockC30[i] + SUBGROUP_GET_BIAS(3, i), group_x * TILE_N + i + 24);
}
}
if( global_y * TILE_M + 1 < output_width * output_height )
{
for( int i = 0; i < 8; i++ )
{
ACTIVATION_FUNCTION(dst, out1_offset + ( 0+i) * out_pitch_y, blockC01[i] + SUBGROUP_GET_BIAS(0, i), group_x * TILE_N + i);
ACTIVATION_FUNCTION(dst, out1_offset + ( 8+i) * out_pitch_y, blockC11[i] + SUBGROUP_GET_BIAS(1, i), group_x * TILE_N + i + 8);
ACTIVATION_FUNCTION(dst, out1_offset + (16+i) * out_pitch_y, blockC21[i] + SUBGROUP_GET_BIAS(2, i), group_x * TILE_N + i + 16);
ACTIVATION_FUNCTION(dst, out1_offset + (24+i) * out_pitch_y, blockC31[i] + SUBGROUP_GET_BIAS(3, i), group_x * TILE_N + i + 24);
}
}
}
#if TILE_N_LAST > 0
else
{
int i = 0;
Dtype8  blockC0[TILE_N_LAST_DIV8];
Dtype8  blockC1[TILE_N_LAST_DIV8];
LOOP(TILE_N_LAST_DIV8, i,
{
blockC0[i] = 0.f;
blockC1[i] = 0.f;
} )
int curr_x0 = ( ( global_y * TILE_M + 0 ) % output_width ) * STRIDE_X;
int curr_x1 = ( ( global_y * TILE_M + 1 ) % output_width ) * STRIDE_X;
int curr_y0 = ( ( global_y * TILE_M + 0 ) / output_width ) * STRIDE_Y;
int curr_y1 = ( ( global_y * TILE_M + 1 ) / output_width ) * STRIDE_Y;
#if INPUT_PAD_H != 0 || INPUT_PAD_W != 0 || DILATION_X != 1 || DILATION_Y != 1 || INPUT_PAD_BOTTOM != 0 || INPUT_PAD_RIGHT != 0
int saved_y0 = curr_y0;
int saved_y1 = curr_y1;
#endif
const __global Dtype *src0_read0 = src0
+ aligned_input_size * global_z
+ (curr_y0 - INPUT_PAD_H) * ROW_PITCH
+ curr_x0 - INPUT_PAD_W;
const __global Dtype *src0_read1 = src0
+ aligned_input_size * global_z
+ (curr_y1 - INPUT_PAD_H) * ROW_PITCH
+ curr_x1 - INPUT_PAD_W;
const __global Dtype *src1_read = src1 + ( global_x * TILE_N  * 2);
int patch_depth = 0;
do
{
int patch_row = 0;
do
{
const bool kernel_width_is_odd = KERNEL_WIDTH % 2 == 1;
#if INPUT_PAD_H == 0 && INPUT_PAD_W == 0 && DILATION_X == 1 && DILATION_Y == 1 && INPUT_PAD_BOTTOM == 0 && INPUT_PAD_RIGHT == 0
Dtype_t blockA00 = ( (const __global Dtype_t*)src0_read0 )[  0  ]; src0_read0 += ROW_PITCH;
Dtype_t blockA01 = ( (const __global Dtype_t*)src0_read1 )[  0  ]; src0_read1 += ROW_PITCH;
Dtype*  pblockA00 = (Dtype*)(&blockA00);
Dtype*  pblockA01 = (Dtype*)(&blockA01);
#else
Dtype_t blockA00;
Dtype*  pblockA00 = (Dtype*)(&blockA00);
int pos = 0;
LOOP(KERNEL_WIDTH, pos,
{
if (curr_y0 >= INPUT_PAD_H &&
curr_y0 < input_height + INPUT_PAD_H &&
curr_x0 + pos * DILATION_X >= INPUT_PAD_W &&
curr_x0 + pos * DILATION_X < input_width + INPUT_PAD_W)
pblockA00[pos] = src0_read0[pos * DILATION_X];
else
pblockA00[pos] = 0;
})
curr_y0 += DILATION_Y;
Dtype_t blockA01;
Dtype*  pblockA01 = (Dtype*)(&blockA01);
pos = 0;
LOOP(KERNEL_WIDTH, pos,
{
if (curr_y1 >= INPUT_PAD_H &&
curr_y1 < input_height + INPUT_PAD_H &&
curr_x1 + pos * DILATION_X >= INPUT_PAD_W &&
curr_x1 + pos * DILATION_X < input_width + INPUT_PAD_W)
pblockA01[pos] = src0_read1[pos * DILATION_X];
else
pblockA01[pos] = 0;
})
curr_y1 += DILATION_Y;
src0_read0 += (ROW_PITCH * DILATION_Y);
src0_read1 += (ROW_PITCH * DILATION_Y);
#endif
Dtype blockB[KERNEL_WIDTH * TILE_N_LAST_DIV8];
interleaved_y = 0;
LOOP(KERNEL_WIDTH_DIV2, interleaved_y,
{
#if TILE_N_LAST_DIV8 == 1
Dtype2* p2BlockB = (Dtype2* )blockB;
p2BlockB[interleaved_y] = as_Dtype2( SUB_GROUP_BLOCK_READ2( (const __global INT_TYPE*)src1_read ) );
#elif TILE_N_LAST_DIV8 == 2
Dtype4* p4BlockB = (Dtype4* )blockB;
p4BlockB[interleaved_y] = as_Dtype4( SUB_GROUP_BLOCK_READ4( (const __global INT_TYPE*)src1_read ) );
#elif TILE_N_LAST_DIV8 == 3
Dtype6* p6BlockB = (Dtype6* )blockB;
(*((Dtype8*)(&p6BlockB[interleaved_y]))).s0123 = as_Dtype4( SUB_GROUP_BLOCK_READ4( (const __global INT_TYPE*)src1_read ) );
(*((Dtype8*)(&p6BlockB[interleaved_y]))).s45 = as_Dtype2( SUB_GROUP_BLOCK_READ2( (const __global INT_TYPE*)(src1_read + 4 * 8) ) );
#endif
src1_read += WIDTH1 * 2;
} )
if ( kernel_width_is_odd )
{
#if TILE_N_LAST_DIV8 == 1
Dtype* pBlockB = (Dtype* )blockB;
pBlockB[KERNEL_WIDTH - 1] = as_Dtype( SUB_GROUP_BLOCK_READ( (const __global INT_TYPE*)src1_read ) );
#elif TILE_N_LAST_DIV8 == 2
Dtype2* p2BlockB = (Dtype2* )blockB;
p2BlockB[KERNEL_WIDTH - 1] = as_Dtype2( SUB_GROUP_BLOCK_READ2( (const __global INT_TYPE*)src1_read ) );
#elif TILE_N_LAST_DIV8 == 3
Dtype3* p3BlockB = (Dtype3* )blockB;
p3BlockB[KERNEL_WIDTH - 1].s01 = as_Dtype2( SUB_GROUP_BLOCK_READ2( (const __global INT_TYPE*)src1_read ) );
p3BlockB[KERNEL_WIDTH - 1].s2 = as_Dtype( SUB_GROUP_BLOCK_READ( (const __global INT_TYPE*) (src1_read + 8) ) );
#endif
src1_read += WIDTH1 * 2;
}
Dtype* pBlockB = (Dtype*)blockB;
kernel_idx = 0;
interleaved_y = 0;
LOOP(KERNEL_WIDTH_DIV2, interleaved_y,
{
kernel_y = interleaved_y * 2;
DOT_PRODUCT_8( blockC0[0], pblockA00[kernel_y    ], pBlockB[kernel_idx] );
DOT_PRODUCT_8( blockC1[0], pblockA01[kernel_y    ], pBlockB[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC0[0], pblockA00[kernel_y + 1], pBlockB[kernel_idx] );
DOT_PRODUCT_8( blockC1[0], pblockA01[kernel_y + 1], pBlockB[kernel_idx] ); kernel_idx++;
#if TILE_N_LAST_DIV8 >= 2
DOT_PRODUCT_8( blockC0[1], pblockA00[kernel_y    ], pBlockB[kernel_idx] );
DOT_PRODUCT_8( blockC1[1], pblockA01[kernel_y    ], pBlockB[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC0[1], pblockA00[kernel_y + 1], pBlockB[kernel_idx] );
DOT_PRODUCT_8( blockC1[1], pblockA01[kernel_y + 1], pBlockB[kernel_idx] ); kernel_idx++;
#if TILE_N_LAST_DIV8 >= 3
DOT_PRODUCT_8( blockC0[2], pblockA00[kernel_y    ], pBlockB[kernel_idx] );
DOT_PRODUCT_8( blockC1[2], pblockA01[kernel_y    ], pBlockB[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC0[2], pblockA00[kernel_y + 1], pBlockB[kernel_idx] );
DOT_PRODUCT_8( blockC1[2], pblockA01[kernel_y + 1], pBlockB[kernel_idx] ); kernel_idx++;
#endif
#endif
} )
kernel_y = interleaved_y * 2;
if ( kernel_width_is_odd )
{
DOT_PRODUCT_8( blockC0[0], pblockA00[kernel_y], pBlockB[kernel_idx] );
DOT_PRODUCT_8( blockC1[0], pblockA01[kernel_y], pBlockB[kernel_idx] ); kernel_idx++;
#if TILE_N_LAST_DIV8 >= 2
DOT_PRODUCT_8( blockC0[1], pblockA00[kernel_y], pBlockB[kernel_idx] );
DOT_PRODUCT_8( blockC1[1], pblockA01[kernel_y], pBlockB[kernel_idx] ); kernel_idx++;
#if TILE_N_LAST_DIV8 >= 3
DOT_PRODUCT_8( blockC0[2], pblockA00[kernel_y], pBlockB[kernel_idx] );
DOT_PRODUCT_8( blockC1[2], pblockA01[kernel_y], pBlockB[kernel_idx] ); kernel_idx++;
#endif
#endif
}
}
while( ++patch_row < KERNEL_HEIGHT );
#if INPUT_PAD_W != 0 || INPUT_PAD_H != 0 || DILATION_X != 1 || DILATION_Y != 1 || INPUT_PAD_BOTTOM != 0 || INPUT_PAD_RIGHT != 0
curr_y0 = saved_y0;
curr_y1 = saved_y1;
#endif
src0_read0 += slice_pitch - ( KERNEL_HEIGHT * ROW_PITCH * DILATION_Y );
src0_read1 += slice_pitch - ( KERNEL_HEIGHT * ROW_PITCH * DILATION_Y );
}
while ( ++patch_depth < INPUT_DEPTH );
int out0_offset = global_z * out_pitch_z
+ ( group_x * TILE_N ) * out_pitch_y
+ ( ( global_y * TILE_M + 0 ) / output_width + OUT_PADDING_HEIGHT ) * OUT_PITCH_X
+ ( ( global_y * TILE_M + 0 ) % output_width ) + OUT_PADDING_LEFT;
int out1_offset = global_z * out_pitch_z
+ ( group_x * TILE_N ) * out_pitch_y
+ ( ( global_y * TILE_M + 1 ) / output_width + OUT_PADDING_HEIGHT ) * OUT_PITCH_X
+ ( ( global_y * TILE_M + 1 ) % output_width ) + OUT_PADDING_LEFT;
__global Dtype *out1 = dst + out1_offset;
#if APPLY_BIAS
Dtype bias[4];
Dtype4 *bias_vec;
bias_vec = (Dtype4*)bias;
*bias_vec = as_Dtype4(SUB_GROUP_BLOCK_READ4((__global INT_TYPE *)biases_base + group_x * TILE_N));
if (group_x > 0xFFFFFFFEul) {
dst[0] = bias[0] + bias[1] + bias[2] + bias[3];
}
#else
const Dtype bias[4] = {0, 0, 0, 0};
#endif
if( global_y * TILE_M < output_width * output_height )
{
for( int i = 0; i < 8; i++ )
{
if ( TILE_N_LAST_DIV8 > 0 )
{
ACTIVATION_FUNCTION(dst, out0_offset + ( 0+i) * out_pitch_y, blockC0[0][i] + SUBGROUP_GET_BIAS(0, i), group_x * TILE_N + i);
}
if ( TILE_N_LAST_DIV8 > 1 )
{
ACTIVATION_FUNCTION(dst, out0_offset + ( 8+i) * out_pitch_y, blockC0[1][i] + SUBGROUP_GET_BIAS(1, i), group_x * TILE_N + i + 8);
}
if ( TILE_N_LAST_DIV8 > 2 )
{
ACTIVATION_FUNCTION(dst, out0_offset + (16+i) * out_pitch_y, blockC0[2][i] + SUBGROUP_GET_BIAS(2, i), group_x * TILE_N + i + 16);
}
if ( TILE_N_LAST_DIV8 > 3 )
{
ACTIVATION_FUNCTION(dst, out0_offset + (24+i) * out_pitch_y, blockC0[3][i] + SUBGROUP_GET_BIAS(3, i), group_x * TILE_N + i + 24);
}
}
}
if( global_y * TILE_M + 1 < output_width * output_height )
{
for( int i = 0; i < 8; i++ )
{
if ( TILE_N_LAST_DIV8 > 0 )
{
ACTIVATION_FUNCTION(dst, out1_offset + ( 0+i) * out_pitch_y, blockC1[0][i] + SUBGROUP_GET_BIAS(0, i), group_x * TILE_N + i);
}
if ( TILE_N_LAST_DIV8 > 1 )
{
ACTIVATION_FUNCTION(dst, out1_offset + ( 8+i) * out_pitch_y, blockC1[1][i] + SUBGROUP_GET_BIAS(1, i), group_x * TILE_N + i + 8);
}
if ( TILE_N_LAST_DIV8 > 2 )
{
ACTIVATION_FUNCTION(dst, out1_offset + (16+i) * out_pitch_y, blockC1[2][i] + SUBGROUP_GET_BIAS(2, i), group_x * TILE_N + i + 16);
}
if ( TILE_N_LAST_DIV8 > 3 )
{
ACTIVATION_FUNCTION(dst, out1_offset + (24+i) * out_pitch_y, blockC1[3][i] + SUBGROUP_GET_BIAS(3, i), group_x * TILE_N + i + 24);
}
}
}
}
#endif
}
#endif
#if defined(GEMM_LIKE_CONV_32_2_SIMD16) || defined(GEMM_LIKE_CONV_32_1_SIMD16)
#define INTERLEAVED_SIMD16_OUTPUT(_out_, _offset_,  _m_) do {\
if (global_y * TILE_M < output_width * output_height ) \
{ \
if ( ( OUT_DEPTH % TILE_N ) == 0 ) {\
for (int i = 0; i < 16; i++) \
{ \
ACTIVATION_FUNCTION(_out_, _offset_ + ( 0+i) * out_pitch_y, blockC0 ##_m_ [i] + SUBGROUP_GET_BIAS(0, i), group_x * TILE_N + i); \
ACTIVATION_FUNCTION(_out_, _offset_ + (16+i) * out_pitch_y, blockC1 ##_m_ [i] + SUBGROUP_GET_BIAS(1, i), group_x * TILE_N + i + 16); \
} \
} \
else if( ( OUT_DEPTH % 16 ) == 0 ) { \
if ( ( global_x + 1 ) < get_global_size(0) ) { \
for ( int i = 0; i < 16; i++ ) \
{ \
ACTIVATION_FUNCTION(_out_, _offset_ + ( 0+i) * out_pitch_y, blockC0 ##_m_ [i] + SUBGROUP_GET_BIAS(0, i), group_x * TILE_N + i); \
ACTIVATION_FUNCTION(_out_, _offset_ + (16+i) * out_pitch_y, blockC1 ##_m_ [i] + SUBGROUP_GET_BIAS(1, i), group_x * TILE_N + i + 16); \
} \
} \
else { \
for (int i = 0; i < 16; i++) \
{ \
ACTIVATION_FUNCTION(_out_, _offset_ + ( 0+i) * out_pitch_y, blockC0 ##_m_ [i] + SUBGROUP_GET_BIAS(0, i), group_x * TILE_N + i); \
} \
} \
} \
else { \
if ( ( global_x + 1 ) < get_global_size(0) ) \
{ \
for ( int i = 0; i < 16; i++ ) \
{ \
ACTIVATION_FUNCTION(_out_, _offset_ + ( 0+i) * out_pitch_y, blockC0 ##_m_[i] + SUBGROUP_GET_BIAS(0, i), group_x * TILE_N + i); \
ACTIVATION_FUNCTION(_out_, _offset_ + (16+i) * out_pitch_y, blockC1 ##_m_[i] + SUBGROUP_GET_BIAS(1, i), group_x * TILE_N + i + 16); \
} \
} \
else { \
if ( (OUT_DEPTH % TILE_N) > 16 ) { \
for (int i = 0; i < 16 ; i++) \
{ \
ACTIVATION_FUNCTION(_out_, _offset_ + ( 0+i) * out_pitch_y, blockC0 ##_m_[i] + SUBGROUP_GET_BIAS(0, i), group_x * TILE_N + i); \
} \
for (int i = 0; i < OUT_DEPTH % 16 ; i++) \
{ \
ACTIVATION_FUNCTION(_out_, _offset_ + (16+i) * out_pitch_y, blockC1 ##_m_[i] + SUBGROUP_GET_BIAS(1, i), group_x * TILE_N + i + 16); \
} \
} \
else { \
for (int i = 0; i < OUT_DEPTH % 16 ; i++) \
{ \
ACTIVATION_FUNCTION(_out_, _offset_ + ( 0+i) * out_pitch_y, blockC0 ##_m_[i] + SUBGROUP_GET_BIAS(0, i), group_x * TILE_N + i); \
} \
} \
} \
} \
} \
}while(0)
#endif
#ifdef GEMM_LIKE_CONV_32_1_SIMD16
#define TILE_M          1
#define TILE_K          KERNEL_WIDTH
#define TILE_N          32
__attribute__((intel_reqd_sub_group_size(16)))
__kernel void Conv_Interleaved(GEMM_LIKE_KERNEL_ARGS)
{
__global Dtype *dst = dst_base + dst_offset;
const int group_x = get_group_id(0);
const int group_y = get_group_id(1);
const int global_x = get_global_id(0);
const int global_y = get_global_id(1);
const int global_z = get_global_id(2);
int interleaved_y;
int kernel_y;
int kernel_idx;
Dtype16  blockC00 = 0.f;
Dtype16  blockC10 = 0.f;
int curr_x = ( global_y % output_width ) * STRIDE_X;
int curr_y = ( global_y / output_width ) * STRIDE_Y;
#if INPUT_PAD_H != 0 || INPUT_PAD_W != 0 || DILATION_X != 1 || DILATION_Y != 1 || INPUT_PAD_BOTTOM != 0 || INPUT_PAD_RIGHT != 0
int saved_y = curr_y;
#endif
const __global Dtype *src0_read = src0
+ aligned_input_size * global_z
+ (curr_y - INPUT_PAD_H) * ROW_PITCH
+ curr_x - INPUT_PAD_W;
const __global Dtype *src0_read_orig = src0_read;
const __global Dtype *src1_read = src1 + ( global_x * TILE_N * 2 );
#define DOT_PRODUCT_16( _result, _rowA, colB )    \
{   \
_result.s0 = mad( _rowA, sub_group_broadcast( colB,  0 ), _result.s0 );  \
_result.s1 = mad( _rowA, sub_group_broadcast( colB,  1 ), _result.s1 );  \
_result.s2 = mad( _rowA, sub_group_broadcast( colB,  2 ), _result.s2 );  \
_result.s3 = mad( _rowA, sub_group_broadcast( colB,  3 ), _result.s3 );  \
_result.s4 = mad( _rowA, sub_group_broadcast( colB,  4 ), _result.s4 );  \
_result.s5 = mad( _rowA, sub_group_broadcast( colB,  5 ), _result.s5 );  \
_result.s6 = mad( _rowA, sub_group_broadcast( colB,  6 ), _result.s6 );  \
_result.s7 = mad( _rowA, sub_group_broadcast( colB,  7 ), _result.s7 );  \
_result.s8 = mad( _rowA, sub_group_broadcast( colB,  8 ), _result.s8 );  \
_result.s9 = mad( _rowA, sub_group_broadcast( colB,  9 ), _result.s9 );  \
_result.sa = mad( _rowA, sub_group_broadcast( colB, 10 ), _result.sa );  \
_result.sb = mad( _rowA, sub_group_broadcast( colB, 11 ), _result.sb );  \
_result.sc = mad( _rowA, sub_group_broadcast( colB, 12 ), _result.sc );  \
_result.sd = mad( _rowA, sub_group_broadcast( colB, 13 ), _result.sd );  \
_result.se = mad( _rowA, sub_group_broadcast( colB, 14 ), _result.se );  \
_result.sf = mad( _rowA, sub_group_broadcast( colB, 15 ), _result.sf );  \
}
typedef CAT( Dtype, KERNEL_WIDTH ) Dtype_t;
int patch_depth = 0;
__attribute__((opencl_unroll_hint(1)))
do
{
int patch_row = 0;
#if INPUT_PAD_H != 0 || INPUT_PAD_W != 0 || DILATION_X != 1 || DILATION_Y != 1 || INPUT_PAD_BOTTOM != 0 || INPUT_PAD_RIGHT != 0
curr_y = saved_y;
#endif
__attribute__((opencl_unroll_hint(1)))
do
{
const bool kernel_width_is_odd = KERNEL_WIDTH % 2 == 1;
#if INPUT_PAD_W == 0 && INPUT_PAD_H == 0 && DILATION_X == 1 && DILATION_Y == 1 && INPUT_PAD_BOTTOM == 0 && INPUT_PAD_RIGHT == 0
#if KERNEL_WIDTH == 3
Dtype_t blockA00 = vload3(0, src0_read);
Dtype*  pblockA00 = (Dtype*)(&blockA00);
#else
Dtype_t blockA00 = ( (const __global Dtype_t*)src0_read )[  0  ];
Dtype*  pblockA00 = (Dtype*)(&blockA00);
#endif
#else
Dtype_t blockA00;
Dtype*  pblockA00 = (Dtype*)(&blockA00);
int pos = 0;
LOOP(KERNEL_WIDTH, pos,
{
if (curr_y >= INPUT_PAD_H &&
curr_y < input_height + INPUT_PAD_H &&
curr_x + pos * DILATION_X >= INPUT_PAD_W &&
curr_x + pos * DILATION_X < input_width + INPUT_PAD_W)
pblockA00[pos] = src0_read[pos * DILATION_X];
else
pblockA00[pos] = 0;
})
curr_y += DILATION_Y;
#endif
src0_read += ROW_PITCH * DILATION_Y;
INT_TYPE blockB00[KERNEL_WIDTH * 2];
INT_TYPE4* p4BlockB00 = (INT_TYPE4*)blockB00;
INT_TYPE2* p2BlockB00 = (INT_TYPE2*)blockB00;
Dtype* pBlockB00  = (Dtype*)blockB00;
interleaved_y = 0;
LOOP(KERNEL_WIDTH_DIV2, interleaved_y,
{
p4BlockB00[interleaved_y] = SUB_GROUP_BLOCK_READ4( (const __global INT_TYPE*)src1_read );
src1_read += WIDTH1 * 2;
} )
if ( kernel_width_is_odd )
{
p2BlockB00[KERNEL_WIDTH - 1] = SUB_GROUP_BLOCK_READ2( (const __global INT_TYPE*)src1_read );
src1_read += WIDTH1 * 2;
}
kernel_idx = 0;
interleaved_y = 0;
LOOP(KERNEL_WIDTH_DIV2, interleaved_y,
{
kernel_y = interleaved_y * 2;
DOT_PRODUCT_16( blockC00, pblockA00[kernel_y    ], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_16( blockC00, pblockA00[kernel_y + 1], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_16( blockC10, pblockA00[kernel_y    ], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_16( blockC10, pblockA00[kernel_y + 1], pBlockB00[kernel_idx] ); kernel_idx++;
} )
if ( kernel_width_is_odd )
{
kernel_y = interleaved_y * 2;
DOT_PRODUCT_16( blockC00, pblockA00[kernel_y], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_16( blockC10, pblockA00[kernel_y], pBlockB00[kernel_idx] ); kernel_idx++;
}
}
while( ++patch_row < KERNEL_HEIGHT );
src0_read += slice_pitch - ( KERNEL_HEIGHT * ROW_PITCH * DILATION_Y );
}
while ( ++patch_depth < INPUT_DEPTH );
int out_offset = global_z * out_pitch_z
+ ( group_x * TILE_N ) * out_pitch_y
+ ( ( global_y * TILE_M ) / output_width + OUT_PADDING_HEIGHT) * OUT_PITCH_X
+ ( ( global_y * TILE_M ) % output_width ) + OUT_PADDING_LEFT;
__global Dtype *out = dst + out_offset;
#if APPLY_BIAS
Dtype bias[2];
Dtype2 *bias_vec;
bias_vec = (Dtype2*)bias;
*bias_vec = as_Dtype2(SUB_GROUP_BLOCK_READ2((__global INT_TYPE *)biases_base + group_x * TILE_N));
if (group_x > 0xFFFFFFFEul) {
dst[0] = bias[0] + bias[1];
}
#else
const Dtype bias[2] = {0, 0};
#endif
INTERLEAVED_SIMD16_OUTPUT(dst, out_offset, 0);
}
#endif
#ifdef GEMM_LIKE_CONV_32_2_SIMD16
#define TILE_M          2
#define TILE_K          KERNEL_WIDTH
#define TILE_N          32
__attribute__((intel_reqd_sub_group_size(16)))
__kernel void Conv_Interleaved(GEMM_LIKE_KERNEL_ARGS)
{
__global Dtype *dst = dst_base + dst_offset;
const int group_x = get_group_id(0);
const int group_y = get_group_id(1);
const int global_x = get_global_id(0);
const int global_y = get_global_id(1);
const int global_z = get_global_id(2);
int interleaved_y;
int kernel_y;
int kernel_idx;
#define DOT_PRODUCT_16( _result, _rowA, colB )    \
{   \
_result.s0 = mad( _rowA, sub_group_broadcast( colB, 0 ), _result.s0 );  \
_result.s1 = mad( _rowA, sub_group_broadcast( colB, 1 ), _result.s1 );  \
_result.s2 = mad( _rowA, sub_group_broadcast( colB, 2 ), _result.s2 );  \
_result.s3 = mad( _rowA, sub_group_broadcast( colB, 3 ), _result.s3 );  \
_result.s4 = mad( _rowA, sub_group_broadcast( colB, 4 ), _result.s4 );  \
_result.s5 = mad( _rowA, sub_group_broadcast( colB, 5 ), _result.s5 );  \
_result.s6 = mad( _rowA, sub_group_broadcast( colB, 6 ), _result.s6 );  \
_result.s7 = mad( _rowA, sub_group_broadcast( colB, 7 ), _result.s7 );  \
_result.s8 = mad( _rowA, sub_group_broadcast( colB, 8 ), _result.s8 );  \
_result.s9 = mad( _rowA, sub_group_broadcast( colB, 9 ), _result.s9 );  \
_result.sa = mad( _rowA, sub_group_broadcast( colB, 10 ), _result.sa );  \
_result.sb = mad( _rowA, sub_group_broadcast( colB, 11 ), _result.sb );  \
_result.sc = mad( _rowA, sub_group_broadcast( colB, 12 ), _result.sc );  \
_result.sd = mad( _rowA, sub_group_broadcast( colB, 13 ), _result.sd );  \
_result.se = mad( _rowA, sub_group_broadcast( colB, 14 ), _result.se );  \
_result.sf = mad( _rowA, sub_group_broadcast( colB, 15 ), _result.sf );  \
}
typedef CAT( Dtype, KERNEL_WIDTH ) Dtype_t;
{
Dtype16  blockC00 = 0.f;
Dtype16  blockC10 = 0.f;
Dtype16  blockC01 = 0.f;
Dtype16  blockC11 = 0.f;
int curr_x0 = ( ( global_y * TILE_M + 0 ) % output_width ) * STRIDE_X;
int curr_x1 = ( ( global_y * TILE_M + 1 ) % output_width ) * STRIDE_X;
int curr_y0 = ( ( global_y * TILE_M + 0 ) / output_width ) * STRIDE_Y;
int curr_y1 = ( ( global_y * TILE_M + 1 ) / output_width ) * STRIDE_Y;
#if INPUT_PAD_H != 0 || INPUT_PAD_W != 0 || DILATION_X != 1 || DILATION_Y != 1 || INPUT_PAD_BOTTOM != 0 || INPUT_PAD_RIGHT != 0
int saved_y0 = curr_y0;
int saved_y1 = curr_y1;
#endif
const __global Dtype *src0_read0 = src0
+ aligned_input_size * global_z
+ (curr_y0 - INPUT_PAD_H) * ROW_PITCH
+ curr_x0 - INPUT_PAD_W;
const __global Dtype *src0_read1 = src0
+ aligned_input_size * global_z
+ (curr_y1 - INPUT_PAD_H) * ROW_PITCH
+ curr_x1 - INPUT_PAD_W;
const __global Dtype *src1_read = src1 + ( global_x * TILE_N * 2);
int patch_depth = 0;
do
{
int patch_row = 0;
do
{
const bool kernel_width_is_odd = KERNEL_WIDTH % 2 == 1;
#if INPUT_PAD_H == 0 && INPUT_PAD_W == 0 && DILATION_X == 1 && DILATION_Y == 1 && INPUT_PAD_BOTTOM == 0 && INPUT_PAD_RIGHT == 0
Dtype_t blockA00 = ( (const __global Dtype_t*)src0_read0 )[  0  ]; src0_read0 += ROW_PITCH;
Dtype_t blockA01 = ( (const __global Dtype_t*)src0_read1 )[  0  ]; src0_read1 += ROW_PITCH;
Dtype*  pblockA00 = (Dtype*)(&blockA00);
Dtype*  pblockA01 = (Dtype*)(&blockA01);
#else
Dtype_t blockA00;
Dtype*  pblockA00 = (Dtype*)(&blockA00);
int pos = 0;
LOOP(KERNEL_WIDTH, pos,
{
if (curr_y0 >= INPUT_PAD_H &&
curr_y0 < input_height + INPUT_PAD_H &&
curr_x0 + pos * DILATION_X >= INPUT_PAD_W &&
curr_x0 + pos * DILATION_X < input_width + INPUT_PAD_W)
pblockA00[pos] = src0_read0[pos * DILATION_X];
else
pblockA00[pos] = 0;
})
curr_y0 += DILATION_Y;
Dtype_t blockA01;
Dtype*  pblockA01 = (Dtype*)(&blockA01);
pos = 0;
LOOP(KERNEL_WIDTH, pos,
{
if (curr_y1 >= INPUT_PAD_H &&
curr_y1 < input_height + INPUT_PAD_H &&
curr_x1 + pos * DILATION_X >= INPUT_PAD_W &&
curr_x1 + pos * DILATION_X < input_width + INPUT_PAD_W)
pblockA01[pos] = src0_read1[pos * DILATION_X];
else
pblockA01[pos] = 0;
})
curr_y1 += DILATION_Y;
src0_read0 += (ROW_PITCH * DILATION_Y);
src0_read1 += (ROW_PITCH * DILATION_Y);
#endif
Dtype blockB00[KERNEL_WIDTH*2];
Dtype4* p4BlockB00 = (Dtype4*)blockB00;
Dtype2* p2BlockB00 = (Dtype2*)blockB00;
Dtype*  pBlockB00 =  (Dtype* )blockB00;
interleaved_y = 0;
LOOP(KERNEL_WIDTH_DIV2, interleaved_y,
{
p4BlockB00[interleaved_y] = as_Dtype4( SUB_GROUP_BLOCK_READ4( (const __global INT_TYPE*)src1_read ) );
src1_read += WIDTH1 * 2;
} )
if ( kernel_width_is_odd )
{
p2BlockB00[KERNEL_WIDTH - 1] = as_Dtype2( SUB_GROUP_BLOCK_READ2( (const __global INT_TYPE*)src1_read ) );
src1_read += WIDTH1 * 2;
}
kernel_idx = 0;
interleaved_y = 0;
LOOP(KERNEL_WIDTH_DIV2, interleaved_y,
{
kernel_y = interleaved_y * 2;
DOT_PRODUCT_16( blockC00, pblockA00[kernel_y    ], pBlockB00[kernel_idx] );
DOT_PRODUCT_16( blockC01, pblockA01[kernel_y    ], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_16( blockC00, pblockA00[kernel_y + 1], pBlockB00[kernel_idx] );
DOT_PRODUCT_16( blockC01, pblockA01[kernel_y + 1], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_16( blockC10, pblockA00[kernel_y    ], pBlockB00[kernel_idx] );
DOT_PRODUCT_16( blockC11, pblockA01[kernel_y    ], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_16( blockC10, pblockA00[kernel_y + 1], pBlockB00[kernel_idx] );
DOT_PRODUCT_16( blockC11, pblockA01[kernel_y + 1], pBlockB00[kernel_idx] ); kernel_idx++;
} )
if ( kernel_width_is_odd )
{
kernel_y = interleaved_y * 2;
DOT_PRODUCT_16( blockC00, pblockA00[kernel_y], pBlockB00[kernel_idx] );
DOT_PRODUCT_16( blockC01, pblockA01[kernel_y], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_16( blockC10, pblockA00[kernel_y], pBlockB00[kernel_idx] );
DOT_PRODUCT_16( blockC11, pblockA01[kernel_y], pBlockB00[kernel_idx] ); kernel_idx++;
}
}
while( ++patch_row < KERNEL_HEIGHT );
#if INPUT_PAD_W != 0 || INPUT_PAD_H != 0 || DILATION_X != 1 || DILATION_Y != 1 || INPUT_PAD_BOTTOM != 0 || INPUT_PAD_RIGHT != 0
curr_y0 = saved_y0;
curr_y1 = saved_y1;
#endif
src0_read0 += slice_pitch - ( KERNEL_HEIGHT * ROW_PITCH * DILATION_Y);
src0_read1 += slice_pitch - ( KERNEL_HEIGHT * ROW_PITCH * DILATION_Y);
}
while ( ++patch_depth < INPUT_DEPTH );
int out0_offset = global_z * out_pitch_z
+ ( group_x * TILE_N ) * out_pitch_y
+ ( ( global_y * TILE_M + 0 ) / output_width + OUT_PADDING_HEIGHT ) * OUT_PITCH_X
+ ( ( global_y * TILE_M + 0 ) % output_width ) + OUT_PADDING_LEFT;
int out1_offset = global_z * out_pitch_z
+ ( group_x * TILE_N ) * out_pitch_y
+ ( ( global_y * TILE_M + 1 ) / output_width + OUT_PADDING_HEIGHT ) * OUT_PITCH_X
+ ( ( global_y * TILE_M + 1 ) % output_width ) + OUT_PADDING_LEFT;
#if APPLY_BIAS
Dtype bias[2];
Dtype2 *bias_vec;
bias_vec = (Dtype2*)bias;
*bias_vec = as_Dtype2(SUB_GROUP_BLOCK_READ2((__global INT_TYPE *)biases_base + group_x * TILE_N));
if (group_x > 0xFFFFFFFEul) {
dst[0] = bias[0] + bias[1];
}
#else
const Dtype bias[2] = {0, 0};
#endif
INTERLEAVED_SIMD16_OUTPUT(dst, out0_offset, 0);
INTERLEAVED_SIMD16_OUTPUT(dst, out1_offset, 1);
}
}
#endif
#elif defined KERNEL_DWCONV
__kernel void DWCONV(
ELTWISE_DATA_ARG
FUSED_ARG
__global Dtype* image_data,
__global Dtype* kernel_data,
BIAS_KERNEL_ARG
__global Dtype* convolved_image_base,
const int convolved_image_offset,
const ushort input_width,
const ushort input_height,
const ushort output_width,
const ushort output_height) {
__global Dtype* convolved_image = convolved_image_base + convolved_image_offset;
const int outputX = get_global_id(0);
const int outputY = get_global_id(1);
const int outputZ = get_global_id(2);
if(outputX < output_width && outputY < output_height)
{
Dtype sum = 0.;
const int org_y = outputY * STRIDE_Y - INPUT_PAD_H;
const int org_x = outputX * STRIDE_X - INPUT_PAD_W;
const int currentKernelOffset = KERNEL_SIZE*(outputZ%CHANNELS);
const int biasIndex=outputZ%CHANNELS;
const int local_image_offset = org_y*input_width + org_x;
const int imageSize = input_width*input_height;
__global Dtype* image_dataPtrFloat = (image_data + (imageSize*outputZ + local_image_offset));
__global Dtype* kernel_dataPtrFloat = (kernel_data + (currentKernelOffset));
for(int y = 0; y < KERNEL_H; y++)
{
for(int x = 0; x < KERNEL_W; x++)
{
if(!(org_y + y * DILATION_Y >= 0 && org_y + y * DILATION_Y < input_height && org_x + x * DILATION_X >= 0 && org_x + x * DILATION_X < input_width))
{
continue;
}
sum += image_dataPtrFloat[x * DILATION_X] * kernel_dataPtrFloat[x];
}
image_dataPtrFloat += input_width * DILATION_Y;
kernel_dataPtrFloat += KERNEL_W;
}
#if APPLY_BIAS
int offset = outputZ*output_height*output_width + outputY*output_width + outputX;
ACTIVATION_FUNCTION(convolved_image, offset, sum + biases_base[biasIndex], biasIndex);
#else
int offset = outputZ*output_height*output_width + outputY*output_width + outputX;
ACTIVATION_FUNCTION(convolved_image, offset, sum, biasIndex);
#endif
}
}
#endif

########################################

########################################
BUILDING PROGRAM:
sourceModule_:
dnn
sourceName_:
conv_layer_spatial
sourceHash_:
2cdd81c1843105011ecb613a4f6f9e26
buildflags:
 -D TYPE=1 -D Dtype=float -D Dtype2=float2 -D Dtype4=float4 -D Dtype8=float8 -D Dtype16=float16 -D as_Dtype=as_float -D as_Dtype2=as_float2 -D as_Dtype4=as_float4 -D as_Dtype8=as_float8 -D KERNEL_WIDTH=3 -D KERNEL_HEIGHT=3 -D STRIDE_X=1 -D STRIDE_Y=1 -D DILATION_X=1 -D DILATION_Y=1 -D INPUT_PAD_W=1 -D INPUT_PAD_H=1 -D INPUT_PAD_RIGHT=1 -D INPUT_PAD_BOTTOM=1 -cl-fast-relaxed-math  -D GEMM_LIKE_CONV_32_1 -D Conv_Interleaved=U_GEMM_LIKE_CONV_k3x3_cn576_g1_s1x1_d1x1_b1_in64x48_p1x1_num1_M512_activ0_eltwise0_FP32_5_1_8_32_SIMD8 -cl-mad-enable -D KERNEL_GEMM_LIKE -D INPUT_DEPTH=576 -D WIDTH1=512 -D OUT_PADDING_LEFT=0 -D OUT_PADDING_HEIGHT=0 -D OUT_DEPTH=512 -D NUM_BATCHES=1 -D DY=1 -D DX=32 -D KERNEL_WIDTH_DIV2=1 -D KERNEL_SLICE_DIV2=4 -D TILE_N_LAST=0 -D TILE_N_LAST_DIV8=0 -D APPLY_BIAS=1 -D INTEL_DEVICE
codeStr_:
(null)
src_->sourceAddr_:
-135676032
Source:
#if defined(cl_khr_fp16)
#pragma OPENCL EXTENSION cl_khr_fp16 : enable
#endif
#define KERNEL_ARG_DTYPE float
#define TYPE_FLOAT  1
#define TYPE_HALF   2
#if defined(FUSED_CONV_RELU)
#define ACTIVATION_RELU_FUNCTION(x, c) ((Dtype)(x) > 0 ? (Dtype)(x) : ((Dtype)(x) * (negative_slope)))
#define FUSED_ARG KERNEL_ARG_DTYPE negative_slope,
#elif defined(FUSED_CONV_PRELU)
#define ACTIVATION_RELU_FUNCTION(x, c) ((Dtype)(x) > 0 ? (Dtype)(x) : ((Dtype)(x) * (negative_slope[c])))
#define FUSED_ARG __global const KERNEL_ARG_DTYPE* negative_slope,
#elif defined(FUSED_CONV_POWER)
#define ACTIVATION_RELU_FUNCTION(x, c) pow(x, (Dtype)power)
#define FUSED_ARG KERNEL_ARG_DTYPE power,
#elif defined(FUSED_CONV_TANH)
#define ACTIVATION_RELU_FUNCTION(x, c) tanh(x)
#define FUSED_ARG
#elif defined(FUSED_CONV_RELU6)
#define ACTIVATION_RELU_FUNCTION(x, c) (clamp((Dtype)(x), (Dtype)min_value, (Dtype)max_value))
#define FUSED_ARG KERNEL_ARG_DTYPE min_value, KERNEL_ARG_DTYPE max_value,
#else
#define ACTIVATION_RELU_FUNCTION(x, c) (x)
#define FUSED_ARG
#endif
#ifdef FUSED_CONV_ELTWISE
#define ACTIVATION_FUNCTION(_dst_, _offset_, _data_, _channel_) do { (_dst_)[(_offset_)] = ACTIVATION_RELU_FUNCTION(eltwise_data[(_offset_)] + (_data_), _channel_);} while(0)
#define ELTWISE_DATA_ARG __global Dtype* eltwise_data,
#else
#define ACTIVATION_FUNCTION(_dst_, _offset_, _data_, _channel_) do { (_dst_)[(_offset_)] = ACTIVATION_RELU_FUNCTION(_data_, _channel_);} while(0)
#define ELTWISE_DATA_ARG
#endif
#if APPLY_BIAS
#define BIAS_KERNEL_ARG __global Dtype * biases_base,
#else
#define BIAS_KERNEL_ARG
#endif
#define __CAT(x, y) x##y
#define CAT(x, y) __CAT(x, y)
#define LOOP0(VAR, STMT)
#define LOOP1(VAR, STMT) (STMT); (VAR)++;
#define LOOP2(VAR, STMT) LOOP1(VAR, STMT); (STMT); (VAR)++;
#define LOOP3(VAR, STMT) LOOP2(VAR, STMT); (STMT); (VAR)++;
#define LOOP4(VAR, STMT) LOOP3(VAR, STMT); (STMT); (VAR)++;
#define LOOP5(VAR, STMT) LOOP4(VAR, STMT); (STMT); (VAR)++;
#define LOOP6(VAR, STMT) LOOP5(VAR, STMT); (STMT); (VAR)++;
#define LOOP7(VAR, STMT) LOOP6(VAR, STMT); (STMT); (VAR)++;
#define LOOP8(VAR, STMT) LOOP7(VAR, STMT); (STMT); (VAR)++;
#define LOOP9(VAR, STMT) LOOP8(VAR, STMT); (STMT); (VAR)++;
#define LOOP10(VAR, STMT) LOOP9(VAR, STMT); (STMT); (VAR)++;
#define LOOP11(VAR, STMT) LOOP10(VAR, STMT); (STMT); (VAR)++;
#define LOOP12(VAR, STMT) LOOP11(VAR, STMT); (STMT); (VAR)++;
#define LOOP13(VAR, STMT) LOOP12(VAR, STMT); (STMT); (VAR)++;
#define LOOP14(VAR, STMT) LOOP13(VAR, STMT); (STMT); (VAR)++;
#define LOOP15(VAR, STMT) LOOP14(VAR, STMT); (STMT); (VAR)++;
#define LOOP16(VAR, STMT) LOOP15(VAR, STMT); (STMT); (VAR)++;
#define LOOP(N, VAR, STMT) CAT(LOOP, N)((VAR), (STMT))
#if defined(convolve_simd) || defined(Conv_Interleaved)
#if TYPE == TYPE_HALF
#define INT_TYPE ushort
#define INT_TYPE2 ushort2
#define INT_TYPE4 ushort4
#define INT_TYPE8 ushort8
#define SUB_GROUP_BLOCK_READ2 intel_sub_group_block_read_us2
#define SUB_GROUP_BLOCK_READ4 intel_sub_group_block_read_us4
#define SUB_GROUP_BLOCK_READ8 intel_sub_group_block_read_us8
#define SUB_GROUP_BLOCK_READ intel_sub_group_block_read_us
#else
#define INT_TYPE uint
#define INT_TYPE2 uint2
#define INT_TYPE4 uint4
#define INT_TYPE8 uint8
#define SUB_GROUP_BLOCK_READ2 intel_sub_group_block_read2
#define SUB_GROUP_BLOCK_READ4 intel_sub_group_block_read4
#define SUB_GROUP_BLOCK_READ8 intel_sub_group_block_read8
#define SUB_GROUP_BLOCK_READ intel_sub_group_block_read
#endif
#endif
#ifdef KERNEL_BASIC
__kernel void ConvolveBasic(
ELTWISE_DATA_ARG
FUSED_ARG
__global Dtype* image_data,
int image_offset,
__global Dtype* kernel_data,
int kernel_offset,
__global Dtype* bias,
const int bias_offset,
__global Dtype* convolved_image_base,
const int convolved_image_base_offset,
const int convolved_image_offset,
const ushort input_width,
const ushort input_height,
const ushort output_width,
const ushort output_height,
const ushort pad_w,
const ushort pad_h
)
{
__global Dtype* convolved_image = convolved_image_base + convolved_image_base_offset;
const int outputX = get_global_id(0);
const int outputY = get_global_id(1);
const int kernelNum = get_global_id(2) * ZPAR;
if (outputX < output_width && outputY < output_height)
{
Dtype sum[ZPAR];
for (int kern = 0; kern < ZPAR; kern++)
{
sum[kern] = 0.0f;
}
const int org_y = outputY * STRIDE_Y - pad_h;
const int org_x = outputX * STRIDE_X - pad_w;
const int currentKernelOffset = kernel_offset + kernelNum*KERNEL_HEIGHT*KERNEL_WIDTH*CHANNELS;
#if APPLY_BIAS
const int biasIndex = bias_offset + kernelNum;
#endif
const int local_image_offset = org_y * input_width + org_x;
const int imageSize = input_width * input_height;
__global Dtype* image_dataPtr = (image_data + (image_offset + local_image_offset));
__global Dtype* kernel_dataPtr = (kernel_data + (currentKernelOffset));
for (int c = 0; c < CHANNELS; c++)
{
for (int y = 0; y < KERNEL_HEIGHT; y++)
{
for (int x = 0; x < KERNEL_WIDTH; x++)
{
int y_ = org_y + y * DILATION_Y;
int x_ = org_x + x * DILATION_X;
if (!(y_ >= 0 && y_ < input_height && x_ >= 0 && x_ < input_width))
{
continue;
}
for (int kern = 0; kern < ZPAR; kern++)
{
sum[kern] += image_dataPtr[x * DILATION_X] * kernel_dataPtr[kern*KERNEL_HEIGHT*KERNEL_WIDTH*CHANNELS + x];
}
}
image_dataPtr += input_width * DILATION_Y;
kernel_dataPtr += KERNEL_WIDTH;
}
image_dataPtr += imageSize - input_width*KERNEL_HEIGHT*DILATION_Y;
}
for (int kern = 0; kern < ZPAR; kern++)
{
if (kernelNum + kern < OUTPUT_Z)
{
int offset = convolved_image_offset + (kernelNum+kern)*output_height*output_width + outputY*output_width + outputX;
#if APPLY_BIAS
ACTIVATION_FUNCTION(convolved_image, offset, sum[kern] + bias[biasIndex + kern], biasIndex + kern);
#else
ACTIVATION_FUNCTION(convolved_image, offset, sum[kern], biasIndex + kern);
#endif
}
}
}
}
#elif defined KERNEL_IDLF
__attribute__((reqd_work_group_size(1, 1, SIMD_SIZE)))
__attribute__((intel_reqd_sub_group_size(SIMD_SIZE)))
__kernel void
convolve_simd(
ELTWISE_DATA_ARG
FUSED_ARG
__global Dtype* inputs,
__global Dtype* weights,
BIAS_KERNEL_ARG
__global Dtype* outputs_base,
const int outputs_offset,
const ushort input_width,
const ushort input_height,
const ushort output_width,
const ushort output_height)
{
__global Dtype* outputs = outputs_base + outputs_offset;
unsigned int oc = get_global_id(0) * OUT_BLOCK_WIDTH;
unsigned int or = get_global_id(1) * OUT_BLOCK_HEIGHT;
unsigned int fm = get_global_id(2);
unsigned int fmg = get_group_id(2);
unsigned int lid = get_local_id(2);
Dtype out[OUT_BLOCK_WIDTH * OUT_BLOCK_HEIGHT] = { 0.0f };
unsigned int weight_addr = (fmg % FILTERS_IN_GROUP) *
INPUT_DEPTH * KERNEL_WIDTH * KERNEL_HEIGHT * SIMD_SIZE + lid;
unsigned int num_in_batch = fm / ALIGNED_NUM_FILTERS;
unsigned int input_batch_offset = num_in_batch * INPUT_PITCH * TOTAL_INPUT_DEPTH_SIZE;
int curr_y = or * STRIDE_Y;
int curr_x = oc * STRIDE_X + lid;
int in_addr = input_batch_offset
+  (curr_y - INPUT_PAD_H) * INPUT_WIDTH
+   curr_x - INPUT_PAD_W;
const int in_limit = (get_global_size(2) / ALIGNED_NUM_FILTERS) * TOTAL_INPUT_DEPTH_SIZE * INPUT_PITCH - 1;
Dtype in_buf[INVEC_SIZE];
for(int kd = 0; kd < INPUT_DEPTH; kd++)
{
#if INPUT_PAD_W != 0 || INPUT_PAD_H != 0 || INPUT_PAD_BOTTOM != 0 || INPUT_PAD_RIGHT != 0
const bool cx_out_of_range = !(curr_x >= INPUT_PAD_W && curr_x < INPUT_WIDTH + INPUT_PAD_W);
int in_offset = in_addr;
__attribute__((opencl_unroll_hint(INVEC_SIZE)))
for (int reg = 0; reg < INVEC_SIZE; reg++, in_offset += INPUT_WIDTH)
{
Dtype input = inputs[clamp(in_offset, 0, in_limit)];
int cy = curr_y + reg;
in_buf[reg] = (cx_out_of_range || cy < INPUT_PAD_H || cy >= INPUT_HEIGHT + INPUT_PAD_H) ? 0 : input;
}
#else
int in_offset = in_addr;
__attribute__((opencl_unroll_hint(INVEC_SIZE)))
for (int reg = 0; reg < INVEC_SIZE; reg++, in_offset += INPUT_WIDTH)
{
in_buf[reg] = inputs[min(in_offset, in_limit)];
}
#endif
in_addr += INPUT_PITCH;
#define BLOCK_IN(n, c) intel_sub_group_shuffle(in_buf[n], (c))
int kr = 0;
LOOP(KERNEL_HEIGHT, kr,
{
int kc = 0;
LOOP(KERNEL_WIDTH, kc,
{
Dtype weight_value = weights[weight_addr];
weight_addr += SIMD_SIZE;
for (int br=0; br < OUT_BLOCK_HEIGHT; br++)
{
for(int bc=0; bc < OUT_BLOCK_WIDTH; bc++)
{
Dtype input = BLOCK_IN((br * STRIDE_Y + kr * DILATION_Y), bc * STRIDE_X + kc * DILATION_X);
out[br * OUT_BLOCK_WIDTH + bc] = mad(weight_value, input, out[br * OUT_BLOCK_WIDTH + bc]);
}
}
});
});
}
fm = fm % ALIGNED_NUM_FILTERS;
#if LEFT_FILTERS > 0
if (fm < NUM_FILTERS)
#endif
{
unsigned int out_addr = (num_in_batch * TOTAL_OUTPUT_DEPTH + fm) * OUTPUT_PITCH;
out_addr += or * output_width + oc;
#if APPLY_BIAS
Dtype bias = biases_base[fm];
#else
Dtype bias = 0;
#endif
for(unsigned int r = 0; r < OUT_BLOCK_HEIGHT; r++)
{
if (r + or >= output_height) break;
for(unsigned int c = 0; c < OUT_BLOCK_WIDTH; c++)
{
if (c + oc >= output_width) break;
ACTIVATION_FUNCTION(outputs, out_addr + r * output_width + c, bias + out[r * OUT_BLOCK_WIDTH + c], fm);
}
}
}
}
#elif defined KERNEL_GEMM_LIKE
#if APPLY_BIAS
#define SUBGROUP_GET_BIAS(k, i) intel_sub_group_shuffle(bias[k], i)
#else
#define SUBGROUP_GET_BIAS(k, i) ((Dtype)0)
#endif
#ifdef Conv_Interleaved
typedef struct float1 { float s0; } float1;
typedef struct float5 { float s0; float s1; float s2; float s3; float s4; } float5;
typedef struct float6 { float s0; float s1; float s2; float s3; float s4; float s5; } float6;
typedef struct float7 { float s0; float s1; float s2; float s3; float s4; float s5; float s6; } float7;
typedef struct float9 { float s0; float s1; float s2; float s3; float s4; float s5; float s6; float s7; float s8; } float9;
typedef struct float10 { float s0; float s1; float s2; float s3; float s4; float s5;
float s6; float s7; float s8; float s9;} float10;
typedef struct float11 { float s0; float s1; float s2; float s3; float s4; float s5;
float s6; float s7; float s8; float s9; float sa;} float11;
typedef struct float12 { float s0; float s1; float s2; float s3; float s4; float s5;
float s6; float s7; float s8; float s9; float sa; float sb; } float12;
typedef struct float13 { float s0; float s1; float s2; float s3; float s4; float s5;
float s6; float s7; float s8; float s9; float sa; float sb; float sc;} float13;
typedef struct float14 { float s0; float s1; float s2; float s3; float s4; float s5;
float s6; float s7; float s8; float s9; float sa; float sb; float sc; float sd; } float14;
typedef struct float15 { float s0; float s1; float s2; float s3; float s4; float s5;
float s6; float s7; float s8; float s9; float sa; float sb; float sc; float sd; float se; } float15;
typedef struct float0 { float s0; } float0;
typedef struct half1 { half s0; } half1;
typedef struct half5 { half s0; half s1; half s2; half s3; half s4; } half5;
typedef struct half6 { half s0; half s1; half s2; half s3; half s4; half s5; } half6;
typedef struct half7 { half s0; half s1; half s2; half s3; half s4; half s5; half s6; } half7;
typedef struct half9 { half s0; half s1; half s2; half s3; half s4; half s5; half s6; half s7; half s8; } half9;
typedef struct half10 { half s0; half s1; half s2; half s3; half s4; half s5;
half s6; half s7; half s8; half s9; } half10;
typedef struct half11 { half s0; half s1; half s2; half s3; half s4; half s5;
half s6; half s7; half s8; half s9; half sa; } half11;
typedef struct half12 { half s0; half s1; half s2; half s3; half s4; half s5;
half s6; half s7; half s8; half s9; half sa; half sb; } half12;
typedef struct half13 { half s0; half s1; half s2; half s3; half s4; half s5;
half s6; half s7; half s8; half s9; half sa; half sb; half sc; } half13;
typedef struct half14 { half s0; half s1; half s2; half s3; half s4; half s5;
half s6; half s7; half s8; half s9; half sa; half sb; half sc; half sd; } half14;
typedef struct half15 { half s0; half s1; half s2; half s3; half s4; half s5;
half s6; half s7; half s8; half s9; half sa; half sb; half sc; half sd; half se; } half15;
typedef struct half0 { half s0; } half0;
#define OUT_PITCH_X output_width
#define ROW_PITCH input_width
#define GEMM_LIKE_KERNEL_ARGS     \
ELTWISE_DATA_ARG              \
FUSED_ARG                     \
const __global Dtype *src0,   \
const __global Dtype *src1,   \
BIAS_KERNEL_ARG               \
__global Dtype *dst_base,     \
const int dst_offset,         \
const ushort input_width,     \
const ushort input_height,    \
const ushort output_width,    \
const ushort output_height,   \
const int out_pitch_y,     \
const int out_pitch_z,     \
const int aligned_input_size, \
const int slice_pitch
#endif
#ifdef GEMM_LIKE_CONV_32_1
#define TILE_M          1
#define TILE_K          KERNEL_WIDTH
#define TILE_N          32
__attribute__((intel_reqd_sub_group_size(8)))
__kernel void Conv_Interleaved(GEMM_LIKE_KERNEL_ARGS)
{
__global Dtype *dst = dst_base + dst_offset;
const int group_x = get_group_id(0);
const int group_y = get_group_id(1);
const int global_x = get_global_id(0);
const int global_y = get_global_id(1);
const int global_z = get_global_id(2);
int interleaved_y;
int kernel_y;
int kernel_idx;
#define DOT_PRODUCT_8( _result, _rowA, colB )    \
{   \
_result.s0 = mad( _rowA, sub_group_broadcast( colB, 0 ), _result.s0 );  \
_result.s1 = mad( _rowA, sub_group_broadcast( colB, 1 ), _result.s1 );  \
_result.s2 = mad( _rowA, sub_group_broadcast( colB, 2 ), _result.s2 );  \
_result.s3 = mad( _rowA, sub_group_broadcast( colB, 3 ), _result.s3 );  \
_result.s4 = mad( _rowA, sub_group_broadcast( colB, 4 ), _result.s4 );  \
_result.s5 = mad( _rowA, sub_group_broadcast( colB, 5 ), _result.s5 );  \
_result.s6 = mad( _rowA, sub_group_broadcast( colB, 6 ), _result.s6 );  \
_result.s7 = mad( _rowA, sub_group_broadcast( colB, 7 ), _result.s7 );  \
}
typedef CAT( Dtype, KERNEL_WIDTH ) Dtype_t;
if( TILE_N_LAST == 0 || global_x < WIDTH1 / TILE_N )
{
Dtype8  blockC00 = 0.f;
Dtype8  blockC10 = 0.f;
Dtype8  blockC20 = 0.f;
Dtype8  blockC30 = 0.f;
int curr_x = ( global_y % output_width ) * STRIDE_X;
int curr_y = ( global_y / output_width ) * STRIDE_Y;
#if INPUT_PAD_H != 0 || INPUT_PAD_W != 0 || DILATION_X != 1 || DILATION_Y != 1 || INPUT_PAD_BOTTOM != 0 || INPUT_PAD_RIGHT != 0
int saved_y = curr_y;
#endif
const __global Dtype *src0_read = src0
+ aligned_input_size * global_z
+ (curr_y - INPUT_PAD_H) * ROW_PITCH
+ (curr_x - INPUT_PAD_W);
const __global Dtype *src1_read = src1 + ( global_x * TILE_N  * 2);
int patch_depth = 0;
do
{
int patch_row = 0;
#if INPUT_PAD_H != 0 || INPUT_PAD_W != 0 || DILATION_X != 1 || DILATION_Y != 1 || INPUT_PAD_BOTTOM != 0 || INPUT_PAD_RIGHT != 0
curr_y = saved_y;
#endif
do
{
const bool kernel_width_is_odd = KERNEL_WIDTH % 2 == 1;
#if INPUT_PAD_W == 0 && INPUT_PAD_H == 0 && DILATION_X == 1 && DILATION_Y == 1 && INPUT_PAD_BOTTOM == 0 && INPUT_PAD_RIGHT == 0
#if KERNEL_WIDTH == 3
Dtype_t blockA00 = vload3(0, src0_read);
Dtype*  pblockA00 = (Dtype*)(&blockA00);
#else
Dtype_t blockA00 = ( (const __global Dtype_t*)src0_read )[  0  ];
Dtype*  pblockA00 = (Dtype*)(&blockA00);
#endif
#else
Dtype_t blockA00;
Dtype*  pblockA00 = (Dtype*)(&blockA00);
int pos = 0;
LOOP(KERNEL_WIDTH, pos,
{
if (curr_y >= INPUT_PAD_H &&
curr_y < input_height + INPUT_PAD_H &&
curr_x + pos * DILATION_X >= INPUT_PAD_W &&
curr_x + pos * DILATION_X < input_width + INPUT_PAD_W)
pblockA00[pos] = src0_read[pos * DILATION_X];
else
pblockA00[pos] = 0;
})
curr_y += DILATION_Y;
#endif
src0_read += (ROW_PITCH * DILATION_Y);
Dtype blockB00[KERNEL_WIDTH*4];
Dtype8* p8BlockB00 = (Dtype8*)blockB00;
Dtype4* p4BlockB00 = (Dtype4*)blockB00;
Dtype*  pBlockB00 =  (Dtype* )blockB00;
interleaved_y = 0;
LOOP(KERNEL_WIDTH_DIV2, interleaved_y,
{
p8BlockB00[interleaved_y] = as_Dtype8( SUB_GROUP_BLOCK_READ8( (const __global INT_TYPE *)src1_read ) );
src1_read += WIDTH1 * 2;
} )
if ( kernel_width_is_odd )
{
p4BlockB00[KERNEL_WIDTH - 1] = as_Dtype4( SUB_GROUP_BLOCK_READ4( (const __global INT_TYPE *)src1_read ) );
src1_read += WIDTH1 * 2;
}
kernel_idx = 0;
interleaved_y = 0;
LOOP(KERNEL_WIDTH_DIV2, interleaved_y,
{
kernel_y = interleaved_y * 2;
DOT_PRODUCT_8( blockC00, pblockA00[kernel_y    ], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC00, pblockA00[kernel_y + 1], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC10, pblockA00[kernel_y    ], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC10, pblockA00[kernel_y + 1], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC20, pblockA00[kernel_y    ], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC20, pblockA00[kernel_y + 1], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC30, pblockA00[kernel_y    ], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC30, pblockA00[kernel_y + 1], pBlockB00[kernel_idx] ); kernel_idx++;
} )
kernel_y = interleaved_y * 2;
if ( kernel_width_is_odd )
{
DOT_PRODUCT_8( blockC00, pblockA00[kernel_y], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC10, pblockA00[kernel_y], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC20, pblockA00[kernel_y], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC30, pblockA00[kernel_y], pBlockB00[kernel_idx] ); kernel_idx++;
}
}
while( ++patch_row < KERNEL_HEIGHT );
src0_read += slice_pitch - ( KERNEL_HEIGHT * ROW_PITCH * DILATION_Y);
}
while ( ++patch_depth < INPUT_DEPTH );
int out_offset = global_z * out_pitch_z
+ ( group_x * TILE_N ) * out_pitch_y
+ ( ( global_y * TILE_M ) / output_width + OUT_PADDING_HEIGHT) * OUT_PITCH_X
+ ( ( global_y * TILE_M ) % output_width ) + OUT_PADDING_LEFT;
__global Dtype *out = dst + out_offset;
#if APPLY_BIAS
Dtype bias[4];
Dtype4 *bias_vec;
bias_vec = (Dtype4*)bias;
*bias_vec = as_Dtype4(SUB_GROUP_BLOCK_READ4((__global INT_TYPE *)biases_base + group_x * TILE_N));
if (group_x > 0xFFFFFFFEul) {
dst[0] = bias[0] + bias[1] + bias[2] + bias[3];
}
#else
const Dtype bias[4] = {0, 0, 0, 0};
#endif
if (global_y * TILE_M < output_width * output_height )
{
for (int i = 0; i < 8; i++)
{
ACTIVATION_FUNCTION(dst, out_offset + ( 0 + i ) * out_pitch_y, blockC00[i] + SUBGROUP_GET_BIAS(0, i), group_x * TILE_N + i);
ACTIVATION_FUNCTION(dst, out_offset + ( 8 + i ) * out_pitch_y, blockC10[i] + SUBGROUP_GET_BIAS(1, i), group_x * TILE_N + 8 + i);
ACTIVATION_FUNCTION(dst, out_offset + ( 16 + i ) * out_pitch_y, blockC20[i] + SUBGROUP_GET_BIAS(2, i), group_x * TILE_N + 16 + i);
ACTIVATION_FUNCTION(dst, out_offset + ( 24 + i ) * out_pitch_y, blockC30[i] + SUBGROUP_GET_BIAS(3, i), group_x * TILE_N + 24 + i);
}
}
}
#if TILE_N_LAST > 0
else
{
int i = 0;
Dtype8  blockC[TILE_N_LAST_DIV8];
LOOP(TILE_N_LAST_DIV8, i,
{
blockC[i] = 0.f;
} )
int curr_x = ( global_y % output_width ) * STRIDE_X;
int curr_y = ( global_y / output_width ) * STRIDE_Y;
#if INPUT_PAD_H != 0 || INPUT_PAD_W != 0 || DILATION_X != 1 || DILATION_Y != 1 || INPUT_PAD_BOTTOM != 0 || INPUT_PAD_RIGHT != 0
int saved_y = curr_y;
#endif
const __global Dtype *src0_read = src0
+ aligned_input_size * global_z
+ (curr_y - INPUT_PAD_H) * ROW_PITCH
+ (curr_x - INPUT_PAD_W);
const __global Dtype *src1_read = src1 + ( global_x * TILE_N  * 2);
int patch_depth = 0;
do
{
int patch_row = 0;
#if INPUT_PAD_H != 0 || INPUT_PAD_W != 0 || DILATION_X != 1 || DILATION_Y != 1 || INPUT_PAD_BOTTOM != 0 || INPUT_PAD_RIGHT != 0
curr_y = saved_y;
#endif
do
{
const bool kernel_width_is_odd = KERNEL_WIDTH % 2 == 1;
#if INPUT_PAD_W == 0 && INPUT_PAD_H == 0 && DILATION_X == 1 && DILATION_Y == 1 && INPUT_PAD_BOTTOM == 0 && INPUT_PAD_RIGHT == 0
Dtype_t blockA00 = ( (const __global Dtype_t*)src0_read )[  0  ];
Dtype*  pblockA00 = (Dtype*)(&blockA00);
#else
Dtype_t blockA00;
Dtype*  pblockA00 = (Dtype*)(&blockA00);
int pos = 0;
LOOP(KERNEL_WIDTH, pos,
{
if (curr_y >= INPUT_PAD_H &&
curr_y < input_height + INPUT_PAD_H &&
curr_x + pos * DILATION_X >= INPUT_PAD_W &&
curr_x + pos * DILATION_X < input_width + INPUT_PAD_W)
pblockA00[pos] = src0_read[pos * DILATION_X];
else
pblockA00[pos] = 0;
})
curr_y += DILATION_Y;
#endif
src0_read += (ROW_PITCH * DILATION_Y);
Dtype blockB[KERNEL_WIDTH * TILE_N_LAST_DIV8];
interleaved_y = 0;
LOOP(KERNEL_WIDTH_DIV2, interleaved_y,
{
#if TILE_N_LAST_DIV8 == 1
Dtype2* p2BlockB = (Dtype2* )blockB;
p2BlockB[interleaved_y] = as_Dtype2( SUB_GROUP_BLOCK_READ2( (const __global INT_TYPE*)src1_read ) );
#elif TILE_N_LAST_DIV8 == 2
Dtype4* p4BlockB = (Dtype4* )blockB;
p4BlockB[interleaved_y] = as_Dtype4( SUB_GROUP_BLOCK_READ4( (const __global INT_TYPE*)src1_read ) );
#elif TILE_N_LAST_DIV8 == 3
Dtype6* p6BlockB = (Dtype6* )blockB;
(*((Dtype8*)(&p6BlockB[interleaved_y]))).s0123 = as_Dtype4( SUB_GROUP_BLOCK_READ4( (const __global INT_TYPE*)src1_read ) );
(*((Dtype8*)(&p6BlockB[interleaved_y]))).s45 = as_Dtype2( SUB_GROUP_BLOCK_READ2( (const __global INT_TYPE*)(src1_read + 4 * 8) ) );
#endif
src1_read += WIDTH1 * 2;
} )
if ( kernel_width_is_odd )
{
#if TILE_N_LAST_DIV8 == 1
Dtype* pBlockB = (Dtype* )blockB;
pBlockB[KERNEL_WIDTH - 1] = as_Dtype( SUB_GROUP_BLOCK_READ( (const __global INT_TYPE*)src1_read ) );
#elif TILE_N_LAST_DIV8 == 2
Dtype2* p2BlockB = (Dtype2* )blockB;
p2BlockB[KERNEL_WIDTH - 1] = as_Dtype2( SUB_GROUP_BLOCK_READ2( (const __global INT_TYPE*)src1_read ) );
#elif TILE_N_LAST_DIV8 == 3
Dtype3* p3BlockB = (Dtype3* )blockB;
p3BlockB[KERNEL_WIDTH - 1].s01 = as_Dtype2( SUB_GROUP_BLOCK_READ2( (const __global INT_TYPE*)src1_read ) );
p3BlockB[KERNEL_WIDTH - 1].s2 = as_Dtype( SUB_GROUP_BLOCK_READ( (const __global INT_TYPE*) (src1_read + 2 * 8) ) );
#endif
src1_read += WIDTH1 * 2;
}
Dtype* pBlockB = (Dtype*)blockB;
kernel_idx = 0;
interleaved_y = 0;
LOOP(KERNEL_WIDTH_DIV2, interleaved_y,
{
kernel_y = interleaved_y * 2;
DOT_PRODUCT_8( blockC[0], pblockA00[kernel_y    ], pBlockB[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC[0], pblockA00[kernel_y + 1], pBlockB[kernel_idx] ); kernel_idx++;
#if TILE_N_LAST_DIV8 >= 2
DOT_PRODUCT_8( blockC[1], pblockA00[kernel_y    ], pBlockB[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC[1], pblockA00[kernel_y + 1], pBlockB[kernel_idx] ); kernel_idx++;
#if TILE_N_LAST_DIV8 >= 3
DOT_PRODUCT_8( blockC[2], pblockA00[kernel_y    ], pBlockB[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC[2], pblockA00[kernel_y + 1], pBlockB[kernel_idx] ); kernel_idx++;
#endif
#endif
} )
kernel_y = interleaved_y * 2;
if ( kernel_width_is_odd )
{
DOT_PRODUCT_8( blockC[0], pblockA00[kernel_y], pBlockB[kernel_idx] ); kernel_idx++;
#if TILE_N_LAST_DIV8 >= 2
DOT_PRODUCT_8( blockC[1], pblockA00[kernel_y], pBlockB[kernel_idx] ); kernel_idx++;
#if TILE_N_LAST_DIV8 >= 3
DOT_PRODUCT_8( blockC[2], pblockA00[kernel_y], pBlockB[kernel_idx] ); kernel_idx++;
#endif
#endif
}
}
while( ++patch_row < KERNEL_HEIGHT );
src0_read += slice_pitch - ( KERNEL_HEIGHT * ROW_PITCH * DILATION_Y );
}
while ( ++patch_depth < INPUT_DEPTH );
int out_offset = global_z * out_pitch_z
+ ( group_x * TILE_N ) * out_pitch_y
+ ( ( global_y * TILE_M ) / output_width + OUT_PADDING_HEIGHT) * OUT_PITCH_X
+ ( ( global_y * TILE_M ) % output_width ) + OUT_PADDING_LEFT;
__global Dtype *out = dst + out_offset;
#if APPLY_BIAS
Dtype bias[4];
Dtype4 *bias_vec;
bias_vec = (Dtype4*)bias;
*bias_vec = as_Dtype4(SUB_GROUP_BLOCK_READ4((__global INT_TYPE *)biases_base + group_x * TILE_N));
if (group_x > 0xFFFFFFFEul) {
dst[0] = bias[0] + bias[1] + bias[2] + bias[3];
}
#else
const Dtype bias[4] = {0, 0, 0, 0};
#endif
if (global_y * TILE_M < output_width * output_height )
{
for (int i = 0; i < 8; i++)
{
if ( TILE_N_LAST_DIV8 > 0 )
{
ACTIVATION_FUNCTION(dst, out_offset + ( 0+i) * out_pitch_y, blockC[0][i] + SUBGROUP_GET_BIAS(0, i), group_x * TILE_N + i);
}
if ( TILE_N_LAST_DIV8 > 1 )
{
ACTIVATION_FUNCTION(dst, out_offset + ( 8+i) * out_pitch_y, blockC[1][i] + SUBGROUP_GET_BIAS(1, i), group_x * TILE_N + i + 8);
}
if ( TILE_N_LAST_DIV8 > 2 )
{
ACTIVATION_FUNCTION(dst, out_offset + (16+i) * out_pitch_y, blockC[2][i] + SUBGROUP_GET_BIAS(2, i), group_x * TILE_N + i + 16);
}
if ( TILE_N_LAST_DIV8 > 3 )
{
ACTIVATION_FUNCTION(dst, out_offset + (24+i) * out_pitch_y, blockC[3][i] + SUBGROUP_GET_BIAS(3, i), group_x * TILE_N + i + 24);
}
}
}
}
#endif
}
#endif
#ifdef GEMM_LIKE_CONV_32_2
#define TILE_M          2
#define TILE_K          KERNEL_WIDTH
#define TILE_N          32
__attribute__((intel_reqd_sub_group_size(8)))
__kernel void Conv_Interleaved(GEMM_LIKE_KERNEL_ARGS)
{
__global Dtype *dst = dst_base + dst_offset;
const int group_x = get_group_id(0);
const int group_y = get_group_id(1);
const int global_x = get_global_id(0);
const int global_y = get_global_id(1);
const int global_z = get_global_id(2);
int interleaved_y;
int kernel_y;
int kernel_idx;
#define DOT_PRODUCT_8( _result, _rowA, colB )    \
{   \
_result.s0 = mad( _rowA, sub_group_broadcast( colB, 0 ), _result.s0 );  \
_result.s1 = mad( _rowA, sub_group_broadcast( colB, 1 ), _result.s1 );  \
_result.s2 = mad( _rowA, sub_group_broadcast( colB, 2 ), _result.s2 );  \
_result.s3 = mad( _rowA, sub_group_broadcast( colB, 3 ), _result.s3 );  \
_result.s4 = mad( _rowA, sub_group_broadcast( colB, 4 ), _result.s4 );  \
_result.s5 = mad( _rowA, sub_group_broadcast( colB, 5 ), _result.s5 );  \
_result.s6 = mad( _rowA, sub_group_broadcast( colB, 6 ), _result.s6 );  \
_result.s7 = mad( _rowA, sub_group_broadcast( colB, 7 ), _result.s7 );  \
}
typedef CAT( Dtype, KERNEL_WIDTH ) Dtype_t;
if( TILE_N_LAST == 0 || global_x < WIDTH1 / TILE_N )
{
Dtype8  blockC00 = 0.f;
Dtype8  blockC10 = 0.f;
Dtype8  blockC20 = 0.f;
Dtype8  blockC30 = 0.f;
Dtype8  blockC01 = 0.f;
Dtype8  blockC11 = 0.f;
Dtype8  blockC21 = 0.f;
Dtype8  blockC31 = 0.f;
int curr_x0 = ( ( global_y * TILE_M + 0 ) % output_width ) * STRIDE_X;
int curr_x1 = ( ( global_y * TILE_M + 1 ) % output_width ) * STRIDE_X;
int curr_y0 = ( ( global_y * TILE_M + 0 ) / output_width ) * STRIDE_Y;
int curr_y1 = ( ( global_y * TILE_M + 1 ) / output_width ) * STRIDE_Y;
#if INPUT_PAD_H != 0 || INPUT_PAD_W != 0 || DILATION_X != 1 || DILATION_Y != 1 || INPUT_PAD_BOTTOM != 0 || INPUT_PAD_RIGHT != 0
int saved_y0 = curr_y0;
int saved_y1 = curr_y1;
#endif
const __global Dtype *src0_read0 = src0
+ aligned_input_size * global_z
+ (curr_y0 - INPUT_PAD_H) * ROW_PITCH
+ curr_x0 - INPUT_PAD_W;
const __global Dtype *src0_read1 = src0
+ aligned_input_size * global_z
+ (curr_y1 - INPUT_PAD_H) * ROW_PITCH
+ curr_x1 - INPUT_PAD_W;
const __global Dtype *src1_read = src1 + ( global_x * TILE_N * 2);
int patch_depth = 0;
do
{
int patch_row = 0;
do
{
const bool kernel_width_is_odd = KERNEL_WIDTH % 2 == 1;
#if INPUT_PAD_H == 0 && INPUT_PAD_W == 0 && DILATION_X == 1 && DILATION_Y == 1 && INPUT_PAD_BOTTOM == 0 && INPUT_PAD_RIGHT == 0
#if KERNEL_WIDTH == 3
Dtype_t blockA00 = vload3(0, src0_read0); src0_read0 += ROW_PITCH;
Dtype_t blockA01 = vload3(0, src0_read1); src0_read1 += ROW_PITCH;
Dtype*  pblockA00 = (Dtype*)(&blockA00);
Dtype*  pblockA01 = (Dtype*)(&blockA01);
#else
Dtype_t blockA00 = { (Dtype)0.f };
Dtype_t blockA01 = { (Dtype)0.f };
Dtype*  pblockA00 = (Dtype*)(&blockA00);
Dtype*  pblockA01 = (Dtype*)(&blockA01);
int pos = 0;
LOOP(KERNEL_WIDTH, pos,
{
if (curr_x0 + pos < input_width)
pblockA00[pos] = src0_read0[pos];
if (curr_x1 + pos < input_width)
pblockA01[pos] = src0_read1[pos];
})
src0_read0 += ROW_PITCH;
src0_read1 += ROW_PITCH;
#endif
#else
Dtype_t blockA00;
Dtype*  pblockA00 = (Dtype*)(&blockA00);
int pos = 0;
LOOP(KERNEL_WIDTH, pos,
{
if (curr_y0 >= INPUT_PAD_H &&
curr_y0 < input_height + INPUT_PAD_H &&
curr_x0 + pos * DILATION_X >= INPUT_PAD_W &&
curr_x0 + pos * DILATION_X < input_width + INPUT_PAD_W)
pblockA00[pos] = src0_read0[pos * DILATION_X];
else
pblockA00[pos] = 0;
})
curr_y0 += DILATION_Y;
Dtype_t blockA01;
Dtype*  pblockA01 = (Dtype*)(&blockA01);
pos = 0;
LOOP(KERNEL_WIDTH, pos,
{
if (curr_y1 >= INPUT_PAD_H &&
curr_y1 < input_height + INPUT_PAD_H &&
curr_x1 + pos * DILATION_X >= INPUT_PAD_W &&
curr_x1 + pos * DILATION_X < input_width + INPUT_PAD_W)
pblockA01[pos] = src0_read1[pos * DILATION_X];
else
pblockA01[pos] = 0;
})
curr_y1 += DILATION_Y;
src0_read0 += (ROW_PITCH * DILATION_Y);
src0_read1 += (ROW_PITCH * DILATION_Y);
#endif
Dtype blockB00[KERNEL_WIDTH*4];
Dtype8* p8BlockB00 = (Dtype8*)blockB00;
Dtype4* p4BlockB00 = (Dtype4*)blockB00;
Dtype*  pBlockB00 =  (Dtype* )blockB00;
interleaved_y = 0;
LOOP(KERNEL_WIDTH_DIV2, interleaved_y,
{
p8BlockB00[interleaved_y] = as_Dtype8( SUB_GROUP_BLOCK_READ8( (const __global INT_TYPE*)src1_read ) );
src1_read += WIDTH1 * 2;
} )
if ( kernel_width_is_odd )
{
p4BlockB00[KERNEL_WIDTH - 1] = as_Dtype4( SUB_GROUP_BLOCK_READ4( (const __global INT_TYPE*)src1_read ) );
src1_read += WIDTH1 * 2;
}
kernel_idx = 0;
interleaved_y = 0;
LOOP(KERNEL_WIDTH_DIV2, interleaved_y,
{
kernel_y = interleaved_y * 2;
DOT_PRODUCT_8( blockC00, pblockA00[kernel_y    ], pBlockB00[kernel_idx] );
DOT_PRODUCT_8( blockC01, pblockA01[kernel_y    ], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC00, pblockA00[kernel_y + 1], pBlockB00[kernel_idx] );
DOT_PRODUCT_8( blockC01, pblockA01[kernel_y + 1], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC10, pblockA00[kernel_y    ], pBlockB00[kernel_idx] );
DOT_PRODUCT_8( blockC11, pblockA01[kernel_y    ], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC10, pblockA00[kernel_y + 1], pBlockB00[kernel_idx] );
DOT_PRODUCT_8( blockC11, pblockA01[kernel_y + 1], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC20, pblockA00[kernel_y    ], pBlockB00[kernel_idx] );
DOT_PRODUCT_8( blockC21, pblockA01[kernel_y    ], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC20, pblockA00[kernel_y + 1], pBlockB00[kernel_idx] );
DOT_PRODUCT_8( blockC21, pblockA01[kernel_y + 1], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC30, pblockA00[kernel_y    ], pBlockB00[kernel_idx] );
DOT_PRODUCT_8( blockC31, pblockA01[kernel_y    ], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC30, pblockA00[kernel_y + 1], pBlockB00[kernel_idx] );
DOT_PRODUCT_8( blockC31, pblockA01[kernel_y + 1], pBlockB00[kernel_idx] ); kernel_idx++;
} )
if ( kernel_width_is_odd )
{
kernel_y = interleaved_y * 2;
DOT_PRODUCT_8( blockC00, pblockA00[kernel_y], pBlockB00[kernel_idx] );
DOT_PRODUCT_8( blockC01, pblockA01[kernel_y], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC10, pblockA00[kernel_y], pBlockB00[kernel_idx] );
DOT_PRODUCT_8( blockC11, pblockA01[kernel_y], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC20, pblockA00[kernel_y], pBlockB00[kernel_idx] );
DOT_PRODUCT_8( blockC21, pblockA01[kernel_y], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC30, pblockA00[kernel_y], pBlockB00[kernel_idx] );
DOT_PRODUCT_8( blockC31, pblockA01[kernel_y], pBlockB00[kernel_idx] ); kernel_idx++;
}
}
while( ++patch_row < KERNEL_HEIGHT );
#if INPUT_PAD_W != 0 || INPUT_PAD_H != 0 || DILATION_X != 1 || DILATION_Y != 1 || INPUT_PAD_BOTTOM != 0 || INPUT_PAD_RIGHT != 0
curr_y0 = saved_y0;
curr_y1 = saved_y1;
#endif
src0_read0 += slice_pitch - ( KERNEL_HEIGHT * ROW_PITCH * DILATION_Y );
src0_read1 += slice_pitch - ( KERNEL_HEIGHT * ROW_PITCH * DILATION_Y );
}
while ( ++patch_depth < INPUT_DEPTH );
int out0_offset = global_z * out_pitch_z
+ ( group_x * TILE_N ) * out_pitch_y
+ ( ( global_y * TILE_M + 0 ) / output_width + OUT_PADDING_HEIGHT ) * OUT_PITCH_X
+ ( ( global_y * TILE_M + 0 ) % output_width ) + OUT_PADDING_LEFT;
int out1_offset = global_z * out_pitch_z
+ ( group_x * TILE_N ) * out_pitch_y
+ ( ( global_y * TILE_M + 1 ) / output_width + OUT_PADDING_HEIGHT ) * OUT_PITCH_X
+ ( ( global_y * TILE_M + 1 ) % output_width ) + OUT_PADDING_LEFT;
#if APPLY_BIAS
Dtype bias[4];
Dtype4 *bias_vec;
bias_vec = (Dtype4*)bias;
*bias_vec = as_Dtype4(SUB_GROUP_BLOCK_READ4((__global INT_TYPE *)biases_base + group_x * TILE_N));
if (group_x > 0xFFFFFFFEul) {
dst[0] = bias[0] + bias[1] + bias[2] + bias[3];
}
#else
const Dtype bias[4] = {0, 0, 0, 0};
#endif
if( global_y * TILE_M < output_width * output_height )
{
for( int i = 0; i < 8; i++ )
{
ACTIVATION_FUNCTION(dst, out0_offset + ( 0+i) * out_pitch_y, blockC00[i] + SUBGROUP_GET_BIAS(0, i), group_x * TILE_N + i);
ACTIVATION_FUNCTION(dst, out0_offset + ( 8+i) * out_pitch_y, blockC10[i] + SUBGROUP_GET_BIAS(1, i), group_x * TILE_N + i + 8);
ACTIVATION_FUNCTION(dst, out0_offset + (16+i) * out_pitch_y, blockC20[i] + SUBGROUP_GET_BIAS(2, i), group_x * TILE_N + i + 16);
ACTIVATION_FUNCTION(dst, out0_offset + (24+i) * out_pitch_y, blockC30[i] + SUBGROUP_GET_BIAS(3, i), group_x * TILE_N + i + 24);
}
}
if( global_y * TILE_M + 1 < output_width * output_height )
{
for( int i = 0; i < 8; i++ )
{
ACTIVATION_FUNCTION(dst, out1_offset + ( 0+i) * out_pitch_y, blockC01[i] + SUBGROUP_GET_BIAS(0, i), group_x * TILE_N + i);
ACTIVATION_FUNCTION(dst, out1_offset + ( 8+i) * out_pitch_y, blockC11[i] + SUBGROUP_GET_BIAS(1, i), group_x * TILE_N + i + 8);
ACTIVATION_FUNCTION(dst, out1_offset + (16+i) * out_pitch_y, blockC21[i] + SUBGROUP_GET_BIAS(2, i), group_x * TILE_N + i + 16);
ACTIVATION_FUNCTION(dst, out1_offset + (24+i) * out_pitch_y, blockC31[i] + SUBGROUP_GET_BIAS(3, i), group_x * TILE_N + i + 24);
}
}
}
#if TILE_N_LAST > 0
else
{
int i = 0;
Dtype8  blockC0[TILE_N_LAST_DIV8];
Dtype8  blockC1[TILE_N_LAST_DIV8];
LOOP(TILE_N_LAST_DIV8, i,
{
blockC0[i] = 0.f;
blockC1[i] = 0.f;
} )
int curr_x0 = ( ( global_y * TILE_M + 0 ) % output_width ) * STRIDE_X;
int curr_x1 = ( ( global_y * TILE_M + 1 ) % output_width ) * STRIDE_X;
int curr_y0 = ( ( global_y * TILE_M + 0 ) / output_width ) * STRIDE_Y;
int curr_y1 = ( ( global_y * TILE_M + 1 ) / output_width ) * STRIDE_Y;
#if INPUT_PAD_H != 0 || INPUT_PAD_W != 0 || DILATION_X != 1 || DILATION_Y != 1 || INPUT_PAD_BOTTOM != 0 || INPUT_PAD_RIGHT != 0
int saved_y0 = curr_y0;
int saved_y1 = curr_y1;
#endif
const __global Dtype *src0_read0 = src0
+ aligned_input_size * global_z
+ (curr_y0 - INPUT_PAD_H) * ROW_PITCH
+ curr_x0 - INPUT_PAD_W;
const __global Dtype *src0_read1 = src0
+ aligned_input_size * global_z
+ (curr_y1 - INPUT_PAD_H) * ROW_PITCH
+ curr_x1 - INPUT_PAD_W;
const __global Dtype *src1_read = src1 + ( global_x * TILE_N  * 2);
int patch_depth = 0;
do
{
int patch_row = 0;
do
{
const bool kernel_width_is_odd = KERNEL_WIDTH % 2 == 1;
#if INPUT_PAD_H == 0 && INPUT_PAD_W == 0 && DILATION_X == 1 && DILATION_Y == 1 && INPUT_PAD_BOTTOM == 0 && INPUT_PAD_RIGHT == 0
Dtype_t blockA00 = ( (const __global Dtype_t*)src0_read0 )[  0  ]; src0_read0 += ROW_PITCH;
Dtype_t blockA01 = ( (const __global Dtype_t*)src0_read1 )[  0  ]; src0_read1 += ROW_PITCH;
Dtype*  pblockA00 = (Dtype*)(&blockA00);
Dtype*  pblockA01 = (Dtype*)(&blockA01);
#else
Dtype_t blockA00;
Dtype*  pblockA00 = (Dtype*)(&blockA00);
int pos = 0;
LOOP(KERNEL_WIDTH, pos,
{
if (curr_y0 >= INPUT_PAD_H &&
curr_y0 < input_height + INPUT_PAD_H &&
curr_x0 + pos * DILATION_X >= INPUT_PAD_W &&
curr_x0 + pos * DILATION_X < input_width + INPUT_PAD_W)
pblockA00[pos] = src0_read0[pos * DILATION_X];
else
pblockA00[pos] = 0;
})
curr_y0 += DILATION_Y;
Dtype_t blockA01;
Dtype*  pblockA01 = (Dtype*)(&blockA01);
pos = 0;
LOOP(KERNEL_WIDTH, pos,
{
if (curr_y1 >= INPUT_PAD_H &&
curr_y1 < input_height + INPUT_PAD_H &&
curr_x1 + pos * DILATION_X >= INPUT_PAD_W &&
curr_x1 + pos * DILATION_X < input_width + INPUT_PAD_W)
pblockA01[pos] = src0_read1[pos * DILATION_X];
else
pblockA01[pos] = 0;
})
curr_y1 += DILATION_Y;
src0_read0 += (ROW_PITCH * DILATION_Y);
src0_read1 += (ROW_PITCH * DILATION_Y);
#endif
Dtype blockB[KERNEL_WIDTH * TILE_N_LAST_DIV8];
interleaved_y = 0;
LOOP(KERNEL_WIDTH_DIV2, interleaved_y,
{
#if TILE_N_LAST_DIV8 == 1
Dtype2* p2BlockB = (Dtype2* )blockB;
p2BlockB[interleaved_y] = as_Dtype2( SUB_GROUP_BLOCK_READ2( (const __global INT_TYPE*)src1_read ) );
#elif TILE_N_LAST_DIV8 == 2
Dtype4* p4BlockB = (Dtype4* )blockB;
p4BlockB[interleaved_y] = as_Dtype4( SUB_GROUP_BLOCK_READ4( (const __global INT_TYPE*)src1_read ) );
#elif TILE_N_LAST_DIV8 == 3
Dtype6* p6BlockB = (Dtype6* )blockB;
(*((Dtype8*)(&p6BlockB[interleaved_y]))).s0123 = as_Dtype4( SUB_GROUP_BLOCK_READ4( (const __global INT_TYPE*)src1_read ) );
(*((Dtype8*)(&p6BlockB[interleaved_y]))).s45 = as_Dtype2( SUB_GROUP_BLOCK_READ2( (const __global INT_TYPE*)(src1_read + 4 * 8) ) );
#endif
src1_read += WIDTH1 * 2;
} )
if ( kernel_width_is_odd )
{
#if TILE_N_LAST_DIV8 == 1
Dtype* pBlockB = (Dtype* )blockB;
pBlockB[KERNEL_WIDTH - 1] = as_Dtype( SUB_GROUP_BLOCK_READ( (const __global INT_TYPE*)src1_read ) );
#elif TILE_N_LAST_DIV8 == 2
Dtype2* p2BlockB = (Dtype2* )blockB;
p2BlockB[KERNEL_WIDTH - 1] = as_Dtype2( SUB_GROUP_BLOCK_READ2( (const __global INT_TYPE*)src1_read ) );
#elif TILE_N_LAST_DIV8 == 3
Dtype3* p3BlockB = (Dtype3* )blockB;
p3BlockB[KERNEL_WIDTH - 1].s01 = as_Dtype2( SUB_GROUP_BLOCK_READ2( (const __global INT_TYPE*)src1_read ) );
p3BlockB[KERNEL_WIDTH - 1].s2 = as_Dtype( SUB_GROUP_BLOCK_READ( (const __global INT_TYPE*) (src1_read + 8) ) );
#endif
src1_read += WIDTH1 * 2;
}
Dtype* pBlockB = (Dtype*)blockB;
kernel_idx = 0;
interleaved_y = 0;
LOOP(KERNEL_WIDTH_DIV2, interleaved_y,
{
kernel_y = interleaved_y * 2;
DOT_PRODUCT_8( blockC0[0], pblockA00[kernel_y    ], pBlockB[kernel_idx] );
DOT_PRODUCT_8( blockC1[0], pblockA01[kernel_y    ], pBlockB[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC0[0], pblockA00[kernel_y + 1], pBlockB[kernel_idx] );
DOT_PRODUCT_8( blockC1[0], pblockA01[kernel_y + 1], pBlockB[kernel_idx] ); kernel_idx++;
#if TILE_N_LAST_DIV8 >= 2
DOT_PRODUCT_8( blockC0[1], pblockA00[kernel_y    ], pBlockB[kernel_idx] );
DOT_PRODUCT_8( blockC1[1], pblockA01[kernel_y    ], pBlockB[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC0[1], pblockA00[kernel_y + 1], pBlockB[kernel_idx] );
DOT_PRODUCT_8( blockC1[1], pblockA01[kernel_y + 1], pBlockB[kernel_idx] ); kernel_idx++;
#if TILE_N_LAST_DIV8 >= 3
DOT_PRODUCT_8( blockC0[2], pblockA00[kernel_y    ], pBlockB[kernel_idx] );
DOT_PRODUCT_8( blockC1[2], pblockA01[kernel_y    ], pBlockB[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_8( blockC0[2], pblockA00[kernel_y + 1], pBlockB[kernel_idx] );
DOT_PRODUCT_8( blockC1[2], pblockA01[kernel_y + 1], pBlockB[kernel_idx] ); kernel_idx++;
#endif
#endif
} )
kernel_y = interleaved_y * 2;
if ( kernel_width_is_odd )
{
DOT_PRODUCT_8( blockC0[0], pblockA00[kernel_y], pBlockB[kernel_idx] );
DOT_PRODUCT_8( blockC1[0], pblockA01[kernel_y], pBlockB[kernel_idx] ); kernel_idx++;
#if TILE_N_LAST_DIV8 >= 2
DOT_PRODUCT_8( blockC0[1], pblockA00[kernel_y], pBlockB[kernel_idx] );
DOT_PRODUCT_8( blockC1[1], pblockA01[kernel_y], pBlockB[kernel_idx] ); kernel_idx++;
#if TILE_N_LAST_DIV8 >= 3
DOT_PRODUCT_8( blockC0[2], pblockA00[kernel_y], pBlockB[kernel_idx] );
DOT_PRODUCT_8( blockC1[2], pblockA01[kernel_y], pBlockB[kernel_idx] ); kernel_idx++;
#endif
#endif
}
}
while( ++patch_row < KERNEL_HEIGHT );
#if INPUT_PAD_W != 0 || INPUT_PAD_H != 0 || DILATION_X != 1 || DILATION_Y != 1 || INPUT_PAD_BOTTOM != 0 || INPUT_PAD_RIGHT != 0
curr_y0 = saved_y0;
curr_y1 = saved_y1;
#endif
src0_read0 += slice_pitch - ( KERNEL_HEIGHT * ROW_PITCH * DILATION_Y );
src0_read1 += slice_pitch - ( KERNEL_HEIGHT * ROW_PITCH * DILATION_Y );
}
while ( ++patch_depth < INPUT_DEPTH );
int out0_offset = global_z * out_pitch_z
+ ( group_x * TILE_N ) * out_pitch_y
+ ( ( global_y * TILE_M + 0 ) / output_width + OUT_PADDING_HEIGHT ) * OUT_PITCH_X
+ ( ( global_y * TILE_M + 0 ) % output_width ) + OUT_PADDING_LEFT;
int out1_offset = global_z * out_pitch_z
+ ( group_x * TILE_N ) * out_pitch_y
+ ( ( global_y * TILE_M + 1 ) / output_width + OUT_PADDING_HEIGHT ) * OUT_PITCH_X
+ ( ( global_y * TILE_M + 1 ) % output_width ) + OUT_PADDING_LEFT;
__global Dtype *out1 = dst + out1_offset;
#if APPLY_BIAS
Dtype bias[4];
Dtype4 *bias_vec;
bias_vec = (Dtype4*)bias;
*bias_vec = as_Dtype4(SUB_GROUP_BLOCK_READ4((__global INT_TYPE *)biases_base + group_x * TILE_N));
if (group_x > 0xFFFFFFFEul) {
dst[0] = bias[0] + bias[1] + bias[2] + bias[3];
}
#else
const Dtype bias[4] = {0, 0, 0, 0};
#endif
if( global_y * TILE_M < output_width * output_height )
{
for( int i = 0; i < 8; i++ )
{
if ( TILE_N_LAST_DIV8 > 0 )
{
ACTIVATION_FUNCTION(dst, out0_offset + ( 0+i) * out_pitch_y, blockC0[0][i] + SUBGROUP_GET_BIAS(0, i), group_x * TILE_N + i);
}
if ( TILE_N_LAST_DIV8 > 1 )
{
ACTIVATION_FUNCTION(dst, out0_offset + ( 8+i) * out_pitch_y, blockC0[1][i] + SUBGROUP_GET_BIAS(1, i), group_x * TILE_N + i + 8);
}
if ( TILE_N_LAST_DIV8 > 2 )
{
ACTIVATION_FUNCTION(dst, out0_offset + (16+i) * out_pitch_y, blockC0[2][i] + SUBGROUP_GET_BIAS(2, i), group_x * TILE_N + i + 16);
}
if ( TILE_N_LAST_DIV8 > 3 )
{
ACTIVATION_FUNCTION(dst, out0_offset + (24+i) * out_pitch_y, blockC0[3][i] + SUBGROUP_GET_BIAS(3, i), group_x * TILE_N + i + 24);
}
}
}
if( global_y * TILE_M + 1 < output_width * output_height )
{
for( int i = 0; i < 8; i++ )
{
if ( TILE_N_LAST_DIV8 > 0 )
{
ACTIVATION_FUNCTION(dst, out1_offset + ( 0+i) * out_pitch_y, blockC1[0][i] + SUBGROUP_GET_BIAS(0, i), group_x * TILE_N + i);
}
if ( TILE_N_LAST_DIV8 > 1 )
{
ACTIVATION_FUNCTION(dst, out1_offset + ( 8+i) * out_pitch_y, blockC1[1][i] + SUBGROUP_GET_BIAS(1, i), group_x * TILE_N + i + 8);
}
if ( TILE_N_LAST_DIV8 > 2 )
{
ACTIVATION_FUNCTION(dst, out1_offset + (16+i) * out_pitch_y, blockC1[2][i] + SUBGROUP_GET_BIAS(2, i), group_x * TILE_N + i + 16);
}
if ( TILE_N_LAST_DIV8 > 3 )
{
ACTIVATION_FUNCTION(dst, out1_offset + (24+i) * out_pitch_y, blockC1[3][i] + SUBGROUP_GET_BIAS(3, i), group_x * TILE_N + i + 24);
}
}
}
}
#endif
}
#endif
#if defined(GEMM_LIKE_CONV_32_2_SIMD16) || defined(GEMM_LIKE_CONV_32_1_SIMD16)
#define INTERLEAVED_SIMD16_OUTPUT(_out_, _offset_,  _m_) do {\
if (global_y * TILE_M < output_width * output_height ) \
{ \
if ( ( OUT_DEPTH % TILE_N ) == 0 ) {\
for (int i = 0; i < 16; i++) \
{ \
ACTIVATION_FUNCTION(_out_, _offset_ + ( 0+i) * out_pitch_y, blockC0 ##_m_ [i] + SUBGROUP_GET_BIAS(0, i), group_x * TILE_N + i); \
ACTIVATION_FUNCTION(_out_, _offset_ + (16+i) * out_pitch_y, blockC1 ##_m_ [i] + SUBGROUP_GET_BIAS(1, i), group_x * TILE_N + i + 16); \
} \
} \
else if( ( OUT_DEPTH % 16 ) == 0 ) { \
if ( ( global_x + 1 ) < get_global_size(0) ) { \
for ( int i = 0; i < 16; i++ ) \
{ \
ACTIVATION_FUNCTION(_out_, _offset_ + ( 0+i) * out_pitch_y, blockC0 ##_m_ [i] + SUBGROUP_GET_BIAS(0, i), group_x * TILE_N + i); \
ACTIVATION_FUNCTION(_out_, _offset_ + (16+i) * out_pitch_y, blockC1 ##_m_ [i] + SUBGROUP_GET_BIAS(1, i), group_x * TILE_N + i + 16); \
} \
} \
else { \
for (int i = 0; i < 16; i++) \
{ \
ACTIVATION_FUNCTION(_out_, _offset_ + ( 0+i) * out_pitch_y, blockC0 ##_m_ [i] + SUBGROUP_GET_BIAS(0, i), group_x * TILE_N + i); \
} \
} \
} \
else { \
if ( ( global_x + 1 ) < get_global_size(0) ) \
{ \
for ( int i = 0; i < 16; i++ ) \
{ \
ACTIVATION_FUNCTION(_out_, _offset_ + ( 0+i) * out_pitch_y, blockC0 ##_m_[i] + SUBGROUP_GET_BIAS(0, i), group_x * TILE_N + i); \
ACTIVATION_FUNCTION(_out_, _offset_ + (16+i) * out_pitch_y, blockC1 ##_m_[i] + SUBGROUP_GET_BIAS(1, i), group_x * TILE_N + i + 16); \
} \
} \
else { \
if ( (OUT_DEPTH % TILE_N) > 16 ) { \
for (int i = 0; i < 16 ; i++) \
{ \
ACTIVATION_FUNCTION(_out_, _offset_ + ( 0+i) * out_pitch_y, blockC0 ##_m_[i] + SUBGROUP_GET_BIAS(0, i), group_x * TILE_N + i); \
} \
for (int i = 0; i < OUT_DEPTH % 16 ; i++) \
{ \
ACTIVATION_FUNCTION(_out_, _offset_ + (16+i) * out_pitch_y, blockC1 ##_m_[i] + SUBGROUP_GET_BIAS(1, i), group_x * TILE_N + i + 16); \
} \
} \
else { \
for (int i = 0; i < OUT_DEPTH % 16 ; i++) \
{ \
ACTIVATION_FUNCTION(_out_, _offset_ + ( 0+i) * out_pitch_y, blockC0 ##_m_[i] + SUBGROUP_GET_BIAS(0, i), group_x * TILE_N + i); \
} \
} \
} \
} \
} \
}while(0)
#endif
#ifdef GEMM_LIKE_CONV_32_1_SIMD16
#define TILE_M          1
#define TILE_K          KERNEL_WIDTH
#define TILE_N          32
__attribute__((intel_reqd_sub_group_size(16)))
__kernel void Conv_Interleaved(GEMM_LIKE_KERNEL_ARGS)
{
__global Dtype *dst = dst_base + dst_offset;
const int group_x = get_group_id(0);
const int group_y = get_group_id(1);
const int global_x = get_global_id(0);
const int global_y = get_global_id(1);
const int global_z = get_global_id(2);
int interleaved_y;
int kernel_y;
int kernel_idx;
Dtype16  blockC00 = 0.f;
Dtype16  blockC10 = 0.f;
int curr_x = ( global_y % output_width ) * STRIDE_X;
int curr_y = ( global_y / output_width ) * STRIDE_Y;
#if INPUT_PAD_H != 0 || INPUT_PAD_W != 0 || DILATION_X != 1 || DILATION_Y != 1 || INPUT_PAD_BOTTOM != 0 || INPUT_PAD_RIGHT != 0
int saved_y = curr_y;
#endif
const __global Dtype *src0_read = src0
+ aligned_input_size * global_z
+ (curr_y - INPUT_PAD_H) * ROW_PITCH
+ curr_x - INPUT_PAD_W;
const __global Dtype *src0_read_orig = src0_read;
const __global Dtype *src1_read = src1 + ( global_x * TILE_N * 2 );
#define DOT_PRODUCT_16( _result, _rowA, colB )    \
{   \
_result.s0 = mad( _rowA, sub_group_broadcast( colB,  0 ), _result.s0 );  \
_result.s1 = mad( _rowA, sub_group_broadcast( colB,  1 ), _result.s1 );  \
_result.s2 = mad( _rowA, sub_group_broadcast( colB,  2 ), _result.s2 );  \
_result.s3 = mad( _rowA, sub_group_broadcast( colB,  3 ), _result.s3 );  \
_result.s4 = mad( _rowA, sub_group_broadcast( colB,  4 ), _result.s4 );  \
_result.s5 = mad( _rowA, sub_group_broadcast( colB,  5 ), _result.s5 );  \
_result.s6 = mad( _rowA, sub_group_broadcast( colB,  6 ), _result.s6 );  \
_result.s7 = mad( _rowA, sub_group_broadcast( colB,  7 ), _result.s7 );  \
_result.s8 = mad( _rowA, sub_group_broadcast( colB,  8 ), _result.s8 );  \
_result.s9 = mad( _rowA, sub_group_broadcast( colB,  9 ), _result.s9 );  \
_result.sa = mad( _rowA, sub_group_broadcast( colB, 10 ), _result.sa );  \
_result.sb = mad( _rowA, sub_group_broadcast( colB, 11 ), _result.sb );  \
_result.sc = mad( _rowA, sub_group_broadcast( colB, 12 ), _result.sc );  \
_result.sd = mad( _rowA, sub_group_broadcast( colB, 13 ), _result.sd );  \
_result.se = mad( _rowA, sub_group_broadcast( colB, 14 ), _result.se );  \
_result.sf = mad( _rowA, sub_group_broadcast( colB, 15 ), _result.sf );  \
}
typedef CAT( Dtype, KERNEL_WIDTH ) Dtype_t;
int patch_depth = 0;
__attribute__((opencl_unroll_hint(1)))
do
{
int patch_row = 0;
#if INPUT_PAD_H != 0 || INPUT_PAD_W != 0 || DILATION_X != 1 || DILATION_Y != 1 || INPUT_PAD_BOTTOM != 0 || INPUT_PAD_RIGHT != 0
curr_y = saved_y;
#endif
__attribute__((opencl_unroll_hint(1)))
do
{
const bool kernel_width_is_odd = KERNEL_WIDTH % 2 == 1;
#if INPUT_PAD_W == 0 && INPUT_PAD_H == 0 && DILATION_X == 1 && DILATION_Y == 1 && INPUT_PAD_BOTTOM == 0 && INPUT_PAD_RIGHT == 0
#if KERNEL_WIDTH == 3
Dtype_t blockA00 = vload3(0, src0_read);
Dtype*  pblockA00 = (Dtype*)(&blockA00);
#else
Dtype_t blockA00 = ( (const __global Dtype_t*)src0_read )[  0  ];
Dtype*  pblockA00 = (Dtype*)(&blockA00);
#endif
#else
Dtype_t blockA00;
Dtype*  pblockA00 = (Dtype*)(&blockA00);
int pos = 0;
LOOP(KERNEL_WIDTH, pos,
{
if (curr_y >= INPUT_PAD_H &&
curr_y < input_height + INPUT_PAD_H &&
curr_x + pos * DILATION_X >= INPUT_PAD_W &&
curr_x + pos * DILATION_X < input_width + INPUT_PAD_W)
pblockA00[pos] = src0_read[pos * DILATION_X];
else
pblockA00[pos] = 0;
})
curr_y += DILATION_Y;
#endif
src0_read += ROW_PITCH * DILATION_Y;
INT_TYPE blockB00[KERNEL_WIDTH * 2];
INT_TYPE4* p4BlockB00 = (INT_TYPE4*)blockB00;
INT_TYPE2* p2BlockB00 = (INT_TYPE2*)blockB00;
Dtype* pBlockB00  = (Dtype*)blockB00;
interleaved_y = 0;
LOOP(KERNEL_WIDTH_DIV2, interleaved_y,
{
p4BlockB00[interleaved_y] = SUB_GROUP_BLOCK_READ4( (const __global INT_TYPE*)src1_read );
src1_read += WIDTH1 * 2;
} )
if ( kernel_width_is_odd )
{
p2BlockB00[KERNEL_WIDTH - 1] = SUB_GROUP_BLOCK_READ2( (const __global INT_TYPE*)src1_read );
src1_read += WIDTH1 * 2;
}
kernel_idx = 0;
interleaved_y = 0;
LOOP(KERNEL_WIDTH_DIV2, interleaved_y,
{
kernel_y = interleaved_y * 2;
DOT_PRODUCT_16( blockC00, pblockA00[kernel_y    ], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_16( blockC00, pblockA00[kernel_y + 1], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_16( blockC10, pblockA00[kernel_y    ], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_16( blockC10, pblockA00[kernel_y + 1], pBlockB00[kernel_idx] ); kernel_idx++;
} )
if ( kernel_width_is_odd )
{
kernel_y = interleaved_y * 2;
DOT_PRODUCT_16( blockC00, pblockA00[kernel_y], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_16( blockC10, pblockA00[kernel_y], pBlockB00[kernel_idx] ); kernel_idx++;
}
}
while( ++patch_row < KERNEL_HEIGHT );
src0_read += slice_pitch - ( KERNEL_HEIGHT * ROW_PITCH * DILATION_Y );
}
while ( ++patch_depth < INPUT_DEPTH );
int out_offset = global_z * out_pitch_z
+ ( group_x * TILE_N ) * out_pitch_y
+ ( ( global_y * TILE_M ) / output_width + OUT_PADDING_HEIGHT) * OUT_PITCH_X
+ ( ( global_y * TILE_M ) % output_width ) + OUT_PADDING_LEFT;
__global Dtype *out = dst + out_offset;
#if APPLY_BIAS
Dtype bias[2];
Dtype2 *bias_vec;
bias_vec = (Dtype2*)bias;
*bias_vec = as_Dtype2(SUB_GROUP_BLOCK_READ2((__global INT_TYPE *)biases_base + group_x * TILE_N));
if (group_x > 0xFFFFFFFEul) {
dst[0] = bias[0] + bias[1];
}
#else
const Dtype bias[2] = {0, 0};
#endif
INTERLEAVED_SIMD16_OUTPUT(dst, out_offset, 0);
}
#endif
#ifdef GEMM_LIKE_CONV_32_2_SIMD16
#define TILE_M          2
#define TILE_K          KERNEL_WIDTH
#define TILE_N          32
__attribute__((intel_reqd_sub_group_size(16)))
__kernel void Conv_Interleaved(GEMM_LIKE_KERNEL_ARGS)
{
__global Dtype *dst = dst_base + dst_offset;
const int group_x = get_group_id(0);
const int group_y = get_group_id(1);
const int global_x = get_global_id(0);
const int global_y = get_global_id(1);
const int global_z = get_global_id(2);
int interleaved_y;
int kernel_y;
int kernel_idx;
#define DOT_PRODUCT_16( _result, _rowA, colB )    \
{   \
_result.s0 = mad( _rowA, sub_group_broadcast( colB, 0 ), _result.s0 );  \
_result.s1 = mad( _rowA, sub_group_broadcast( colB, 1 ), _result.s1 );  \
_result.s2 = mad( _rowA, sub_group_broadcast( colB, 2 ), _result.s2 );  \
_result.s3 = mad( _rowA, sub_group_broadcast( colB, 3 ), _result.s3 );  \
_result.s4 = mad( _rowA, sub_group_broadcast( colB, 4 ), _result.s4 );  \
_result.s5 = mad( _rowA, sub_group_broadcast( colB, 5 ), _result.s5 );  \
_result.s6 = mad( _rowA, sub_group_broadcast( colB, 6 ), _result.s6 );  \
_result.s7 = mad( _rowA, sub_group_broadcast( colB, 7 ), _result.s7 );  \
_result.s8 = mad( _rowA, sub_group_broadcast( colB, 8 ), _result.s8 );  \
_result.s9 = mad( _rowA, sub_group_broadcast( colB, 9 ), _result.s9 );  \
_result.sa = mad( _rowA, sub_group_broadcast( colB, 10 ), _result.sa );  \
_result.sb = mad( _rowA, sub_group_broadcast( colB, 11 ), _result.sb );  \
_result.sc = mad( _rowA, sub_group_broadcast( colB, 12 ), _result.sc );  \
_result.sd = mad( _rowA, sub_group_broadcast( colB, 13 ), _result.sd );  \
_result.se = mad( _rowA, sub_group_broadcast( colB, 14 ), _result.se );  \
_result.sf = mad( _rowA, sub_group_broadcast( colB, 15 ), _result.sf );  \
}
typedef CAT( Dtype, KERNEL_WIDTH ) Dtype_t;
{
Dtype16  blockC00 = 0.f;
Dtype16  blockC10 = 0.f;
Dtype16  blockC01 = 0.f;
Dtype16  blockC11 = 0.f;
int curr_x0 = ( ( global_y * TILE_M + 0 ) % output_width ) * STRIDE_X;
int curr_x1 = ( ( global_y * TILE_M + 1 ) % output_width ) * STRIDE_X;
int curr_y0 = ( ( global_y * TILE_M + 0 ) / output_width ) * STRIDE_Y;
int curr_y1 = ( ( global_y * TILE_M + 1 ) / output_width ) * STRIDE_Y;
#if INPUT_PAD_H != 0 || INPUT_PAD_W != 0 || DILATION_X != 1 || DILATION_Y != 1 || INPUT_PAD_BOTTOM != 0 || INPUT_PAD_RIGHT != 0
int saved_y0 = curr_y0;
int saved_y1 = curr_y1;
#endif
const __global Dtype *src0_read0 = src0
+ aligned_input_size * global_z
+ (curr_y0 - INPUT_PAD_H) * ROW_PITCH
+ curr_x0 - INPUT_PAD_W;
const __global Dtype *src0_read1 = src0
+ aligned_input_size * global_z
+ (curr_y1 - INPUT_PAD_H) * ROW_PITCH
+ curr_x1 - INPUT_PAD_W;
const __global Dtype *src1_read = src1 + ( global_x * TILE_N * 2);
int patch_depth = 0;
do
{
int patch_row = 0;
do
{
const bool kernel_width_is_odd = KERNEL_WIDTH % 2 == 1;
#if INPUT_PAD_H == 0 && INPUT_PAD_W == 0 && DILATION_X == 1 && DILATION_Y == 1 && INPUT_PAD_BOTTOM == 0 && INPUT_PAD_RIGHT == 0
Dtype_t blockA00 = ( (const __global Dtype_t*)src0_read0 )[  0  ]; src0_read0 += ROW_PITCH;
Dtype_t blockA01 = ( (const __global Dtype_t*)src0_read1 )[  0  ]; src0_read1 += ROW_PITCH;
Dtype*  pblockA00 = (Dtype*)(&blockA00);
Dtype*  pblockA01 = (Dtype*)(&blockA01);
#else
Dtype_t blockA00;
Dtype*  pblockA00 = (Dtype*)(&blockA00);
int pos = 0;
LOOP(KERNEL_WIDTH, pos,
{
if (curr_y0 >= INPUT_PAD_H &&
curr_y0 < input_height + INPUT_PAD_H &&
curr_x0 + pos * DILATION_X >= INPUT_PAD_W &&
curr_x0 + pos * DILATION_X < input_width + INPUT_PAD_W)
pblockA00[pos] = src0_read0[pos * DILATION_X];
else
pblockA00[pos] = 0;
})
curr_y0 += DILATION_Y;
Dtype_t blockA01;
Dtype*  pblockA01 = (Dtype*)(&blockA01);
pos = 0;
LOOP(KERNEL_WIDTH, pos,
{
if (curr_y1 >= INPUT_PAD_H &&
curr_y1 < input_height + INPUT_PAD_H &&
curr_x1 + pos * DILATION_X >= INPUT_PAD_W &&
curr_x1 + pos * DILATION_X < input_width + INPUT_PAD_W)
pblockA01[pos] = src0_read1[pos * DILATION_X];
else
pblockA01[pos] = 0;
})
curr_y1 += DILATION_Y;
src0_read0 += (ROW_PITCH * DILATION_Y);
src0_read1 += (ROW_PITCH * DILATION_Y);
#endif
Dtype blockB00[KERNEL_WIDTH*2];
Dtype4* p4BlockB00 = (Dtype4*)blockB00;
Dtype2* p2BlockB00 = (Dtype2*)blockB00;
Dtype*  pBlockB00 =  (Dtype* )blockB00;
interleaved_y = 0;
LOOP(KERNEL_WIDTH_DIV2, interleaved_y,
{
p4BlockB00[interleaved_y] = as_Dtype4( SUB_GROUP_BLOCK_READ4( (const __global INT_TYPE*)src1_read ) );
src1_read += WIDTH1 * 2;
} )
if ( kernel_width_is_odd )
{
p2BlockB00[KERNEL_WIDTH - 1] = as_Dtype2( SUB_GROUP_BLOCK_READ2( (const __global INT_TYPE*)src1_read ) );
src1_read += WIDTH1 * 2;
}
kernel_idx = 0;
interleaved_y = 0;
LOOP(KERNEL_WIDTH_DIV2, interleaved_y,
{
kernel_y = interleaved_y * 2;
DOT_PRODUCT_16( blockC00, pblockA00[kernel_y    ], pBlockB00[kernel_idx] );
DOT_PRODUCT_16( blockC01, pblockA01[kernel_y    ], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_16( blockC00, pblockA00[kernel_y + 1], pBlockB00[kernel_idx] );
DOT_PRODUCT_16( blockC01, pblockA01[kernel_y + 1], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_16( blockC10, pblockA00[kernel_y    ], pBlockB00[kernel_idx] );
DOT_PRODUCT_16( blockC11, pblockA01[kernel_y    ], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_16( blockC10, pblockA00[kernel_y + 1], pBlockB00[kernel_idx] );
DOT_PRODUCT_16( blockC11, pblockA01[kernel_y + 1], pBlockB00[kernel_idx] ); kernel_idx++;
} )
if ( kernel_width_is_odd )
{
kernel_y = interleaved_y * 2;
DOT_PRODUCT_16( blockC00, pblockA00[kernel_y], pBlockB00[kernel_idx] );
DOT_PRODUCT_16( blockC01, pblockA01[kernel_y], pBlockB00[kernel_idx] ); kernel_idx++;
DOT_PRODUCT_16( blockC10, pblockA00[kernel_y], pBlockB00[kernel_idx] );
DOT_PRODUCT_16( blockC11, pblockA01[kernel_y], pBlockB00[kernel_idx] ); kernel_idx++;
}
}
while( ++patch_row < KERNEL_HEIGHT );
#if INPUT_PAD_W != 0 || INPUT_PAD_H != 0 || DILATION_X != 1 || DILATION_Y != 1 || INPUT_PAD_BOTTOM != 0 || INPUT_PAD_RIGHT != 0
curr_y0 = saved_y0;
curr_y1 = saved_y1;
#endif
src0_read0 += slice_pitch - ( KERNEL_HEIGHT * ROW_PITCH * DILATION_Y);
src0_read1 += slice_pitch - ( KERNEL_HEIGHT * ROW_PITCH * DILATION_Y);
}
while ( ++patch_depth < INPUT_DEPTH );
int out0_offset = global_z * out_pitch_z
+ ( group_x * TILE_N ) * out_pitch_y
+ ( ( global_y * TILE_M + 0 ) / output_width + OUT_PADDING_HEIGHT ) * OUT_PITCH_X
+ ( ( global_y * TILE_M + 0 ) % output_width ) + OUT_PADDING_LEFT;
int out1_offset = global_z * out_pitch_z
+ ( group_x * TILE_N ) * out_pitch_y
+ ( ( global_y * TILE_M + 1 ) / output_width + OUT_PADDING_HEIGHT ) * OUT_PITCH_X
+ ( ( global_y * TILE_M + 1 ) % output_width ) + OUT_PADDING_LEFT;
#if APPLY_BIAS
Dtype bias[2];
Dtype2 *bias_vec;
bias_vec = (Dtype2*)bias;
*bias_vec = as_Dtype2(SUB_GROUP_BLOCK_READ2((__global INT_TYPE *)biases_base + group_x * TILE_N));
if (group_x > 0xFFFFFFFEul) {
dst[0] = bias[0] + bias[1];
}
#else
const Dtype bias[2] = {0, 0};
#endif
INTERLEAVED_SIMD16_OUTPUT(dst, out0_offset, 0);
INTERLEAVED_SIMD16_OUTPUT(dst, out1_offset, 1);
}
}
#endif
#elif defined KERNEL_DWCONV
__kernel void DWCONV(
ELTWISE_DATA_ARG
FUSED_ARG
__global Dtype* image_data,
__global Dtype* kernel_data,
BIAS_KERNEL_ARG
__global Dtype* convolved_image_base,
const int convolved_image_offset,
const ushort input_width,
const ushort input_height,
const ushort output_width,
const ushort output_height) {
__global Dtype* convolved_image = convolved_image_base + convolved_image_offset;
const int outputX = get_global_id(0);
const int outputY = get_global_id(1);
const int outputZ = get_global_id(2);
if(outputX < output_width && outputY < output_height)
{
Dtype sum = 0.;
const int org_y = outputY * STRIDE_Y - INPUT_PAD_H;
const int org_x = outputX * STRIDE_X - INPUT_PAD_W;
const int currentKernelOffset = KERNEL_SIZE*(outputZ%CHANNELS);
const int biasIndex=outputZ%CHANNELS;
const int local_image_offset = org_y*input_width + org_x;
const int imageSize = input_width*input_height;
__global Dtype* image_dataPtrFloat = (image_data + (imageSize*outputZ + local_image_offset));
__global Dtype* kernel_dataPtrFloat = (kernel_data + (currentKernelOffset));
for(int y = 0; y < KERNEL_H; y++)
{
for(int x = 0; x < KERNEL_W; x++)
{
if(!(org_y + y * DILATION_Y >= 0 && org_y + y * DILATION_Y < input_height && org_x + x * DILATION_X >= 0 && org_x + x * DILATION_X < input_width))
{
continue;
}
sum += image_dataPtrFloat[x * DILATION_X] * kernel_dataPtrFloat[x];
}
image_dataPtrFloat += input_width * DILATION_Y;
kernel_dataPtrFloat += KERNEL_W;
}
#if APPLY_BIAS
int offset = outputZ*output_height*output_width + outputY*output_width + outputX;
ACTIVATION_FUNCTION(convolved_image, offset, sum + biases_base[biasIndex], biasIndex);
#else
int offset = outputZ*output_height*output_width + outputY*output_width + outputX;
ACTIVATION_FUNCTION(convolved_image, offset, sum, biasIndex);
#endif
}
}
#endif

Thread 1 "opencv_perf_dnn" received signal SIGSEGV, Segmentation fault.
0x00007fffbacf27a9 in llvm::PredIterator<llvm::BasicBlock, llvm::Value::user_iterator_impl<llvm::User> >::advancePastNonTerminators() () from /usr/lib/x86_64-linux-gnu/beignet//libgbe.so
(gdb) bt
#0  0x00007fffbacf27a9 in llvm::PredIterator<llvm::BasicBlock, llvm::Value::user_iterator_impl<llvm::User> >::advancePastNonTerminators() () at /usr/lib/x86_64-linux-gnu/beignet//libgbe.so
#1  0x00007fffbb583ce0 in llvm::LoopBase<llvm::BasicBlock, llvm::Loop>::getLoopLatch() const () at /usr/lib/x86_64-linux-gnu/beignet//libgbe.so
#2  0x00007fffbb583e46 in llvm::Loop::getLoopID() const () at /usr/lib/x86_64-linux-gnu/beignet//libgbe.so
#3  0x00007fffb9bf5573 in gbe::CustomLoopUnroll::GetUnrollMetadataValue(llvm::Loop const*, llvm::StringRef) (Name=..., L=0x555555c5aed0) at ./backend/src/llvm/llvm_unroll.cpp:70
#4  0x00007fffb9bf5573 in gbe::CustomLoopUnroll::runOnLoop(llvm::Loop*, llvm::LPPassManager&) (this=0x555555c54840, L=0x555555c5aed0, LPM=...) at ./backend/src/llvm/llvm_unroll.cpp:227
#5  0x00007fffbb58e8bb in llvm::LPPassManager::runOnFunction(llvm::Function&) () at /usr/lib/x86_64-linux-gnu/beignet//libgbe.so
#6  0x00007fffbb80f358 in llvm::FPPassManager::runOnFunction(llvm::Function&) () at /usr/lib/x86_64-linux-gnu/beignet//libgbe.so
#7  0x00007fffbb4ee4f3 in (anonymous namespace)::CGPassManager::runOnModule(llvm::Module&) () at /usr/lib/x86_64-linux-gnu/beignet//libgbe.so
#8  0x00007fffbb80ec96 in llvm::legacy::PassManagerImpl::run(llvm::Module&) () at /usr/lib/x86_64-linux-gnu/beignet//libgbe.so
#9  0x00007fffb9bf1134 in gbe::runModulePass(llvm::Module&, llvm::TargetLibraryInfoImpl*, llvm::DataLayout const&, int, bool) (mod=..., libraryInfo=0x55555582f140, DL=..., optLevel=1, strictMath=false)
    at ./backend/src/llvm/llvm_to_gen.cpp:222
#10 0x00007fffb9bf151d in gbe::llvmToGen(gbe::ir::Unit&, void const*, int, bool, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) (unit=..., module=<optimized out>, optLevel=1, strictMath=<optimized out>, profiling=0, errors="") at ./backend/src/llvm/llvm_to_gen.cpp:341
#11 0x00007fffb9b5c99d in gbe::Program::buildFromLLVMModule(void const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, int) (this=0x555555c7be40, module=0x555555648e80, error="", optLevel=1)
    at ./backend/src/backend/program.cpp:127
#12 0x00007fffb9cc15f2 in gbe::genProgramNewFromLLVM(unsigned int, void const*, void const*, char const*, unsigned long, char*, unsigned long*, int, char const*) (deviceID=2582, module=0x555555648e80, llvm_ctx=0x5555556721b0, asm_file_name=<optimized out>, stringSize=1048190, err=0x555555994932 "", errSize=0x55555563ab10, optLevel=1, options=0x555555829610 " -D TYPE=1 -D Dtype=float -D Dtype2=float2 -D Dtype4=float4 -D Dtype8=float8 -D Dtype16=float16 -D as_Dtype=as_float -D as_Dtype2=as_float2 -D as_Dtype4=as_float4 -D as_Dtype8=as_float8 -D KERNEL_WIDT"...) at ./backend/src/backend/gen_program.cpp:497
#13 0x00007fffb9b7276f in gbe::programNewFromSource(uint32_t, char const*, size_t, char const*, char*, size_t*) (deviceID=2582, source=<optimized out>, stringSize=1048190, options=0x555555829610 " -D TYPE=1 -D Dtype=float -D Dtype2=float2 -D Dtype4=float4 -D Dtype8=float8 -D Dtype16=float16 -D as_Dtype=as_float -D as_Dtype2=as_float2 -D as_Dtype4=as_float4 -D as_Dtype8=as_float8 -D KERNEL_WIDT"..., err=0x555555994932 "", errSize=0x55555563ab10)
    at /usr/include/c++/8/bits/basic_string.h:1031
#14 0x00007fffd5b8b939 in cl_program_build (p=p at entry=0x55555563aa00, options=options at entry=0x555555829610 " -D TYPE=1 -D Dtype=float -D Dtype2=float2 -D Dtype4=float4 -D Dtype8=float8 -D Dtype16=float16 -D as_Dtype=as_float -D as_Dtype2=as_float2 -D as_Dtype4=as_float4 -D as_Dtype8=as_float8 -D KERNEL_WIDT"...) at ./src/cl_program.c:597
#15 0x00007fffd5b8084b in clBuildProgram (program=0x55555563aa00, num_devices=<optimized out>, device_list=<optimized out>, options=0x555555829610 " -D TYPE=1 -D Dtype=float -D Dtype2=float2 -D Dtype4=float4 -D Dtype8=float8 -D Dtype16=float16 -D as_Dtype=as_float -D as_Dtype2=as_float2 -D as_Dtype4=as_float4 -D as_Dtype8=as_float8 -D KERNEL_WIDT"..., pfn_notify=0x0, user_data=0x0) at ./src/cl_api.c:838
#16 0x00007ffff44aa22e in cv::ocl::Program::Impl::buildFromSources(cv::ocl::Context const&, cv::ocl::ProgramSource::Impl const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) ()
    at /mnt/scratch/src/opencv/build/lib/libopencv_core.so.4.0
#17 0x00007ffff44b08e4 in cv::ocl::Program::Impl::compileWithCache(cv::ocl::Context const&, cv::ocl::ProgramSource::Impl const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) ()
    at /mnt/scratch/src/opencv/build/lib/libopencv_core.so.4.0
#18 0x00007ffff44baf96 in cv::ocl::Program::Impl::Impl(cv::ocl::ProgramSource const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) () at /mnt/scratch/src/opencv/build/lib/libopencv_core.so.4.0
#19 0x00007ffff44bb498 in cv::ocl::Program::create(cv::ocl::ProgramSource const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) () at /mnt/scratch/src/opencv/build/lib/libopencv_core.so.4.0
#20 0x00007ffff44bd5b0 in cv::ocl::Context::Impl::getProg(cv::ocl::ProgramSource const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) () at /mnt/scratch/src/opencv/build/lib/libopencv_core.so.4.0
#21 0x00007ffff44bd791 in cv::ocl::Context::getProg(cv::ocl::ProgramSource const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) () at /mnt/scratch/src/opencv/build/lib/libopencv_core.so.4.0
#22 0x00007ffff7b9b477 in cv::dnn::ocl4dnn::OCL4DNNConvSpatial<float>::compileKernel() () at /mnt/scratch/src/opencv/build/lib/libopencv_dnn.so.4.0
#23 0x00007ffff7b9c3f4 in cv::dnn::ocl4dnn::OCL4DNNConvSpatial<float>::createGEMMLikeConvKernel(int, int, int) () at /mnt/scratch/src/opencv/build/lib/libopencv_dnn.so.4.0
#24 0x00007ffff7b9c862 in cv::dnn::ocl4dnn::OCL4DNNConvSpatial<float>::createConvolutionKernel(int, int, int, int) () at /mnt/scratch/src/opencv/build/lib/libopencv_dnn.so.4.0
#25 0x00007ffff7b9fed3 in cv::dnn::ocl4dnn::OCL4DNNConvSpatial<float>::useFirstAvailable(cv::UMat const&, cv::UMat&, cv::UMat const&, cv::UMat const&, int, cv::UMat&) () at /mnt/scratch/src/opencv/build/lib/libopencv_dnn.so.4.0
#26 0x00007ffff7ba13c1 in cv::dnn::ocl4dnn::OCL4DNNConvSpatial<float>::prepareKernel(cv::UMat const&, cv::UMat&, cv::UMat const&, cv::UMat const&, int) () at /mnt/scratch/src/opencv/build/lib/libopencv_dnn.so.4.0
#27 0x00007ffff7ba1667 in cv::dnn::ocl4dnn::OCL4DNNConvSpatial<float>::Forward(cv::UMat const&, cv::UMat const&, cv::UMat const&, cv::UMat const&, cv::UMat&, int) () at /mnt/scratch/src/opencv/build/lib/libopencv_dnn.so.4.0
#28 0x00007ffff7af42eb in cv::dnn::ConvolutionLayerImpl::forward(cv::_InputArray const&, cv::_OutputArray const&, cv::_OutputArray const&) () at /mnt/scratch/src/opencv/build/lib/libopencv_dnn.so.4.0
#29 0x00007ffff7ac1f6f in cv::dnn::dnn4_v20180917::Net::Impl::forwardLayer(cv::dnn::dnn4_v20180917::LayerData&) () at /mnt/scratch/src/opencv/build/lib/libopencv_dnn.so.4.0
#30 0x00007ffff7ac2e0a in cv::dnn::dnn4_v20180917::Net::Impl::forwardToLayer(cv::dnn::dnn4_v20180917::LayerData&, bool) () at /mnt/scratch/src/opencv/build/lib/libopencv_dnn.so.4.0
#31 0x00007ffff7ad8dde in cv::dnn::dnn4_v20180917::Net::forward(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () at /mnt/scratch/src/opencv/build/lib/libopencv_dnn.so.4.0
#32 0x0000555555579053 in opencv_test::Conv_conv_Test::PerfTestBody() ()
#33 0x00005555555ad811 in perf::TestBase::RunPerfTestBody() ()
#34 0x000055555556f3fa in opencv_test::Conv_conv_Test::TestBody() ()
#35 0x00005555555aa9ea in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ()
#36 0x00005555555aabfa in testing::Test::Run() ()
#37 0x00005555555aaee0 in testing::TestInfo::Run() ()
#38 0x00005555555aafb5 in testing::TestCase::Run() ()
#39 0x00005555555ab4dc in testing::internal::UnitTestImpl::RunAllTests() ()
#40 0x00005555555ab5fe in testing::UnitTest::Run() ()
#41 0x000055555556e8f2 in main ()
(gdb)