Bug#1016332: librsb: FTBFS: gmake[3]: *** [Makefile:920: qtests] Error 1

Lucas Nussbaum lucas at debian.org
Fri Jul 29 19:37:16 BST 2022


Source: librsb
Version: 1.3.0.1+dfsg-2
Severity: serious
Justification: FTBFS
Tags: bookworm sid ftbfs
User: lucas at debian.org
Usertags: ftbfs-20220728 ftbfs-bookworm

Hi,

During a rebuild of all packages in sid, your package failed to build
on amd64.


Relevant part (hopefully):
> make[2]: Entering directory '/<<PKGBUILDDIR>>'
> gmake  all-recursive
> gmake[3]: Entering directory '/<<PKGBUILDDIR>>'
> Making all in librsbpp
> gmake[4]: Entering directory '/<<PKGBUILDDIR>>/librsbpp'
> gmake  all-am
> gmake[5]: Entering directory '/<<PKGBUILDDIR>>/librsbpp'
> gmake[5]: Leaving directory '/<<PKGBUILDDIR>>/librsbpp'
> gmake[4]: Leaving directory '/<<PKGBUILDDIR>>/librsbpp'
> Making all in .
> gmake[4]: Entering directory '/<<PKGBUILDDIR>>'
> gmake[4]: Nothing to be done for 'all-am'.
> gmake[4]: Leaving directory '/<<PKGBUILDDIR>>'
> Making all in examples
> gmake[4]: Entering directory '/<<PKGBUILDDIR>>/examples'
> gmake[4]: Nothing to be done for 'all'.
> gmake[4]: Leaving directory '/<<PKGBUILDDIR>>/examples'
> Making all in scripts
> gmake[4]: Entering directory '/<<PKGBUILDDIR>>/scripts'
> gmake[4]: Nothing to be done for 'all'.
> gmake[4]: Leaving directory '/<<PKGBUILDDIR>>/scripts'
> Making all in bench
> gmake[4]: Entering directory '/<<PKGBUILDDIR>>/bench'
> gmake[4]: Nothing to be done for 'all'.
> gmake[4]: Leaving directory '/<<PKGBUILDDIR>>/bench'
> Making all in blas_sparse
> gmake[4]: Entering directory '/<<PKGBUILDDIR>>/blas_sparse'
> gmake[4]: Nothing to be done for 'all'.
> gmake[4]: Leaving directory '/<<PKGBUILDDIR>>/blas_sparse'
> Making all in doc
> gmake[4]: Entering directory '/<<PKGBUILDDIR>>/doc'
> gmake[4]: Nothing to be done for 'all'.
> gmake[4]: Leaving directory '/<<PKGBUILDDIR>>/doc'
> Making all in m4
> gmake[4]: Entering directory '/<<PKGBUILDDIR>>/m4'
> gmake[4]: Nothing to be done for 'all'.
> gmake[4]: Leaving directory '/<<PKGBUILDDIR>>/m4'
> Making all in rsblib
> gmake[4]: Entering directory '/<<PKGBUILDDIR>>/rsblib'
> gmake  all-recursive
> gmake[5]: Entering directory '/<<PKGBUILDDIR>>/rsblib'
> Making all in .
> gmake[6]: Entering directory '/<<PKGBUILDDIR>>/rsblib'
> gmake[6]: Leaving directory '/<<PKGBUILDDIR>>/rsblib'
> Making all in examples
> gmake[6]: Entering directory '/<<PKGBUILDDIR>>/rsblib/examples'
> gmake[6]: Nothing to be done for 'all'.
> gmake[6]: Leaving directory '/<<PKGBUILDDIR>>/rsblib/examples'
> gmake[5]: Leaving directory '/<<PKGBUILDDIR>>/rsblib'
> gmake[4]: Leaving directory '/<<PKGBUILDDIR>>/rsblib'
> Making all in rsbtest
> gmake[4]: Entering directory '/<<PKGBUILDDIR>>/rsbtest'
> gmake  all-am
> gmake[5]: Entering directory '/<<PKGBUILDDIR>>/rsbtest'
> gmake[5]: Leaving directory '/<<PKGBUILDDIR>>/rsbtest'
> gmake[4]: Leaving directory '/<<PKGBUILDDIR>>/rsbtest'
> gmake[3]: Leaving directory '/<<PKGBUILDDIR>>'
>  [*] beginning quick test...
> gmake gtests
> gmake[3]: Entering directory '/<<PKGBUILDDIR>>'
>  [*] Skipping tests based on Google Test (not detected at configure time)
> gmake[3]: Leaving directory '/<<PKGBUILDDIR>>'
> gmake mtests -C .
> gmake[3]: Entering directory '/<<PKGBUILDDIR>>'
> echo 'if test x"${srcdir}" = x ; then srcdir=. ; fi' > scripts/readme-tests.sh
> LANG=C /bin/grep '^	*\(make \)**\./\(rsbench\|sbtc\|sbtf\)\|\(^	*test\> -f\)' README | /bin/sed 's/#.*$//g;s/$/ || exit 255/g' | /bin/sed 's/A.mtx/${srcdir}\/A.mtx/g' >> scripts/readme-tests.sh
> srcdir="/<<PKGBUILDDIR>>" /bin/bash -ex ./scripts/readme-tests.sh
> + test x/<<PKGBUILDDIR>> = x
> + ./rsbench -oa -Ob --bench -f /<<PKGBUILDDIR>>/A.mtx -qH -R -n1,4 -T z --verbose --nrhs 1,2 --by-rows
> # --bench option implies -qH -R --write-performance-record --want-mkl-autotune --mkl-benchmark --types : --split-experimental 6 --merge-experimental 6 --also-transpose --sort-filenames-list --want-memory-benchmark
> # Passed 0 arguments via autotuning string "" (an empty string requests defaults)
> Will invoke autotuning for ~10.000000 s x 1 rounds, specifying verbosity=0 and threads=0. (>0 means no structure tuning; 0 means only structure tuning, <0 means tuning of both with (negated) thread count suggestion).
> Will try /<<PKGBUILDDIR>>/A.mtx
> Adding matrix file: /<<PKGBUILDDIR>>/A.mtx
> # Sorting matrices list (use --no-sort-filenames-list to prevent this)
> # Using matrices: A.mtx
> # beginning run at 1659086390
> # /<<PKGBUILDDIR>>/.libs/rsbench -oa -Ob --bench -f /<<PKGBUILDDIR>>/A.mtx -qH -R -n1,4 -T z --verbose --nrhs 1,2 --by-rows
> # compiled with: CC=gcc CFLAGS=-g -O2 -ffile-prefix-map=/<<PKGBUILDDIR>>=. -fstack-protector-strong -Wformat -Werror=format-security -O3 -std=c99
> # average timer granularity: 3.37e-08 s
> # Will write a final performance record to file rsbench_pr__1659086390_gcc-12.1-1,4th.rpr and periodic checkpoints to rsbench_pr__1659086390_gcc-12.1-1,4th.rpr.tmp
> # will NOT perform ancillary tests.
> # will flush cache memory:  between each operation measurement series, and NOT between each operation.
> # will keep any zero encountered in the matrix.
> # env: export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
> # env: export LD_LIBRARY_PATH=/<<PKGBUILDDIR>>/.libs
> # env: HOSTNAME is not set
> # env: KMP_AFFINITY is not set
> # env: OMP_AFFINITY_FORMAT is not set
> # env: OMP_ALLOCATOR is not set
> # env: OMP_CANCELLATION is not set
> # env: OMP_DEBUG is not set
> # env: OMP_DEFAULT_DEVICE is not set
> # env: OMP_DISPLAY_ENV is not set
> # env: OMP_DISPLAY_AFFINITY is not set
> # env: OMP_DYNAMIC is not set
> # env: OMP_MAX_ACTIVE_LEVELS is not set
> # env: OMP_MAX_TASK_PRIORITY is not set
> # env: OMP_NESTED is not set
> # env: OMP_NUM_THREADS is not set
> # env: OMP_PLACES is not set
> # env: OMP_PROC_BIND is not set
> # env: OMP_SCHEDULE is not set
> # env: OMP_STACKSIZE is not set
> # env: OMP_TARGET_OFFLOAD is not set
> # env: OMP_THREAD_LIMIT is not set
> # env: OMP_TOOL is not set
> # env: OMP_TOOL_LIBRARIES is not set
> # env: OMP_WAIT_POLICY is not set
> # env: RSB_WANT_RSBPP is not set
> #     using kernels from librsbpp (default).
> # env: SLURM_CLUSTER_NAME is not set
> # env: SLURM_CPUS_ON_NODE is not set
> # env: SLURM_JOB_CPUS_PER_NODE is not set
> # env: SLURM_JOB_ID is not set
> # env: SLURM_JOBID is not set
> # env: SLURM_JOB_NAME is not set
> # env: SLURM_JOB_NUM_NODES is not set
> # env: SLURM_JOB_PARTITION is not set
> # env: SLURM_NPROCS is not set
> # env: SLURM_NTASKS is not set
> # env: SLURM_STEP_TASKS_PER_NODE is not set
> # env: SLURM_TASKS_PER_NODE is not set
> # detected hostname: ip-10-84-234-251
> # user specified a verbosity level of 1 (each --verbose occurrence counts +1)
> # This test will measure times in scanning arrays sized and aligned to fit in caches.
> # 3 cache levels detected
> Will fill struct with 50 samples...
> # Memory benchmark took 11.572s
> # auto-tuning oriented output implies  times==0 iterations and sort-after-load.
> #pr: allocated a performance record for 8 samples (2240 bytes).
> # Cache block size total 34603008 bytes, per-thread 4325376 bytes
> # so far, program took 11.574s of wall clock time; ancillary tests 0.000s; I/O 0.000s; checks 0.000s; conversions 0.000s; rsb/mkl tuning 0.000s/0.000s .
> # Using 1 threads
> # reading A.mtx (184 bytes / 1 MiB / 6 nnz / 3 rows / 3 columns / 1 MiB COO) as type Z...
> # file input of A.mtx took   0.00 s (6 nnz, 61832 nnz/s ) (1.90 MB/s ) 
> #pre-sorting (6 elements) took 0.000334024 s
> #weeding duplicates (to 6 elements) took 2.14577e-06 s (and check, 1.90735e-06 s )
> # multi-nrhs benchmarking (1,2) -- now using nrhs 1.
> # Using alpha=1 beta=1 order=rows for rsb_spmv/rsb_spsv/rsb_spmm/rsb_spsm.
> # multi-transpose benchmarking -- now using transA = N.
> # will use input matrix flags: RSB_FLAG_USE_HALFWORD_INDICES, RSB_FLAG_SORTED_INPUT, RSB_FLAG_QUAD_PARTITIONING, RSB_FLAG_SYMMETRIC, RSB_FLAG_OWN_PARTITIONING_ARRAYS
> # Using 1 threads
> # Constructed matrix (took 0.001s): (3 x 3)[0x562387e0baf0]{Z} @ (0(0..0),0(0..0)) (6 nnz, 2 nnz/r) flags 0x2442186 (coo:1, csr:0, hw:1, ic:1, fi:0), storage: 40, subm: 3, symflags:'S'
> # matrix consistency check took 0.000s (ok)
> RSB Sparse Blocks Autotuner invoked requesting max 6 splits and max 6 merges in 1 rounds, threads spec.0 (specify negative values to enable threads tuning).
> Will autotune matrix: 3 x 3, type Z, 6 nnz, 2 nnz/r, 4 subms, 3 lsubms, 4.0000 bpnz.
> Parameters: verbosity:1 mintimes:3 maxtimes:10 mindt:0 maxdt:3
> 3 iterations (1 th.) took 4.101e-05s; avg 1.367e-05s ( +/-  79.07/156.40 %); best 2.861e-06s; worst 3.505e-05s; std dev. 1.512e-05 (taking best).
> Reference operation time is 2.86102e-06 s (33.55 Mflops) with 1 threads.
> Starting merge (same threads) based auto-tuning procedure (transA=N, nrhs=1, order=rows) (max 6 steps, inclusive 3 grace steps) on: 3 x 3, type Z, 6 nnz, 2 nnz/r, 4 subms, 3 lsubms, 4.0000 bpnz (tpop: 2.861e-06  Mflops: 33.554)
> Merge (3 -> 1 leaves) took w.c.t. of 1.192e-05s, ~5.007e-06s of computing time (of which 9.537e-07s sorting, 1.907e-06s analysis)
> 3 iterations (1 th.) took 2.408e-05s; avg 8.027e-06s ( +/-  99.65/200.00 %); best 2.81e-08s; worst 2.408e-05s; std dev. 1.135e-05 (taking best).
> Reference operation time is 2.80976e-08 s (3417 Mflops) with 1 threads.
> After merge step 1: tpop: 2.81e-08 s   ~Mflops: 3416.658   nsubm:1 otn:1
> Applying merge (3 -> 1 leaves, 1 th.) yielded SPEEDUP of 101.824x: 2.861e-06s -> 2.81e-08s, so taking this instance.
> Merged all the matrix leaves: no reason to continue merging.
> A total of 1 merge steps (of max 6) (3 -> 1 subms) took 5.889e-05s (of which 1.597e-05s partitioning, 0s I/O); computing times: 5.007e-06s in par. loops, 9.537e-07s sorting, 1.907e-06s analyzing)
> Total merge + benchmarking process took 5.889e-05s, equivalent to 2095.9/20.6 new/old ops (2.432e-05s for 2 clones -- as 865.5/8.5 ops, or 432.8/4.2 ops per clone), SPEEDUP of 101.824x
> Applying multi-merge (3 -> 1 leaves, 1 steps, 0 -> 1 th.sp.) yielded SPEEDUP of 101.824x (2.861e-06s -> 2.81e-08s), will amortize in       20.8 ops by saving 2.833e-06s per op.
> In 1 tuning rounds (tot. 0.00015s, 2.4e-05s for constructor, 2 clones) obtained a SPEEDUP of 10082.4% (101.8x) (from 33.55 to 3417 Mflops).
> #pr: updating sample at index 1 (0^th of 8), 0^th touch for (0,0,0,0,0,0,0).
> First run of RSB Autotuner took 0.000173092 s  (2.861e-06 s -> 2.810e-08 s per spmv_sxsa) (tuned: 3 -> 1 lsubm).
> RSB Sparse Blocks Autotuner invoked requesting max 0 splits and max 0 merges in 1 rounds, auto threads spec.
> Second run of RSB Autotuner took 0.0018611 s and estimated a speedup of 1.000000 x (2.810e-08 s -> 2.810e-08 s per op) in same matrix (1 -> 1 lsubm)
> #min:1 0
> #max:1 0
> #sum:3 0
> #norm:1.7320508075688772 0
> #used index storage compared to COO:28 vs 48 bytes (58.33%) ; compared to CSR:28 vs 40 bytes (77.78%)
> #%:CONSTRUCTOR_*:SORT	SCAN	INSERT	SCAN+INSERT
> %:CONSTRUCTOR_TIMES:A.mtx	S	N	1	3	3	6	  0.000000	  0.000482	  0.000013	  0.000495
> %:UNSORTEDCOO2RSB_TIME:A.mtx	S	N	1	3	3	6	  0.000495
> %:RSB_SUBDIVISION_TIME:A.mtx	S	N	1	3	3	6	  0.000482
> %:RSB_SHUFFLE_TIME:A.mtx	S	N	1	3	3	6	  0.000013
> %:ROW_MAJOR_SORT_TIME:A.mtx	S	N	1	3	3	6	  0.000000
> %:ROW_MAJOR_SORT_SCALING:A.mtx	S	N	1	3	3	6	      -nan
> %:SORTEDCOO2RSB_TIME:A.mtx	S	N	1	3	3	6	  0.000495
> %:ROW_MAJOR_SORT_TO_MOP:A.mtx	S	N	1	3	3	6	     0.000
> %:UNSORTEDCOO2RSB_SCALING:A.mtx	S	N	1	3	3	6	      1.00
> %:SORTEDCOO2RSB_SCALING:A.mtx	S	N	1	3	3	6	      1.00
> %:RSB_SUBDIVISION_SCALING:A.mtx	S	N	1	3	3	6	      1.00
> %:RSB_SHUFFLE_SCALING:A.mtx	S	N	1	3	3	6	      1.00
> %:CONSTRUCTOR_SCALING:A.mtx	S	N	1	3	3	6	      -nan	      1.00	      1.00	      1.00
> #%:SM_COUNTS:	Tot	HalfwordCsr	FullwordCsr	HalfwordCoo	FullwordCoo
> %:SM_COUNTS:A.mtx	S	N	1	3	3	6	1	1	0	0	0
> %:SM_IDXOCCUPATIONRSBVSCOOANDCSR:A.mtx	S	N	1	3	3	6	28	48	36
> %:SM_IDXOCCUPATION:A.mtx	S	N	1	3	3	6	28
> %:SM_MEMTRAFFIC:A.mtx	S	N	1	3	3	6	       276
> %:SM_MINMAXAVGNNZ:A.mtx	S	N	1	3	3	6	6	6	6
> #
> # Using 4 threads
> # Constructed matrix (took 0.021s): (3 x 3)[0x562387e10840]{Z} @ (0(0..0),0(0..0)) (6 nnz, 2 nnz/r) flags 0x2442186 (coo:1, csr:0, hw:1, ic:1, fi:0), storage: 40, subm: 3, symflags:'S'
> # matrix consistency check took 0.000s (ok)
> RSB Sparse Blocks Autotuner invoked requesting max 6 splits and max 6 merges in 1 rounds, threads spec.1 (specify negative values to enable threads tuning).
> Will autotune matrix: 3 x 3, type Z, 6 nnz, 2 nnz/r, 4 subms, 3 lsubms, 4.0000 bpnz.
> Parameters: verbosity:1 mintimes:3 maxtimes:10 mindt:0 maxdt:3
> 3 iterations (4 th.) took 2.289e-05s; avg 7.629e-06s ( +/-  34.38/ 31.25 %); best 5.007e-06s; worst 1.001e-05s; std dev. 2.051e-06 (taking best).
> Reference operation time is 5.00679e-06 s (19.17 Mflops) with 4 threads.
> Starting merge (user-supplied threads) based auto-tuning procedure (transA=N, nrhs=1, order=rows) (max 6 steps, inclusive 3 grace steps) on: 3 x 3, type Z, 6 nnz, 2 nnz/r, 4 subms, 3 lsubms, 4.0000 bpnz (tpop: 5.007e-06  Mflops: 19.174)
> Merge (3 -> 1 leaves) took w.c.t. of 7.868e-06s, ~2.861e-06s of computing time (of which 9.537e-07s sorting, 1.907e-06s analysis)
> 3 iterations (4 th.) took 3.099e-06s; avg 1.033e-06s ( +/-  97.28/200.00 %); best 2.81e-08s; worst 3.099e-06s; std dev. 1.461e-06 (taking best).
> Reference operation time is 2.80976e-08 s (3417 Mflops) with 4 threads.
> After merge step 1: tpop: 2.81e-08 s   ~Mflops: 3416.658   nsubm:1 otn:1
> Applying merge (3 -> 1 leaves, 1 th.) yielded SPEEDUP of 178.193x: 5.007e-06s -> 2.81e-08s, so taking this instance.
> Merged all the matrix leaves: no reason to continue merging.
> A total of 1 merge steps (of max 6) (3 -> 1 subms) took 3.505e-05s (of which 1.097e-05s partitioning, 0s I/O); computing times: 2.861e-06s in par. loops, 9.537e-07s sorting, 1.907e-06s analyzing)
> Total merge + benchmarking process took 3.505e-05s, equivalent to 1247.3/7.0 new/old ops (2.003e-05s for 2 clones -- as 712.8/4.0 ops, or 356.4/2.0 ops per clone), SPEEDUP of 178.193x
> Applying multi-merge (3 -> 1 leaves, 1 steps, 1 -> 1 th.sp.) yielded SPEEDUP of 178.193x (5.007e-06s -> 2.81e-08s), will amortize in        7.0 ops by saving 4.979e-06s per op.
> In 1 tuning rounds (tot. 9.6e-05s, 2e-05s for constructor, 2 clones) obtained a SPEEDUP of 17719.3% (178.2x) (from 19.17 to 3417 Mflops).
> #pr: updating sample at index 5 (1^th of 8), 0^th touch for (0,1,0,0,0,0,0).
> First run of RSB Autotuner took 0.000109911 s  (5.007e-06 s -> 2.810e-08 s per spmv_sxsa) (tuned: 3 -> 1 lsubm).
> RSB Sparse Blocks Autotuner invoked requesting max 0 splits and max 0 merges in 1 rounds, auto threads spec.
> Second run of RSB Autotuner took 0.000231981 s and estimated a speedup of 1.000000 x (2.810e-08 s -> 2.810e-08 s per op) in same matrix (1 -> 1 lsubm)
> #min:1 0
> #max:1 0
> #sum:3 0
> #norm:1.7320508075688772 0
> #used index storage compared to COO:28 vs 48 bytes (58.33%) ; compared to CSR:28 vs 40 bytes (77.78%)
> #%:CONSTRUCTOR_*:SORT	SCAN	INSERT	SCAN+INSERT
> %:CONSTRUCTOR_TIMES:A.mtx	S	N	4	3	3	6	  0.000000	  0.020428	  0.000693	  0.021121
> %:UNSORTEDCOO2RSB_TIME:A.mtx	S	N	4	3	3	6	  0.021121
> %:RSB_SUBDIVISION_TIME:A.mtx	S	N	4	3	3	6	  0.020428
> %:RSB_SHUFFLE_TIME:A.mtx	S	N	4	3	3	6	  0.000693
> %:ROW_MAJOR_SORT_TIME:A.mtx	S	N	4	3	3	6	  0.000000
> %:ROW_MAJOR_SORT_SCALING:A.mtx	S	N	4	3	3	6	      -nan
> %:SORTEDCOO2RSB_TIME:A.mtx	S	N	4	3	3	6	  0.021121
> %:ROW_MAJOR_SORT_TO_MOP:A.mtx	S	N	4	3	3	6	     0.000
> %:UNSORTEDCOO2RSB_SCALING:A.mtx	S	N	4	3	3	6	      0.02
> %:SORTEDCOO2RSB_SCALING:A.mtx	S	N	4	3	3	6	      0.02
> %:RSB_SUBDIVISION_SCALING:A.mtx	S	N	4	3	3	6	      0.02
> %:RSB_SHUFFLE_SCALING:A.mtx	S	N	4	3	3	6	      0.02
> %:CONSTRUCTOR_SCALING:A.mtx	S	N	4	3	3	6	      -nan	      0.02	      0.02	      0.02
> #%:SM_COUNTS:	Tot	HalfwordCsr	FullwordCsr	HalfwordCoo	FullwordCoo
> %:SM_COUNTS:A.mtx	S	N	4	3	3	6	1	1	0	0	0
> %:SM_IDXOCCUPATIONRSBVSCOOANDCSR:A.mtx	S	N	4	3	3	6	28	48	36
> %:SM_IDXOCCUPATION:A.mtx	S	N	4	3	3	6	28
> %:SM_MEMTRAFFIC:A.mtx	S	N	4	3	3	6	       276
> %:SM_MINMAXAVGNNZ:A.mtx	S	N	4	3	3	6	6	6	6
> #
> %operation:matrix	CONSTRUCTOR[1]	SPMV[1]	SPMV[4]
> %operation:A.mtx	0.000527859	1e+09	1e+09
> %constructor:matrix	SORT[1]	SCAN[1]	SHUFFLE[1]	INSERT[1]
> %constructor:A.mtx	0	0.000482082	0	1.3113e-05
> # symmetric matrix --- skipping transposed benchmarking
> # multi-nrhs benchmarking (1,2) -- now using nrhs 2.
> # Using alpha=1 beta=1 order=rows for rsb_spmv/rsb_spsv/rsb_spmm/rsb_spsm.
> # multi-transpose benchmarking -- now using transA = N.
> # will use input matrix flags: RSB_FLAG_USE_HALFWORD_INDICES, RSB_FLAG_SORTED_INPUT, RSB_FLAG_QUAD_PARTITIONING, RSB_FLAG_SYMMETRIC, RSB_FLAG_OWN_PARTITIONING_ARRAYS
> # Using 1 threads
> # Constructed matrix (took 0.001s): (3 x 3)[0x562387e10840]{Z} @ (0(0..0),0(0..0)) (6 nnz, 2 nnz/r) flags 0x2442186 (coo:1, csr:0, hw:1, ic:1, fi:0), storage: 40, subm: 3, symflags:'S'
> # matrix consistency check took 0.000s (ok)
> RSB Sparse Blocks Autotuner invoked requesting max 6 splits and max 6 merges in 1 rounds, threads spec.0 (specify negative values to enable threads tuning).
> Will autotune matrix: 3 x 3, type Z, 6 nnz, 2 nnz/r, 4 subms, 3 lsubms, 4.0000 bpnz.
> Parameters: verbosity:1 mintimes:3 maxtimes:10 mindt:0 maxdt:3
> 3 iterations (1 th.) took 3.409e-05s; avg 1.136e-05s ( +/-  74.83/147.55 %); best 2.861e-06s; worst 2.813e-05s; std dev. 1.186e-05 (taking best).
> Reference operation time is 2.86102e-06 s (67.11 Mflops) with 1 threads.
> Starting merge (same threads) based auto-tuning procedure (transA=N, nrhs=2, order=rows) (max 6 steps, inclusive 3 grace steps) on: 3 x 3, type Z, 6 nnz, 2 nnz/r, 4 subms, 3 lsubms, 4.0000 bpnz (tpop: 2.861e-06  Mflops: 67.109)
> Merge (3 -> 1 leaves) took w.c.t. of 6.914e-06s, ~2.861e-06s of computing time (of which 9.537e-07s sorting, 1.192e-06s analysis)
> 3 iterations (1 th.) took 1.001e-05s; avg 3.338e-06s ( +/-  99.16/171.43 %); best 2.81e-08s; worst 9.06e-06s; std dev. 4.065e-06 (taking best).
> Reference operation time is 2.80976e-08 s (6833 Mflops) with 1 threads.
> After merge step 1: tpop: 2.81e-08 s   ~Mflops: 6833.317   nsubm:1 otn:1
> Applying merge (3 -> 1 leaves, 1 th.) yielded SPEEDUP of 101.824x: 2.861e-06s -> 2.81e-08s, so taking this instance.
> Merged all the matrix leaves: no reason to continue merging.
> A total of 1 merge steps (of max 6) (3 -> 1 subms) took 3.695e-05s (of which 9.06e-06s partitioning, 0s I/O); computing times: 2.861e-06s in par. loops, 9.537e-07s sorting, 1.192e-06s analyzing)
> Total merge + benchmarking process took 3.695e-05s, equivalent to 1315.2/12.9 new/old ops (1.407e-05s for 2 clones -- as 500.6/4.9 ops, or 250.3/2.5 ops per clone), SPEEDUP of 101.824x
> Applying multi-merge (3 -> 1 leaves, 1 steps, 0 -> 1 th.sp.) yielded SPEEDUP of 101.824x (2.861e-06s -> 2.81e-08s), will amortize in       13.0 ops by saving 2.833e-06s per op.
> In 1 tuning rounds (tot. 0.00011s, 1.4e-05s for constructor, 2 clones) obtained a SPEEDUP of 10082.4% (101.8x) (from 67.11 to 6833 Mflops).
> #pr: updating sample at index 3 (2^th of 8), 0^th touch for (0,0,0,0,1,0,0).
> First run of RSB Autotuner took 0.000118017 s  (2.861e-06 s -> 2.810e-08 s per spmv_sxsa) (tuned: 3 -> 1 lsubm).
> RSB Sparse Blocks Autotuner invoked requesting max 0 splits and max 0 merges in 1 rounds, auto threads spec.
> Second run of RSB Autotuner took 0.000602007 s and estimated a speedup of 1.000000 x (2.810e-08 s -> 2.810e-08 s per op) in same matrix (1 -> 1 lsubm)
> #min:1 0
> #max:1 0
> #sum:3 0
> #norm:1.7320508075688772 0
> #used index storage compared to COO:28 vs 48 bytes (58.33%) ; compared to CSR:28 vs 40 bytes (77.78%)
> #%:CONSTRUCTOR_*:SORT	SCAN	INSERT	SCAN+INSERT
> %:CONSTRUCTOR_TIMES:A.mtx	S	N	1	3	3	6	  0.000000	  0.000826	  0.000008	  0.000834
> %:UNSORTEDCOO2RSB_TIME:A.mtx	S	N	1	3	3	6	  0.000834
> %:RSB_SUBDIVISION_TIME:A.mtx	S	N	1	3	3	6	  0.000826
> %:RSB_SHUFFLE_TIME:A.mtx	S	N	1	3	3	6	  0.000008
> %:ROW_MAJOR_SORT_TIME:A.mtx	S	N	1	3	3	6	  0.000000
> %:ROW_MAJOR_SORT_SCALING:A.mtx	S	N	1	3	3	6	      -nan
> %:SORTEDCOO2RSB_TIME:A.mtx	S	N	1	3	3	6	  0.000834
> %:ROW_MAJOR_SORT_TO_MOP:A.mtx	S	N	1	3	3	6	     0.000
> %:UNSORTEDCOO2RSB_SCALING:A.mtx	S	N	1	3	3	6	      1.00
> %:SORTEDCOO2RSB_SCALING:A.mtx	S	N	1	3	3	6	      1.00
> %:RSB_SUBDIVISION_SCALING:A.mtx	S	N	1	3	3	6	      1.00
> %:RSB_SHUFFLE_SCALING:A.mtx	S	N	1	3	3	6	      1.00
> %:CONSTRUCTOR_SCALING:A.mtx	S	N	1	3	3	6	      -nan	      1.00	      1.00	      1.00
> #%:SM_COUNTS:	Tot	HalfwordCsr	FullwordCsr	HalfwordCoo	FullwordCoo
> %:SM_COUNTS:A.mtx	S	N	1	3	3	6	1	1	0	0	0
> %:SM_IDXOCCUPATIONRSBVSCOOANDCSR:A.mtx	S	N	1	3	3	6	28	48	36
> %:SM_IDXOCCUPATION:A.mtx	S	N	1	3	3	6	28
> %:SM_MEMTRAFFIC:A.mtx	S	N	1	3	3	6	       276
> %:SM_MINMAXAVGNNZ:A.mtx	S	N	1	3	3	6	6	6	6
> #
> # Using 4 threads
> # Constructed matrix (took 0.010s): (3 x 3)[0x562387e10840]{Z} @ (0(0..0),0(0..0)) (6 nnz, 2 nnz/r) flags 0x2442186 (coo:1, csr:0, hw:1, ic:1, fi:0), storage: 40, subm: 3, symflags:'S'
> # matrix consistency check took 0.000s (ok)
> RSB Sparse Blocks Autotuner invoked requesting max 6 splits and max 6 merges in 1 rounds, threads spec.1 (specify negative values to enable threads tuning).
> Will autotune matrix: 3 x 3, type Z, 6 nnz, 2 nnz/r, 4 subms, 3 lsubms, 4.0000 bpnz.
> Parameters: verbosity:1 mintimes:3 maxtimes:10 mindt:0 maxdt:3
> 3 iterations (4 th.) took 3.004e-05s; avg 1.001e-05s ( +/-  59.52/ 59.52 %); best 4.053e-06s; worst 1.597e-05s; std dev. 4.867e-06 (taking best).
> Reference operation time is 4.05312e-06 s (47.37 Mflops) with 4 threads.
> Starting merge (user-supplied threads) based auto-tuning procedure (transA=N, nrhs=2, order=rows) (max 6 steps, inclusive 3 grace steps) on: 3 x 3, type Z, 6 nnz, 2 nnz/r, 4 subms, 3 lsubms, 4.0000 bpnz (tpop: 4.053e-06  Mflops: 47.371)
> Merge (3 -> 1 leaves) took w.c.t. of 9.06e-06s, ~4.053e-06s of computing time (of which 9.537e-07s sorting, 9.537e-07s analysis)
> 3 iterations (4 th.) took 4.053e-06s; avg 1.351e-06s ( +/-  97.92/129.41 %); best 2.81e-08s; worst 3.099e-06s; std dev. 1.296e-06 (taking best).
> Reference operation time is 2.80976e-08 s (6833 Mflops) with 4 threads.
> After merge step 1: tpop: 2.81e-08 s   ~Mflops: 6833.317   nsubm:1 otn:1
> Applying merge (3 -> 1 leaves, 1 th.) yielded SPEEDUP of 144.251x: 4.053e-06s -> 2.81e-08s, so taking this instance.
> Merged all the matrix leaves: no reason to continue merging.
> A total of 1 merge steps (of max 6) (3 -> 1 subms) took 3.91e-05s (of which 1.192e-05s partitioning, 0s I/O); computing times: 4.053e-06s in par. loops, 9.537e-07s sorting, 9.537e-07s analyzing)
> Total merge + benchmarking process took 3.91e-05s, equivalent to 1391.6/9.6 new/old ops (2.098e-05s for 2 clones -- as 746.7/5.2 ops, or 373.4/2.6 ops per clone), SPEEDUP of 144.251x
> Applying multi-merge (3 -> 1 leaves, 1 steps, 1 -> 1 th.sp.) yielded SPEEDUP of 144.251x (4.053e-06s -> 2.81e-08s), will amortize in        9.7 ops by saving 4.025e-06s per op.
> In 1 tuning rounds (tot. 0.00011s, 2.1e-05s for constructor, 2 clones) obtained a SPEEDUP of 14325.1% (144.3x) (from 47.37 to 6833 Mflops).
> #pr: updating sample at index 7 (3^th of 8), 0^th touch for (0,1,0,0,1,0,0).
> First run of RSB Autotuner took 0.000118971 s  (4.053e-06 s -> 2.810e-08 s per spmv_sxsa) (tuned: 3 -> 1 lsubm).
> RSB Sparse Blocks Autotuner invoked requesting max 0 splits and max 0 merges in 1 rounds, auto threads spec.
> Second run of RSB Autotuner took 0.000989914 s and estimated a speedup of 1.000000 x (2.810e-08 s -> 2.810e-08 s per op) in same matrix (1 -> 1 lsubm)
> #min:1 0
> #max:1 0
> #sum:3 0
> #norm:1.7320508075688772 0
> #used index storage compared to COO:28 vs 48 bytes (58.33%) ; compared to CSR:28 vs 40 bytes (77.78%)
> #%:CONSTRUCTOR_*:SORT	SCAN	INSERT	SCAN+INSERT
> %:CONSTRUCTOR_TIMES:A.mtx	S	N	4	3	3	6	  0.000000	  0.003342	  0.006749	  0.010091
> %:UNSORTEDCOO2RSB_TIME:A.mtx	S	N	4	3	3	6	  0.010091
> %:RSB_SUBDIVISION_TIME:A.mtx	S	N	4	3	3	6	  0.003342
> %:RSB_SHUFFLE_TIME:A.mtx	S	N	4	3	3	6	  0.006749
> %:ROW_MAJOR_SORT_TIME:A.mtx	S	N	4	3	3	6	  0.000000
> %:ROW_MAJOR_SORT_SCALING:A.mtx	S	N	4	3	3	6	      -nan
> %:SORTEDCOO2RSB_TIME:A.mtx	S	N	4	3	3	6	  0.010091
> %:ROW_MAJOR_SORT_TO_MOP:A.mtx	S	N	4	3	3	6	     0.000
> %:UNSORTEDCOO2RSB_SCALING:A.mtx	S	N	4	3	3	6	      0.08
> %:SORTEDCOO2RSB_SCALING:A.mtx	S	N	4	3	3	6	      0.08
> %:RSB_SUBDIVISION_SCALING:A.mtx	S	N	4	3	3	6	      0.25
> %:RSB_SHUFFLE_SCALING:A.mtx	S	N	4	3	3	6	      0.00
> %:CONSTRUCTOR_SCALING:A.mtx	S	N	4	3	3	6	      -nan	      0.25	      0.00	      0.08
> #%:SM_COUNTS:	Tot	HalfwordCsr	FullwordCsr	HalfwordCoo	FullwordCoo
> %:SM_COUNTS:A.mtx	S	N	4	3	3	6	1	1	0	0	0
> %:SM_IDXOCCUPATIONRSBVSCOOANDCSR:A.mtx	S	N	4	3	3	6	28	48	36
> %:SM_IDXOCCUPATION:A.mtx	S	N	4	3	3	6	28
> %:SM_MEMTRAFFIC:A.mtx	S	N	4	3	3	6	       276
> %:SM_MINMAXAVGNNZ:A.mtx	S	N	4	3	3	6	6	6	6
> #
> %operation:matrix	CONSTRUCTOR[1]	SPMV[1]	SPMV[4]
> %operation:A.mtx	0.000849009	1e+09	1e+09
> %constructor:matrix	SORT[1]	SCAN[1]	SHUFFLE[1]	INSERT[1]
> %constructor:A.mtx	0	0.00082612	0	7.86781e-06
> # symmetric matrix --- skipping transposed benchmarking
> # so far, program took 13.172s of wall clock time; ancillary tests 0.000s; I/O 0.000s; checks 0.000s; conversions 0.000s; rsb/mkl tuning 0.004s/0.000s .
> getrusage() stats:
> ru_maxrss: 137 (maximum resident set size -- MB)
> ru_stime : 1.34s (system CPU time used)
> ru_utime : 14.19s (user CPU time used)
> # benchmarking terminated --- finalizing run.
> # ====== BEGIN Total summary record.
> #pr: ========  Limiting to nrhs=1:
> #pr:  2 samples (out of 4) matched the dump limiting criteria.
> #pr: Dump from a base of 4 samples (of max 8) ordered by (1,2,1,1,2,1,2) = (filename x cores x incX x incY x nrhs x typecode x transA).
> pr: BESTCODE MTX NR NC NNZ NRHS TYPE SYM TRANS NT AT-NT AT-MKL-NT BPNZ AT-BPNZ NSUBM AT-SUBM RSBBEST-MFLOPS OPTIME MKL-OPTIME AT-OPTIME AT-MKL-OPTIME AT-TIME RWminBW-GBps CB-bpf AT-MS CMFLOPS
> pr:    1:R_R  A 3 3 6 1 Z S N  1  1  0 4.0000 4.6667 3 1 3416.66 2.861e-06 0.000e+00 2.810e-08 0.000e+00 1.731e-04 9.54e+00 2.29e+00 1 9.60e-05
> pr:    5:R_R  A 3 3 6 1 Z S N  4  1  0 4.0000 4.6667 3 1 3416.66 5.007e-06 0.000e+00 2.810e-08 0.000e+00 1.099e-04 9.54e+00 2.29e+00 1 9.60e-05
> #pr:  2 samples (out of 4) matched the dump limiting criteria.
> #pr: below, we define 'successful' autotuning when speedup of 1.010000x is exceeded, and 'tuned' results even the ones which are same as untuned
> #pr: rsb autotuning was successful in     2 cases (100.00 %) and unsuccessful in 0 cases (0.00 %)
> #pr:  (in succ. cases rsb autotuning gave    avg. 13900.8 % faster, avg. sp. ratio 140.008x, max sp. ratio 178.193x, avg. ratio 0.000x)
> #pr:  (in succ. cases rsb autotuning took an avg/min/max/tot of: 5036.1/3911.8/6160.4/10072.1   tuned ops)
> #pr:  (in succ. cases rsb autotuning took an avg/min/max/tot of:  41.2/ 22.0/ 60.5/ 82.5 untuned ops)
> #pr:  (and amortizes from untuned rsb in avg.  41.6, min.  22.1, max.  61.1 ops)
> #pr:  (avg/min/max (avg) nnz   per subm before successful tuning were          2/         2/         2)
> #pr:  (avg/min/max (avg) nnz   per subm after  successful tuning were          6/         6/         6)
> #pr:  (avg/min/max (avg) bytes per subm before successful tuning were         32/        32/        32)
> #pr:  (avg/min/max (avg) bytes per subm after  successful tuning were         96/        96/        96)
> #pr:  (avg/min/max (avg) bytes per nnz  before successful tuning were      4.000/     4.000/     4.000)
> #pr:  (avg/min/max operands (mtx,lhs,rhs) read bandwidth lower bound       7.830/     7.830/     7.830,GBps)
> #pr:  (avg/min/max operands (mtx,rhs:r;lhs:rw) bandwidth lower bound      19.076/     9.538/     9.538,GBps)
> #pr:  (avg/min/max code balance (bytes read at least once per flop)        2.292/     2.292/     2.292)
> #pr:  (avg/min/max (avg) bytes per nnz  after  successful tuning were      4.667/     4.667/     4.667)
> #pr:  (matrix has been subdivided  more/less/same            in resp.  0 / 2 /0 cases)
> #pr:  (matrix has used             more/less/same    threads in resp.  0 / 1 /1 cases)
> #pr: no unsuccessful rsb autotuning attempt (according to  1.01x threshold) 
> #pr: rsb auto tuning (either succ. or uns.) time was: on avg.:  0.00 s, min  0.00 s, max  0.00 s, tot  0.00 s (2 samples)
> #pr: rsb auto tuning (   only successful  ) time was: on avg.:  0.00 s, min  0.00 s, max  0.00 s, tot  0.00 s (2 samples)
> #pr:  best tun. rsb canon. mflops were: on avg. 3.417e+03,  min 3.417e+03,  max 3.417e+03  (2 samples)
> #pr:  ref. unt. rsb canon. mflops were: on avg. 2.636e+01,  min 1.917e+01,  max 3.355e+01  (2 samples)
> #pr:  best tun. rsb operation time was: on avg. 2.810e-08s, min 2.810e-08s, max 2.810e-08s, tot 5.620e-08s (2 samples)
> #pr:  ref. unt. rsb operation time was: on avg. 3.934e-06s, min 2.861e-06s, max 5.007e-06s, tot 7.868e-06s (2 samples)
> #pr:  min / max ratio of in-memory MEMSET bandwidth to extrapolated read bandwidth ratio: 9.210e-01 9.210e-01
> #pr: # Warning: extrapolated memory I/O bandwidth exceeds memory bandwidth --- is this a tiny matrix ?
> #pr:  in-cache to in-memory MEMSET bandwidth ratio: 2.112e+01
> #pr: ========  Limiting to nrhs=2:
> #pr:  2 samples (out of 4) matched the dump limiting criteria.
> #pr: Dump from a base of 4 samples (of max 8) ordered by (1,2,1,1,2,1,2) = (filename x cores x incX x incY x nrhs x typecode x transA).
> pr: BESTCODE MTX NR NC NNZ NRHS TYPE SYM TRANS NT AT-NT AT-MKL-NT BPNZ AT-BPNZ NSUBM AT-SUBM RSBBEST-MFLOPS OPTIME MKL-OPTIME AT-OPTIME AT-MKL-OPTIME AT-TIME RWminBW-GBps CB-bpf AT-MS CMFLOPS
> pr:    3:R_R  A 3 3 6 2 Z S N  1  1  0 4.0000 4.6667 3 1 6833.32 2.861e-06 0.000e+00 2.810e-08 0.000e+00 1.180e-04 1.47e+01 1.65e+00 1 1.92e-04
> pr:    7:R_R  A 3 3 6 2 Z S N  4  1  0 4.0000 4.6667 3 1 6833.32 4.053e-06 0.000e+00 2.810e-08 0.000e+00 1.190e-04 1.47e+01 1.65e+00 1 1.92e-04
> #pr:  2 samples (out of 4) matched the dump limiting criteria.
> #pr: below, we define 'successful' autotuning when speedup of 1.010000x is exceeded, and 'tuned' results even the ones which are same as untuned
> #pr: rsb autotuning was successful in     2 cases (100.00 %) and unsuccessful in 0 cases (0.00 %)
> #pr:  (in succ. cases rsb autotuning gave    avg. 12203.8 % faster, avg. sp. ratio 123.038x, max sp. ratio 144.251x, avg. ratio 0.000x)
> #pr:  (in succ. cases rsb autotuning took an avg/min/max/tot of: 4217.2/4200.3/4234.2/8434.5   tuned ops)
> #pr:  (in succ. cases rsb autotuning took an avg/min/max/tot of:  35.3/ 29.4/ 41.2/ 70.6 untuned ops)
> #pr:  (and amortizes from untuned rsb in avg.  35.6, min.  29.6, max.  41.7 ops)
> #pr:  (avg/min/max (avg) nnz   per subm before successful tuning were          2/         2/         2)
> #pr:  (avg/min/max (avg) nnz   per subm after  successful tuning were          6/         6/         6)
> #pr:  (avg/min/max (avg) bytes per subm before successful tuning were         32/        32/        32)
> #pr:  (avg/min/max (avg) bytes per subm after  successful tuning were         96/        96/        96)
> #pr:  (avg/min/max (avg) bytes per nnz  before successful tuning were      4.000/     4.000/     4.000)
> #pr:  (avg/min/max operands (mtx,lhs,rhs) read bandwidth lower bound      11.247/    11.247/    11.247,GBps)
> #pr:  (avg/min/max operands (mtx,rhs:r;lhs:rw) bandwidth lower bound      29.326/    14.663/    14.663,GBps)
> #pr:  (avg/min/max code balance (bytes read at least once per flop)        1.646/     1.646/     1.646)
> #pr:  (avg/min/max (avg) bytes per nnz  after  successful tuning were      4.667/     4.667/     4.667)
> #pr:  (matrix has been subdivided  more/less/same            in resp.  0 / 2 /0 cases)
> #pr:  (matrix has used             more/less/same    threads in resp.  0 / 1 /1 cases)
> #pr: no unsuccessful rsb autotuning attempt (according to  1.01x threshold) 
> #pr: rsb auto tuning (either succ. or uns.) time was: on avg.:  0.00 s, min  0.00 s, max  0.00 s, tot  0.00 s (2 samples)
> #pr: rsb auto tuning (   only successful  ) time was: on avg.:  0.00 s, min  0.00 s, max  0.00 s, tot  0.00 s (2 samples)
> #pr:  best tun. rsb canon. mflops were: on avg. 6.833e+03,  min 6.833e+03,  max 6.833e+03  (2 samples)
> #pr:  ref. unt. rsb canon. mflops were: on avg. 5.724e+01,  min 4.737e+01,  max 6.711e+01  (2 samples)
> #pr:  best tun. rsb operation time was: on avg. 2.810e-08s, min 2.810e-08s, max 2.810e-08s, tot 5.620e-08s (2 samples)
> #pr:  ref. unt. rsb operation time was: on avg. 3.457e-06s, min 2.861e-06s, max 4.053e-06s, tot 6.914e-06s (2 samples)
> #pr:  min / max ratio of in-memory MEMSET bandwidth to extrapolated read bandwidth ratio: 6.412e-01 6.412e-01
> #pr: # Warning: extrapolated memory I/O bandwidth exceeds memory bandwidth --- is this a tiny matrix ?
> #pr:  in-cache to in-memory MEMSET bandwidth ratio: 2.112e+01
> #pr:  rsb nrhs-to-overall-min-rhs speed ratio was: on avg.    2.000e+00 x, min 2.000e+00 x, max 2.000e+00 x (2 samples, the non-min-nrhs ones)
> #pr: ========  Limiting to transA=N:
> #pr: Dump from a base of 4 samples (of max 8) ordered by (1,2,1,1,2,1,2) = (filename x cores x incX x incY x nrhs x typecode x transA).
> pr: BESTCODE MTX NR NC NNZ NRHS TYPE SYM TRANS NT AT-NT AT-MKL-NT BPNZ AT-BPNZ NSUBM AT-SUBM RSBBEST-MFLOPS OPTIME MKL-OPTIME AT-OPTIME AT-MKL-OPTIME AT-TIME RWminBW-GBps CB-bpf AT-MS CMFLOPS
> pr:    1:R_R  A 3 3 6 1 Z S N  1  1  0 4.0000 4.6667 3 1 3416.66 2.861e-06 0.000e+00 2.810e-08 0.000e+00 1.731e-04 9.54e+00 2.29e+00 1 9.60e-05
> pr:    3:R_R  A 3 3 6 2 Z S N  1  1  0 4.0000 4.6667 3 1 6833.32 2.861e-06 0.000e+00 2.810e-08 0.000e+00 1.180e-04 1.47e+01 1.65e+00 1 1.92e-04
> pr:    5:R_R  A 3 3 6 1 Z S N  4  1  0 4.0000 4.6667 3 1 3416.66 5.007e-06 0.000e+00 2.810e-08 0.000e+00 1.099e-04 9.54e+00 2.29e+00 1 9.60e-05
> pr:    7:R_R  A 3 3 6 2 Z S N  4  1  0 4.0000 4.6667 3 1 6833.32 4.053e-06 0.000e+00 2.810e-08 0.000e+00 1.190e-04 1.47e+01 1.65e+00 1 1.92e-04
> #pr: below, we define 'successful' autotuning when speedup of 1.010000x is exceeded, and 'tuned' results even the ones which are same as untuned
> #pr: rsb autotuning was successful in     4 cases (100.00 %) and unsuccessful in 0 cases (0.00 %)
> #pr:  (in succ. cases rsb autotuning gave    avg. 13052.3 % faster, avg. sp. ratio 131.523x, max sp. ratio 178.193x, avg. ratio 0.000x)
> #pr:  (in succ. cases rsb autotuning took an avg/min/max/tot of: 4626.6/3911.8/6160.4/18506.6   tuned ops)
> #pr:  (in succ. cases rsb autotuning took an avg/min/max/tot of:  38.3/ 22.0/ 60.5/153.1 untuned ops)
> #pr:  (and amortizes from untuned rsb in avg.  38.6, min.  22.1, max.  61.1 ops)
> #pr:  (avg/min/max (avg) nnz   per subm before successful tuning were          2/         2/         2)
> #pr:  (avg/min/max (avg) nnz   per subm after  successful tuning were          6/         6/         6)
> #pr:  (avg/min/max (avg) bytes per subm before successful tuning were         32/        32/        32)
> #pr:  (avg/min/max (avg) bytes per subm after  successful tuning were         96/        96/        96)
> #pr:  (avg/min/max (avg) bytes per nnz  before successful tuning were      4.000/     4.000/     4.000)
> #pr:  (avg/min/max operands (mtx,lhs,rhs) read bandwidth lower bound       9.538/     7.830/    11.247,GBps)
> #pr:  (avg/min/max operands (mtx,rhs:r;lhs:rw) bandwidth lower bound      48.403/     9.538/    14.663,GBps)
> #pr:  (avg/min/max code balance (bytes read at least once per flop)        1.969/     1.646/     2.292)
> #pr:  (avg/min/max (avg) bytes per nnz  after  successful tuning were      4.667/     4.667/     4.667)
> #pr:  (matrix has been subdivided  more/less/same            in resp.  0 / 4 /0 cases)
> #pr:  (matrix has used             more/less/same    threads in resp.  0 / 2 /2 cases)
> #pr: no unsuccessful rsb autotuning attempt (according to  1.01x threshold) 
> #pr: rsb auto tuning (either succ. or uns.) time was: on avg.:  0.00 s, min  0.00 s, max  0.00 s, tot  0.00 s (4 samples)
> #pr: rsb auto tuning (   only successful  ) time was: on avg.:  0.00 s, min  0.00 s, max  0.00 s, tot  0.00 s (4 samples)
> #pr:  best tun. rsb canon. mflops were: on avg. 5.125e+03,  min 3.417e+03,  max 6.833e+03  (4 samples)
> #pr:  ref. unt. rsb canon. mflops were: on avg. 4.180e+01,  min 1.917e+01,  max 6.711e+01  (4 samples)
> #pr:  best tun. rsb operation time was: on avg. 2.810e-08s, min 2.810e-08s, max 2.810e-08s, tot 1.124e-07s (4 samples)
> #pr:  ref. unt. rsb operation time was: on avg. 3.695e-06s, min 2.861e-06s, max 5.007e-06s, tot 1.478e-05s (4 samples)
> #pr:  min / max ratio of in-memory MEMSET bandwidth to extrapolated read bandwidth ratio: 6.412e-01 9.210e-01
> #pr: # Warning: extrapolated memory I/O bandwidth exceeds memory bandwidth --- is this a tiny matrix ?
> #pr:  in-cache to in-memory MEMSET bandwidth ratio: 2.112e+01
> #pr:  rsb nrhs-to-overall-min-rhs speed ratio was: on avg.    2.000e+00 x, min 2.000e+00 x, max 2.000e+00 x (2 samples, the non-min-nrhs ones)
> #pr: ========  Limiting to both transA=N and nrhs=1:
> #pr:  2 samples (out of 4) matched the dump limiting criteria.
> #pr: Dump from a base of 4 samples (of max 8) ordered by (1,2,1,1,2,1,2) = (filename x cores x incX x incY x nrhs x typecode x transA).
> pr: BESTCODE MTX NR NC NNZ NRHS TYPE SYM TRANS NT AT-NT AT-MKL-NT BPNZ AT-BPNZ NSUBM AT-SUBM RSBBEST-MFLOPS OPTIME MKL-OPTIME AT-OPTIME AT-MKL-OPTIME AT-TIME RWminBW-GBps CB-bpf AT-MS CMFLOPS
> pr:    1:R_R  A 3 3 6 1 Z S N  1  1  0 4.0000 4.6667 3 1 3416.66 2.861e-06 0.000e+00 2.810e-08 0.000e+00 1.731e-04 9.54e+00 2.29e+00 1 9.60e-05
> pr:    5:R_R  A 3 3 6 1 Z S N  4  1  0 4.0000 4.6667 3 1 3416.66 5.007e-06 0.000e+00 2.810e-08 0.000e+00 1.099e-04 9.54e+00 2.29e+00 1 9.60e-05
> #pr:  2 samples (out of 4) matched the dump limiting criteria.
> #pr: below, we define 'successful' autotuning when speedup of 1.010000x is exceeded, and 'tuned' results even the ones which are same as untuned
> #pr: rsb autotuning was successful in     2 cases (100.00 %) and unsuccessful in 0 cases (0.00 %)
> #pr:  (in succ. cases rsb autotuning gave    avg. 13900.8 % faster, avg. sp. ratio 140.008x, max sp. ratio 178.193x, avg. ratio 0.000x)
> #pr:  (in succ. cases rsb autotuning took an avg/min/max/tot of: 5036.1/3911.8/6160.4/10072.1   tuned ops)
> #pr:  (in succ. cases rsb autotuning took an avg/min/max/tot of:  41.2/ 22.0/ 60.5/ 82.5 untuned ops)
> #pr:  (and amortizes from untuned rsb in avg.  41.6, min.  22.1, max.  61.1 ops)
> #pr:  (avg/min/max (avg) nnz   per subm before successful tuning were          2/         2/         2)
> #pr:  (avg/min/max (avg) nnz   per subm after  successful tuning were          6/         6/         6)
> #pr:  (avg/min/max (avg) bytes per subm before successful tuning were         32/        32/        32)
> #pr:  (avg/min/max (avg) bytes per subm after  successful tuning were         96/        96/        96)
> #pr:  (avg/min/max (avg) bytes per nnz  before successful tuning were      4.000/     4.000/     4.000)
> #pr:  (avg/min/max operands (mtx,lhs,rhs) read bandwidth lower bound       7.830/     7.830/     7.830,GBps)
> #pr:  (avg/min/max operands (mtx,rhs:r;lhs:rw) bandwidth lower bound      19.076/     9.538/     9.538,GBps)
> #pr:  (avg/min/max code balance (bytes read at least once per flop)        2.292/     2.292/     2.292)
> #pr:  (avg/min/max (avg) bytes per nnz  after  successful tuning were      4.667/     4.667/     4.667)
> #pr:  (matrix has been subdivided  more/less/same            in resp.  0 / 2 /0 cases)
> #pr:  (matrix has used             more/less/same    threads in resp.  0 / 1 /1 cases)
> #pr: no unsuccessful rsb autotuning attempt (according to  1.01x threshold) 
> #pr: rsb auto tuning (either succ. or uns.) time was: on avg.:  0.00 s, min  0.00 s, max  0.00 s, tot  0.00 s (2 samples)
> #pr: rsb auto tuning (   only successful  ) time was: on avg.:  0.00 s, min  0.00 s, max  0.00 s, tot  0.00 s (2 samples)
> #pr:  best tun. rsb canon. mflops were: on avg. 3.417e+03,  min 3.417e+03,  max 3.417e+03  (2 samples)
> #pr:  ref. unt. rsb canon. mflops were: on avg. 2.636e+01,  min 1.917e+01,  max 3.355e+01  (2 samples)
> #pr:  best tun. rsb operation time was: on avg. 2.810e-08s, min 2.810e-08s, max 2.810e-08s, tot 5.620e-08s (2 samples)
> #pr:  ref. unt. rsb operation time was: on avg. 3.934e-06s, min 2.861e-06s, max 5.007e-06s, tot 7.868e-06s (2 samples)
> #pr:  min / max ratio of in-memory MEMSET bandwidth to extrapolated read bandwidth ratio: 9.210e-01 9.210e-01
> #pr: # Warning: extrapolated memory I/O bandwidth exceeds memory bandwidth --- is this a tiny matrix ?
> #pr:  in-cache to in-memory MEMSET bandwidth ratio: 2.112e+01
> #pr: ========  Limiting to both transA=N and nrhs=2:
> #pr:  2 samples (out of 4) matched the dump limiting criteria.
> #pr: Dump from a base of 4 samples (of max 8) ordered by (1,2,1,1,2,1,2) = (filename x cores x incX x incY x nrhs x typecode x transA).
> pr: BESTCODE MTX NR NC NNZ NRHS TYPE SYM TRANS NT AT-NT AT-MKL-NT BPNZ AT-BPNZ NSUBM AT-SUBM RSBBEST-MFLOPS OPTIME MKL-OPTIME AT-OPTIME AT-MKL-OPTIME AT-TIME RWminBW-GBps CB-bpf AT-MS CMFLOPS
> pr:    3:R_R  A 3 3 6 2 Z S N  1  1  0 4.0000 4.6667 3 1 6833.32 2.861e-06 0.000e+00 2.810e-08 0.000e+00 1.180e-04 1.47e+01 1.65e+00 1 1.92e-04
> pr:    7:R_R  A 3 3 6 2 Z S N  4  1  0 4.0000 4.6667 3 1 6833.32 4.053e-06 0.000e+00 2.810e-08 0.000e+00 1.190e-04 1.47e+01 1.65e+00 1 1.92e-04
> #pr:  2 samples (out of 4) matched the dump limiting criteria.
> #pr: below, we define 'successful' autotuning when speedup of 1.010000x is exceeded, and 'tuned' results even the ones which are same as untuned
> #pr: rsb autotuning was successful in     2 cases (100.00 %) and unsuccessful in 0 cases (0.00 %)
> #pr:  (in succ. cases rsb autotuning gave    avg. 12203.8 % faster, avg. sp. ratio 123.038x, max sp. ratio 144.251x, avg. ratio 0.000x)
> #pr:  (in succ. cases rsb autotuning took an avg/min/max/tot of: 4217.2/4200.3/4234.2/8434.5   tuned ops)
> #pr:  (in succ. cases rsb autotuning took an avg/min/max/tot of:  35.3/ 29.4/ 41.2/ 70.6 untuned ops)
> #pr:  (and amortizes from untuned rsb in avg.  35.6, min.  29.6, max.  41.7 ops)
> #pr:  (avg/min/max (avg) nnz   per subm before successful tuning were          2/         2/         2)
> #pr:  (avg/min/max (avg) nnz   per subm after  successful tuning were          6/         6/         6)
> #pr:  (avg/min/max (avg) bytes per subm before successful tuning were         32/        32/        32)
> #pr:  (avg/min/max (avg) bytes per subm after  successful tuning were         96/        96/        96)
> #pr:  (avg/min/max (avg) bytes per nnz  before successful tuning were      4.000/     4.000/     4.000)
> #pr:  (avg/min/max operands (mtx,lhs,rhs) read bandwidth lower bound      11.247/    11.247/    11.247,GBps)
> #pr:  (avg/min/max operands (mtx,rhs:r;lhs:rw) bandwidth lower bound      29.326/    14.663/    14.663,GBps)
> #pr:  (avg/min/max code balance (bytes read at least once per flop)        1.646/     1.646/     1.646)
> #pr:  (avg/min/max (avg) bytes per nnz  after  successful tuning were      4.667/     4.667/     4.667)
> #pr:  (matrix has been subdivided  more/less/same            in resp.  0 / 2 /0 cases)
> #pr:  (matrix has used             more/less/same    threads in resp.  0 / 1 /1 cases)
> #pr: no unsuccessful rsb autotuning attempt (according to  1.01x threshold) 
> #pr: rsb auto tuning (either succ. or uns.) time was: on avg.:  0.00 s, min  0.00 s, max  0.00 s, tot  0.00 s (2 samples)
> #pr: rsb auto tuning (   only successful  ) time was: on avg.:  0.00 s, min  0.00 s, max  0.00 s, tot  0.00 s (2 samples)
> #pr:  best tun. rsb canon. mflops were: on avg. 6.833e+03,  min 6.833e+03,  max 6.833e+03  (2 samples)
> #pr:  ref. unt. rsb canon. mflops were: on avg. 5.724e+01,  min 4.737e+01,  max 6.711e+01  (2 samples)
> #pr:  best tun. rsb operation time was: on avg. 2.810e-08s, min 2.810e-08s, max 2.810e-08s, tot 5.620e-08s (2 samples)
> #pr:  ref. unt. rsb operation time was: on avg. 3.457e-06s, min 2.861e-06s, max 4.053e-06s, tot 6.914e-06s (2 samples)
> #pr:  min / max ratio of in-memory MEMSET bandwidth to extrapolated read bandwidth ratio: 6.412e-01 6.412e-01
> #pr: # Warning: extrapolated memory I/O bandwidth exceeds memory bandwidth --- is this a tiny matrix ?
> #pr:  in-cache to in-memory MEMSET bandwidth ratio: 2.112e+01
> #pr:  rsb nrhs-to-overall-min-rhs speed ratio was: on avg.    2.000e+00 x, min 2.000e+00 x, max 2.000e+00 x (2 samples, the non-min-nrhs ones)
> #pr: ========  Limiting to transA=T:
> #pr:  No sample (out of 4) matched the dump criteria -- skipping dump round.
> #pr: ========  Limiting to both transA=T and nrhs=1:
> #pr:  No sample (out of 4) matched the dump criteria -- skipping dump round.
> #pr: ========  Limiting to both transA=T and nrhs=2:
> #pr:  No sample (out of 4) matched the dump criteria -- skipping dump round.
> #pr: ========  All results (not limiting)
> #pr: Dump from a base of 4 samples (of max 8) ordered by (1,2,1,1,2,1,2) = (filename x cores x incX x incY x nrhs x typecode x transA).
> pr: BESTCODE MTX NR NC NNZ NRHS TYPE SYM TRANS NT AT-NT AT-MKL-NT BPNZ AT-BPNZ NSUBM AT-SUBM RSBBEST-MFLOPS OPTIME MKL-OPTIME AT-OPTIME AT-MKL-OPTIME AT-TIME RWminBW-GBps CB-bpf AT-MS CMFLOPS
> pr:    1:R_R  A 3 3 6 1 Z S N  1  1  0 4.0000 4.6667 3 1 3416.66 2.861e-06 0.000e+00 2.810e-08 0.000e+00 1.731e-04 9.54e+00 2.29e+00 1 9.60e-05
> pr:    3:R_R  A 3 3 6 2 Z S N  1  1  0 4.0000 4.6667 3 1 6833.32 2.861e-06 0.000e+00 2.810e-08 0.000e+00 1.180e-04 1.47e+01 1.65e+00 1 1.92e-04
> pr:    5:R_R  A 3 3 6 1 Z S N  4  1  0 4.0000 4.6667 3 1 3416.66 5.007e-06 0.000e+00 2.810e-08 0.000e+00 1.099e-04 9.54e+00 2.29e+00 1 9.60e-05
> pr:    7:R_R  A 3 3 6 2 Z S N  4  1  0 4.0000 4.6667 3 1 6833.32 4.053e-06 0.000e+00 2.810e-08 0.000e+00 1.190e-04 1.47e+01 1.65e+00 1 1.92e-04
> #pr: below, we define 'successful' autotuning when speedup of 1.010000x is exceeded, and 'tuned' results even the ones which are same as untuned
> #pr: rsb autotuning was successful in     4 cases (100.00 %) and unsuccessful in 0 cases (0.00 %)
> #pr:  (in succ. cases rsb autotuning gave    avg. 13052.3 % faster, avg. sp. ratio 131.523x, max sp. ratio 178.193x, avg. ratio 0.000x)
> #pr:  (in succ. cases rsb autotuning took an avg/min/max/tot of: 4626.6/3911.8/6160.4/18506.6   tuned ops)
> #pr:  (in succ. cases rsb autotuning took an avg/min/max/tot of:  38.3/ 22.0/ 60.5/153.1 untuned ops)
> #pr:  (and amortizes from untuned rsb in avg.  38.6, min.  22.1, max.  61.1 ops)
> #pr:  (avg/min/max (avg) nnz   per subm before successful tuning were          2/         2/         2)
> #pr:  (avg/min/max (avg) nnz   per subm after  successful tuning were          6/         6/         6)
> #pr:  (avg/min/max (avg) bytes per subm before successful tuning were         32/        32/        32)
> #pr:  (avg/min/max (avg) bytes per subm after  successful tuning were         96/        96/        96)
> #pr:  (avg/min/max (avg) bytes per nnz  before successful tuning were      4.000/     4.000/     4.000)
> #pr:  (avg/min/max operands (mtx,lhs,rhs) read bandwidth lower bound       9.538/     7.830/    11.247,GBps)
> #pr:  (avg/min/max operands (mtx,rhs:r;lhs:rw) bandwidth lower bound      48.403/     9.538/    14.663,GBps)
> #pr:  (avg/min/max code balance (bytes read at least once per flop)        1.969/     1.646/     2.292)
> #pr:  (avg/min/max (avg) bytes per nnz  after  successful tuning were      4.667/     4.667/     4.667)
> #pr:  (matrix has been subdivided  more/less/same            in resp.  0 / 4 /0 cases)
> #pr:  (matrix has used             more/less/same    threads in resp.  0 / 2 /2 cases)
> #pr: no unsuccessful rsb autotuning attempt (according to  1.01x threshold) 
> #pr: rsb auto tuning (either succ. or uns.) time was: on avg.:  0.00 s, min  0.00 s, max  0.00 s, tot  0.00 s (4 samples)
> #pr: rsb auto tuning (   only successful  ) time was: on avg.:  0.00 s, min  0.00 s, max  0.00 s, tot  0.00 s (4 samples)
> #pr:  best tun. rsb canon. mflops were: on avg. 5.125e+03,  min 3.417e+03,  max 6.833e+03  (4 samples)
> #pr:  ref. unt. rsb canon. mflops were: on avg. 4.180e+01,  min 1.917e+01,  max 6.711e+01  (4 samples)
> #pr:  best tun. rsb operation time was: on avg. 2.810e-08s, min 2.810e-08s, max 2.810e-08s, tot 1.124e-07s (4 samples)
> #pr:  ref. unt. rsb operation time was: on avg. 3.695e-06s, min 2.861e-06s, max 5.007e-06s, tot 1.478e-05s (4 samples)
> #pr:  min / max ratio of in-memory MEMSET bandwidth to extrapolated read bandwidth ratio: 6.412e-01 9.210e-01
> #pr: # Warning: extrapolated memory I/O bandwidth exceeds memory bandwidth --- is this a tiny matrix ?
> #pr:  in-cache to in-memory MEMSET bandwidth ratio: 2.112e+01
> #pr:  rsb nrhs-to-overall-min-rhs speed ratio was: on avg.    2.000e+00 x, min 2.000e+00 x, max 2.000e+00 x (2 samples, the non-min-nrhs ones)
> #pr: Record collection took  1.25 s.
> #pr: Record comprises 50 memory benchmark samples (prepend RSB_PR_MBW=1 to dump this).
> #pr: Record comprises 81 environment variables in 3164 bytes (prepend RSB_PR_ENV=1 to dump this).
> # ======  END  Total summary record.
> #pr: ======== Saved a performance record of 8 samples to rsbench_pr__1659086390_gcc-12.1-1,4th.rpr
> # Removing the temporary record file rsbench_pr__1659086390_gcc-12.1-1,4th.rpr.tmp.
> # terminating run at 1659086403 (after 13.2s of w.c.t.)
> + ./rsbench -oa -Ob --help
> /<<PKGBUILDDIR>>/.libs/rsbench is a swiss army knife for testing the library functionality and performance.
> You can use it to perform sparse matrix - unitary vector multiplication, specifying the blocking parameters, the times to perform multiplication.
> 
> Additional debugging flags (-d, -p) are present.
> 
> Usage : /<<PKGBUILDDIR>>/.libs/rsbench [OPTIONS]
>  where OPTIONS are taken from [ -f filename ] 
> [ -F matrix_storage=[b|c|bc] ] 
> [ -r br ] 
> [ -c bc ] 
> [ -t TIMES ]
> [ -n OPENMP_THREADS ]
> [ -T ( S | D | I | C ) /* float, double, integer, character*/ ] 
> [ -s /* will internally sort out nnzs */ ] 
> [ -p /* will set to 1 nonzeros */ ] 
> [-d /* if debugging on */]: 
> [-A /* for auto-blocking */]: 
> [ -h ] 
> 
> please note that not all of the suggested numerical types could be compiled in right now and/or work well.default is double.
> 
> 
> e.g.: /<<PKGBUILDDIR>>/.libs/rsbench -f raefsky4.mtx -t 10 -T :   # 10 times for each of the supported numerical types
> /<<PKGBUILDDIR>>/.libs/rsbench  where OPTIONS are taken from :
> 	-Q		--all-flags
> 			--all-formats
> 			--all-blas-opts
> 			--all-blas-types
> 			--allow-any-transposition-combination
> 			--alpha <arg>
> 			--alternate-sort <arg>
> 	-A		--auto-blocking
> 	-v		--be-verbose
> 			--bench
> 			--beta <arg>
> 	-c		--block-columnsize <arg>
> 	-r		--block-rowsize <arg>
> 			--cache-blocking <arg>
> 			--chdir <arg>
> 	-k		--column-expand <arg>
> 			--compare-competitors
> 			--no-compare-competitors
> 	-K		--convert
> 	-d		--dense <arg>
> 			--diagonal-dominance-check
> 			--dump-n-lhs-elements <arg>
> 			--echo-arguments
> 			--flush-cache-in-iterations
> 			--impatient
> 			--no-flush-cache-in-iterations
> 			--flush-cache-around-loop
> 			--want-ancillary-execs
> 			--no-want-ancillary-execs
> 			--no-flush-cache-around-loop
> 			--want-no-recursive
> 			--want-memory-benchmark
> 			--want-no-memory-benchmark
> 			--nmb
> 	-G		--guess-blocking
> 	-h		--help
> 			--ilu0
> 			--inc <arg>
> 			--incx <arg>
> 			--incy <arg>
> 			--in-place-assembly-experimental
> 	-i		--in-place-csr
> 	-P		--in-place-permutation
> 	-l		--lower <arg>
> 			--lower-dense <arg>
> 			--generate-lowerband <arg>
> 			--gen-lband <arg>
> 			--generate-spacing <arg>
> 			--matrix-dump
> 			--matrix-dump-graph <arg>
> 			--matrix-dump-internals
> 			--merge-experimental <arg>
> 			--split-experimental <arg>
> 			--ms-experimental <arg>
> 	-f		--matrix-filename <arg>
> 			--matrix-sample-pcnt <arg>
> 	-F		--matrix-storage <arg>
> 	-M		--matrix-time
> 			--mem-hierarchy-info <arg>
> 			--max-runtime <arg>
> 	-N		--no-op
> 			--notranspose
> 			--no-transpose
> 			--nrhs <arg>
> 			--nrhs-by-rows
> 			--by-rows
> 			--nrhs-by-columns
> 			--by-columns
> 			--nrhs-by-cols
> 			--by-cols
> 			--one-nonunit-incx-incy-nrhs-per-type
> 	-n		--nthreads <arg>
> 	-B		--oski-benchmark
> 			--out-lhs
> 			--out-rhs
> 			--override-matrix-name <arg>
> 	-p		--pattern-mark
> 			--pre-transpose
> 	-b		--read-as-binary <arg>
> 			--repeat-constructor <arg>
> 			--reuse-io-arrays
> 			--no-reuse-io-arrays
> 			--reverse-alternate-rows
> 			--generate-upperband <arg>
> 			--gen-uband <arg>
> 			--generate-diagonal <arg>
> 			--gen-diag <arg>
> 			--implicit-diagonal
> 			--also-implicit-diagonal
> 			--also-symmetries
> 			--also-short-idx
> 			--also-coo-csr
> 			--also-recursive
> 			--zig-zag
> 			--subdivision-multiplier <arg>
> 			--bounded-box <arg>
> 	-s		--sort
> 			--no-leaf-multivec
> 			--with-leaf-multivec
> 			--setenv <arg>
> 			--unsetenv <arg>
> 			--sort-after-load
> 			--sort-filenames-list
> 			--no-sort-filenames-list
> 			--skip-loading-symmetric-matrices
> 			--skip-loading-unsymmetric-matrices
> 			--skip-loading-hermitian-matrices
> 			--skip-loading-not-unsymmetric-matrices
> 			--skip-loading-if-more-nnz-matrices <arg>
> 			--skip-loading-if-less-nnz-matrices <arg>
> 			--skip-loading-if-more-filesize-kb-matrices <arg>
> 			--skip-loading-if-matching-regex <arg>
> 			--skip-loading-if-matching-substr <arg>
> 	-t		--times <arg>
> 			--transpose-as <arg>
> 			--transpose
> 			--also-transpose
> 			--all-transposes
> 	-T		--type <arg>
> 	-T		--types <arg>
> 	-U		--update
> 			--as-unsymmetric
> 			--as-symmetric
> 			--expand-symmetry
> 			--as-hermitian
> 			--only-lower-triangle
> 			--only-upper-triangle
> 	-V		--verbose
> 			--less-verbose
> 			--want-io-only
> 			--want-nonzeroes-distplot
> 			--want-accuracy-test
> 			--want-getdiag-bench
> 			--want-getrow-bench
> 			--want-print-per-subm-stats
> 			--want-only-accuracy-test
> 			--want-autotune [=arg]
> 			--want-no-autotune
> 			--want-no-ones-fill
> 			--want-mkl-autotune [=arg]
> 			--want-mkl-one-based-indexing
> 			--mkl-inspector-super-light
> 			--mkl-inspector-light
> 			--mkl-inspector
> 			--mkl-no-inspector
> 			--want-unordered-coo-test
> 	-q		--with-flags <arg>
> 	-w		--write-as-binary <arg>
> 			--write-as-csr <arg>
> 			--write-performance-record [=arg]
> 			--performance-record-name-append <arg>
> 			--performance-record-name-prepend <arg>
> 			--write-no-performance-record
> 			--discard-read-zeros
> 	-z		--z-sorted-coo
> 
> Arguments to --want-autotune of the format "Ss[Xx[Tt[V[V]]]]", where S is the autotuning time in seconds, X is the number of tries, T the number of starting threads, V can be either q for quiet autotuning or v for a verbose one (can be specified twice). Valid examples: 3.0s2x4tv, 3.0s2x0tq, 3.0s, 2.0s10x . See documentation of rsb_tune_spmm for a full explanation of these parameters role in auto-tuning.
> Report bugs to michelemartone_AT_users_DOT_sourceforge_DOT_net.
> + ./rsbench --help
> Usage: rsbench [--bench] [OPTIONS] 
>   or:  rsbench [ -o OPCODE] [ -O {subprogram-code}] [ {subprogram-specific-arguments} ] 
> rsbench is a swiss army knife for testing the library functionality and performance.
> 
> 	
> Choose {subprogram-code} among:
> 
> 	r for the reference benchmark (will produce a machine specific file)
> 
> 	c for the complete benchmark
> 
> 	e for the matrix experimentation code
> 
> 	d for a single matrix dumpout
> 
> 	b for the (current, going to be obsoleted) benchmark
> 
> 	t for some matrix construction tests
> 
> 	o obsolete, will soon be removed
> 
> {subprogram-specific-arguments} will be available from the subprograms.
> 
> 	e.g.: rsbench      -O b -h   will show the current benchmark subprogram's options
> 
> 	e.g.: rsbench -o a -O b -h   will show the spmv     benchmark subprogram's options
> 
> 	e.g.: rsbench -o n -O b -h   will show the negation benchmark subprogram's options
> 
> 
> The default {subprogram-code} is 'b'
> 
> 	With OPCODE among 'actinS'
> 
> rsbench  where OPTIONS are taken from :
> 	-h		--help
> 			--bench
> 	-o		--matrix-operation <arg>
> 	-O		--subprogram-operation <arg>
> 	-I		--information
> 	-C		--configuration
> 	-H		--hardware-counters
> 	-M		--memory-benchmark
> 	-e		--experiments
> 	-v		--version
> 	-B		--blas-testing
> 	-Q		--quick-blas-testing <arg>
> 	-E		--error-testing <arg>
> 	-F		--fp-bench
> 	-t		--transpose-test
> 			--limits-testing
> 	-G		--guess-blocking
> 	-g		--generate-matrix
> 			--plot-matrix
> 			--matrix-ls
> 			--matrix-ls-latex
> 	-P		--matrix-print <arg>
> 			--read-performance-record <arg>
> 			--help-read-performance-record
> 			--setenv <arg>
> 
> Arguments to --want-autotune of the format "Ss[Xx[Tt[V[V]]]]", where S is the autotuning time in seconds, X is the number of tries, T the number of starting threads, V can be either q for quiet autotuning or v for a verbose one (can be specified twice). Valid examples: 3.0s2x4tv, 3.0s2x0tq, 3.0s, 2.0s10x . See documentation of rsb_tune_spmm for a full explanation of these parameters role in auto-tuning.
> Report bugs to michelemartone_AT_users_DOT_sourceforge_DOT_net.
> + ./rsbench --version
> /<<PKGBUILDDIR>>/.libs/rsbench version: 1.3.0
> Copyright (c) 2008-2022 Michele Martone.
> 
> Written by michelemartone_AT_users_DOT_sourceforge_DOT_net.
> + ./rsbench -I
> cache block size		: 1048576 
> hwloc size of cache level 1: 32768
> hwloc size of cache level 2: 1048576
> hwloc size of cache level 3: 34603008
> detected max available cores/threads : 8
> detected max OpenMP procs : 8
> detected max OpenMP procs : 8
> detected max OpenMP procs : 8
> detected max OpenMP procs : 8
> detected max OpenMP procs : 8
> detected max OpenMP procs : 8
> detected max OpenMP procs : 8
> detected max OpenMP procs : 8
> detected 3 levels of cache
> L1 size: 32768 
> L2 size: 1048576 
> L3 size: 34603008 
> sysconf() : 4096 bytes per pagesize
> sysconf() : 8130688 physical pages
> sysconf() : 33303298048 bytes (31760 MB) of physical memory
> sysconf() : 4732104 available (free) physical pages
> sysconf() : 19382697984 available (free) physical memory
> sysconf() , processors : 8
> sysconf() , processors online : 8
> sysconf() : level 1 cache size 32768 
> sysconf() : level 1 cache associativity 8 
> sysconf() : level 1 cache line size 64 
> sysconf() : level 2 cache size 1048576 
> sysconf() : level 2 cache associativity 16 
> sysconf() : level 2 cache line size 64 
> sysconf() : level 3 cache size 34603008 
> sysconf() : level 3 cache associativity 11 
> sysconf() : level 3 cache line size 64 
> sysconf() : no level 4 cache
> 8 bits per byte. Good.
> SHRT_MAX : 32767
> SHRT_MIN : -32768
> USHRT_MAX : 65535
> INT_MIN : -2147483648
> INT_MAX : 2147483647
> UINT_MAX : 4294967295
> LONG_MAX : 9223372036854775807
> LONG_MIN : -9223372036854775808
> ULONG_MAX : 18446744073709551615
> LLONG_MAX : 9223372036854775807
> LLONG_MIN : -9223372036854775808
> ULLONG_MAX : 18446744073709551615
> RSB_MARKER_COO_VALUE : 2147483138
> RSB_MARKER_NNZ_VALUE : 2147483393
> RSB_SUBM_IDX_MARKER : 2147483647
> RSB_MAX_ALLOCATABLE_MEMORY_CHUNK: 18446744073709551615
> timing min delta (if negative, don't complain with us)   : 0 s
> timing granularity : 2.80976e-08 s
> CFLAGS   : -g -O2 -ffile-prefix-map=/<<PKGBUILDDIR>>=. -fstack-protector-strong -Wformat -Werror=format-security -O3 -std=c99
> CXXFLAGS : -g -O2 -ffile-prefix-map=/<<PKGBUILDDIR>>=. -fstack-protector-strong -Wformat -Werror=format-security -fopenmp
> CC       : gcc
> memhinfo : L3:11/64/33M,L2:16/64/1M,L1:8/64/32K
> detected free  memory : 19382697984
> detected total memory : 33303298048
> for array sized 34603008 elems, took 0.0233049 s for linear search and 9.53674e-07 s for binary search for element 33554431, in 10 tries, for a total of 0.267291 s (ignore this:671088620)
> for array sized 34603008 elems, took 0.0108781 s for linear search and 9.53674e-07 s for binary search for element 16777215, in 10 tries, for a total of 0.123187 s (ignore this:1006632920)
> for array sized 34603008 elems, took 0.00516319 s for linear search and 9.53674e-07 s for binary search for element 8388607, in 18 tries, for a total of 0.105614 s (ignore this:1308622772)
> for array sized 34603008 elems, took 0.00250292 s for linear search and 9.53674e-07 s for binary search for element 4194303, in 34 tries, for a total of 0.100964 s (ignore this:1593835376)
> for array sized 34603008 elems, took 0.000697136 s for linear search and 0 s for binary search for element 2097151, in 107 tries, for a total of 0.100046 s (ignore this:2042625690)
> for array sized 34603008 elems, took 0.000329018 s for linear search and 0 s for binary search for element 1048575, in 204 tries, for a total of 0.100022 s (ignore this:-1824523006)
> for array sized 34603008 elems, took 0.000164986 s for linear search and 0 s for binary search for element 524287, in 450 tries, for a total of 0.100107 s (ignore this:-1352664706)
> for array sized 34603008 elems, took 8.2016e-05 s for linear search and 0 s for binary search for element 262143, in 1032 tries, for a total of 0.100069 s (ignore this:-811601554)
> for array sized 34603008 elems, took 4.19617e-05 s for linear search and 0 s for binary search for element 131071, in 2343 tries, for a total of 0.100032 s (ignore this:-197402848)
> for array sized 34603008 elems, took 2.09808e-05 s for linear search and 0 s for binary search for element 65535, in 4666 tries, for a total of 0.100010 s (ignore this:414169772)
> for array sized 34603008 elems, took 9.77516e-06 s for linear search and 0 s for binary search for element 32767, in 9262 tries, for a total of 0.100001 s (ignore this:1021145680)
> for array sized 34603008 elems, took 4.76837e-06 s for linear search and 0 s for binary search for element 16383, in 18334 tries, for a total of 0.100002 s (ignore this:1621877524)
> for array sized 34603008 elems, took 1.90735e-06 s for linear search and 0 s for binary search for element 8191, in 35651 tries, for a total of 0.100002 s (ignore this:-2089055090)
> for array sized 34603008 elems, took 9.53674e-07 s for linear search and 0 s for binary search for element 4095, in 69332 tries, for a total of 0.100001 s (ignore this:-1521226010)
> for array sized 34603008 elems, took 0 s for linear search and 0 s for binary search for element 2047, in 128633 tries, for a total of 0.100001 s (ignore this:-994602508)
> for array sized 34603008 elems, took 0 s for linear search and 0 s for binary search for element 1023, in 227607 tries, for a total of 0.100001 s (ignore this:-528918586)
> for array sized 34603008 elems, took 0 s for linear search and 0 s for binary search for element 511, in 366267 tries, for a total of 0.100001 s (ignore this:-154593712)
> for array sized 34603008 elems, took 0 s for linear search and 0 s for binary search for element 255, in 531248 tries, for a total of 0.100000 s (ignore this:116342768)
> for array sized 34603008 elems, took 0 s for linear search and 0 s for binary search for element 127, in 684539 tries, for a total of 0.100000 s (ignore this:290215674)
> for array sized 34603008 elems, took 0 s for linear search and 0 s for binary search for element 63, in 910834 tries, for a total of 0.100001 s (ignore this:404980758)
> for array sized 34603008 elems, took 0 s for linear search and 0 s for binary search for element 31, in 1015481 tries, for a total of 0.100001 s (ignore this:467940580)
> for array sized 34603008 elems, took 0 s for linear search and 0 s for binary search for element 15, in 1064698 tries, for a total of 0.100000 s (ignore this:499881520)
> for array sized 34603008 elems, took 0 s for linear search and 0 s for binary search for element 7, in 1086690 tries, for a total of 0.100001 s (ignore this:515095180)
> + ./rsbench -C
> /<<PKGBUILDDIR>>/.libs/rsbench version: 1.3.0
> format switches:br 
> ops:spmv_uaua,spmv_uauz,spmv_uxua,spmv_unua,spmv_sasa,spsv_uxua,spmv_sxsa,spsv_sxsx,infty_norm,rowssums,scale
> types:double,float,float complex,double complex
> type char codes:D S C Z 
> types count:4
> transposition codes:n t c 
> restrict keyword is: on
> row unrolls:1
> column unrolls:1
> reference benchmark sample minimum time (seconds):1
> reference benchmark sample minimum runs:10
> maximal configured block size:1
> sizeof(rsb_nnz_idx_t):4
> sizeof(rsb_coo_idx_t):4
> sizeof(rsb_blk_idx_t):4
> sizeof(size_t):8
> sizeof(struct rsb_mtx_t):272
> sizeof(struct rsb_blas_sparse_matrix_t):144
> sizeof(struct rsb_coo_mtx_t):48
> RSB_MAX_MATRIX_DIM:2147483137
> RSB_MAX_MATRIX_NNZ:2147483392
> RSB_CONST_MAX_SUPPORTED_CORES:128
> RSB_BLAS_MATRICES_MAX:2147482623
> RSB_CONST_MIN_NNZ_PER_ROW_FOR_COO_SWITCH:2
> RSB_USER_SET_MEM_HIERARCHY_INFO:L3:11/64/33792K,L2:16/64/1024K,L1:8/64/32K
> RSB_MAX_VALUE_FOR_TYPE(rsb_half_idx_t):65535
> RSB_IOLEVEL:7
> LIBRSBPP support: on.
> MKL support: off.
> OpenMP support: on.
> ARMPL support: off.
> XDR support: off.
> ZLIB support: on.
> Binary I/O Matrix Market hack: on.
> Assertions: off.
> Internal environment variables: off.
> + ./rsbench -oa -Ob --bench -f /<<PKGBUILDDIR>>/A.mtx --verbose --nrhs 1,4 --by-rows
> # --bench option implies -qH -R --write-performance-record --want-mkl-autotune --mkl-benchmark --types : --split-experimental 6 --merge-experimental 6 --also-transpose --sort-filenames-list --want-memory-benchmark
> # Passed 0 arguments via autotuning string "" (an empty string requests defaults)
> Will invoke autotuning for ~10.000000 s x 1 rounds, specifying verbosity=0 and threads=0. (>0 means no structure tuning; 0 means only structure tuning, <0 means tuning of both with (negated) thread count suggestion).
> Will try /<<PKGBUILDDIR>>/A.mtx
> Adding matrix file: /<<PKGBUILDDIR>>/A.mtx
> # Sorting matrices list (use --no-sort-filenames-list to prevent this)
> # Using matrices: A.mtx
> # beginning run at 1659086407
> # /<<PKGBUILDDIR>>/.libs/rsbench -oa -Ob --bench -f /<<PKGBUILDDIR>>/A.mtx --verbose --nrhs 1,4 --by-rows
> # compiled with: CC=gcc CFLAGS=-g -O2 -ffile-prefix-map=/<<PKGBUILDDIR>>=. -fstack-protector-strong -Wformat -Werror=format-security -O3 -std=c99
> # User did not specify threads; assuming 1. Environment provides max 8 threads; this build supports max 128.
> # User did not specify threads; assuming 1. Environment provides max 8 threads; this build supports max 128.
> # average timer granularity: 3.03e-08 s
> # Will write a final performance record to file rsbench_pr__1659086407_gcc-12.1.rpr and periodic checkpoints to rsbench_pr__1659086407_gcc-12.1.rpr.tmp
> # will NOT perform ancillary tests.
> # will flush cache memory:  between each operation measurement series, and NOT between each operation.
> # will keep any zero encountered in the matrix.
> # env: export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
> # env: export LD_LIBRARY_PATH=/<<PKGBUILDDIR>>/.libs
> # env: HOSTNAME is not set
> # env: KMP_AFFINITY is not set
> # env: OMP_AFFINITY_FORMAT is not set
> # env: OMP_ALLOCATOR is not set
> # env: OMP_CANCELLATION is not set
> # env: OMP_DEBUG is not set
> # env: OMP_DEFAULT_DEVICE is not set
> # env: OMP_DISPLAY_ENV is not set
> # env: OMP_DISPLAY_AFFINITY is not set
> # env: OMP_DYNAMIC is not set
> # env: OMP_MAX_ACTIVE_LEVELS is not set
> # env: OMP_MAX_TASK_PRIORITY is not set
> # env: OMP_NESTED is not set
> # env: OMP_NUM_THREADS is not set
> # env: OMP_PLACES is not set
> # env: OMP_PROC_BIND is not set
> # env: OMP_SCHEDULE is not set
> # env: OMP_STACKSIZE is not set
> # env: OMP_TARGET_OFFLOAD is not set
> # env: OMP_THREAD_LIMIT is not set
> # env: OMP_TOOL is not set
> # env: OMP_TOOL_LIBRARIES is not set
> # env: OMP_WAIT_POLICY is not set
> # env: RSB_WANT_RSBPP is not set
> #     using kernels from librsbpp (default).
> # env: SLURM_CLUSTER_NAME is not set
> # env: SLURM_CPUS_ON_NODE is not set
> # env: SLURM_JOB_CPUS_PER_NODE is not set
> # env: SLURM_JOB_ID is not set
> # env: SLURM_JOBID is not set
> # env: SLURM_JOB_NAME is not set
> # env: SLURM_JOB_NUM_NODES is not set
> # env: SLURM_JOB_PARTITION is not set
> # env: SLURM_NPROCS is not set
> # env: SLURM_NTASKS is not set
> # env: SLURM_STEP_TASKS_PER_NODE is not set
> # env: SLURM_TASKS_PER_NODE is not set
> # detected hostname: ip-10-84-234-251
> # user specified a verbosity level of 1 (each --verbose occurrence counts +1)
> # This test will measure times in scanning arrays sized and aligned to fit in caches.
> # 3 cache levels detected
> Will fill struct with 50 samples...
> # Memory benchmark took 11.603s
> # auto-tuning oriented output implies  times==0 iterations and sort-after-load.
> #pr: allocated a performance record for 16 samples (4480 bytes).
> # multi-type benchmarking (DSCZ) -- now using typecode D (last was D).
> # Cache block size total 34603008 bytes, per-thread 4325376 bytes
> # so far, program took 11.605s of wall clock time; ancillary tests 0.000s; I/O 0.000s; checks 0.000s; conversions 0.000s; rsb/mkl tuning 0.000s/0.000s .
> # reading A.mtx (184 bytes / 1 MiB / 6 nnz / 3 rows / 3 columns / 1 MiB COO) as type D...
> # file input of A.mtx took   0.00 s (6 nnz, 59919 nnz/s ) (1.84 MB/s ) 
> #pre-sorting (6 elements) took 0.000450134 s
> #weeding duplicates (to 6 elements) took 2.14577e-06 s (and check, 1.90735e-06 s )
> # multi-nrhs benchmarking (1,4) -- now using nrhs 1.
> # Using alpha=1 beta=1 order=rows for rsb_spmv/rsb_spsv/rsb_spmm/rsb_spsm.
> # multi-transpose benchmarking -- now using transA = N.
> # will use input matrix flags: RSB_FLAG_USE_HALFWORD_INDICES, RSB_FLAG_SORTED_INPUT, RSB_FLAG_QUAD_PARTITIONING, RSB_FLAG_SYMMETRIC, RSB_FLAG_OWN_PARTITIONING_ARRAYS
> # Using 8 threads
> # Constructed matrix (took 0.001s): (3 x 3)[0x5557beef42d0]{D} @ (0(0..0),0(0..0)) (6 nnz, 2 nnz/r) flags 0x2442186 (coo:1, csr:0, hw:1, ic:1, fi:0), storage: 40, subm: 3, symflags:'S'
> # matrix consistency check took 0.000s (ok)
> RSB Sparse Blocks Autotuner invoked requesting max 6 splits and max 6 merges in 1 rounds, threads spec.0 (specify negative values to enable threads tuning).
> Will autotune matrix: 3 x 3, type D, 6 nnz, 2 nnz/r, 4 subms, 3 lsubms, 4.0000 bpnz.
> Parameters: verbosity:1 mintimes:3 maxtimes:10 mindt:0 maxdt:3
> 3 iterations (8 th.) took 8.011e-05s; avg 2.67e-05s ( +/-  74.11/ 72.32 %); best 6.914e-06s; worst 4.601e-05s; std dev. 1.597e-05 (taking best).
> Reference operation time is 6.91414e-06 s (3.471 Mflops) with 8 threads.
> Starting merge (same threads) based auto-tuning procedure (transA=N, nrhs=1, order=rows) (max 6 steps, inclusive 3 grace steps) on: 3 x 3, type D, 6 nnz, 2 nnz/r, 4 subms, 3 lsubms, 4.0000 bpnz (tpop: 6.914e-06  Mflops: 3.471)
> Merge (3 -> 1 leaves) took w.c.t. of 1.311e-05s, ~5.007e-06s of computing time (of which 0s sorting, 1.192e-06s analysis)
> 3 iterations (8 th.) took 1.907e-05s; avg 6.358e-06s ( +/-  99.56/200.00 %); best 2.81e-08s; worst 1.907e-05s; std dev. 8.991e-06 (taking best).
> Reference operation time is 2.80976e-08 s (854.2 Mflops) with 8 threads.
> After merge step 1: tpop: 2.81e-08 s   ~Mflops: 854.165   nsubm:1 otn:8
> Applying merge (3 -> 1 leaves, 8 th.) yielded SPEEDUP of 246.076x: 6.914e-06s -> 2.81e-08s, so taking this instance.
> Merged all the matrix leaves: no reason to continue merging.
> A total of 1 merge steps (of max 6) (3 -> 1 subms) took 5.317e-05s (of which 1.502e-05s partitioning, 0s I/O); computing times: 5.007e-06s in par. loops, 0s sorting, 1.192e-06s analyzing)
> Total merge + benchmarking process took 5.317e-05s, equivalent to 1892.2/7.7 new/old ops (2.289e-05s for 2 clones -- as 814.6/3.3 ops, or 407.3/1.7 ops per clone), SPEEDUP of 246.076x
> Applying multi-merge (3 -> 1 leaves, 1 steps, 0 -> 8 th.sp.) yielded SPEEDUP of 246.076x (6.914e-06s -> 2.81e-08s), will amortize in        7.7 ops by saving 6.886e-06s per op.
> In 1 tuning rounds (tot. 0.00018s, 2.3e-05s for constructor, 2 clones) obtained a SPEEDUP of 24507.6% (246.1x) (from 3.471 to 854.2 Mflops).
> #pr: updating sample at index 1 (0^th of 16), 0^th touch for (0,0,0,0,0,0,0).
> First run of RSB Autotuner took 0.000200033 s  (6.914e-06 s -> 2.810e-08 s per spmv_sxsa) (tuned: 3 -> 1 lsubm).
> RSB Sparse Blocks Autotuner invoked requesting max 0 splits and max 0 merges in 1 rounds, auto threads spec.
> Second run of RSB Autotuner took 0.000293016 s and estimated a speedup of 1.000000 x (2.810e-08 s -> 2.810e-08 s per op) in same matrix (1 -> 1 lsubm)
> #min:1
> #max:1
> #sum:3
> #norm:1.7320508075688772
> #used index storage compared to COO:28 vs 48 bytes (58.33%) ; compared to CSR:28 vs 40 bytes (77.78%)
> #%:CONSTRUCTOR_*:SORT	SCAN	INSERT	SCAN+INSERT
> %:CONSTRUCTOR_TIMES:A.mtx	S	N	8	3	3	6	  0.000000	  0.000689	  0.000013	  0.000702
> %:UNSORTEDCOO2RSB_TIME:A.mtx	S	N	8	3	3	6	  0.000702
> %:RSB_SUBDIVISION_TIME:A.mtx	S	N	8	3	3	6	  0.000689
> %:RSB_SHUFFLE_TIME:A.mtx	S	N	8	3	3	6	  0.000013
> %:ROW_MAJOR_SORT_TIME:A.mtx	S	N	8	3	3	6	  0.000000
> %:ROW_MAJOR_SORT_SCALING:A.mtx	S	N	8	3	3	6	      -nan
> %:SORTEDCOO2RSB_TIME:A.mtx	S	N	8	3	3	6	  0.000702
> %:ROW_MAJOR_SORT_TO_MOP:A.mtx	S	N	8	3	3	6	     0.000
> %:UNSORTEDCOO2RSB_SCALING:A.mtx	S	N	8	3	3	6	      1.00
> %:SORTEDCOO2RSB_SCALING:A.mtx	S	N	8	3	3	6	      1.00
> %:RSB_SUBDIVISION_SCALING:A.mtx	S	N	8	3	3	6	      1.00
> %:RSB_SHUFFLE_SCALING:A.mtx	S	N	8	3	3	6	      1.00
> %:CONSTRUCTOR_SCALING:A.mtx	S	N	8	3	3	6	      -nan	      1.00	      1.00	      1.00
> #%:SM_COUNTS:	Tot	HalfwordCsr	FullwordCsr	HalfwordCoo	FullwordCoo
> %:SM_COUNTS:A.mtx	S	N	8	3	3	6	1	1	0	0	0
> %:SM_IDXOCCUPATIONRSBVSCOOANDCSR:A.mtx	S	N	8	3	3	6	28	48	36
> %:SM_IDXOCCUPATION:A.mtx	S	N	8	3	3	6	28
> %:SM_MEMTRAFFIC:A.mtx	S	N	8	3	3	6	       156
> %:SM_MINMAXAVGNNZ:A.mtx	S	N	8	3	3	6	6	6	6
> #
> %operation:matrix	CONSTRUCTOR[8]	SPMV[8]	SPMV[8]
> %operation:A.mtx	0.000790834	1e+09	1e+09
> %constructor:matrix	SORT[8]	SCAN[8]	SHUFFLE[8]	INSERT[8]
> %constructor:A.mtx	0	0.000688791	0	1.3113e-05
> # symmetric matrix --- skipping transposed benchmarking
> # multi-nrhs benchmarking (1,4) -- now using nrhs 4.
> # Using alpha=1 beta=1 order=rows for rsb_spmv/rsb_spsv/rsb_spmm/rsb_spsm.
> # multi-transpose benchmarking -- now using transA = N.
> # will use input matrix flags: RSB_FLAG_USE_HALFWORD_INDICES, RSB_FLAG_SORTED_INPUT, RSB_FLAG_QUAD_PARTITIONING, RSB_FLAG_SYMMETRIC, RSB_FLAG_OWN_PARTITIONING_ARRAYS
> # Using 8 threads
> # Constructed matrix (took 0.008s): (3 x 3)[0x5557beef7800]{D} @ (0(0..0),0(0..0)) (6 nnz, 2 nnz/r) flags 0x2442186 (coo:1, csr:0, hw:1, ic:1, fi:0), storage: 40, subm: 3, symflags:'S'
> # matrix consistency check took 0.000s (ok)
> RSB Sparse Blocks Autotuner invoked requesting max 6 splits and max 6 merges in 1 rounds, threads spec.0 (specify negative values to enable threads tuning).
> Will autotune matrix: 3 x 3, type D, 6 nnz, 2 nnz/r, 4 subms, 3 lsubms, 4.0000 bpnz.
> Parameters: verbosity:1 mintimes:3 maxtimes:10 mindt:0 maxdt:3
> 3 iterations (8 th.) took 3.695e-05s; avg 1.232e-05s ( +/-  51.61/103.23 %); best 5.96e-06s; worst 2.503e-05s; std dev. 8.991e-06 (taking best).
> Reference operation time is 5.96046e-06 s (16.11 Mflops) with 8 threads.
> Starting merge (same threads) based auto-tuning procedure (transA=N, nrhs=4, order=rows) (max 6 steps, inclusive 3 grace steps) on: 3 x 3, type D, 6 nnz, 2 nnz/r, 4 subms, 3 lsubms, 4.0000 bpnz (tpop: 5.96e-06  Mflops: 16.106)
> Merge (3 -> 1 leaves) took w.c.t. of 8.106e-06s, ~3.099e-06s of computing time (of which 9.537e-07s sorting, 1.192e-06s analysis)
> 3 iterations (8 th.) took 4.053e-06s; avg 1.351e-06s ( +/-  97.92/129.41 %); best 2.81e-08s; worst 3.099e-06s; std dev. 1.296e-06 (taking best).
> Reference operation time is 2.80976e-08 s (3417 Mflops) with 8 threads.
> After merge step 1: tpop: 2.81e-08 s   ~Mflops: 3416.658   nsubm:1 otn:8
> Applying merge (3 -> 1 leaves, 8 th.) yielded SPEEDUP of 212.134x: 5.96e-06s -> 2.81e-08s, so taking this instance.
> Merged all the matrix leaves: no reason to continue merging.
> A total of 1 merge steps (of max 6) (3 -> 1 subms) took 3.195e-05s (of which 1.097e-05s partitioning, 0s I/O); computing times: 3.099e-06s in par. loops, 9.537e-07s sorting, 1.192e-06s analyzing)
> Total merge + benchmarking process took 3.195e-05s, equivalent to 1137.0/5.4 new/old ops (1.597e-05s for 2 clones -- as 568.5/2.7 ops, or 284.3/1.3 ops per clone), SPEEDUP of 212.134x
> Applying multi-merge (3 -> 1 leaves, 1 steps, 0 -> 8 th.sp.) yielded SPEEDUP of 212.134x (5.96e-06s -> 2.81e-08s), will amortize in        5.4 ops by saving 5.932e-06s per op.
> In 1 tuning rounds (tot. 0.0001s, 1.6e-05s for constructor, 2 clones) obtained a SPEEDUP of 21113.4% (212.1x) (from 16.11 to 3417 Mflops).
> #pr: updating sample at index 9 (1^th of 16), 0^th touch for (0,0,0,0,1,0,0).
> First run of RSB Autotuner took 0.000117064 s  (5.960e-06 s -> 2.810e-08 s per spmv_sxsa) (tuned: 3 -> 1 lsubm).
> RSB Sparse Blocks Autotuner invoked requesting max 0 splits and max 0 merges in 1 rounds, auto threads spec.
> Second run of RSB Autotuner took 0.000330925 s and estimated a speedup of 1.000000 x (2.810e-08 s -> 2.810e-08 s per op) in same matrix (1 -> 1 lsubm)
> #min:1
> #max:1
> #sum:3
> #norm:1.7320508075688772
> #used index storage compared to COO:28 vs 48 bytes (58.33%) ; compared to CSR:28 vs 40 bytes (77.78%)
> #%:CONSTRUCTOR_*:SORT	SCAN	INSERT	SCAN+INSERT
> %:CONSTRUCTOR_TIMES:A.mtx	S	N	8	3	3	6	  0.000000	  0.007803	  0.000194	  0.007997
> %:UNSORTEDCOO2RSB_TIME:A.mtx	S	N	8	3	3	6	  0.007997
> %:RSB_SUBDIVISION_TIME:A.mtx	S	N	8	3	3	6	  0.007803
> %:RSB_SHUFFLE_TIME:A.mtx	S	N	8	3	3	6	  0.000194
> %:ROW_MAJOR_SORT_TIME:A.mtx	S	N	8	3	3	6	  0.000000
> %:ROW_MAJOR_SORT_SCALING:A.mtx	S	N	8	3	3	6	      -nan
> %:SORTEDCOO2RSB_TIME:A.mtx	S	N	8	3	3	6	  0.007997
> %:ROW_MAJOR_SORT_TO_MOP:A.mtx	S	N	8	3	3	6	     0.000
> %:UNSORTEDCOO2RSB_SCALING:A.mtx	S	N	8	3	3	6	      1.00
> %:SORTEDCOO2RSB_SCALING:A.mtx	S	N	8	3	3	6	      1.00
> %:RSB_SUBDIVISION_SCALING:A.mtx	S	N	8	3	3	6	      1.00
> %:RSB_SHUFFLE_SCALING:A.mtx	S	N	8	3	3	6	      1.00
> %:CONSTRUCTOR_SCALING:A.mtx	S	N	8	3	3	6	      -nan	      1.00	      1.00	      1.00
> #%:SM_COUNTS:	Tot	HalfwordCsr	FullwordCsr	HalfwordCoo	FullwordCoo
> %:SM_COUNTS:A.mtx	S	N	8	3	3	6	1	1	0	0	0
> %:SM_IDXOCCUPATIONRSBVSCOOANDCSR:A.mtx	S	N	8	3	3	6	28	48	36
> %:SM_IDXOCCUPATION:A.mtx	S	N	8	3	3	6	28
> %:SM_MEMTRAFFIC:A.mtx	S	N	8	3	3	6	       156
> %:SM_MINMAXAVGNNZ:A.mtx	S	N	8	3	3	6	6	6	6
> #
> %operation:matrix	CONSTRUCTOR[8]	SPMV[8]	SPMV[8]
> %operation:A.mtx	0.00800991	1e+09	1e+09
> %constructor:matrix	SORT[8]	SCAN[8]	SHUFFLE[8]	INSERT[8]
> %constructor:A.mtx	0	0.00780296	0	0.000194073
> # symmetric matrix --- skipping transposed benchmarking
> # so far, program took 12.421s of wall clock time; ancillary tests 0.000s; I/O 0.000s; checks 0.000s; conversions 0.000s; rsb/mkl tuning 0.001s/0.000s .
> getrusage() stats:
> ru_maxrss: 137 (maximum resident set size -- MB)
> ru_stime : 0.8688s (system CPU time used)
> ru_utime : 12.56s (user CPU time used)
> # multi-type benchmarking (DSCZ) -- now using typecode S (last was D).
> # Cache block size total 34603008 bytes, per-thread 4325376 bytes
> # so far, program took 12.421s of wall clock time; ancillary tests 0.000s; I/O 0.000s; checks 0.000s; conversions 0.000s; rsb/mkl tuning 0.001s/0.000s .
> # Reusing type converted (D->S) arrays from last iteration instead of reloading matrix file.
> # multi-nrhs benchmarking (1,4) -- now using nrhs 1.
> # Using alpha=1 beta=1 order=rows for rsb_spmv/rsb_spsv/rsb_spmm/rsb_spsm.
> # multi-transpose benchmarking -- now using transA = N.
> # will use input matrix flags: RSB_FLAG_USE_HALFWORD_INDICES, RSB_FLAG_SORTED_INPUT, RSB_FLAG_QUAD_PARTITIONING, RSB_FLAG_SYMMETRIC, RSB_FLAG_OWN_PARTITIONING_ARRAYS
> # Using 8 threads
> # Constructed matrix (took 0.001s): (3 x 3)[0x5557beefa4f0]{S} @ (0(0..0),0(0..0)) (6 nnz, 2 nnz/r) flags 0x2442186 (coo:1, csr:0, hw:1, ic:1, fi:0), storage: 40, subm: 3, symflags:'S'
> # matrix consistency check took 0.000s (ok)
> RSB Sparse Blocks Autotuner invoked requesting max 6 splits and max 6 merges in 1 rounds, threads spec.0 (specify negative values to enable threads tuning).
> Will autotune matrix: 3 x 3, type S, 6 nnz, 2 nnz/r, 4 subms, 3 lsubms, 4.0000 bpnz.
> Parameters: verbosity:1 mintimes:3 maxtimes:10 mindt:0 maxdt:3
> 3 iterations (8 th.) took 6.795e-05s; avg 2.265e-05s ( +/-  77.89/146.32 %); best 5.007e-06s; worst 5.579e-05s; std dev. 2.345e-05 (taking best).
> Reference operation time is 5.00679e-06 s (4.793 Mflops) with 8 threads.
> Starting merge (same threads) based auto-tuning procedure (transA=N, nrhs=1, order=rows) (max 6 steps, inclusive 3 grace steps) on: 3 x 3, type S, 6 nnz, 2 nnz/r, 4 subms, 3 lsubms, 4.0000 bpnz (tpop: 5.007e-06  Mflops: 4.793)
> Merge (3 -> 1 leaves) took w.c.t. of 6.914e-06s, ~2.861e-06s of computing time (of which 0s sorting, 9.537e-07s analysis)
> 3 iterations (8 th.) took 4.053e-06s; avg 1.351e-06s ( +/-  97.92/200.00 %); best 2.81e-08s; worst 4.053e-06s; std dev. 1.911e-06 (taking best).
> Reference operation time is 2.80976e-08 s (854.2 Mflops) with 8 threads.
> After merge step 1: tpop: 2.81e-08 s   ~Mflops: 854.165   nsubm:1 otn:8
> Applying merge (3 -> 1 leaves, 8 th.) yielded SPEEDUP of 178.193x: 5.007e-06s -> 2.81e-08s, so taking this instance.
> Merged all the matrix leaves: no reason to continue merging.
> A total of 1 merge steps (of max 6) (3 -> 1 subms) took 6.89e-05s (of which 1.001e-05s partitioning, 0s I/O); computing times: 2.861e-06s in par. loops, 0s sorting, 9.537e-07s analyzing)
> Total merge + benchmarking process took 6.89e-05s, equivalent to 2452.3/13.8 new/old ops (5.889e-05s for 2 clones -- as 2095.9/11.8 ops, or 1047.9/5.9 ops per clone), SPEEDUP of 178.193x
> Applying multi-merge (3 -> 1 leaves, 1 steps, 0 -> 8 th.sp.) yielded SPEEDUP of 178.193x (5.007e-06s -> 2.81e-08s), will amortize in       13.8 ops by saving 4.979e-06s per op.
> In 1 tuning rounds (tot. 0.00039s, 5.9e-05s for constructor, 2 clones) obtained a SPEEDUP of 17719.3% (178.2x) (from 4.793 to 854.2 Mflops).
> #pr: updating sample at index 3 (2^th of 16), 0^th touch for (0,0,0,0,0,1,0).
> First run of RSB Autotuner took 0.00040102 s  (5.007e-06 s -> 2.810e-08 s per spmv_sxsa) (tuned: 3 -> 1 lsubm).
> RSB Sparse Blocks Autotuner invoked requesting max 0 splits and max 0 merges in 1 rounds, auto threads spec.
> Second run of RSB Autotuner took 0.00237608 s and estimated a speedup of 1.000000 x (2.810e-08 s -> 2.810e-08 s per op) in same matrix (1 -> 1 lsubm)
> #min:1
> #max:1
> #sum:3
> #norm:1.73205078
> #used index storage compared to COO:28 vs 48 bytes (58.33%) ; compared to CSR:28 vs 40 bytes (77.78%)
> #%:CONSTRUCTOR_*:SORT	SCAN	INSERT	SCAN+INSERT
> %:CONSTRUCTOR_TIMES:A.mtx	S	N	8	3	3	6	  0.000000	  0.001082	  0.000008	  0.001090
> %:UNSORTEDCOO2RSB_TIME:A.mtx	S	N	8	3	3	6	  0.001090
> %:RSB_SUBDIVISION_TIME:A.mtx	S	N	8	3	3	6	  0.001082
> %:RSB_SHUFFLE_TIME:A.mtx	S	N	8	3	3	6	  0.000008
> %:ROW_MAJOR_SORT_TIME:A.mtx	S	N	8	3	3	6	  0.000000
> %:ROW_MAJOR_SORT_SCALING:A.mtx	S	N	8	3	3	6	      -nan
> %:SORTEDCOO2RSB_TIME:A.mtx	S	N	8	3	3	6	  0.001090
> %:ROW_MAJOR_SORT_TO_MOP:A.mtx	S	N	8	3	3	6	     0.000
> %:UNSORTEDCOO2RSB_SCALING:A.mtx	S	N	8	3	3	6	      1.00
> %:SORTEDCOO2RSB_SCALING:A.mtx	S	N	8	3	3	6	      1.00
> %:RSB_SUBDIVISION_SCALING:A.mtx	S	N	8	3	3	6	      1.00
> %:RSB_SHUFFLE_SCALING:A.mtx	S	N	8	3	3	6	      1.00
> %:CONSTRUCTOR_SCALING:A.mtx	S	N	8	3	3	6	      -nan	      1.00	      1.00	      1.00
> #%:SM_COUNTS:	Tot	HalfwordCsr	FullwordCsr	HalfwordCoo	FullwordCoo
> %:SM_COUNTS:A.mtx	S	N	8	3	3	6	1	1	0	0	0
> %:SM_IDXOCCUPATIONRSBVSCOOANDCSR:A.mtx	S	N	8	3	3	6	28	48	36
> %:SM_IDXOCCUPATION:A.mtx	S	N	8	3	3	6	28
> %:SM_MEMTRAFFIC:A.mtx	S	N	8	3	3	6	        96
> %:SM_MINMAXAVGNNZ:A.mtx	S	N	8	3	3	6	6	6	6
> #
> %operation:matrix	CONSTRUCTOR[8]	SPMV[8]	SPMV[8]
> %operation:A.mtx	0.00110292	1e+09	1e+09
> %constructor:matrix	SORT[8]	SCAN[8]	SHUFFLE[8]	INSERT[8]
> %constructor:A.mtx	0	0.00108194	0	8.10623e-06
> # symmetric matrix --- skipping transposed benchmarking
> # multi-nrhs benchmarking (1,4) -- now using nrhs 4.
> # Using alpha=1 beta=1 order=rows for rsb_spmv/rsb_spsv/rsb_spmm/rsb_spsm.
> # multi-transpose benchmarking -- now using transA = N.
> # will use input matrix flags: RSB_FLAG_USE_HALFWORD_INDICES, RSB_FLAG_SORTED_INPUT, RSB_FLAG_QUAD_PARTITIONING, RSB_FLAG_SYMMETRIC, RSB_FLAG_OWN_PARTITIONING_ARRAYS
> # Using 8 threads
> # Constructed matrix (took 0.000s): (3 x 3)[0x5557beefa4f0]{S} @ (0(0..0),0(0..0)) (6 nnz, 2 nnz/r) flags 0x2442186 (coo:1, csr:0, hw:1, ic:1, fi:0), storage: 40, subm: 3, symflags:'S'
> # matrix consistency check took 0.000s (ok)
> RSB Sparse Blocks Autotuner invoked requesting max 6 splits and max 6 merges in 1 rounds, threads spec.0 (specify negative values to enable threads tuning).
> Will autotune matrix: 3 x 3, type S, 6 nnz, 2 nnz/r, 4 subms, 3 lsubms, 4.0000 bpnz.
> Parameters: verbosity:1 mintimes:3 maxtimes:10 mindt:0 maxdt:3
> 3 iterations (8 th.) took 3.004e-05s; avg 1.001e-05s ( +/-  40.48/ 69.05 %); best 5.96e-06s; worst 1.693e-05s; std dev. 4.913e-06 (taking best).
> Reference operation time is 5.96046e-06 s (16.11 Mflops) with 8 threads.
> Starting merge (same threads) based auto-tuning procedure (transA=N, nrhs=4, order=rows) (max 6 steps, inclusive 3 grace steps) on: 3 x 3, type S, 6 nnz, 2 nnz/r, 4 subms, 3 lsubms, 4.0000 bpnz (tpop: 5.96e-06  Mflops: 16.106)
> Merge (3 -> 1 leaves) took w.c.t. of 6.914e-06s, ~3.099e-06s of computing time (of which 9.537e-07s sorting, 9.537e-07s analysis)
> 3 iterations (8 th.) took 3.099e-06s; avg 1.033e-06s ( +/-  97.28/200.00 %); best 2.81e-08s; worst 3.099e-06s; std dev. 1.461e-06 (taking best).
> Reference operation time is 2.80976e-08 s (3417 Mflops) with 8 threads.
> After merge step 1: tpop: 2.81e-08 s   ~Mflops: 3416.658   nsubm:1 otn:8
> Applying merge (3 -> 1 leaves, 8 th.) yielded SPEEDUP of 212.134x: 5.96e-06s -> 2.81e-08s, so taking this instance.
> Merged all the matrix leaves: no reason to continue merging.
> A total of 1 merge steps (of max 6) (3 -> 1 subms) took 3.004e-05s (of which 9.06e-06s partitioning, 0s I/O); computing times: 3.099e-06s in par. loops, 9.537e-07s sorting, 9.537e-07s analyzing)
> Total merge + benchmarking process took 3.004e-05s, equivalent to 1069.2/5.0 new/old ops (1.502e-05s for 2 clones -- as 534.6/2.5 ops, or 267.3/1.3 ops per clone), SPEEDUP of 212.134x
> Applying multi-merge (3 -> 1 leaves, 1 steps, 0 -> 8 th.sp.) yielded SPEEDUP of 212.134x (5.96e-06s -> 2.81e-08s), will amortize in        5.1 ops by saving 5.932e-06s per op.
> In 1 tuning rounds (tot. 0.00039s, 1.5e-05s for constructor, 2 clones) obtained a SPEEDUP of 21113.4% (212.1x) (from 16.11 to 3417 Mflops).
> #pr: updating sample at index 11 (3^th of 16), 0^th touch for (0,0,0,0,1,1,0).
> First run of RSB Autotuner took 0.000404835 s  (5.960e-06 s -> 2.810e-08 s per spmv_sxsa) (tuned: 3 -> 1 lsubm).
> RSB Sparse Blocks Autotuner invoked requesting max 0 splits and max 0 merges in 1 rounds, auto threads spec.
> Second run of RSB Autotuner took 0.00271297 s and estimated a speedup of 1.000000 x (2.810e-08 s -> 2.810e-08 s per op) in same matrix (1 -> 1 lsubm)
> #min:1
> #max:1
> #sum:3
> #norm:1.73205078
> #used index storage compared to COO:28 vs 48 bytes (58.33%) ; compared to CSR:28 vs 40 bytes (77.78%)
> #%:CONSTRUCTOR_*:SORT	SCAN	INSERT	SCAN+INSERT
> %:CONSTRUCTOR_TIMES:A.mtx	S	N	8	3	3	6	  0.000000	  0.000310	  0.000008	  0.000318
> %:UNSORTEDCOO2RSB_TIME:A.mtx	S	N	8	3	3	6	  0.000318
> %:RSB_SUBDIVISION_TIME:A.mtx	S	N	8	3	3	6	  0.000310
> %:RSB_SHUFFLE_TIME:A.mtx	S	N	8	3	3	6	  0.000008
> %:ROW_MAJOR_SORT_TIME:A.mtx	S	N	8	3	3	6	  0.000000
> %:ROW_MAJOR_SORT_SCALING:A.mtx	S	N	8	3	3	6	      -nan
> %:SORTEDCOO2RSB_TIME:A.mtx	S	N	8	3	3	6	  0.000318
> %:ROW_MAJOR_SORT_TO_MOP:A.mtx	S	N	8	3	3	6	     0.000
> %:UNSORTEDCOO2RSB_SCALING:A.mtx	S	N	8	3	3	6	      1.00
> %:SORTEDCOO2RSB_SCALING:A.mtx	S	N	8	3	3	6	      1.00
> %:RSB_SUBDIVISION_SCALING:A.mtx	S	N	8	3	3	6	      1.00
> %:RSB_SHUFFLE_SCALING:A.mtx	S	N	8	3	3	6	      1.00
> %:CONSTRUCTOR_SCALING:A.mtx	S	N	8	3	3	6	      -nan	      1.00	      1.00	      1.00
> #%:SM_COUNTS:	Tot	HalfwordCsr	FullwordCsr	HalfwordCoo	FullwordCoo
> %:SM_COUNTS:A.mtx	S	N	8	3	3	6	1	1	0	0	0
> %:SM_IDXOCCUPATIONRSBVSCOOANDCSR:A.mtx	S	N	8	3	3	6	28	48	36
> %:SM_IDXOCCUPATION:A.mtx	S	N	8	3	3	6	28
> %:SM_MEMTRAFFIC:A.mtx	S	N	8	3	3	6	        96
> %:SM_MINMAXAVGNNZ:A.mtx	S	N	8	3	3	6	6	6	6
> #
> %operation:matrix	CONSTRUCTOR[8]	SPMV[8]	SPMV[8]
> %operation:A.mtx	0.000330925	1e+09	1e+09
> %constructor:matrix	SORT[8]	SCAN[8]	SHUFFLE[8]	INSERT[8]
> %constructor:A.mtx	0	0.000310183	0	7.86781e-06
> # symmetric matrix --- skipping transposed benchmarking
> # so far, program took 12.978s of wall clock time; ancillary tests 0.000s; I/O 0.000s; checks 0.000s; conversions 0.000s; rsb/mkl tuning 0.007s/0.000s .
> getrusage() stats:
> ru_maxrss: 137 (maximum resident set size -- MB)
> ru_stime : 1.24s (system CPU time used)
> ru_utime : 13.51s (user CPU time used)
> # multi-type benchmarking (DSCZ) -- now using typecode C (last was S).
> # Cache block size total 34603008 bytes, per-thread 4325376 bytes
> # so far, program took 12.978s of wall clock time; ancillary tests 0.000s; I/O 0.000s; checks 0.000s; conversions 0.000s; rsb/mkl tuning 0.007s/0.000s .
> # Reusing type converted (S->C) arrays from last iteration instead of reloading matrix file.
> # multi-nrhs benchmarking (1,4) -- now using nrhs 1.
> # Using alpha=1 beta=1 order=rows for rsb_spmv/rsb_spsv/rsb_spmm/rsb_spsm.
> # multi-transpose benchmarking -- now using transA = N.
> # will use input matrix flags: RSB_FLAG_USE_HALFWORD_INDICES, RSB_FLAG_SORTED_INPUT, RSB_FLAG_QUAD_PARTITIONING, RSB_FLAG_SYMMETRIC, RSB_FLAG_OWN_PARTITIONING_ARRAYS
> # Using 8 threads
> # Constructed matrix (took 0.000s): (3 x 3)[0x5557beefa4f0]{C} @ (0(0..0),0(0..0)) (6 nnz, 2 nnz/r) flags 0x2442186 (coo:1, csr:0, hw:1, ic:1, fi:0), storage: 40, subm: 3, symflags:'S'
> # matrix consistency check took 0.000s (ok)
> RSB Sparse Blocks Autotuner invoked requesting max 6 splits and max 6 merges in 1 rounds, threads spec.0 (specify negative values to enable threads tuning).
> Will autotune matrix: 3 x 3, type C, 6 nnz, 2 nnz/r, 4 subms, 3 lsubms, 4.0000 bpnz.
> Parameters: verbosity:1 mintimes:3 maxtimes:10 mindt:0 maxdt:3
> 3 iterations (8 th.) took 4.697e-05s; avg 1.566e-05s ( +/-  55.84/ 97.97 %); best 6.914e-06s; worst 3.099e-05s; std dev. 1.088e-05 (taking best).
> Reference operation time is 6.91414e-06 s (13.88 Mflops) with 8 threads.
> Starting merge (same threads) based auto-tuning procedure (transA=N, nrhs=1, order=rows) (max 6 steps, inclusive 3 grace steps) on: 3 x 3, type C, 6 nnz, 2 nnz/r, 4 subms, 3 lsubms, 4.0000 bpnz (tpop: 6.914e-06  Mflops: 13.885)
> Merge (3 -> 1 leaves) took w.c.t. of 5.96e-06s, ~3.099e-06s of computing time (of which 1.192e-06s sorting, 9.537e-07s analysis)
> 3 iterations (8 th.) took 1.621e-05s; avg 5.404e-06s ( +/-  99.48/177.94 %); best 2.81e-08s; worst 1.502e-05s; std dev. 6.817e-06 (taking best).
> Reference operation time is 2.80976e-08 s (3417 Mflops) with 8 threads.
> After merge step 1: tpop: 2.81e-08 s   ~Mflops: 3416.658   nsubm:1 otn:8
> Applying merge (3 -> 1 leaves, 8 th.) yielded SPEEDUP of 246.076x: 6.914e-06s -> 2.81e-08s, so taking this instance.
> Merged all the matrix leaves: no reason to continue merging.
> A total of 1 merge steps (of max 6) (3 -> 1 subms) took 4.292e-05s (of which 8.821e-06s partitioning, 0s I/O); computing times: 3.099e-06s in par. loops, 1.192e-06s sorting, 9.537e-07s analyzing)
> Total merge + benchmarking process took 4.292e-05s, equivalent to 1527.4/6.2 new/old ops (0.001432s for 2 clones -- as 50963.1/207.1 ops, or 25481.5/103.6 ops per clone), SPEEDUP of 246.076x
> Applying multi-merge (3 -> 1 leaves, 1 steps, 0 -> 8 th.sp.) yielded SPEEDUP of 246.076x (6.914e-06s -> 2.81e-08s), will amortize in        6.2 ops by saving 6.886e-06s per op.
> In 1 tuning rounds (tot. 0.0015s, 0.0014s for constructor, 2 clones) obtained a SPEEDUP of 24507.6% (246.1x) (from 13.88 to 3417 Mflops).
> #pr: updating sample at index 5 (4^th of 16), 0^th touch for (0,0,0,0,0,2,0).
> First run of RSB Autotuner took 0.00183201 s  (6.914e-06 s -> 2.810e-08 s per spmv_sxsa) (tuned: 3 -> 1 lsubm).
> RSB Sparse Blocks Autotuner invoked requesting max 0 splits and max 0 merges in 1 rounds, auto threads spec.
> Second run of RSB Autotuner took 0.000270844 s and estimated a speedup of 1.000000 x (2.810e-08 s -> 2.810e-08 s per op) in same matrix (1 -> 1 lsubm)
> #min:1 0
> #max:1 0
> #sum:3 0
> #norm:1.73205078 0
> #used index storage compared to COO:28 vs 48 bytes (58.33%) ; compared to CSR:28 vs 40 bytes (77.78%)
> #%:CONSTRUCTOR_*:SORT	SCAN	INSERT	SCAN+INSERT
> %:CONSTRUCTOR_TIMES:A.mtx	S	N	8	3	3	6	  0.000000	  0.000157	  0.000008	  0.000165
> %:UNSORTEDCOO2RSB_TIME:A.mtx	S	N	8	3	3	6	  0.000165
> %:RSB_SUBDIVISION_TIME:A.mtx	S	N	8	3	3	6	  0.000157
> %:RSB_SHUFFLE_TIME:A.mtx	S	N	8	3	3	6	  0.000008
> %:ROW_MAJOR_SORT_TIME:A.mtx	S	N	8	3	3	6	  0.000000
> %:ROW_MAJOR_SORT_SCALING:A.mtx	S	N	8	3	3	6	      -nan
> %:SORTEDCOO2RSB_TIME:A.mtx	S	N	8	3	3	6	  0.000165
> %:ROW_MAJOR_SORT_TO_MOP:A.mtx	S	N	8	3	3	6	     0.000
> %:UNSORTEDCOO2RSB_SCALING:A.mtx	S	N	8	3	3	6	      1.00
> %:SORTEDCOO2RSB_SCALING:A.mtx	S	N	8	3	3	6	      1.00
> %:RSB_SUBDIVISION_SCALING:A.mtx	S	N	8	3	3	6	      1.00
> %:RSB_SHUFFLE_SCALING:A.mtx	S	N	8	3	3	6	      1.00
> %:CONSTRUCTOR_SCALING:A.mtx	S	N	8	3	3	6	      -nan	      1.00	      1.00	      1.00
> #%:SM_COUNTS:	Tot	HalfwordCsr	FullwordCsr	HalfwordCoo	FullwordCoo
> %:SM_COUNTS:A.mtx	S	N	8	3	3	6	1	1	0	0	0
> %:SM_IDXOCCUPATIONRSBVSCOOANDCSR:A.mtx	S	N	8	3	3	6	28	48	36
> %:SM_IDXOCCUPATION:A.mtx	S	N	8	3	3	6	28
> %:SM_MEMTRAFFIC:A.mtx	S	N	8	3	3	6	       156
> %:SM_MINMAXAVGNNZ:A.mtx	S	N	8	3	3	6	6	6	6
> #
> %operation:matrix	CONSTRUCTOR[8]	SPMV[8]	SPMV[8]
> %operation:A.mtx	0.00017786	1e+09	1e+09
> %constructor:matrix	SORT[8]	SCAN[8]	SHUFFLE[8]	INSERT[8]
> %constructor:A.mtx	0	0.000156879	0	8.10623e-06
> # symmetric matrix --- skipping transposed benchmarking
> # multi-nrhs benchmarking (1,4) -- now using nrhs 4.
> # Using alpha=1 beta=1 order=rows for rsb_spmv/rsb_spsv/rsb_spmm/rsb_spsm.
> # multi-transpose benchmarking -- now using transA = N.
> # will use input matrix flags: RSB_FLAG_USE_HALFWORD_INDICES, RSB_FLAG_SORTED_INPUT, RSB_FLAG_QUAD_PARTITIONING, RSB_FLAG_SYMMETRIC, RSB_FLAG_OWN_PARTITIONING_ARRAYS
> # Using 8 threads
> # Constructed matrix (took 0.001s): (3 x 3)[0x5557beefa4f0]{C} @ (0(0..0),0(0..0)) (6 nnz, 2 nnz/r) flags 0x2442186 (coo:1, csr:0, hw:1, ic:1, fi:0), storage: 40, subm: 3, symflags:'S'
> # matrix consistency check took 0.000s (ok)
> RSB Sparse Blocks Autotuner invoked requesting max 6 splits and max 6 merges in 1 rounds, threads spec.0 (specify negative values to enable threads tuning).
> Will autotune matrix: 3 x 3, type C, 6 nnz, 2 nnz/r, 4 subms, 3 lsubms, 4.0000 bpnz.
> Parameters: verbosity:1 mintimes:3 maxtimes:10 mindt:0 maxdt:3
> 3 iterations (8 th.) took 4.482e-05s; avg 1.494e-05s ( +/-  66.49/107.45 %); best 5.007e-06s; worst 3.099e-05s; std dev. 1.146e-05 (taking best).
> Reference operation time is 5.00679e-06 s (76.7 Mflops) with 8 threads.
> Starting merge (same threads) based auto-tuning procedure (transA=N, nrhs=4, order=rows) (max 6 steps, inclusive 3 grace steps) on: 3 x 3, type C, 6 nnz, 2 nnz/r, 4 subms, 3 lsubms, 4.0000 bpnz (tpop: 5.007e-06  Mflops: 76.696)
> Merge (3 -> 1 leaves) took w.c.t. of 7.153e-06s, ~1.907e-06s of computing time (of which 9.537e-07s sorting, 0s analysis)
> 3 iterations (8 th.) took 4.053e-06s; avg 1.351e-06s ( +/-  97.92/129.41 %); best 2.81e-08s; worst 3.099e-06s; std dev. 1.296e-06 (taking best).
> Reference operation time is 2.80976e-08 s (1.367e+04 Mflops) with 8 threads.
> After merge step 1: tpop: 2.81e-08 s   ~Mflops: 13666.633   nsubm:1 otn:8
> Applying merge (3 -> 1 leaves, 8 th.) yielded SPEEDUP of 178.193x: 5.007e-06s -> 2.81e-08s, so taking this instance.
> Merged all the matrix leaves: no reason to continue merging.
> A total of 1 merge steps (of max 6) (3 -> 1 subms) took 3.099e-05s (of which 9.06e-06s partitioning, 0s I/O); computing times: 1.907e-06s in par. loops, 9.537e-07s sorting, 0s analyzing)
> Total merge + benchmarking process took 3.099e-05s, equivalent to 1103.1/6.2 new/old ops (1.574e-05s for 2 clones -- as 560.0/3.1 ops, or 280.0/1.6 ops per clone), SPEEDUP of 178.193x
> Applying multi-merge (3 -> 1 leaves, 1 steps, 0 -> 8 th.sp.) yielded SPEEDUP of 178.193x (5.007e-06s -> 2.81e-08s), will amortize in        6.2 ops by saving 4.979e-06s per op.
> In 1 tuning rounds (tot. 0.00011s, 1.6e-05s for constructor, 2 clones) obtained a SPEEDUP of 17719.3% (178.2x) (from 76.7 to 1.367e+04 Mflops).
> #pr: updating sample at index 13 (5^th of 16), 0^th touch for (0,0,0,0,1,2,0).
> First run of RSB Autotuner took 0.00012207 s  (5.007e-06 s -> 2.810e-08 s per spmv_sxsa) (tuned: 3 -> 1 lsubm).
> RSB Sparse Blocks Autotuner invoked requesting max 0 splits and max 0 merges in 1 rounds, auto threads spec.
> Second run of RSB Autotuner took 0.000283957 s and estimated a speedup of 1.000000 x (2.810e-08 s -> 2.810e-08 s per op) in same matrix (1 -> 1 lsubm)
> #min:1 0
> #max:1 0
> #sum:3 0
> #norm:1.73205078 0
> #used index storage compared to COO:28 vs 48 bytes (58.33%) ; compared to CSR:28 vs 40 bytes (77.78%)
> #%:CONSTRUCTOR_*:SORT	SCAN	INSERT	SCAN+INSERT
> %:CONSTRUCTOR_TIMES:A.mtx	S	N	8	3	3	6	  0.000000	  0.001243	  0.000007	  0.001250
> %:UNSORTEDCOO2RSB_TIME:A.mtx	S	N	8	3	3	6	  0.001250
> %:RSB_SUBDIVISION_TIME:A.mtx	S	N	8	3	3	6	  0.001243
> %:RSB_SHUFFLE_TIME:A.mtx	S	N	8	3	3	6	  0.000007
> %:ROW_MAJOR_SORT_TIME:A.mtx	S	N	8	3	3	6	  0.000000
> %:ROW_MAJOR_SORT_SCALING:A.mtx	S	N	8	3	3	6	      -nan
> %:SORTEDCOO2RSB_TIME:A.mtx	S	N	8	3	3	6	  0.001250
> %:ROW_MAJOR_SORT_TO_MOP:A.mtx	S	N	8	3	3	6	     0.000
> %:UNSORTEDCOO2RSB_SCALING:A.mtx	S	N	8	3	3	6	      1.00
> %:SORTEDCOO2RSB_SCALING:A.mtx	S	N	8	3	3	6	      1.00
> %:RSB_SUBDIVISION_SCALING:A.mtx	S	N	8	3	3	6	      1.00
> %:RSB_SHUFFLE_SCALING:A.mtx	S	N	8	3	3	6	      1.00
> %:CONSTRUCTOR_SCALING:A.mtx	S	N	8	3	3	6	      -nan	      1.00	      1.00	      1.00
> #%:SM_COUNTS:	Tot	HalfwordCsr	FullwordCsr	HalfwordCoo	FullwordCoo
> %:SM_COUNTS:A.mtx	S	N	8	3	3	6	1	1	0	0	0
> %:SM_IDXOCCUPATIONRSBVSCOOANDCSR:A.mtx	S	N	8	3	3	6	28	48	36
> %:SM_IDXOCCUPATION:A.mtx	S	N	8	3	3	6	28
> %:SM_MEMTRAFFIC:A.mtx	S	N	8	3	3	6	       156
> %:SM_MINMAXAVGNNZ:A.mtx	S	N	8	3	3	6	6	6	6
> #
> %operation:matrix	CONSTRUCTOR[8]	SPMV[8]	SPMV[8]
> %operation:A.mtx	0.00126219	1e+09	1e+09
> %constructor:matrix	SORT[8]	SCAN[8]	SHUFFLE[8]	INSERT[8]
> %constructor:A.mtx	0	0.00124311	0	6.91414e-06
> # symmetric matrix --- skipping transposed benchmarking
> # so far, program took 13.565s of wall clock time; ancillary tests 0.000s; I/O 0.000s; checks 0.000s; conversions 0.000s; rsb/mkl tuning 0.009s/0.000s .
> getrusage() stats:
> ru_maxrss: 137 (maximum resident set size -- MB)
> ru_stime : 1.604s (system CPU time used)
> ru_utime : 14.55s (user CPU time used)
> # multi-type benchmarking (DSCZ) -- now using typecode Z (last was C).
> # Cache block size total 34603008 bytes, per-thread 4325376 bytes
> # so far, program took 13.565s of wall clock time; ancillary tests 0.000s; I/O 0.000s; checks 0.000s; conversions 0.000s; rsb/mkl tuning 0.009s/0.000s .
> # Reusing type converted (C->Z) arrays from last iteration instead of reloading matrix file.
> # multi-nrhs benchmarking (1,4) -- now using nrhs 1.
> # Using alpha=1 beta=1 order=rows for rsb_spmv/rsb_spsv/rsb_spmm/rsb_spsm.
> # multi-transpose benchmarking -- now using transA = N.
> # will use input matrix flags: RSB_FLAG_USE_HALFWORD_INDICES, RSB_FLAG_SORTED_INPUT, RSB_FLAG_QUAD_PARTITIONING, RSB_FLAG_SYMMETRIC, RSB_FLAG_OWN_PARTITIONING_ARRAYS
> # Using 8 threads
> # Constructed matrix (took 0.001s): (3 x 3)[0x5557beefa4f0]{Z} @ (0(0..0),0(0..0)) (6 nnz, 2 nnz/r) flags 0x2442186 (coo:1, csr:0, hw:1, ic:1, fi:0), storage: 40, subm: 3, symflags:'S'
> # matrix consistency check took 0.000s (ok)
> RSB Sparse Blocks Autotuner invoked requesting max 6 splits and max 6 merges in 1 rounds, threads spec.0 (specify negative values to enable threads tuning).
> Will autotune matrix: 3 x 3, type Z, 6 nnz, 2 nnz/r, 4 subms, 3 lsubms, 4.0000 bpnz.
> Parameters: verbosity:1 mintimes:3 maxtimes:10 mindt:0 maxdt:3
> 3 iterations (8 th.) took 4.101e-05s; avg 1.367e-05s ( +/-  56.40/105.81 %); best 5.96e-06s; worst 2.813e-05s; std dev. 1.024e-05 (taking best).
> Reference operation time is 5.96046e-06 s (16.11 Mflops) with 8 threads.
> Starting merge (same threads) based auto-tuning procedure (transA=N, nrhs=1, order=rows) (max 6 steps, inclusive 3 grace steps) on: 3 x 3, type Z, 6 nnz, 2 nnz/r, 4 subms, 3 lsubms, 4.0000 bpnz (tpop: 5.96e-06  Mflops: 16.106)
> Merge (3 -> 1 leaves) took w.c.t. of 7.153e-06s, ~2.861e-06s of computing time (of which 9.537e-07s sorting, 1.192e-06s analysis)
> 3 iterations (8 th.) took 1.121e-05s; avg 3.735e-06s ( +/-  99.25/168.09 %); best 2.81e-08s; worst 1.001e-05s; std dev. 4.466e-06 (taking best).
> Reference operation time is 2.80976e-08 s (3417 Mflops) with 8 threads.
> After merge step 1: tpop: 2.81e-08 s   ~Mflops: 3416.658   nsubm:1 otn:8
> Applying merge (3 -> 1 leaves, 8 th.) yielded SPEEDUP of 212.134x: 5.96e-06s -> 2.81e-08s, so taking this instance.
> Merged all the matrix leaves: no reason to continue merging.
> A total of 1 merge steps (of max 6) (3 -> 1 subms) took 4.005e-05s (of which 1.001e-05s partitioning, 0s I/O); computing times: 2.861e-06s in par. loops, 9.537e-07s sorting, 1.192e-06s analyzing)
> Total merge + benchmarking process took 4.005e-05s, equivalent to 1425.5/6.7 new/old ops (1.693e-05s for 2 clones -- as 602.5/2.8 ops, or 301.2/1.4 ops per clone), SPEEDUP of 212.134x
> Applying multi-merge (3 -> 1 leaves, 1 steps, 0 -> 8 th.sp.) yielded SPEEDUP of 212.134x (5.96e-06s -> 2.81e-08s), will amortize in        6.8 ops by saving 5.932e-06s per op.
> In 1 tuning rounds (tot. 0.00011s, 1.7e-05s for constructor, 2 clones) obtained a SPEEDUP of 21113.4% (212.1x) (from 16.11 to 3417 Mflops).
> #pr: updating sample at index 7 (6^th of 16), 0^th touch for (0,0,0,0,0,3,0).
> First run of RSB Autotuner took 0.000126839 s  (5.960e-06 s -> 2.810e-08 s per spmv_sxsa) (tuned: 3 -> 1 lsubm).
> RSB Sparse Blocks Autotuner invoked requesting max 0 splits and max 0 merges in 1 rounds, auto threads spec.
> Second run of RSB Autotuner took 0.000293016 s and estimated a speedup of 1.000000 x (2.810e-08 s -> 2.810e-08 s per op) in same matrix (1 -> 1 lsubm)
> #min:1 0
> #max:1 0
> #sum:3 0
> #norm:1.7320508075688772 0
> #used index storage compared to COO:28 vs 48 bytes (58.33%) ; compared to CSR:28 vs 40 bytes (77.78%)
> #%:CONSTRUCTOR_*:SORT	SCAN	INSERT	SCAN+INSERT
> %:CONSTRUCTOR_TIMES:A.mtx	S	N	8	3	3	6	  0.000000	  0.001158	  0.000008	  0.001166
> %:UNSORTEDCOO2RSB_TIME:A.mtx	S	N	8	3	3	6	  0.001166
> %:RSB_SUBDIVISION_TIME:A.mtx	S	N	8	3	3	6	  0.001158
> %:RSB_SHUFFLE_TIME:A.mtx	S	N	8	3	3	6	  0.000008
> %:ROW_MAJOR_SORT_TIME:A.mtx	S	N	8	3	3	6	  0.000000
> %:ROW_MAJOR_SORT_SCALING:A.mtx	S	N	8	3	3	6	      -nan
> %:SORTEDCOO2RSB_TIME:A.mtx	S	N	8	3	3	6	  0.001166
> %:ROW_MAJOR_SORT_TO_MOP:A.mtx	S	N	8	3	3	6	     0.000
> %:UNSORTEDCOO2RSB_SCALING:A.mtx	S	N	8	3	3	6	      1.00
> %:SORTEDCOO2RSB_SCALING:A.mtx	S	N	8	3	3	6	      1.00
> %:RSB_SUBDIVISION_SCALING:A.mtx	S	N	8	3	3	6	      1.00
> %:RSB_SHUFFLE_SCALING:A.mtx	S	N	8	3	3	6	      1.00
> %:CONSTRUCTOR_SCALING:A.mtx	S	N	8	3	3	6	      -nan	      1.00	      1.00	      1.00
> #%:SM_COUNTS:	Tot	HalfwordCsr	FullwordCsr	HalfwordCoo	FullwordCoo
> %:SM_COUNTS:A.mtx	S	N	8	3	3	6	1	1	0	0	0
> %:SM_IDXOCCUPATIONRSBVSCOOANDCSR:A.mtx	S	N	8	3	3	6	28	48	36
> %:SM_IDXOCCUPATION:A.mtx	S	N	8	3	3	6	28
> %:SM_MEMTRAFFIC:A.mtx	S	N	8	3	3	6	       276
> %:SM_MINMAXAVGNNZ:A.mtx	S	N	8	3	3	6	6	6	6
> #
> %operation:matrix	CONSTRUCTOR[8]	SPMV[8]	SPMV[8]
> %operation:A.mtx	0.00117993	1e+09	1e+09
> %constructor:matrix	SORT[8]	SCAN[8]	SHUFFLE[8]	INSERT[8]
> %constructor:A.mtx	0	0.001158	0	8.10623e-06
> # symmetric matrix --- skipping transposed benchmarking
> # multi-nrhs benchmarking (1,4) -- now using nrhs 4.
> # Using alpha=1 beta=1 order=rows for rsb_spmv/rsb_spsv/rsb_spmm/rsb_spsm.
> # multi-transpose benchmarking -- now using transA = N.
> # will use input matrix flags: RSB_FLAG_USE_HALFWORD_INDICES, RSB_FLAG_SORTED_INPUT, RSB_FLAG_QUAD_PARTITIONING, RSB_FLAG_SYMMETRIC, RSB_FLAG_OWN_PARTITIONING_ARRAYS
> # Using 8 threads
> # Constructed matrix (took 0.001s): (3 x 3)[0x5557beefb300]{Z} @ (0(0..0),0(0..0)) (6 nnz, 2 nnz/r) flags 0x2442186 (coo:1, csr:0, hw:1, ic:1, fi:0), storage: 40, subm: 3, symflags:'S'
> # matrix consistency check took 0.000s (ok)
> RSB Sparse Blocks Autotuner invoked requesting max 6 splits and max 6 merges in 1 rounds, threads spec.0 (specify negative values to enable threads tuning).
> Will autotune matrix: 3 x 3, type Z, 6 nnz, 2 nnz/r, 4 subms, 3 lsubms, 4.0000 bpnz.
> Parameters: verbosity:1 mintimes:3 maxtimes:10 mindt:0 maxdt:3
> 3 iterations (8 th.) took 4.911e-05s; avg 1.637e-05s ( +/-  57.77/108.25 %); best 6.914e-06s; worst 3.409e-05s; std dev. 1.254e-05 (taking best).
> Reference operation time is 6.91414e-06 s (55.54 Mflops) with 8 threads.
> Starting merge (same threads) based auto-tuning procedure (transA=N, nrhs=4, order=rows) (max 6 steps, inclusive 3 grace steps) on: 3 x 3, type Z, 6 nnz, 2 nnz/r, 4 subms, 3 lsubms, 4.0000 bpnz (tpop: 6.914e-06  Mflops: 55.538)
> Merge (3 -> 1 leaves) took w.c.t. of 8.106e-06s, ~2.861e-06s of computing time (of which 9.537e-07s sorting, 9.537e-07s analysis)
> 3 iterations (8 th.) took 5.007e-06s; avg 1.669e-06s ( +/-  98.32/142.86 %); best 2.81e-08s; worst 4.053e-06s; std dev. 1.73e-06 (taking best).
> Reference operation time is 2.80976e-08 s (1.367e+04 Mflops) with 8 threads.
> After merge step 1: tpop: 2.81e-08 s   ~Mflops: 13666.633   nsubm:1 otn:8
> Applying merge (3 -> 1 leaves, 8 th.) yielded SPEEDUP of 246.076x: 6.914e-06s -> 2.81e-08s, so taking this instance.
> Merged all the matrix leaves: no reason to continue merging.
> A total of 1 merge steps (of max 6) (3 -> 1 subms) took 3.29e-05s (of which 1.001e-05s partitioning, 0s I/O); computing times: 2.861e-06s in par. loops, 9.537e-07s sorting, 9.537e-07s analyzing)
> Total merge + benchmarking process took 3.29e-05s, equivalent to 1171.0/4.8 new/old ops (1.502e-05s for 2 clones -- as 534.6/2.2 ops, or 267.3/1.1 ops per clone), SPEEDUP of 246.076x
> Applying multi-merge (3 -> 1 leaves, 1 steps, 0 -> 8 th.sp.) yielded SPEEDUP of 246.076x (6.914e-06s -> 2.81e-08s), will amortize in        4.8 ops by saving 6.886e-06s per op.
> In 1 tuning rounds (tot. 0.00012s, 1.5e-05s for constructor, 2 clones) obtained a SPEEDUP of 24507.6% (246.1x) (from 55.54 to 1.367e+04 Mflops).
> #pr: updating sample at index 15 (7^th of 16), 0^th touch for (0,0,0,0,1,3,0).
> First run of RSB Autotuner took 0.000128031 s  (6.914e-06 s -> 2.810e-08 s per spmv_sxsa) (tuned: 3 -> 1 lsubm).
> RSB Sparse Blocks Autotuner invoked requesting max 0 splits and max 0 merges in 1 rounds, auto threads spec.
> Second run of RSB Autotuner took 0.00120401 s and estimated a speedup of 1.000000 x (2.810e-08 s -> 2.810e-08 s per op) in same matrix (1 -> 1 lsubm)
> #min:1 0
> #max:1 0
> #sum:3 0
> #norm:1.7320508075688772 0
> #used index storage compared to COO:28 vs 48 bytes (58.33%) ; compared to CSR:28 vs 40 bytes (77.78%)
> #%:CONSTRUCTOR_*:SORT	SCAN	INSERT	SCAN+INSERT
> %:CONSTRUCTOR_TIMES:A.mtx	S	N	8	3	3	6	  0.000000	  0.000912	  0.000009	  0.000921
> %:UNSORTEDCOO2RSB_TIME:A.mtx	S	N	8	3	3	6	  0.000921
> %:RSB_SUBDIVISION_TIME:A.mtx	S	N	8	3	3	6	  0.000912
> %:RSB_SHUFFLE_TIME:A.mtx	S	N	8	3	3	6	  0.000009
> %:ROW_MAJOR_SORT_TIME:A.mtx	S	N	8	3	3	6	  0.000000
> %:ROW_MAJOR_SORT_SCALING:A.mtx	S	N	8	3	3	6	      -nan
> %:SORTEDCOO2RSB_TIME:A.mtx	S	N	8	3	3	6	  0.000921
> %:ROW_MAJOR_SORT_TO_MOP:A.mtx	S	N	8	3	3	6	     0.000
> %:UNSORTEDCOO2RSB_SCALING:A.mtx	S	N	8	3	3	6	      1.00
> %:SORTEDCOO2RSB_SCALING:A.mtx	S	N	8	3	3	6	      1.00
> %:RSB_SUBDIVISION_SCALING:A.mtx	S	N	8	3	3	6	      1.00
> %:RSB_SHUFFLE_SCALING:A.mtx	S	N	8	3	3	6	      1.00
> %:CONSTRUCTOR_SCALING:A.mtx	S	N	8	3	3	6	      -nan	      1.00	      1.00	      1.00
> #%:SM_COUNTS:	Tot	HalfwordCsr	FullwordCsr	HalfwordCoo	FullwordCoo
> %:SM_COUNTS:A.mtx	S	N	8	3	3	6	1	1	0	0	0
> %:SM_IDXOCCUPATIONRSBVSCOOANDCSR:A.mtx	S	N	8	3	3	6	28	48	36
> %:SM_IDXOCCUPATION:A.mtx	S	N	8	3	3	6	28
> %:SM_MEMTRAFFIC:A.mtx	S	N	8	3	3	6	       276
> %:SM_MINMAXAVGNNZ:A.mtx	S	N	8	3	3	6	6	6	6
> #
> %operation:matrix	CONSTRUCTOR[8]	SPMV[8]	SPMV[8]
> %operation:A.mtx	0.000934839	1e+09	1e+09
> %constructor:matrix	SORT[8]	SCAN[8]	SHUFFLE[8]	INSERT[8]
> %constructor:A.mtx	0	0.000911951	0	9.05991e-06
> # symmetric matrix --- skipping transposed benchmarking
> # so far, program took 14.275s of wall clock time; ancillary tests 0.000s; I/O 0.000s; checks 0.000s; conversions 0.000s; rsb/mkl tuning 0.011s/0.000s .
> getrusage() stats:
> ru_maxrss: 137 (maximum resident set size -- MB)
> ru_stime : 2.014s (system CPU time used)
> ru_utime : 15.63s (user CPU time used)
> # benchmarking terminated --- finalizing run.
> # ====== BEGIN Total summary record.
> #pr: ========  Limiting to type D:
> #pr:  2 samples (out of 8) matched the dump limiting criteria.
> #pr: Dump from a base of 8 samples (of max 16) ordered by (1,1,1,1,2,4,2) = (filename x cores x incX x incY x nrhs x typecode x transA).
> pr: BESTCODE MTX NR NC NNZ NRHS TYPE SYM TRANS NT AT-NT AT-MKL-NT BPNZ AT-BPNZ NSUBM AT-SUBM RSBBEST-MFLOPS OPTIME MKL-OPTIME AT-OPTIME AT-MKL-OPTIME AT-TIME RWminBW-GBps CB-bpf AT-MS CMFLOPS
> pr:    1:R_R  A 3 3 6 1 D S N  8  8  0 4.0000 4.6667 3 1 854.16 6.914e-06 0.000e+00 2.810e-08 0.000e+00 2.000e-04 5.27e+00 5.17e+00 1 2.40e-05
> pr:    9:R_R  A 3 3 6 4 D S N  8  8  0 4.0000 4.6667 3 1 3416.66 5.960e-06 0.000e+00 2.810e-08 0.000e+00 1.171e-04 1.30e+01 2.79e+00 1 9.60e-05
> #pr:  2 samples (out of 8) matched the dump limiting criteria.
> #pr: below, we define 'successful' autotuning when speedup of 1.010000x is exceeded, and 'tuned' results even the ones which are same as untuned
> #pr: rsb autotuning was successful in     2 cases (100.00 %) and unsuccessful in 0 cases (0.00 %)
> #pr:  (in succ. cases rsb autotuning gave    avg. 22810.5 % faster, avg. sp. ratio 229.105x, max sp. ratio 246.076x, avg. ratio 0.000x)
> #pr:  (in succ. cases rsb autotuning took an avg/min/max/tot of: 5642.8/4166.3/7119.2/11285.5   tuned ops)
> #pr:  (in succ. cases rsb autotuning took an avg/min/max/tot of:  24.3/ 19.6/ 28.9/ 48.6 untuned ops)
> #pr:  (and amortizes from untuned rsb in avg.  24.4, min.  19.7, max.  29.0 ops)
> #pr:  (avg/min/max (avg) nnz   per subm before successful tuning were          2/         2/         2)
> #pr:  (avg/min/max (avg) nnz   per subm after  successful tuning were          6/         6/         6)
> #pr:  (avg/min/max (avg) bytes per subm before successful tuning were         16/        16/        16)
> #pr:  (avg/min/max (avg) bytes per subm after  successful tuning were         48/        48/        48)
> #pr:  (avg/min/max (avg) bytes per nnz  before successful tuning were      4.000/     4.000/     4.000)
> #pr:  (avg/min/max operands (mtx,lhs,rhs) read bandwidth lower bound       6.976/     4.413/     9.538,GBps)
> #pr:  (avg/min/max operands (mtx,rhs:r;lhs:rw) bandwidth lower bound      18.222/     5.267/    12.955,GBps)
> #pr:  (avg/min/max code balance (bytes read at least once per flop)        3.979/     2.792/     5.167)
> #pr:  (avg/min/max (avg) bytes per nnz  after  successful tuning were      4.667/     4.667/     4.667)
> #pr:  (matrix has been subdivided  more/less/same            in resp.  0 / 2 /0 cases)
> #pr:  (matrix has used             more/less/same    threads in resp.  0 / 0 /2 cases)
> #pr: no unsuccessful rsb autotuning attempt (according to  1.01x threshold) 
> #pr: rsb auto tuning (either succ. or uns.) time was: on avg.:  0.00 s, min  0.00 s, max  0.00 s, tot  0.00 s (2 samples)
> #pr: rsb auto tuning (   only successful  ) time was: on avg.:  0.00 s, min  0.00 s, max  0.00 s, tot  0.00 s (2 samples)
> #pr:  best tun. rsb canon. mflops were: on avg. 2.135e+03,  min 8.542e+02,  max 3.417e+03  (2 samples)
> #pr:  ref. unt. rsb canon. mflops were: on avg. 9.789e+00,  min 3.471e+00,  max 1.611e+01  (2 samples)
> #pr:  best tun. rsb operation time was: on avg. 2.810e-08s, min 2.810e-08s, max 2.810e-08s, tot 5.620e-08s (2 samples)
> #pr:  ref. unt. rsb operation time was: on avg. 6.437e-06s, min 5.960e-06s, max 6.914e-06s, tot 1.287e-05s (2 samples)
> #pr:  min / max ratio of in-memory MEMSET bandwidth to extrapolated read bandwidth ratio: 7.529e-01 1.627e+00
> #pr: # Warning: extrapolated memory I/O bandwidth exceeds memory bandwidth --- is this a tiny matrix ?
> #pr:  in-cache to in-memory MEMSET bandwidth ratio: 1.676e+01
> #pr:  rsb nrhs-to-overall-min-rhs speed ratio was: on avg.    4.000e+00 x, min 4.000e+00 x, max 4.000e+00 x (1 samples, the non-min-nrhs ones)
> #pr: ========  Limiting to type S:
> #pr:  2 samples (out of 8) matched the dump limiting criteria.
> #pr: Dump from a base of 8 samples (of max 16) ordered by (1,1,1,1,2,4,2) = (filename x cores x incX x incY x nrhs x typecode x transA).
> pr: BESTCODE MTX NR NC NNZ NRHS TYPE SYM TRANS NT AT-NT AT-MKL-NT BPNZ AT-BPNZ NSUBM AT-SUBM RSBBEST-MFLOPS OPTIME MKL-OPTIME AT-OPTIME AT-MKL-OPTIME AT-TIME RWminBW-GBps CB-bpf AT-MS CMFLOPS
> pr:    3:R_R  A 3 3 6 1 S S N  8  8  0 4.0000 4.6667 3 1 854.16 5.007e-06 0.000e+00 2.810e-08 0.000e+00 4.010e-04 3.13e+00 3.17e+00 1 2.40e-05
> pr:   11:R_R  A 3 3 6 4 S S N  8  8  0 4.0000 4.6667 3 1 3416.66 5.960e-06 0.000e+00 2.810e-08 0.000e+00 4.048e-04 6.98e+00 1.54e+00 1 9.60e-05
> #pr:  2 samples (out of 8) matched the dump limiting criteria.
> #pr: below, we define 'successful' autotuning when speedup of 1.010000x is exceeded, and 'tuned' results even the ones which are same as untuned
> #pr: rsb autotuning was successful in     2 cases (100.00 %) and unsuccessful in 0 cases (0.00 %)
> #pr:  (in succ. cases rsb autotuning gave    avg. 19416.3 % faster, avg. sp. ratio 195.163x, max sp. ratio 212.134x, avg. ratio 0.000x)
> #pr:  (in succ. cases rsb autotuning took an avg/min/max/tot of: 14340.3/14272.4/14408.1/28680.5   tuned ops)
> #pr:  (in succ. cases rsb autotuning took an avg/min/max/tot of:  74.0/ 67.9/ 80.1/148.0 untuned ops)
> #pr:  (and amortizes from untuned rsb in avg.  74.4, min.  68.2, max.  80.5 ops)
> #pr:  (avg/min/max (avg) nnz   per subm before successful tuning were          2/         2/         2)
> #pr:  (avg/min/max (avg) nnz   per subm after  successful tuning were          6/         6/         6)
> #pr:  (avg/min/max (avg) bytes per subm before successful tuning were          8/         8/         8)
> #pr:  (avg/min/max (avg) bytes per subm after  successful tuning were         24/        24/        24)
> #pr:  (avg/min/max (avg) bytes per nnz  before successful tuning were      4.000/     4.000/     4.000)
> #pr:  (avg/min/max operands (mtx,lhs,rhs) read bandwidth lower bound       3.986/     2.705/     5.267,GBps)
> #pr:  (avg/min/max operands (mtx,rhs:r;lhs:rw) bandwidth lower bound      10.108/     3.132/     6.976,GBps)
> #pr:  (avg/min/max code balance (bytes read at least once per flop)        2.354/     1.542/     3.167)
> #pr:  (avg/min/max (avg) bytes per nnz  after  successful tuning were      4.667/     4.667/     4.667)
> #pr:  (matrix has been subdivided  more/less/same            in resp.  0 / 2 /0 cases)
> #pr:  (matrix has used             more/less/same    threads in resp.  0 / 0 /2 cases)
> #pr: no unsuccessful rsb autotuning attempt (according to  1.01x threshold) 
> #pr: rsb auto tuning (either succ. or uns.) time was: on avg.:  0.00 s, min  0.00 s, max  0.00 s, tot  0.00 s (2 samples)
> #pr: rsb auto tuning (   only successful  ) time was: on avg.:  0.00 s, min  0.00 s, max  0.00 s, tot  0.00 s (2 samples)
> #pr:  best tun. rsb canon. mflops were: on avg. 2.135e+03,  min 8.542e+02,  max 3.417e+03  (2 samples)
> #pr:  ref. unt. rsb canon. mflops were: on avg. 1.045e+01,  min 4.793e+00,  max 1.611e+01  (2 samples)
> #pr:  best tun. rsb operation time was: on avg. 2.810e-08s, min 2.810e-08s, max 2.810e-08s, tot 5.620e-08s (2 samples)
> #pr:  ref. unt. rsb operation time was: on avg. 5.484e-06s, min 5.007e-06s, max 5.960e-06s, tot 1.097e-05s (2 samples)
> #pr:  min / max ratio of in-memory MEMSET bandwidth to extrapolated read bandwidth ratio: 1.363e+00 2.655e+00
> #pr:  in-cache to in-memory MEMSET bandwidth ratio: 1.676e+01
> #pr:  rsb nrhs-to-overall-min-rhs speed ratio was: on avg.    4.000e+00 x, min 4.000e+00 x, max 4.000e+00 x (1 samples, the non-min-nrhs ones)
> #pr: ========  Limiting to type C:
> #pr:  2 samples (out of 8) matched the dump limiting criteria.
> #pr: Dump from a base of 8 samples (of max 16) ordered by (1,1,1,1,2,4,2) = (filename x cores x incX x incY x nrhs x typecode x transA).
> pr: BESTCODE MTX NR NC NNZ NRHS TYPE SYM TRANS NT AT-NT AT-MKL-NT BPNZ AT-BPNZ NSUBM AT-SUBM RSBBEST-MFLOPS OPTIME MKL-OPTIME AT-OPTIME AT-MKL-OPTIME AT-TIME RWminBW-GBps CB-bpf AT-MS CMFLOPS
> pr:    5:R_R  A 3 3 6 1 C S N  8  8  0 4.0000 4.6667 3 1 3416.66 6.914e-06 0.000e+00 2.810e-08 0.000e+00 1.832e-03 5.27e+00 1.29e+00 1 9.60e-05
> pr:   13:R_R  A 3 3 6 4 C S N  8  8  0 4.0000 4.6667 3 1 13666.63 5.007e-06 0.000e+00 2.810e-08 0.000e+00 1.221e-04 1.30e+01 6.98e-01 1 3.84e-04
> #pr:  2 samples (out of 8) matched the dump limiting criteria.
> #pr: below, we define 'successful' autotuning when speedup of 1.010000x is exceeded, and 'tuned' results even the ones which are same as untuned
> #pr: rsb autotuning was successful in     2 cases (100.00 %) and unsuccessful in 0 cases (0.00 %)
> #pr:  (in succ. cases rsb autotuning gave    avg. 21113.4 % faster, avg. sp. ratio 212.134x, max sp. ratio 246.076x, avg. ratio 0.000x)
> #pr:  (in succ. cases rsb autotuning took an avg/min/max/tot of: 34773.0/4344.5/65201.5/69546.0   tuned ops)
> #pr:  (in succ. cases rsb autotuning took an avg/min/max/tot of: 144.7/ 24.4/265.0/289.3 untuned ops)
> #pr:  (and amortizes from untuned rsb in avg. 145.3, min.  24.5, max. 266.0 ops)
> #pr:  (avg/min/max (avg) nnz   per subm before successful tuning were          2/         2/         2)
> #pr:  (avg/min/max (avg) nnz   per subm after  successful tuning were          6/         6/         6)
> #pr:  (avg/min/max (avg) bytes per subm before successful tuning were         16/        16/        16)
> #pr:  (avg/min/max (avg) bytes per subm after  successful tuning were         48/        48/        48)
> #pr:  (avg/min/max (avg) bytes per nnz  before successful tuning were      4.000/     4.000/     4.000)
> #pr:  (avg/min/max operands (mtx,lhs,rhs) read bandwidth lower bound       6.976/     4.413/     9.538,GBps)
> #pr:  (avg/min/max operands (mtx,rhs:r;lhs:rw) bandwidth lower bound      18.222/     5.267/    12.955,GBps)
> #pr:  (avg/min/max code balance (bytes read at least once per flop)        0.995/     0.698/     1.292)
> #pr:  (avg/min/max (avg) bytes per nnz  after  successful tuning were      4.667/     4.667/     4.667)
> #pr:  (matrix has been subdivided  more/less/same            in resp.  0 / 2 /0 cases)
> #pr:  (matrix has used             more/less/same    threads in resp.  0 / 0 /2 cases)
> #pr: no unsuccessful rsb autotuning attempt (according to  1.01x threshold) 
> #pr: rsb auto tuning (either succ. or uns.) time was: on avg.:  0.00 s, min  0.00 s, max  0.00 s, tot  0.00 s (2 samples)
> #pr: rsb auto tuning (   only successful  ) time was: on avg.:  0.00 s, min  0.00 s, max  0.00 s, tot  0.00 s (2 samples)
> #pr:  best tun. rsb canon. mflops were: on avg. 8.542e+03,  min 3.417e+03,  max 1.367e+04  (2 samples)
> #pr:  ref. unt. rsb canon. mflops were: on avg. 4.529e+01,  min 1.388e+01,  max 7.670e+01  (2 samples)
> #pr:  best tun. rsb operation time was: on avg. 2.810e-08s, min 2.810e-08s, max 2.810e-08s, tot 5.620e-08s (2 samples)
> #pr:  ref. unt. rsb operation time was: on avg. 5.960e-06s, min 5.007e-06s, max 6.914e-06s, tot 1.192e-05s (2 samples)
> #pr:  min / max ratio of in-memory MEMSET bandwidth to extrapolated read bandwidth ratio: 7.529e-01 1.627e+00
> #pr: # Warning: extrapolated memory I/O bandwidth exceeds memory bandwidth --- is this a tiny matrix ?
> #pr:  in-cache to in-memory MEMSET bandwidth ratio: 1.676e+01
> #pr:  rsb nrhs-to-overall-min-rhs speed ratio was: on avg.    4.000e+00 x, min 4.000e+00 x, max 4.000e+00 x (1 samples, the non-min-nrhs ones)
> #pr: ========  Limiting to type Z:
> #pr:  2 samples (out of 8) matched the dump limiting criteria.
> #pr: Dump from a base of 8 samples (of max 16) ordered by (1,1,1,1,2,4,2) = (filename x cores x incX x incY x nrhs x typecode x transA).
> pr: BESTCODE MTX NR NC NNZ NRHS TYPE SYM TRANS NT AT-NT AT-MKL-NT BPNZ AT-BPNZ NSUBM AT-SUBM RSBBEST-MFLOPS OPTIME MKL-OPTIME AT-OPTIME AT-MKL-OPTIME AT-TIME RWminBW-GBps CB-bpf AT-MS CMFLOPS
> pr:    7:R_R  A 3 3 6 1 Z S N  8  8  0 4.0000 4.6667 3 1 3416.66 5.960e-06 0.000e+00 2.810e-08 0.000e+00 1.268e-04 9.54e+00 2.29e+00 1 9.60e-05
> pr:   15:R_R  A 3 3 6 4 Z S N  8  8  0 4.0000 4.6667 3 1 13666.63 6.914e-06 0.000e+00 2.810e-08 0.000e+00 1.280e-04 2.49e+01 1.32e+00 1 3.84e-04
> #pr:  2 samples (out of 8) matched the dump limiting criteria.
> #pr: below, we define 'successful' autotuning when speedup of 1.010000x is exceeded, and 'tuned' results even the ones which are same as untuned
> #pr: rsb autotuning was successful in     2 cases (100.00 %) and unsuccessful in 0 cases (0.00 %)
> #pr:  (in succ. cases rsb autotuning gave    avg. 22810.5 % faster, avg. sp. ratio 229.105x, max sp. ratio 246.076x, avg. ratio 0.000x)
> #pr:  (in succ. cases rsb autotuning took an avg/min/max/tot of: 4535.4/4514.2/4556.6/9070.9   tuned ops)
> #pr:  (in succ. cases rsb autotuning took an avg/min/max/tot of:  19.9/ 18.5/ 21.3/ 39.8 untuned ops)
> #pr:  (and amortizes from untuned rsb in avg.  20.0, min.  18.6, max.  21.4 ops)
> #pr:  (avg/min/max (avg) nnz   per subm before successful tuning were          2/         2/         2)
> #pr:  (avg/min/max (avg) nnz   per subm after  successful tuning were          6/         6/         6)
> #pr:  (avg/min/max (avg) bytes per subm before successful tuning were         32/        32/        32)
> #pr:  (avg/min/max (avg) bytes per subm after  successful tuning were         96/        96/        96)
> #pr:  (avg/min/max (avg) bytes per nnz  before successful tuning were      4.000/     4.000/     4.000)
> #pr:  (avg/min/max operands (mtx,lhs,rhs) read bandwidth lower bound      12.955/     7.830/    18.080,GBps)
> #pr:  (avg/min/max operands (mtx,rhs:r;lhs:rw) bandwidth lower bound      34.451/     9.538/    24.913,GBps)
> #pr:  (avg/min/max code balance (bytes read at least once per flop)        1.807/     1.323/     2.292)
> #pr:  (avg/min/max (avg) bytes per nnz  after  successful tuning were      4.667/     4.667/     4.667)
> #pr:  (matrix has been subdivided  more/less/same            in resp.  0 / 2 /0 cases)
> #pr:  (matrix has used             more/less/same    threads in resp.  0 / 0 /2 cases)
> #pr: no unsuccessful rsb autotuning attempt (according to  1.01x threshold) 
> #pr: rsb auto tuning (either succ. or uns.) time was: on avg.:  0.00 s, min  0.00 s, max  0.00 s, tot  0.00 s (2 samples)
> #pr: rsb auto tuning (   only successful  ) time was: on avg.:  0.00 s, min  0.00 s, max  0.00 s, tot  0.00 s (2 samples)
> #pr:  best tun. rsb canon. mflops were: on avg. 8.542e+03,  min 3.417e+03,  max 1.367e+04  (2 samples)
> #pr:  ref. unt. rsb canon. mflops were: on avg. 3.582e+01,  min 1.611e+01,  max 5.554e+01  (2 samples)
> #pr:  best tun. rsb operation time was: on avg. 2.810e-08s, min 2.810e-08s, max 2.810e-08s, tot 5.620e-08s (2 samples)
> #pr:  ref. unt. rsb operation time was: on avg. 6.437e-06s, min 5.960e-06s, max 6.914e-06s, tot 1.287e-05s (2 samples)
> #pr:  min / max ratio of in-memory MEMSET bandwidth to extrapolated read bandwidth ratio: 3.972e-01 9.172e-01
> #pr: # Warning: extrapolated memory I/O bandwidth exceeds memory bandwidth --- is this a tiny matrix ?
> #pr:  in-cache to in-memory MEMSET bandwidth ratio: 1.676e+01
> #pr:  rsb nrhs-to-overall-min-rhs speed ratio was: on avg.    4.000e+00 x, min 4.000e+00 x, max 4.000e+00 x (1 samples, the non-min-nrhs ones)
> #pr: ========  Limiting to nrhs=1:
> #pr:  4 samples (out of 8) matched the dump limiting criteria.
> #pr: Dump from a base of 8 samples (of max 16) ordered by (1,1,1,1,2,4,2) = (filename x cores x incX x incY x nrhs x typecode x transA).
> pr: BESTCODE MTX NR NC NNZ NRHS TYPE SYM TRANS NT AT-NT AT-MKL-NT BPNZ AT-BPNZ NSUBM AT-SUBM RSBBEST-MFLOPS OPTIME MKL-OPTIME AT-OPTIME AT-MKL-OPTIME AT-TIME RWminBW-GBps CB-bpf AT-MS CMFLOPS
> pr:    1:R_R  A 3 3 6 1 D S N  8  8  0 4.0000 4.6667 3 1 854.16 6.914e-06 0.000e+00 2.810e-08 0.000e+00 2.000e-04 5.27e+00 5.17e+00 1 2.40e-05
> pr:    3:R_R  A 3 3 6 1 S S N  8  8  0 4.0000 4.6667 3 1 854.16 5.007e-06 0.000e+00 2.810e-08 0.000e+00 4.010e-04 3.13e+00 3.17e+00 1 2.40e-05
> pr:    5:R_R  A 3 3 6 1 C S N  8  8  0 4.0000 4.6667 3 1 3416.66 6.914e-06 0.000e+00 2.810e-08 0.000e+00 1.832e-03 5.27e+00 1.29e+00 1 9.60e-05
> pr:    7:R_R  A 3 3 6 1 Z S N  8  8  0 4.0000 4.6667 3 1 3416.66 5.960e-06 0.000e+00 2.810e-08 0.000e+00 1.268e-04 9.54e+00 2.29e+00 1 9.60e-05
> #pr:  4 samples (out of 8) matched the dump limiting criteria.
> #pr: below, we define 'successful' autotuning when speedup of 1.010000x is exceeded, and 'tuned' results even the ones which are same as untuned
> #pr: rsb autotuning was successful in     4 cases (100.00 %) and unsuccessful in 0 cases (0.00 %)
> #pr:  (in succ. cases rsb autotuning gave    avg. 21961.9 % faster, avg. sp. ratio 220.619x, max sp. ratio 246.076x, avg. ratio 0.000x)
> #pr:  (in succ. cases rsb autotuning took an avg/min/max/tot of: 22776.8/4514.2/65201.5/91107.3   tuned ops)
> #pr:  (in succ. cases rsb autotuning took an avg/min/max/tot of:  98.8/ 21.3/265.0/395.3 untuned ops)
> #pr:  (and amortizes from untuned rsb in avg.  99.3, min.  21.4, max. 266.0 ops)
> #pr:  (avg/min/max (avg) nnz   per subm before successful tuning were          2/         2/         2)
> #pr:  (avg/min/max (avg) nnz   per subm after  successful tuning were          6/         6/         6)
> #pr:  (avg/min/max (avg) bytes per subm before successful tuning were         18/         8/        32)
> #pr:  (avg/min/max (avg) bytes per subm after  successful tuning were         54/        24/        96)
> #pr:  (avg/min/max (avg) bytes per nnz  before successful tuning were      4.000/     4.000/     4.000)
> #pr:  (avg/min/max operands (mtx,lhs,rhs) read bandwidth lower bound       4.840/     2.705/     7.830,GBps)
> #pr:  (avg/min/max operands (mtx,rhs:r;lhs:rw) bandwidth lower bound      23.205/     3.132/     9.538,GBps)
> #pr:  (avg/min/max code balance (bytes read at least once per flop)        2.979/     1.292/     5.167)
> #pr:  (avg/min/max (avg) bytes per nnz  after  successful tuning were      4.667/     4.667/     4.667)
> #pr:  (matrix has been subdivided  more/less/same            in resp.  0 / 4 /0 cases)
> #pr:  (matrix has used             more/less/same    threads in resp.  0 / 0 /4 cases)
> #pr: no unsuccessful rsb autotuning attempt (according to  1.01x threshold) 
> #pr: rsb auto tuning (either succ. or uns.) time was: on avg.:  0.00 s, min  0.00 s, max  0.00 s, tot  0.00 s (4 samples)
> #pr: rsb auto tuning (   only successful  ) time was: on avg.:  0.00 s, min  0.00 s, max  0.00 s, tot  0.00 s (4 samples)
> #pr:  best tun. rsb canon. mflops were: on avg. 2.135e+03,  min 8.542e+02,  max 3.417e+03  (4 samples)
> #pr:  ref. unt. rsb canon. mflops were: on avg. 9.564e+00,  min 3.471e+00,  max 1.611e+01  (4 samples)
> #pr:  best tun. rsb operation time was: on avg. 2.810e-08s, min 2.810e-08s, max 2.810e-08s, tot 1.124e-07s (4 samples)
> #pr:  ref. unt. rsb operation time was: on avg. 6.199e-06s, min 5.007e-06s, max 6.914e-06s, tot 2.480e-05s (4 samples)
> #pr:  min / max ratio of in-memory MEMSET bandwidth to extrapolated read bandwidth ratio: 9.172e-01 2.655e+00
> #pr: # Warning: extrapolated memory I/O bandwidth exceeds memory bandwidth --- is this a tiny matrix ?
> #pr:  in-cache to in-memory MEMSET bandwidth ratio: 1.676e+01
> #pr: ========  Limiting to nrhs=4:
> #pr:  4 samples (out of 8) matched the dump limiting criteria.
> #pr: Dump from a base of 8 samples (of max 16) ordered by (1,1,1,1,2,4,2) = (filename x cores x incX x incY x nrhs x typecode x transA).
> pr: BESTCODE MTX NR NC NNZ NRHS TYPE SYM TRANS NT AT-NT AT-MKL-NT BPNZ AT-BPNZ NSUBM AT-SUBM RSBBEST-MFLOPS OPTIME MKL-OPTIME AT-OPTIME AT-MKL-OPTIME AT-TIME RWminBW-GBps CB-bpf AT-MS CMFLOPS
> pr:    9:R_R  A 3 3 6 4 D S N  8  8  0 4.0000 4.6667 3 1 3416.66 5.960e-06 0.000e+00 2.810e-08 0.000e+00 1.171e-04 1.30e+01 2.79e+00 1 9.60e-05
> pr:   11:R_R  A 3 3 6 4 S S N  8  8  0 4.0000 4.6667 3 1 3416.66 5.960e-06 0.000e+00 2.810e-08 0.000e+00 4.048e-04 6.98e+00 1.54e+00 1 9.60e-05
> pr:   13:R_R  A 3 3 6 4 C S N  8  8  0 4.0000 4.6667 3 1 13666.63 5.007e-06 0.000e+00 2.810e-08 0.000e+00 1.221e-04 1.30e+01 6.98e-01 1 3.84e-04
> pr:   15:R_R  A 3 3 6 4 Z S N  8  8  0 4.0000 4.6667 3 1 13666.63 6.914e-06 0.000e+00 2.810e-08 0.000e+00 1.280e-04 2.49e+01 1.32e+00 1 3.84e-04
> #pr:  4 samples (out of 8) matched the dump limiting criteria.
> #pr: below, we define 'successful' autotuning when speedup of 1.010000x is exceeded, and 'tuned' results even the ones which are same as untuned
> #pr: rsb autotuning was successful in     4 cases (100.00 %) and unsuccessful in 0 cases (0.00 %)
> #pr:  (in succ. cases rsb autotuning gave    avg. 21113.4 % faster, avg. sp. ratio 212.134x, max sp. ratio 246.076x, avg. ratio 0.000x)
> #pr:  (in succ. cases rsb autotuning took an avg/min/max/tot of: 6868.9/4166.3/14408.1/27475.6   tuned ops)
> #pr:  (in succ. cases rsb autotuning took an avg/min/max/tot of:  32.6/ 18.5/ 67.9/130.5 untuned ops)
> #pr:  (and amortizes from untuned rsb in avg.  32.8, min.  18.6, max.  68.2 ops)
> #pr:  (avg/min/max (avg) nnz   per subm before successful tuning were          2/         2/         2)
> #pr:  (avg/min/max (avg) nnz   per subm after  successful tuning were          6/         6/         6)
> #pr:  (avg/min/max (avg) bytes per subm before successful tuning were         18/         8/        32)
> #pr:  (avg/min/max (avg) bytes per subm after  successful tuning were         54/        24/        96)
> #pr:  (avg/min/max (avg) bytes per nnz  before successful tuning were      4.000/     4.000/     4.000)
> #pr:  (avg/min/max operands (mtx,lhs,rhs) read bandwidth lower bound      10.606/     5.267/    18.080,GBps)
> #pr:  (avg/min/max operands (mtx,rhs:r;lhs:rw) bandwidth lower bound      57.798/     6.976/    24.913,GBps)
> #pr:  (avg/min/max code balance (bytes read at least once per flop)        1.589/     0.698/     2.792)
> #pr:  (avg/min/max (avg) bytes per nnz  after  successful tuning were      4.667/     4.667/     4.667)
> #pr:  (matrix has been subdivided  more/less/same            in resp.  0 / 4 /0 cases)
> #pr:  (matrix has used             more/less/same    threads in resp.  0 / 0 /4 cases)
> #pr: no unsuccessful rsb autotuning attempt (according to  1.01x threshold) 
> #pr: rsb auto tuning (either succ. or uns.) time was: on avg.:  0.00 s, min  0.00 s, max  0.00 s, tot  0.00 s (4 samples)
> #pr: rsb auto tuning (   only successful  ) time was: on avg.:  0.00 s, min  0.00 s, max  0.00 s, tot  0.00 s (4 samples)
> #pr:  best tun. rsb canon. mflops were: on avg. 8.542e+03,  min 3.417e+03,  max 1.367e+04  (4 samples)
> #pr:  ref. unt. rsb canon. mflops were: on avg. 4.111e+01,  min 1.611e+01,  max 7.670e+01  (4 samples)
> #pr:  best tun. rsb operation time was: on avg. 2.810e-08s, min 2.810e-08s, max 2.810e-08s, tot 1.124e-07s (4 samples)
> #pr:  ref. unt. rsb operation time was: on avg. 5.960e-06s, min 5.007e-06s, max 6.914e-06s, tot 2.384e-05s (4 samples)
> #pr:  min / max ratio of in-memory MEMSET bandwidth to extrapolated read bandwidth ratio: 3.972e-01 1.363e+00
> #pr: # Warning: extrapolated memory I/O bandwidth exceeds memory bandwidth --- is this a tiny matrix ?
> #pr:  in-cache to in-memory MEMSET bandwidth ratio: 1.676e+01
> #pr:  rsb nrhs-to-overall-min-rhs speed ratio was: on avg.    4.000e+00 x, min 4.000e+00 x, max 4.000e+00 x (4 samples, the non-min-nrhs ones)
> #pr: ========  Limiting to transA=N:
> #pr: Dump from a base of 8 samples (of max 16) ordered by (1,1,1,1,2,4,2) = (filename x cores x incX x incY x nrhs x typecode x transA).
> pr: BESTCODE MTX NR NC NNZ NRHS TYPE SYM TRANS NT AT-NT AT-MKL-NT BPNZ AT-BPNZ NSUBM AT-SUBM RSBBEST-MFLOPS OPTIME MKL-OPTIME AT-OPTIME AT-MKL-OPTIME AT-TIME RWminBW-GBps CB-bpf AT-MS CMFLOPS
> pr:    1:R_R  A 3 3 6 1 D S N  8  8  0 4.0000 4.6667 3 1 854.16 6.914e-06 0.000e+00 2.810e-08 0.000e+00 2.000e-04 5.27e+00 5.17e+00 1 2.40e-05
> pr:    3:R_R  A 3 3 6 1 S S N  8  8  0 4.0000 4.6667 3 1 854.16 5.007e-06 0.000e+00 2.810e-08 0.000e+00 4.010e-04 3.13e+00 3.17e+00 1 2.40e-05
> pr:    5:R_R  A 3 3 6 1 C S N  8  8  0 4.0000 4.6667 3 1 3416.66 6.914e-06 0.000e+00 2.810e-08 0.000e+00 1.832e-03 5.27e+00 1.29e+00 1 9.60e-05
> pr:    7:R_R  A 3 3 6 1 Z S N  8  8  0 4.0000 4.6667 3 1 3416.66 5.960e-06 0.000e+00 2.810e-08 0.000e+00 1.268e-04 9.54e+00 2.29e+00 1 9.60e-05
> pr:    9:R_R  A 3 3 6 4 D S N  8  8  0 4.0000 4.6667 3 1 3416.66 5.960e-06 0.000e+00 2.810e-08 0.000e+00 1.171e-04 1.30e+01 2.79e+00 1 9.60e-05
> pr:   11:R_R  A 3 3 6 4 S S N  8  8  0 4.0000 4.6667 3 1 3416.66 5.960e-06 0.000e+00 2.810e-08 0.000e+00 4.048e-04 6.98e+00 1.54e+00 1 9.60e-05
> pr:   13:R_R  A 3 3 6 4 C S N  8  8  0 4.0000 4.6667 3 1 13666.63 5.007e-06 0.000e+00 2.810e-08 0.000e+00 1.221e-04 1.30e+01 6.98e-01 1 3.84e-04
> pr:   15:R_R  A 3 3 6 4 Z S N  8  8  0 4.0000 4.6667 3 1 13666.63 6.914e-06 0.000e+00 2.810e-08 0.000e+00 1.280e-04 2.49e+01 1.32e+00 1 3.84e-04
> #pr: below, we define 'successful' autotuning when speedup of 1.010000x is exceeded, and 'tuned' results even the ones which are same as untuned
> #pr: rsb autotuning was successful in     8 cases (100.00 %) and unsuccessful in 0 cases (0.00 %)
> #pr:  (in succ. cases rsb autotuning gave    avg. 21537.7 % faster, avg. sp. ratio 216.377x, max sp. ratio 246.076x, avg. ratio 0.000x)
> #pr:  (in succ. cases rsb autotuning took an avg/min/max/tot of: 14822.9/4166.3/65201.5/118582.9   tuned ops)
> #pr:  (in succ. cases rsb autotuning took an avg/min/max/tot of:  65.7/ 18.5/265.0/525.7 untuned ops)
> #pr:  (and amortizes from untuned rsb in avg.  66.0, min.  18.6, max. 266.0 ops)
> #pr:  (avg/min/max (avg) nnz   per subm before successful tuning were          2/         2/         2)
> #pr:  (avg/min/max (avg) nnz   per subm after  successful tuning were          6/         6/         6)
> #pr:  (avg/min/max (avg) bytes per subm before successful tuning were         18/         8/        32)
> #pr:  (avg/min/max (avg) bytes per subm after  successful tuning were         54/        24/        96)
> #pr:  (avg/min/max (avg) bytes per nnz  before successful tuning were      4.000/     4.000/     4.000)
> #pr:  (avg/min/max operands (mtx,lhs,rhs) read bandwidth lower bound       7.723/     2.705/    18.080,GBps)
> #pr:  (avg/min/max operands (mtx,rhs:r;lhs:rw) bandwidth lower bound      81.003/     3.132/    24.913,GBps)
> #pr:  (avg/min/max code balance (bytes read at least once per flop)        2.284/     0.698/     5.167)
> #pr:  (avg/min/max (avg) bytes per nnz  after  successful tuning were      4.667/     4.667/     4.667)
> #pr:  (matrix has been subdivided  more/less/same            in resp.  0 / 8 /0 cases)
> #pr:  (matrix has used             more/less/same    threads in resp.  0 / 0 /8 cases)
> #pr: no unsuccessful rsb autotuning attempt (according to  1.01x threshold) 
> #pr: rsb auto tuning (either succ. or uns.) time was: on avg.:  0.00 s, min  0.00 s, max  0.00 s, tot  0.00 s (8 samples)
> #pr: rsb auto tuning (   only successful  ) time was: on avg.:  0.00 s, min  0.00 s, max  0.00 s, tot  0.00 s (8 samples)
> #pr:  best tun. rsb canon. mflops were: on avg. 5.339e+03,  min 8.542e+02,  max 1.367e+04  (8 samples)
> #pr:  ref. unt. rsb canon. mflops were: on avg. 2.534e+01,  min 3.471e+00,  max 7.670e+01  (8 samples)
> #pr:  best tun. rsb operation time was: on avg. 2.810e-08s, min 2.810e-08s, max 2.810e-08s, tot 2.248e-07s (8 samples)
> #pr:  ref. unt. rsb operation time was: on avg. 6.080e-06s, min 5.007e-06s, max 6.914e-06s, tot 4.864e-05s (8 samples)
> #pr:  min / max ratio of in-memory MEMSET bandwidth to extrapolated read bandwidth ratio: 3.972e-01 2.655e+00
> #pr: # Warning: extrapolated memory I/O bandwidth exceeds memory bandwidth --- is this a tiny matrix ?
> #pr:  in-cache to in-memory MEMSET bandwidth ratio: 1.676e+01
> #pr:  rsb nrhs-to-overall-min-rhs speed ratio was: on avg.    4.000e+00 x, min 4.000e+00 x, max 4.000e+00 x (4 samples, the non-min-nrhs ones)
> #pr: ========  Limiting to both transA=N and nrhs=1:
> #pr:  4 samples (out of 8) matched the dump limiting criteria.
> #pr: Dump from a base of 8 samples (of max 16) ordered by (1,1,1,1,2,4,2) = (filename x cores x incX x incY x nrhs x typecode x transA).
> pr: BESTCODE MTX NR NC NNZ NRHS TYPE SYM TRANS NT AT-NT AT-MKL-NT BPNZ AT-BPNZ NSUBM AT-SUBM RSBBEST-MFLOPS OPTIME MKL-OPTIME AT-OPTIME AT-MKL-OPTIME AT-TIME RWminBW-GBps CB-bpf AT-MS CMFLOPS
> pr:    1:R_R  A 3 3 6 1 D S N  8  8  0 4.0000 4.6667 3 1 854.16 6.914e-06 0.000e+00 2.810e-08 0.000e+00 2.000e-04 5.27e+00 5.17e+00 1 2.40e-05
> pr:    3:R_R  A 3 3 6 1 S S N  8  8  0 4.0000 4.6667 3 1 854.16 5.007e-06 0.000e+00 2.810e-08 0.000e+00 4.010e-04 3.13e+00 3.17e+00 1 2.40e-05
> pr:    5:R_R  A 3 3 6 1 C S N  8  8  0 4.0000 4.6667 3 1 3416.66 6.914e-06 0.000e+00 2.810e-08 0.000e+00 1.832e-03 5.27e+00 1.29e+00 1 9.60e-05
> pr:    7:R_R  A 3 3 6 1 Z S N  8  8  0 4.0000 4.6667 3 1 3416.66 5.960e-06 0.000e+00 2.810e-08 0.000e+00 1.268e-04 9.54e+00 2.29e+00 1 9.60e-05
> #pr:  4 samples (out of 8) matched the dump limiting criteria.
> #pr: below, we define 'successful' autotuning when speedup of 1.010000x is exceeded, and 'tuned' results even the ones which are same as untuned
> #pr: rsb autotuning was successful in     4 cases (100.00 %) and unsuccessful in 0 cases (0.00 %)
> #pr:  (in succ. cases rsb autotuning gave    avg. 21961.9 % faster, avg. sp. ratio 220.619x, max sp. ratio 246.076x, avg. ratio 0.000x)
> #pr:  (in succ. cases rsb autotuning took an avg/min/max/tot of: 22776.8/4514.2/65201.5/91107.3   tuned ops)
> #pr:  (in succ. cases rsb autotuning took an avg/min/max/tot of:  98.8/ 21.3/265.0/395.3 untuned ops)
> #pr:  (and amortizes from untuned rsb in avg.  99.3, min.  21.4, max. 266.0 ops)
> #pr:  (avg/min/max (avg) nnz   per subm before successful tuning were          2/         2/         2)
> #pr:  (avg/min/max (avg) nnz   per subm after  successful tuning were          6/         6/         6)
> #pr:  (avg/min/max (avg) bytes per subm before successful tuning were         18/         8/        32)
> #pr:  (avg/min/max (avg) bytes per subm after  successful tuning were         54/        24/        96)
> #pr:  (avg/min/max (avg) bytes per nnz  before successful tuning were      4.000/     4.000/     4.000)
> #pr:  (avg/min/max operands (mtx,lhs,rhs) read bandwidth lower bound       4.840/     2.705/     7.830,GBps)
> #pr:  (avg/min/max operands (mtx,rhs:r;lhs:rw) bandwidth lower bound      23.205/     3.132/     9.538,GBps)
> #pr:  (avg/min/max code balance (bytes read at least once per flop)        2.979/     1.292/     5.167)
> #pr:  (avg/min/max (avg) bytes per nnz  after  successful tuning were      4.667/     4.667/     4.667)
> #pr:  (matrix has been subdivided  more/less/same            in resp.  0 / 4 /0 cases)
> #pr:  (matrix has used             more/less/same    threads in resp.  0 / 0 /4 cases)
> #pr: no unsuccessful rsb autotuning attempt (according to  1.01x threshold) 
> #pr: rsb auto tuning (either succ. or uns.) time was: on avg.:  0.00 s, min  0.00 s, max  0.00 s, tot  0.00 s (4 samples)
> #pr: rsb auto tuning (   only successful  ) time was: on avg.:  0.00 s, min  0.00 s, max  0.00 s, tot  0.00 s (4 samples)
> #pr:  best tun. rsb canon. mflops were: on avg. 2.135e+03,  min 8.542e+02,  max 3.417e+03  (4 samples)
> #pr:  ref. unt. rsb canon. mflops were: on avg. 9.564e+00,  min 3.471e+00,  max 1.611e+01  (4 samples)
> #pr:  best tun. rsb operation time was: on avg. 2.810e-08s, min 2.810e-08s, max 2.810e-08s, tot 1.124e-07s (4 samples)
> #pr:  ref. unt. rsb operation time was: on avg. 6.199e-06s, min 5.007e-06s, max 6.914e-06s, tot 2.480e-05s (4 samples)
> #pr:  min / max ratio of in-memory MEMSET bandwidth to extrapolated read bandwidth ratio: 9.172e-01 2.655e+00
> #pr: # Warning: extrapolated memory I/O bandwidth exceeds memory bandwidth --- is this a tiny matrix ?
> #pr:  in-cache to in-memory MEMSET bandwidth ratio: 1.676e+01
> #pr: ========  Limiting to both transA=N and nrhs=4:
> #pr:  4 samples (out of 8) matched the dump limiting criteria.
> #pr: Dump from a base of 8 samples (of max 16) ordered by (1,1,1,1,2,4,2) = (filename x cores x incX x incY x nrhs x typecode x transA).
> pr: BESTCODE MTX NR NC NNZ NRHS TYPE SYM TRANS NT AT-NT AT-MKL-NT BPNZ AT-BPNZ NSUBM AT-SUBM RSBBEST-MFLOPS OPTIME MKL-OPTIME AT-OPTIME AT-MKL-OPTIME AT-TIME RWminBW-GBps CB-bpf AT-MS CMFLOPS
> pr:    9:R_R  A 3 3 6 4 D S N  8  8  0 4.0000 4.6667 3 1 3416.66 5.960e-06 0.000e+00 2.810e-08 0.000e+00 1.171e-04 1.30e+01 2.79e+00 1 9.60e-05
> pr:   11:R_R  A 3 3 6 4 S S N  8  8  0 4.0000 4.6667 3 1 3416.66 5.960e-06 0.000e+00 2.810e-08 0.000e+00 4.048e-04 6.98e+00 1.54e+00 1 9.60e-05
> pr:   13:R_R  A 3 3 6 4 C S N  8  8  0 4.0000 4.6667 3 1 13666.63 5.007e-06 0.000e+00 2.810e-08 0.000e+00 1.221e-04 1.30e+01 6.98e-01 1 3.84e-04
> pr:   15:R_R  A 3 3 6 4 Z S N  8  8  0 4.0000 4.6667 3 1 13666.63 6.914e-06 0.000e+00 2.810e-08 0.000e+00 1.280e-04 2.49e+01 1.32e+00 1 3.84e-04
> #pr:  4 samples (out of 8) matched the dump limiting criteria.
> #pr: below, we define 'successful' autotuning when speedup of 1.010000x is exceeded, and 'tuned' results even the ones which are same as untuned
> #pr: rsb autotuning was successful in     4 cases (100.00 %) and unsuccessful in 0 cases (0.00 %)
> #pr:  (in succ. cases rsb autotuning gave    avg. 21113.4 % faster, avg. sp. ratio 212.134x, max sp. ratio 246.076x, avg. ratio 0.000x)
> #pr:  (in succ. cases rsb autotuning took an avg/min/max/tot of: 6868.9/4166.3/14408.1/27475.6   tuned ops)
> #pr:  (in succ. cases rsb autotuning took an avg/min/max/tot of:  32.6/ 18.5/ 67.9/130.5 untuned ops)
> #pr:  (and amortizes from untuned rsb in avg.  32.8, min.  18.6, max.  68.2 ops)
> #pr:  (avg/min/max (avg) nnz   per subm before successful tuning were          2/         2/         2)
> #pr:  (avg/min/max (avg) nnz   per subm after  successful tuning were          6/         6/         6)
> #pr:  (avg/min/max (avg) bytes per subm before successful tuning were         18/         8/        32)
> #pr:  (avg/min/max (avg) bytes per subm after  successful tuning were         54/        24/        96)
> #pr:  (avg/min/max (avg) bytes per nnz  before successful tuning were      4.000/     4.000/     4.000)
> #pr:  (avg/min/max operands (mtx,lhs,rhs) read bandwidth lower bound      10.606/     5.267/    18.080,GBps)
> #pr:  (avg/min/max operands (mtx,rhs:r;lhs:rw) bandwidth lower bound      57.798/     6.976/    24.913,GBps)
> #pr:  (avg/min/max code balance (bytes read at least once per flop)        1.589/     0.698/     2.792)
> #pr:  (avg/min/max (avg) bytes per nnz  after  successful tuning were      4.667/     4.667/     4.667)
> #pr:  (matrix has been subdivided  more/less/same            in resp.  0 / 4 /0 cases)
> #pr:  (matrix has used             more/less/same    threads in resp.  0 / 0 /4 cases)
> #pr: no unsuccessful rsb autotuning attempt (according to  1.01x threshold) 
> #pr: rsb auto tuning (either succ. or uns.) time was: on avg.:  0.00 s, min  0.00 s, max  0.00 s, tot  0.00 s (4 samples)
> #pr: rsb auto tuning (   only successful  ) time was: on avg.:  0.00 s, min  0.00 s, max  0.00 s, tot  0.00 s (4 samples)
> #pr:  best tun. rsb canon. mflops were: on avg. 8.542e+03,  min 3.417e+03,  max 1.367e+04  (4 samples)
> #pr:  ref. unt. rsb canon. mflops were: on avg. 4.111e+01,  min 1.611e+01,  max 7.670e+01  (4 samples)
> #pr:  best tun. rsb operation time was: on avg. 2.810e-08s, min 2.810e-08s, max 2.810e-08s, tot 1.124e-07s (4 samples)
> #pr:  ref. unt. rsb operation time was: on avg. 5.960e-06s, min 5.007e-06s, max 6.914e-06s, tot 2.384e-05s (4 samples)
> #pr:  min / max ratio of in-memory MEMSET bandwidth to extrapolated read bandwidth ratio: 3.972e-01 1.363e+00
> #pr: # Warning: extrapolated memory I/O bandwidth exceeds memory bandwidth --- is this a tiny matrix ?
> #pr:  in-cache to in-memory MEMSET bandwidth ratio: 1.676e+01
> #pr:  rsb nrhs-to-overall-min-rhs speed ratio was: on avg.    4.000e+00 x, min 4.000e+00 x, max 4.000e+00 x (4 samples, the non-min-nrhs ones)
> #pr: ========  Limiting to transA=T:
> #pr:  No sample (out of 8) matched the dump criteria -- skipping dump round.
> #pr: ========  Limiting to both transA=T and nrhs=1:
> #pr:  No sample (out of 8) matched the dump criteria -- skipping dump round.
> #pr: ========  Limiting to both transA=T and nrhs=4:
> #pr:  No sample (out of 8) matched the dump criteria -- skipping dump round.
> #pr: ========  All results (not limiting)
> #pr: Dump from a base of 8 samples (of max 16) ordered by (1,1,1,1,2,4,2) = (filename x cores x incX x incY x nrhs x typecode x transA).
> pr: BESTCODE MTX NR NC NNZ NRHS TYPE SYM TRANS NT AT-NT AT-MKL-NT BPNZ AT-BPNZ NSUBM AT-SUBM RSBBEST-MFLOPS OPTIME MKL-OPTIME AT-OPTIME AT-MKL-OPTIME AT-TIME RWminBW-GBps CB-bpf AT-MS CMFLOPS
> pr:    1:R_R  A 3 3 6 1 D S N  8  8  0 4.0000 4.6667 3 1 854.16 6.914e-06 0.000e+00 2.810e-08 0.000e+00 2.000e-04 5.27e+00 5.17e+00 1 2.40e-05
> pr:    3:R_R  A 3 3 6 1 S S N  8  8  0 4.0000 4.6667 3 1 854.16 5.007e-06 0.000e+00 2.810e-08 0.000e+00 4.010e-04 3.13e+00 3.17e+00 1 2.40e-05
> pr:    5:R_R  A 3 3 6 1 C S N  8  8  0 4.0000 4.6667 3 1 3416.66 6.914e-06 0.000e+00 2.810e-08 0.000e+00 1.832e-03 5.27e+00 1.29e+00 1 9.60e-05
> pr:    7:R_R  A 3 3 6 1 Z S N  8  8  0 4.0000 4.6667 3 1 3416.66 5.960e-06 0.000e+00 2.810e-08 0.000e+00 1.268e-04 9.54e+00 2.29e+00 1 9.60e-05
> pr:    9:R_R  A 3 3 6 4 D S N  8  8  0 4.0000 4.6667 3 1 3416.66 5.960e-06 0.000e+00 2.810e-08 0.000e+00 1.171e-04 1.30e+01 2.79e+00 1 9.60e-05
> pr:   11:R_R  A 3 3 6 4 S S N  8  8  0 4.0000 4.6667 3 1 3416.66 5.960e-06 0.000e+00 2.810e-08 0.000e+00 4.048e-04 6.98e+00 1.54e+00 1 9.60e-05
> pr:   13:R_R  A 3 3 6 4 C S N  8  8  0 4.0000 4.6667 3 1 13666.63 5.007e-06 0.000e+00 2.810e-08 0.000e+00 1.221e-04 1.30e+01 6.98e-01 1 3.84e-04
> pr:   15:R_R  A 3 3 6 4 Z S N  8  8  0 4.0000 4.6667 3 1 13666.63 6.914e-06 0.000e+00 2.810e-08 0.000e+00 1.280e-04 2.49e+01 1.32e+00 1 3.84e-04
> #pr: below, we define 'successful' autotuning when speedup of 1.010000x is exceeded, and 'tuned' results even the ones which are same as untuned
> #pr: rsb autotuning was successful in     8 cases (100.00 %) and unsuccessful in 0 cases (0.00 %)
> #pr:  (in succ. cases rsb autotuning gave    avg. 21537.7 % faster, avg. sp. ratio 216.377x, max sp. ratio 246.076x, avg. ratio 0.000x)
> #pr:  (in succ. cases rsb autotuning took an avg/min/max/tot of: 14822.9/4166.3/65201.5/118582.9   tuned ops)
> #pr:  (in succ. cases rsb autotuning took an avg/min/max/tot of:  65.7/ 18.5/265.0/525.7 untuned ops)
> #pr:  (and amortizes from untuned rsb in avg.  66.0, min.  18.6, max. 266.0 ops)
> #pr:  (avg/min/max (avg) nnz   per subm before successful tuning were          2/         2/         2)
> #pr:  (avg/min/max (avg) nnz   per subm after  successful tuning were          6/         6/         6)
> #pr:  (avg/min/max (avg) bytes per subm before successful tuning were         18/         8/        32)
> #pr:  (avg/min/max (avg) bytes per subm after  successful tuning were         54/        24/        96)
> #pr:  (avg/min/max (avg) bytes per nnz  before successful tuning were      4.000/     4.000/     4.000)
> #pr:  (avg/min/max operands (mtx,lhs,rhs) read bandwidth lower bound       7.723/     2.705/    18.080,GBps)
> #pr:  (avg/min/max operands (mtx,rhs:r;lhs:rw) bandwidth lower bound      81.003/     3.132/    24.913,GBps)
> #pr:  (avg/min/max code balance (bytes read at least once per flop)        2.284/     0.698/     5.167)
> #pr:  (avg/min/max (avg) bytes per nnz  after  successful tuning were      4.667/     4.667/     4.667)
> #pr:  (matrix has been subdivided  more/less/same            in resp.  0 / 8 /0 cases)
> #pr:  (matrix has used             more/less/same    threads in resp.  0 / 0 /8 cases)
> #pr: no unsuccessful rsb autotuning attempt (according to  1.01x threshold) 
> #pr: rsb auto tuning (either succ. or uns.) time was: on avg.:  0.00 s, min  0.00 s, max  0.00 s, tot  0.00 s (8 samples)
> #pr: rsb auto tuning (   only successful  ) time was: on avg.:  0.00 s, min  0.00 s, max  0.00 s, tot  0.00 s (8 samples)
> #pr:  best tun. rsb canon. mflops were: on avg. 5.339e+03,  min 8.542e+02,  max 1.367e+04  (8 samples)
> #pr:  ref. unt. rsb canon. mflops were: on avg. 2.534e+01,  min 3.471e+00,  max 7.670e+01  (8 samples)
> #pr:  best tun. rsb operation time was: on avg. 2.810e-08s, min 2.810e-08s, max 2.810e-08s, tot 2.248e-07s (8 samples)
> #pr:  ref. unt. rsb operation time was: on avg. 6.080e-06s, min 5.007e-06s, max 6.914e-06s, tot 4.864e-05s (8 samples)
> #pr:  min / max ratio of in-memory MEMSET bandwidth to extrapolated read bandwidth ratio: 3.972e-01 2.655e+00
> #pr: # Warning: extrapolated memory I/O bandwidth exceeds memory bandwidth --- is this a tiny matrix ?
> #pr:  in-cache to in-memory MEMSET bandwidth ratio: 1.676e+01
> #pr:  rsb nrhs-to-overall-min-rhs speed ratio was: on avg.    4.000e+00 x, min 4.000e+00 x, max 4.000e+00 x (4 samples, the non-min-nrhs ones)
> #pr: Record collection took  2.32 s.
> #pr: Record comprises 50 memory benchmark samples (prepend RSB_PR_MBW=1 to dump this).
> #pr: Record comprises 81 environment variables in 3164 bytes (prepend RSB_PR_ENV=1 to dump this).
> # ======  END  Total summary record.
> #pr: ======== Saved a performance record of 16 samples to rsbench_pr__1659086407_gcc-12.1.rpr
> # Removing the temporary record file rsbench_pr__1659086407_gcc-12.1.rpr.tmp.
> # terminating run at 1659086421 (after 14.3s of w.c.t.)
> srcdir="/<<PKGBUILDDIR>>" /bin/bash ./scripts/doc-tests.sh
> + set -o pipefail
> + test x/<<PKGBUILDDIR>> = x
> + cat /<<PKGBUILDDIR>>/examples/autotune.c /<<PKGBUILDDIR>>/examples/backsolve.c /<<PKGBUILDDIR>>/examples/hello-spblas.c /<<PKGBUILDDIR>>/examples/hello.c /<<PKGBUILDDIR>>/examples/io-spblas.c /<<PKGBUILDDIR>>/examples/power.c /<<PKGBUILDDIR>>/examples/snippets.c /<<PKGBUILDDIR>>/examples/transpose.c
> + grep '^.\{71,\}'
> + true
> + cat /<<PKGBUILDDIR>>/README
> + grep '^[^	].\{80,\}'
> + true
> ++ /<<PKGBUILDDIR>>/rsbench -h
> ++ wc -l
> + test 63 -ge 61
> ++ /<<PKGBUILDDIR>>/rsbench -h
> ++ wc -c
> + test 2014 -ge 1966
> ++ /<<PKGBUILDDIR>>/rsbench -oa -Ob -h
> ++ wc -l
> + test 182 -ge 157
> ++ /<<PKGBUILDDIR>>/rsbench -oa -Ob -h
> ++ wc -c
> + test 5353 -ge 4600
> + exit 0
> if ! ./librsb-config --help ; then echo "Problem executing the librsb-config script!"; false; fi;
> Usage: ./librsb-config [OPTION] ...
> 
> Known values for OPTION are:
> 
>   --prefix        print librsb prefix
>   --libdir        print path to directory containing library
>   --libs          print library linking information
>   --extra_libs    print extra linking information (e.g.: dependency libs)
>   --ccopts        print compiler options (no-op)
>   --cc            print C compiler
>   --fc            print Fortran compiler
>   --cxx           print C++ compiler
>   --cppflags      print C pre-processor flags (no-op)
>   --cflags        print preprocessor flags, I_opts, and compiler options
>   --cxxflags      print preprocessor flags, I_opts, and C++ compiler options
>   --fcflags       print Fortran compilation and preprocessor flags
>   --I_opts        print "-I" include options
>   --L_opts        print linker "-L" flags for dynamic linking
>   --R_opts        print dynamic linker "-R" or "-rpath" flags
>   --ldopts        print linker options (no-op)
>   --link          print suggested linker command
>   --ldflags       print linker flags (ldopts, L_opts, R_opts, and libs)
>   --fclibs        print build-time detected fortran libs
>   --static        revise subsequent outputs for static linking
>   --help          print this help and exit
>   --version       print version information
> if which lynx; then for f in doc/html/*.html ; do if lynx -dump $f | /bin/grep '\\\(see\|code\)' ; then echo "Bad Doxygen generated in $f" ; else true; fi ; done; fi
> if ./rsbench  -C | /bin/grep 'type char codes.*:*[SDCZ]' ; then cd examples ; gmake tests ; fi
> type char codes:D S C Z 
> gmake[4]: Entering directory '/<<PKGBUILDDIR>>/examples'
> if test /<<PKGBUILDDIR>> != /<<PKGBUILDDIR>> ; then cp /<<PKGBUILDDIR>>/pd.mtx /<<PKGBUILDDIR>>/vf.mtx /<<PKGBUILDDIR>>/examples ; fi
> (                                                       PATH="/<<PKGBUILDDIR>>:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games" /bin/bash /<<PKGBUILDDIR>>/examples/bench.sh;   )
> + which rsbench
> /<<PKGBUILDDIR>>/rsbench
> + BRF=test.rpr
> + rsbench -oa -Ob --bench --lower 100 --as-symmetric --types : -n 1 --notranspose --compare-competitors --verbose --verbose --write-performance-record=test.rpr
> # --bench option implies -qH -R --write-performance-record --want-mkl-autotune --mkl-benchmark --types : --split-experimental 6 --merge-experimental 6 --also-transpose --sort-filenames-list --want-memory-benchmark
> # Passed 0 arguments via autotuning string "" (an empty string requests defaults)
> Will invoke autotuning for ~10.000000 s x 1 rounds, specifying verbosity=0 and threads=0. (>0 means no structure tuning; 0 means only structure tuning, <0 means tuning of both with (negated) thread count suggestion).
> # Requested no transposition.
> # performance record file set to: test.rpr
> # beginning run at 1659086421
> # /<<PKGBUILDDIR>>/.libs/rsbench -oa -Ob --bench --lower 100 --as-symmetric --types : -n 1 --notranspose --compare-competitors --verbose --verbose --write-performance-record=test.rpr
> # compiled with: CC=gcc CFLAGS=-g -O2 -ffile-prefix-map=/<<PKGBUILDDIR>>=. -fstack-protector-strong -Wformat -Werror=format-security -O3 -std=c99
> # average timer granularity: 2.86e-08 s
> # Will write a final performance record to file test.rpr and periodic checkpoints to test.rpr.tmp
> # will NOT perform ancillary tests.
> # will flush cache memory:  between each operation measurement series, and NOT between each operation.
> # will keep any zero encountered in the matrix.
> # env: export PATH=/<<PKGBUILDDIR>>:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
> # env: export LD_LIBRARY_PATH=/<<PKGBUILDDIR>>/.libs
> # env: HOSTNAME is not set
> # env: KMP_AFFINITY is not set
> # env: OMP_AFFINITY_FORMAT is not set
> # env: OMP_ALLOCATOR is not set
> # env: OMP_CANCELLATION is not set
> # env: OMP_DEBUG is not set
> # env: OMP_DEFAULT_DEVICE is not set
> # env: OMP_DISPLAY_ENV is not set
> # env: OMP_DISPLAY_AFFINITY is not set
> # env: OMP_DYNAMIC is not set
> # env: OMP_MAX_ACTIVE_LEVELS is not set
> # env: OMP_MAX_TASK_PRIORITY is not set
> # env: OMP_NESTED is not set
> # env: OMP_NUM_THREADS is not set
> # env: OMP_PLACES is not set
> # env: OMP_PROC_BIND is not set
> # env: OMP_SCHEDULE is not set
> # env: OMP_STACKSIZE is not set
> # env: OMP_TARGET_OFFLOAD is not set
> # env: OMP_THREAD_LIMIT is not set
> # env: OMP_TOOL is not set
> # env: OMP_TOOL_LIBRARIES is not set
> # env: OMP_WAIT_POLICY is not set
> # env: RSB_WANT_RSBPP is not set
> #     using kernels from librsbpp (default).
> # env: SLURM_CLUSTER_NAME is not set
> # env: SLURM_CPUS_ON_NODE is not set
> # env: SLURM_JOB_CPUS_PER_NODE is not set
> # env: SLURM_JOB_ID is not set
> # env: SLURM_JOBID is not set
> # env: SLURM_JOB_NAME is not set
> # env: SLURM_JOB_NUM_NODES is not set
> # env: SLURM_JOB_PARTITION is not set
> # env: SLURM_NPROCS is not set
> # env: SLURM_NTASKS is not set
> # env: SLURM_STEP_TASKS_PER_NODE is not set
> # env: SLURM_TASKS_PER_NODE is not set
> # detected hostname: ip-10-84-234-251
> # user specified a verbosity level of 2 (each --verbose occurrence counts +1)
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 4325376 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 8
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # This test will measure times in scanning arrays sized and aligned to fit in caches.
> # 3 cache levels detected
> Will fill struct with 50 samples...
> # Memory benchmark took 11.549s
> # auto-tuning oriented output implies  times==0 iterations and sort-after-load.
> #pr: allocated a performance record for 4 samples (1120 bytes).
> # multi-type benchmarking (DSCZ) -- now using typecode D (last was D).
> # Cache block size total 34603008 bytes, per-thread 4325376 bytes
> # so far, program took 11.550s of wall clock time; ancillary tests 0.000s; I/O 0.000s; checks 0.000s; conversions 0.000s; rsb/mkl tuning 0.000s/0.000s .
> # Using 1 threads
> # Using alpha=1 beta=1 order=cols for rsb_spmv/rsb_spsv/rsb_spmm/rsb_spsm.
> # will use input matrix flags: RSB_FLAG_USE_HALFWORD_INDICES, RSB_FLAG_SORTED_INPUT, RSB_FLAG_LOWER, RSB_FLAG_QUAD_PARTITIONING, RSB_FLAG_SYMMETRIC, RSB_FLAG_OWN_PARTITIONING_ARRAYS
> # Using 1 threads
> Building a matrix with 5050 nnz, 100 x 100
> Duplicates check: 5050 - 0 = 5050
>  converted COO to RSB in 1.878e-03 s (100.00 %)
>  analyzed arrays in 1.141e-03 s (60.75 %)
>  cleaned-up arrays in 1.001e-05 s (0.53 %)
>  deduplicated arrays in 8.106e-06 s (0.43 %)
>  sorted arrays in 0.000e+00 s (0.00 %)
>  shuffled partitions in 7.915e-05 s (4.21 %)
>  memory allocations took 2.980e-05 s (1.59 %)
>  leafs setup took 2.861e-06 s (0.15 %)
>  halfword conversion took 6.042e-04 s (32.17 %)
> Built (100 x 100)[0x5590205aba20]{D} @ (0(0..0),0(0..0)) (5050 nnz, 50 nnz/r) flags 0x2446196 (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 3, symflags:'LS'
> # Constructed matrix (took 0.002s): (100 x 100)[0x5590205aba20]{D} @ (0(0..0),0(0..0)) (5050 nnz, 50 nnz/r) flags 0x2446196 (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 3, symflags:'LS'
> # matrix consistency check took 0.000s (ok)
> RSB Sparse Blocks Autotuner invoked requesting max 6 splits and max 6 merges in 1 rounds, threads spec.0 (specify negative values to enable threads tuning).
> Will autotune matrix: 100 x 100, type D, 5050 nnz, 50 nnz/r, 4 subms, 3 lsubms, 2.1212 bpnz.
> Parameters: verbosity:2 mintimes:3 maxtimes:10 mindt:0 maxdt:3
> Saved plot to  test-tuning-lower-100x100-5050nz--D-N-1--base.eps
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> 3 iterations (1 th.) took 0.000108s; avg 3.6e-05s ( +/-  50.33/ 78.15 %); best 1.788e-05s; worst 6.413e-05s; std dev. 2.017e-05 (taking best).
> Reference operation time is 1.78814e-05 s (1130 Mflops) with 1 threads.
> Starting merge (same threads) based auto-tuning procedure (transA=N, nrhs=1, order=cols) (max 6 steps, inclusive 3 grace steps) on: 100 x 100, type D, 5050 nnz, 50 nnz/r, 4 subms, 3 lsubms, 2.1212 bpnz (tpop: 1.788e-05  Mflops: 1129.666)
> Merge (3 -> 1 leaves) took w.c.t. of 0.001101s, ~0.001091s of computing time (of which 0.000133s sorting, 2.146e-06s analysis)
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> 3 iterations (1 th.) took 3.791e-05s; avg 1.264e-05s ( +/-  20.75/ 26.42 %); best 1.001e-05s; worst 1.597e-05s; std dev. 2.485e-06 (taking best).
> Reference operation time is 1.00136e-05 s (2017 Mflops) with 1 threads.
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> After merge step 1: tpop: 1.001e-05 s   ~Mflops: 2017.260   nsubm:1 otn:1
> Applying merge (3 -> 1 leaves, 1 th.) yielded SPEEDUP of  1.786x: 1.788e-05s -> 1.001e-05s, so taking this instance.
> Saved plot to  test-tuning-lower-100x100-5050nz--D-N-1--mv-tuned_merge1_1x1th.eps
> Merged all the matrix leaves: no reason to continue merging.
> A total of 1 merge steps (of max 6) (3 -> 1 subms) took 0.01232s (of which 0.001106s partitioning, 0.009063s I/O); computing times: 0.001091s in par. loops, 0.000133s sorting, 2.146e-06s analyzing)
> Total merge + benchmarking process took 0.01232s, equivalent to 1230.6/689.1 new/old ops (0.003365s for 2 clones -- as 336.0/188.2 ops, or 168.0/94.1 ops per clone), SPEEDUP of  1.786x
> Applying multi-merge (3 -> 1 leaves, 1 steps, 0 -> 1 th.sp.) yielded SPEEDUP of  1.786x (1.788e-05s -> 1.001e-05s), will amortize in     1566.2 ops by saving 7.868e-06s per op.
> In 1 tuning rounds (tot. 0.014s, 0.0034s for constructor, 2 clones) obtained a SPEEDUP of   78.6% (1.786x) (from 1130 to 2017 Mflops). Employed 0.0073s for I/O of matrix plots.
> #pr: updating sample at index 1 (0^th of 4), 0^th touch for (0,0,0,0,0,0,0).
> First run of RSB Autotuner took 0.021816 s  (1.788e-05 s -> 1.001e-05 s per spmv_sxsa) (tuned: 3 -> 1 lsubm).
> RSB Sparse Blocks Autotuner invoked requesting max 0 splits and max 0 merges in 1 rounds, auto threads spec.
> Will autotune matrix: 100 x 100, type D, 5050 nnz, 50 nnz/r, 1 subms, 1 lsubms, 2.0800 bpnz.
> Parameters: verbosity:1 mintimes:3 maxtimes:10 mindt:0 maxdt:10
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> Started tuning inner round: will search for an optimal matrix instance.
> Starting with requested 0 threads ; current default 1 ; at most 8.
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> 3 iterations (1 th.) took 4.196e-05s; avg 1.399e-05s ( +/-  14.77/ 29.55 %); best 1.192e-05s; worst 1.812e-05s; std dev. 2.922e-06 (taking best).
> Reference operation time is 1.19209e-05 s (1694 Mflops) with 1 threads.
> Building a matrix with 5050 nnz, 100 x 100
> Duplicates check: 5050 - 0 = 5050
>  converted COO to RSB in 1.089e-02 s (100.00 %)
>  analyzed arrays in 3.809e-03 s (34.97 %)
>  cleaned-up arrays in 2.480e-05 s (0.23 %)
>  deduplicated arrays in 1.717e-05 s (0.16 %)
>  sorted arrays in 0.000e+00 s (0.00 %)
>  shuffled partitions in 7.024e-03 s (64.49 %)
>  memory allocations took 3.099e-06 s (0.03 %)
>  leafs setup took 5.007e-06 s (0.05 %)
>  halfword conversion took 8.106e-06 s (0.07 %)
> Built (100 x 100)[0x5590205ae5d0]{D} @ (0(0..100),0(0..100)) (5050 nnz, 50 nnz/r) flags 0x42644094 (coo:0, csr:1, hw:0, ic:1, fi:0), storage: 1, subm: 1, symflags:'LS'
> Starting autotuning stage, with subdivision of 1 (current threads=1, requested threads=0, max threads = 8).
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.25
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 4325376 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 8
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.25
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> Building a matrix with 5050 nnz, 100 x 100
> Duplicates check: 5050 - 0 = 5050
>  converted COO to RSB in 1.655e-03 s (100.00 %)
>  analyzed arrays in 1.574e-03 s (95.10 %)
>  cleaned-up arrays in 1.502e-05 s (0.91 %)
>  deduplicated arrays in 2.599e-05 s (1.57 %)
>  sorted arrays in 0.000e+00 s (0.00 %)
>  shuffled partitions in 2.599e-05 s (1.57 %)
>  memory allocations took 4.053e-06 s (0.24 %)
>  leafs setup took 9.537e-07 s (0.06 %)
>  halfword conversion took 8.106e-06 s (0.49 %)
> Built (100 x 100)[0x55902057f300]{D} @ (0(0..0),0(0..0)) (5050 nnz, 50 nnz/r) flags 0x42646096 (coo:0, csr:1, hw:1, ic:1, fi:0), storage: 1, subm: 3, symflags:'LS'
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.25
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.25
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.25
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> 3 iterations (1 th.) took 0.004537s; avg 0.001512s ( +/-  98.75/ 78.61 %); best 1.884e-05s; worst 0.002701s; std dev. 0.001116 (taking best).
> Reference operation time is 1.88351e-05 s (1072 Mflops) with 1 threads.
> Challenging best inner round reference (1.19209e-05 s/1 threads) with: subdivision 0.25, 3 leaves, 2.121 bytes/nz, 1.88351e-05 s/0 threads (speedup 0.632911 x), same?n.
> New candidate clone performs slowly; discarding it: 100 x 100, type D, 5050 nnz, 50 nnz/r, 4 subms, 3 lsubms, 2.1212 bpnz
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.5
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 4325376 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 8
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.5
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> Building a matrix with 5050 nnz, 100 x 100
> Duplicates check: 5050 - 0 = 5050
>  converted COO to RSB in 9.799e-05 s (100.00 %)
>  analyzed arrays in 2.909e-05 s (29.68 %)
>  cleaned-up arrays in 1.502e-05 s (15.33 %)
>  deduplicated arrays in 1.502e-05 s (15.33 %)
>  sorted arrays in 0.000e+00 s (0.00 %)
>  shuffled partitions in 2.503e-05 s (25.55 %)
>  memory allocations took 3.815e-06 s (3.89 %)
>  leafs setup took 2.146e-06 s (2.19 %)
>  halfword conversion took 6.914e-06 s (7.06 %)
> Built (100 x 100)[0x5590205aec00]{D} @ (0(0..0),0(0..0)) (5050 nnz, 50 nnz/r) flags 0x42646096 (coo:0, csr:1, hw:1, ic:1, fi:0), storage: 1, subm: 10, symflags:'LS'
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.5
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.5
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.5
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> 3 iterations (1 th.) took 0.002883s; avg 0.000961s ( +/-  80.33/ 78.88 %); best 0.0001891s; worst 0.001719s; std dev. 0.0006247 (taking best).
> Reference operation time is 0.000189066 s (106.8 Mflops) with 1 threads.
> Challenging best inner round reference (1.19209e-05 s/1 threads) with: subdivision 0.5, 10 leaves, 2.206 bytes/nz, 0.000189066 s/0 threads (speedup 0.0630517 x), same?n.
> New candidate clone performs slowly; discarding it: 100 x 100, type D, 5050 nnz, 50 nnz/r, 14 subms, 10 lsubms, 2.2059 bpnz
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 4325376 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 8
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> Building a matrix with 5050 nnz, 100 x 100
> Duplicates check: 5050 - 0 = 5050
>  converted COO to RSB in 3.943e-03 s (100.00 %)
>  analyzed arrays in 3.000e-03 s (76.09 %)
>  cleaned-up arrays in 1.407e-05 s (0.36 %)
>  deduplicated arrays in 1.407e-05 s (0.36 %)
>  sorted arrays in 0.000e+00 s (0.00 %)
>  shuffled partitions in 8.981e-04 s (22.78 %)
>  memory allocations took 4.768e-06 s (0.12 %)
>  leafs setup took 2.861e-06 s (0.07 %)
>  halfword conversion took 8.106e-06 s (0.21 %)
> Built (100 x 100)[0x5590205aba20]{D} @ (0(0..0),0(0..0)) (5050 nnz, 50 nnz/r) flags 0x42646096 (coo:0, csr:1, hw:1, ic:1, fi:0), storage: 1, subm: 22, symflags:'LS'
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> 3 iterations (1 th.) took 0.006833s; avg 0.002278s ( +/-  90.21/115.40 %); best 0.0002229s; worst 0.004906s; std dev. 0.001954 (taking best).
> Reference operation time is 0.000222921 s (90.61 Mflops) with 1 threads.
> Challenging best inner round reference (1.19209e-05 s/1 threads) with: subdivision 1, 22 leaves, 2.295 bytes/nz, 0.000222921 s/0 threads (speedup 0.0534759 x), same?n.
> New candidate clone performs slowly; discarding it: 100 x 100, type D, 5050 nnz, 50 nnz/r, 30 subms, 22 lsubms, 2.2947 bpnz
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 2
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 4325376 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 8
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 2
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> Building a matrix with 5050 nnz, 100 x 100
> Duplicates check: 5050 - 0 = 5050
>  converted COO to RSB in 1.853e-02 s (100.00 %)
>  analyzed arrays in 4.453e-03 s (24.04 %)
>  cleaned-up arrays in 1.407e-05 s (0.08 %)
>  deduplicated arrays in 1.383e-05 s (0.07 %)
>  sorted arrays in 0.000e+00 s (0.00 %)
>  shuffled partitions in 1.402e-02 s (75.67 %)
>  memory allocations took 7.153e-06 s (0.04 %)
>  leafs setup took 8.106e-06 s (0.04 %)
>  halfword conversion took 1.001e-05 s (0.05 %)
> Built (100 x 100)[0x559020559ff0]{D} @ (0(0..0),0(0..0)) (5050 nnz, 50 nnz/r) flags 0x42646096 (coo:0, csr:1, hw:1, ic:1, fi:0), storage: 1, subm: 36, symflags:'LS'
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 2
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 2
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 2
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> 3 iterations (1 th.) took 0.005911s; avg 0.00197s ( +/-  98.52/107.93 %); best 2.909e-05s; worst 0.004097s; std dev. 0.001666 (taking best).
> Reference operation time is 2.90871e-05 s (694.5 Mflops) with 1 threads.
> Challenging best inner round reference (1.19209e-05 s/1 threads) with: subdivision 2, 36 leaves, 2.383 bytes/nz, 2.90871e-05 s/0 threads (speedup 0.409836 x), same?n.
> New candidate clone performs slowly; discarding it: 100 x 100, type D, 5050 nnz, 50 nnz/r, 50 subms, 36 lsubms, 2.3834 bpnz
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 4
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 4325376 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 8
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 4
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> Building a matrix with 5050 nnz, 100 x 100
> Duplicates check: 5050 - 0 = 5050
>  converted COO to RSB in 1.819e-04 s (100.00 %)
>  analyzed arrays in 5.794e-05 s (31.85 %)
>  cleaned-up arrays in 1.502e-05 s (8.26 %)
>  deduplicated arrays in 1.407e-05 s (7.73 %)
>  sorted arrays in 0.000e+00 s (0.00 %)
>  shuffled partitions in 5.484e-05 s (30.14 %)
>  memory allocations took 2.909e-05 s (15.99 %)
>  leafs setup took 2.146e-06 s (1.18 %)
>  halfword conversion took 7.868e-06 s (4.33 %)
> Built (100 x 100)[0x5590205401d0]{D} @ (0(0..0),0(0..0)) (5050 nnz, 50 nnz/r) flags 0x42646096 (coo:0, csr:1, hw:1, ic:1, fi:0), storage: 1, subm: 33, symflags:'LS'
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 4
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 4
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 4
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> 3 iterations (1 th.) took 0.003264s; avg 0.001088s ( +/-  97.52/ 95.78 %); best 2.694e-05s; worst 0.00213s; std dev. 0.0008587 (taking best).
> Reference operation time is 2.69413e-05 s (749.8 Mflops) with 1 threads.
> Challenging best inner round reference (1.19209e-05 s/1 threads) with: subdivision 4, 33 leaves, 2.361 bytes/nz, 2.69413e-05 s/0 threads (speedup 0.442478 x), same?n.
> New candidate clone performs slowly; discarding it: 100 x 100, type D, 5050 nnz, 50 nnz/r, 46 subms, 33 lsubms, 2.3612 bpnz
> Best sparse multiply performance with subdivision multiplier of 1: 1694.5 Mflops.
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> Last tuner inner round (1 of 1) took 0.0616498 s (eq. to  5e+03/ 5e+03 old/new op.times), gained local/global speedup 1 x (1.19209e-05 : 1.19209e-05) / 1 x (1.19209e-05 : 1.19209e-05). This is not amortizable !
> Auto tuning inner round 1 did not find a configuration better than the original.
> In 1 tuning rounds (tot. 0.062s, 0.038s for constructor, 0 clones) obtained NO speedup (best stays 1694 Mflops).
> Second run of RSB Autotuner took 0.0617108 s and estimated a speedup of 1.000000 x (1.192e-05 s -> 1.192e-05 s per op) in same matrix (1 -> 1 lsubm)
> #min:1
> #max:1
> #sum:100
> #norm:10
> #used index storage compared to COO:10504 vs 40400 bytes (26.00%) ; compared to CSR:10504 vs 20604 bytes (50.99%)
> #%:CONSTRUCTOR_*:SORT	SCAN	INSERT	SCAN+INSERT
> %:CONSTRUCTOR_TIMES:lower-100x100-5050nz	S	N	1	100	100	5050	  0.000000	  0.001141	  0.000079	  0.001220
> %:UNSORTEDCOO2RSB_TIME:lower-100x100-5050nz	S	N	1	100	100	5050	  0.001220
> %:RSB_SUBDIVISION_TIME:lower-100x100-5050nz	S	N	1	100	100	5050	  0.001141
> %:RSB_SHUFFLE_TIME:lower-100x100-5050nz	S	N	1	100	100	5050	  0.000079
> %:ROW_MAJOR_SORT_TIME:lower-100x100-5050nz	S	N	1	100	100	5050	  0.000000
> %:ROW_MAJOR_SORT_SCALING:lower-100x100-5050nz	S	N	1	100	100	5050	      -nan
> %:SORTEDCOO2RSB_TIME:lower-100x100-5050nz	S	N	1	100	100	5050	  0.001220
> %:ROW_MAJOR_SORT_TO_MOP:lower-100x100-5050nz	S	N	1	100	100	5050	     0.000
> %:UNSORTEDCOO2RSB_SCALING:lower-100x100-5050nz	S	N	1	100	100	5050	      1.00
> %:SORTEDCOO2RSB_SCALING:lower-100x100-5050nz	S	N	1	100	100	5050	      1.00
> %:RSB_SUBDIVISION_SCALING:lower-100x100-5050nz	S	N	1	100	100	5050	      1.00
> %:RSB_SHUFFLE_SCALING:lower-100x100-5050nz	S	N	1	100	100	5050	      1.00
> %:CONSTRUCTOR_SCALING:lower-100x100-5050nz	S	N	1	100	100	5050	      -nan	      1.00	      1.00	      1.00
> #%:SM_COUNTS:	Tot	HalfwordCsr	FullwordCsr	HalfwordCoo	FullwordCoo
> %:SM_COUNTS:lower-100x100-5050nz	S	N	1	100	100	5050	1	1	0	0	0
> %:SM_IDXOCCUPATIONRSBVSCOOANDCSR:lower-100x100-5050nz	S	N	1	100	100	5050	10504	40400	20600
> %:SM_IDXOCCUPATION:lower-100x100-5050nz	S	N	1	100	100	5050	10504
> %:SM_MEMTRAFFIC:lower-100x100-5050nz	S	N	1	100	100	5050	    102200
> %:SM_MINMAXAVGNNZ:lower-100x100-5050nz	S	N	1	100	100	5050	5050	5050	5050
> #
> %operation:matrix	CONSTRUCTOR[1]	SPMV[1]	SPMV[1]
> %operation:lower-100x100-5050nz	0.00187802	1e+09	1e+09
> %constructor:matrix	SORT[1]	SCAN[1]	SHUFFLE[1]	INSERT[1]
> %constructor:lower-100x100-5050nz	0	0.00114083	0	7.9155e-05
> # so far, program took 11.986s of wall clock time; ancillary tests 0.000s; I/O 0.000s; checks 0.000s; conversions 0.000s; rsb/mkl tuning 0.084s/0.000s .
> getrusage() stats:
> ru_maxrss: 137 (maximum resident set size -- MB)
> ru_stime : 0.6636s (system CPU time used)
> ru_utime : 12.48s (user CPU time used)
> # multi-type benchmarking (DSCZ) -- now using typecode S (last was D).
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # so far, program took 11.986s of wall clock time; ancillary tests 0.000s; I/O 0.000s; checks 0.000s; conversions 0.000s; rsb/mkl tuning 0.084s/0.000s .
> # Using 1 threads
> # Using alpha=1 beta=1 order=cols for rsb_spmv/rsb_spsv/rsb_spmm/rsb_spsm.
> # will use input matrix flags: RSB_FLAG_USE_HALFWORD_INDICES, RSB_FLAG_SORTED_INPUT, RSB_FLAG_LOWER, RSB_FLAG_QUAD_PARTITIONING, RSB_FLAG_SYMMETRIC, RSB_FLAG_OWN_PARTITIONING_ARRAYS
> # Using 1 threads
> Building a matrix with 5050 nnz, 100 x 100
> Duplicates check: 5050 - 0 = 5050
>  converted COO to RSB in 1.567e-03 s (100.00 %)
>  analyzed arrays in 1.493e-03 s (95.28 %)
>  cleaned-up arrays in 1.502e-05 s (0.96 %)
>  deduplicated arrays in 1.502e-05 s (0.96 %)
>  sorted arrays in 0.000e+00 s (0.00 %)
>  shuffled partitions in 2.789e-05 s (1.78 %)
>  memory allocations took 4.768e-06 s (0.30 %)
>  leafs setup took 9.537e-07 s (0.06 %)
>  halfword conversion took 9.060e-06 s (0.58 %)
> Built (100 x 100)[0x5590205aba20]{S} @ (0(0..0),0(0..0)) (5050 nnz, 50 nnz/r) flags 0x2446196 (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 3, symflags:'LS'
> # Constructed matrix (took 0.002s): (100 x 100)[0x5590205aba20]{S} @ (0(0..0),0(0..0)) (5050 nnz, 50 nnz/r) flags 0x2446196 (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 3, symflags:'LS'
> # matrix consistency check took 0.000s (ok)
> RSB Sparse Blocks Autotuner invoked requesting max 6 splits and max 6 merges in 1 rounds, threads spec.0 (specify negative values to enable threads tuning).
> Will autotune matrix: 100 x 100, type S, 5050 nnz, 50 nnz/r, 4 subms, 3 lsubms, 2.1212 bpnz.
> Parameters: verbosity:2 mintimes:3 maxtimes:10 mindt:0 maxdt:3
> Saved plot to  test-tuning-lower-100x100-5050nz--S-N-1--base.eps
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> 3 iterations (1 th.) took 7.915e-05s; avg 2.638e-05s ( +/-  39.46/ 70.78 %); best 1.597e-05s; worst 4.506e-05s; std dev. 1.324e-05 (taking best).
> Reference operation time is 1.5974e-05 s (1265 Mflops) with 1 threads.
> Starting merge (same threads) based auto-tuning procedure (transA=N, nrhs=1, order=cols) (max 6 steps, inclusive 3 grace steps) on: 100 x 100, type S, 5050 nnz, 50 nnz/r, 4 subms, 3 lsubms, 2.1212 bpnz (tpop: 1.597e-05  Mflops: 1264.551)
> Merge (3 -> 1 leaves) took w.c.t. of 4.792e-05s, ~4.196e-05s of computing time (of which 2.122e-05s sorting, 9.537e-07s analysis)
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> 3 iterations (1 th.) took 2.694e-05s; avg 8.98e-06s ( +/-   1.77/  0.88 %); best 8.821e-06s; worst 9.06e-06s; std dev. 1.124e-07 (taking best).
> Reference operation time is 8.82149e-06 s (2290 Mflops) with 1 threads.
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> After merge step 1: tpop: 8.821e-06 s   ~Mflops: 2289.863   nsubm:1 otn:1
> Applying merge (3 -> 1 leaves, 1 th.) yielded SPEEDUP of  1.811x: 1.597e-05s -> 8.821e-06s, so taking this instance.
> Saved plot to  test-tuning-lower-100x100-5050nz--S-N-1--mv-tuned_merge1_1x1th.eps
> Merged all the matrix leaves: no reason to continue merging.
> A total of 1 merge steps (of max 6) (3 -> 1 subms) took 0.004878s (of which 4.983e-05s partitioning, 0.002702s I/O); computing times: 4.196e-05s in par. loops, 2.122e-05s sorting, 9.537e-07s analyzing)
> Total merge + benchmarking process took 0.004878s, equivalent to 553.0/305.4 new/old ops (0.002959s for 2 clones -- as 335.4/185.2 ops, or 167.7/92.6 ops per clone), SPEEDUP of  1.811x
> Applying multi-merge (3 -> 1 leaves, 1 steps, 0 -> 1 th.sp.) yielded SPEEDUP of  1.811x (1.597e-05s -> 8.821e-06s), will amortize in      682.0 ops by saving 7.153e-06s per op.
> In 1 tuning rounds (tot. 0.0062s, 0.003s for constructor, 2 clones) obtained a SPEEDUP of   81.1% (1.811x) (from 1265 to 2290 Mflops). Employed 0.0027s for I/O of matrix plots.
> #pr: updating sample at index 2 (1^th of 4), 0^th touch for (0,0,0,0,0,1,0).
> First run of RSB Autotuner took 0.00921893 s  (1.597e-05 s -> 8.821e-06 s per spmv_sxsa) (tuned: 3 -> 1 lsubm).
> RSB Sparse Blocks Autotuner invoked requesting max 0 splits and max 0 merges in 1 rounds, auto threads spec.
> Will autotune matrix: 100 x 100, type S, 5050 nnz, 50 nnz/r, 1 subms, 1 lsubms, 2.0800 bpnz.
> Parameters: verbosity:1 mintimes:3 maxtimes:10 mindt:0 maxdt:10
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> Started tuning inner round: will search for an optimal matrix instance.
> Starting with requested 0 threads ; current default 1 ; at most 8.
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> 3 iterations (1 th.) took 3.123e-05s; avg 1.041e-05s ( +/-  12.98/ 25.95 %); best 9.06e-06s; worst 1.311e-05s; std dev. 1.911e-06 (taking best).
> Reference operation time is 9.05991e-06 s (2230 Mflops) with 1 threads.
> Building a matrix with 5050 nnz, 100 x 100
> Duplicates check: 5050 - 0 = 5050
>  converted COO to RSB in 1.533e-03 s (100.00 %)
>  analyzed arrays in 1.473e-03 s (96.08 %)
>  cleaned-up arrays in 1.478e-05 s (0.96 %)
>  deduplicated arrays in 2.718e-05 s (1.77 %)
>  sorted arrays in 0.000e+00 s (0.00 %)
>  shuffled partitions in 1.097e-05 s (0.72 %)
>  memory allocations took 3.099e-06 s (0.20 %)
>  leafs setup took 1.192e-06 s (0.08 %)
>  halfword conversion took 2.861e-06 s (0.19 %)
> Built (100 x 100)[0x5590205ae5d0]{S} @ (0(0..100),0(0..100)) (5050 nnz, 50 nnz/r) flags 0x42644094 (coo:0, csr:1, hw:0, ic:1, fi:0), storage: 1, subm: 1, symflags:'LS'
> Starting autotuning stage, with subdivision of 1 (current threads=1, requested threads=0, max threads = 8).
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.25
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 4325376 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 8
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.25
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> Building a matrix with 5050 nnz, 100 x 100
> Duplicates check: 5050 - 0 = 5050
>  converted COO to RSB in 1.740e-04 s (100.00 %)
>  analyzed arrays in 1.340e-04 s (76.99 %)
>  cleaned-up arrays in 1.383e-05 s (7.95 %)
>  deduplicated arrays in 1.407e-05 s (8.08 %)
>  sorted arrays in 0.000e+00 s (0.00 %)
>  shuffled partitions in 5.007e-06 s (2.88 %)
>  memory allocations took 3.099e-06 s (1.78 %)
>  leafs setup took 0.000e+00 s (0.00 %)
>  halfword conversion took 4.053e-06 s (2.33 %)
> Built (100 x 100)[0x5590205ac0c0]{S} @ (0(0..100),0(0..100)) (5050 nnz, 50 nnz/r) flags 0x42644096 (coo:0, csr:1, hw:1, ic:1, fi:0), storage: 1, subm: 1, symflags:'LS'
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.25
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.25
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.25
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> 3 iterations (1 th.) took 3.791e-05s; avg 1.264e-05s ( +/-  13.21/ 18.87 %); best 1.097e-05s; worst 1.502e-05s; std dev. 1.73e-06 (taking best).
> Reference operation time is 1.09673e-05 s (1842 Mflops) with 1 threads.
> Challenging best inner round reference (9.05991e-06 s/1 threads) with: subdivision 0.25, 1 leaves,  2.08 bytes/nz, 1.09673e-05 s/0 threads (speedup 0.826087 x), same?n.
> New candidate clone performs slowly; discarding it: 100 x 100, type S, 5050 nnz, 50 nnz/r, 1 subms, 1 lsubms, 2.0800 bpnz
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.5
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 4325376 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 8
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.5
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> Building a matrix with 5050 nnz, 100 x 100
> Duplicates check: 5050 - 0 = 5050
>  converted COO to RSB in 4.472e-03 s (100.00 %)
>  analyzed arrays in 3.861e-03 s (86.34 %)
>  cleaned-up arrays in 1.407e-05 s (0.31 %)
>  deduplicated arrays in 2.503e-05 s (0.56 %)
>  sorted arrays in 0.000e+00 s (0.00 %)
>  shuffled partitions in 5.600e-04 s (12.52 %)
>  memory allocations took 2.861e-06 s (0.06 %)
>  leafs setup took 2.146e-06 s (0.05 %)
>  halfword conversion took 6.914e-06 s (0.15 %)
> Built (100 x 100)[0x5590205ac0c0]{S} @ (0(0..0),0(0..0)) (5050 nnz, 50 nnz/r) flags 0x42646096 (coo:0, csr:1, hw:1, ic:1, fi:0), storage: 1, subm: 6, symflags:'LS'
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.5
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.5
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.5
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> 3 iterations (1 th.) took 0.003623s; avg 0.001208s ( +/-  98.50/ 55.25 %); best 1.812e-05s; worst 0.001875s; std dev. 0.0008432 (taking best).
> Reference operation time is 1.81198e-05 s (1115 Mflops) with 1 threads.
> Challenging best inner round reference (9.05991e-06 s/1 threads) with: subdivision 0.5, 6 leaves, 2.163 bytes/nz, 1.81198e-05 s/0 threads (speedup 0.5 x), same?n.
> New candidate clone performs slowly; discarding it: 100 x 100, type S, 5050 nnz, 50 nnz/r, 8 subms, 6 lsubms, 2.1632 bpnz
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 4325376 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 8
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> Building a matrix with 5050 nnz, 100 x 100
> Duplicates check: 5050 - 0 = 5050
>  converted COO to RSB in 1.981e-04 s (100.00 %)
>  analyzed arrays in 2.885e-05 s (14.56 %)
>  cleaned-up arrays in 1.407e-05 s (7.10 %)
>  deduplicated arrays in 1.383e-05 s (6.98 %)
>  sorted arrays in 0.000e+00 s (0.00 %)
>  shuffled partitions in 3.004e-05 s (15.16 %)
>  memory allocations took 4.292e-06 s (2.17 %)
>  leafs setup took 2.146e-06 s (1.08 %)
>  halfword conversion took 1.040e-04 s (52.47 %)
> Built (100 x 100)[0x55902059a790]{S} @ (0(0..0),0(0..0)) (5050 nnz, 50 nnz/r) flags 0x42646096 (coo:0, csr:1, hw:1, ic:1, fi:0), storage: 1, subm: 17, symflags:'LS'
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> 3 iterations (1 th.) took 0.006474s; avg 0.002158s ( +/-  98.84/117.70 %); best 2.503e-05s; worst 0.004698s; std dev. 0.001929 (taking best).
> Reference operation time is 2.5034e-05 s (806.9 Mflops) with 1 threads.
> Challenging best inner round reference (9.05991e-06 s/1 threads) with: subdivision 1, 17 leaves, 2.251 bytes/nz, 2.5034e-05 s/0 threads (speedup 0.361905 x), same?n.
> New candidate clone performs slowly; discarding it: 100 x 100, type S, 5050 nnz, 50 nnz/r, 23 subms, 17 lsubms, 2.2511 bpnz
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 2
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 4325376 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 8
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 2
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> Building a matrix with 5050 nnz, 100 x 100
> Duplicates check: 5050 - 0 = 5050
>  converted COO to RSB in 1.788e-04 s (100.00 %)
>  analyzed arrays in 6.509e-05 s (36.40 %)
>  cleaned-up arrays in 1.407e-05 s (7.87 %)
>  deduplicated arrays in 1.407e-05 s (7.87 %)
>  sorted arrays in 0.000e+00 s (0.00 %)
>  shuffled partitions in 6.795e-05 s (38.00 %)
>  memory allocations took 4.768e-06 s (2.67 %)
>  leafs setup took 3.099e-06 s (1.73 %)
>  halfword conversion took 7.868e-06 s (4.40 %)
> Built (100 x 100)[0x55902059a790]{S} @ (0(0..0),0(0..0)) (5050 nnz, 50 nnz/r) flags 0x42646096 (coo:0, csr:1, hw:1, ic:1, fi:0), storage: 1, subm: 36, symflags:'LS'
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 2
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 2
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 2
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> 3 iterations (1 th.) took 0.001856s; avg 0.0006187s ( +/-  92.87/184.16 %); best 4.411e-05s; worst 0.001758s; std dev. 0.0008057 (taking best).
> Reference operation time is 4.41074e-05 s (458 Mflops) with 1 threads.
> Challenging best inner round reference (9.05991e-06 s/1 threads) with: subdivision 2, 36 leaves, 2.383 bytes/nz, 4.41074e-05 s/0 threads (speedup 0.205405 x), same?n.
> New candidate clone performs slowly; discarding it: 100 x 100, type S, 5050 nnz, 50 nnz/r, 50 subms, 36 lsubms, 2.3834 bpnz
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 4
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 4325376 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 8
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 4
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> Building a matrix with 5050 nnz, 100 x 100
> Duplicates check: 5050 - 0 = 5050
>  converted COO to RSB in 3.691e-04 s (100.00 %)
>  analyzed arrays in 1.199e-04 s (32.49 %)
>  cleaned-up arrays in 1.407e-05 s (3.81 %)
>  deduplicated arrays in 1.502e-05 s (4.07 %)
>  sorted arrays in 0.000e+00 s (0.00 %)
>  shuffled partitions in 2.038e-04 s (55.23 %)
>  memory allocations took 6.199e-06 s (1.68 %)
>  leafs setup took 3.099e-06 s (0.84 %)
>  halfword conversion took 6.914e-06 s (1.87 %)
> Built (100 x 100)[0x559020550210]{S} @ (0(0..0),0(0..0)) (5050 nnz, 50 nnz/r) flags 0x42646096 (coo:0, csr:1, hw:1, ic:1, fi:0), storage: 1, subm: 36, symflags:'LS'
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 4
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 4
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 4
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> 3 iterations (1 th.) took 0.001799s; avg 0.0005996s ( +/-  95.83/190.85 %); best 2.503e-05s; worst 0.001744s; std dev. 0.0008092 (taking best).
> Reference operation time is 2.5034e-05 s (806.9 Mflops) with 1 threads.
> Challenging best inner round reference (9.05991e-06 s/1 threads) with: subdivision 4, 36 leaves, 2.383 bytes/nz, 2.5034e-05 s/0 threads (speedup 0.361905 x), same?n.
> New candidate clone performs slowly; discarding it: 100 x 100, type S, 5050 nnz, 50 nnz/r, 50 subms, 36 lsubms, 2.3834 bpnz
> Best sparse multiply performance with subdivision multiplier of 1: 2229.6 Mflops.
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> Last tuner inner round (1 of 1) took 0.0231879 s (eq. to  3e+03/ 3e+03 old/new op.times), gained local/global speedup 1 x (9.05991e-06 : 9.05991e-06) / 1 x (9.05991e-06 : 9.05991e-06). This is not amortizable !
> Auto tuning inner round 1 did not find a configuration better than the original.
> In 1 tuning rounds (tot. 0.023s, 0.0071s for constructor, 0 clones) obtained NO speedup (best stays 2230 Mflops).
> Second run of RSB Autotuner took 0.0232358 s and estimated a speedup of 1.000000 x (9.060e-06 s -> 9.060e-06 s per op) in same matrix (1 -> 1 lsubm)
> #min:1
> #max:1
> #sum:100
> #norm:10
> #used index storage compared to COO:10504 vs 40400 bytes (26.00%) ; compared to CSR:10504 vs 20604 bytes (50.99%)
> #%:CONSTRUCTOR_*:SORT	SCAN	INSERT	SCAN+INSERT
> %:CONSTRUCTOR_TIMES:lower-100x100-5050nz	S	N	1	100	100	5050	  0.000000	  0.001493	  0.000028	  0.001521
> %:UNSORTEDCOO2RSB_TIME:lower-100x100-5050nz	S	N	1	100	100	5050	  0.001521
> %:RSB_SUBDIVISION_TIME:lower-100x100-5050nz	S	N	1	100	100	5050	  0.001493
> %:RSB_SHUFFLE_TIME:lower-100x100-5050nz	S	N	1	100	100	5050	  0.000028
> %:ROW_MAJOR_SORT_TIME:lower-100x100-5050nz	S	N	1	100	100	5050	  0.000000
> %:ROW_MAJOR_SORT_SCALING:lower-100x100-5050nz	S	N	1	100	100	5050	      -nan
> %:SORTEDCOO2RSB_TIME:lower-100x100-5050nz	S	N	1	100	100	5050	  0.001521
> %:ROW_MAJOR_SORT_TO_MOP:lower-100x100-5050nz	S	N	1	100	100	5050	     0.000
> %:UNSORTEDCOO2RSB_SCALING:lower-100x100-5050nz	S	N	1	100	100	5050	      1.00
> %:SORTEDCOO2RSB_SCALING:lower-100x100-5050nz	S	N	1	100	100	5050	      1.00
> %:RSB_SUBDIVISION_SCALING:lower-100x100-5050nz	S	N	1	100	100	5050	      1.00
> %:RSB_SHUFFLE_SCALING:lower-100x100-5050nz	S	N	1	100	100	5050	      1.00
> %:CONSTRUCTOR_SCALING:lower-100x100-5050nz	S	N	1	100	100	5050	      -nan	      1.00	      1.00	      1.00
> #%:SM_COUNTS:	Tot	HalfwordCsr	FullwordCsr	HalfwordCoo	FullwordCoo
> %:SM_COUNTS:lower-100x100-5050nz	S	N	1	100	100	5050	1	1	0	0	0
> %:SM_IDXOCCUPATIONRSBVSCOOANDCSR:lower-100x100-5050nz	S	N	1	100	100	5050	10504	40400	20600
> %:SM_IDXOCCUPATION:lower-100x100-5050nz	S	N	1	100	100	5050	10504
> %:SM_MEMTRAFFIC:lower-100x100-5050nz	S	N	1	100	100	5050	     61400
> %:SM_MINMAXAVGNNZ:lower-100x100-5050nz	S	N	1	100	100	5050	5050	5050	5050
> #
> %operation:matrix	CONSTRUCTOR[1]	SPMV[1]	SPMV[1]
> %operation:lower-100x100-5050nz	0.00156689	1e+09	1e+09
> %constructor:matrix	SORT[1]	SCAN[1]	SHUFFLE[1]	INSERT[1]
> %constructor:lower-100x100-5050nz	0	0.00149298	0	2.7895e-05
> # so far, program took 12.296s of wall clock time; ancillary tests 0.000s; I/O 0.000s; checks 0.000s; conversions 0.000s; rsb/mkl tuning 0.116s/0.000s .
> getrusage() stats:
> ru_maxrss: 137 (maximum resident set size -- MB)
> ru_stime : 0.8765s (system CPU time used)
> ru_utime : 13.15s (user CPU time used)
> # multi-type benchmarking (DSCZ) -- now using typecode C (last was D).
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # so far, program took 12.297s of wall clock time; ancillary tests 0.000s; I/O 0.000s; checks 0.000s; conversions 0.000s; rsb/mkl tuning 0.116s/0.000s .
> # Using 1 threads
> # Using alpha=1 beta=1 order=cols for rsb_spmv/rsb_spsv/rsb_spmm/rsb_spsm.
> # will use input matrix flags: RSB_FLAG_USE_HALFWORD_INDICES, RSB_FLAG_SORTED_INPUT, RSB_FLAG_LOWER, RSB_FLAG_QUAD_PARTITIONING, RSB_FLAG_SYMMETRIC, RSB_FLAG_OWN_PARTITIONING_ARRAYS
> # Using 1 threads
> Building a matrix with 5050 nnz, 100 x 100
> Duplicates check: 5050 - 0 = 5050
>  converted COO to RSB in 1.306e-03 s (100.00 %)
>  analyzed arrays in 1.228e-03 s (94.03 %)
>  cleaned-up arrays in 1.597e-05 s (1.22 %)
>  deduplicated arrays in 1.502e-05 s (1.15 %)
>  sorted arrays in 0.000e+00 s (0.00 %)
>  shuffled partitions in 3.099e-05 s (2.37 %)
>  memory allocations took 5.007e-06 s (0.38 %)
>  leafs setup took 9.537e-07 s (0.07 %)
>  halfword conversion took 9.060e-06 s (0.69 %)
> Built (100 x 100)[0x55902059a790]{C} @ (0(0..0),0(0..0)) (5050 nnz, 50 nnz/r) flags 0x2446196 (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 3, symflags:'LS'
> # Constructed matrix (took 0.001s): (100 x 100)[0x55902059a790]{C} @ (0(0..0),0(0..0)) (5050 nnz, 50 nnz/r) flags 0x2446196 (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 3, symflags:'LS'
> # matrix consistency check took 0.000s (ok)
> RSB Sparse Blocks Autotuner invoked requesting max 6 splits and max 6 merges in 1 rounds, threads spec.0 (specify negative values to enable threads tuning).
> Will autotune matrix: 100 x 100, type C, 5050 nnz, 50 nnz/r, 4 subms, 3 lsubms, 2.1212 bpnz.
> Parameters: verbosity:2 mintimes:3 maxtimes:10 mindt:0 maxdt:3
> Saved plot to  test-tuning-lower-100x100-5050nz--C-N-1--base.eps
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> 3 iterations (1 th.) took 0.0001791s; avg 5.968e-05s ( +/-  11.32/ 20.64 %); best 5.293e-05s; worst 7.2e-05s; std dev. 8.724e-06 (taking best).
> Reference operation time is 5.29289e-05 s (1527 Mflops) with 1 threads.
> Starting merge (same threads) based auto-tuning procedure (transA=N, nrhs=1, order=cols) (max 6 steps, inclusive 3 grace steps) on: 100 x 100, type C, 5050 nnz, 50 nnz/r, 4 subms, 3 lsubms, 2.1212 bpnz (tpop: 5.293e-05  Mflops: 1526.576)
> Merge (3 -> 1 leaves) took w.c.t. of 5.412e-05s, ~4.506e-05s of computing time (of which 2.313e-05s sorting, 2.146e-06s analysis)
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> 3 iterations (1 th.) took 0.0001259s; avg 4.196e-05s ( +/-  11.93/ 18.75 %); best 3.695e-05s; worst 4.983e-05s; std dev. 5.632e-06 (taking best).
> Reference operation time is 3.69549e-05 s (2186 Mflops) with 1 threads.
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> After merge step 1: tpop: 3.695e-05 s   ~Mflops: 2186.450   nsubm:1 otn:1
> Applying merge (3 -> 1 leaves, 1 th.) yielded SPEEDUP of  1.432x: 5.293e-05s -> 3.695e-05s, so taking this instance.
> Saved plot to  test-tuning-lower-100x100-5050nz--C-N-1--mv-tuned_merge1_1x1th.eps
> Merged all the matrix leaves: no reason to continue merging.
> A total of 1 merge steps (of max 6) (3 -> 1 subms) took 0.004327s (of which 5.794e-05s partitioning, 0.002726s I/O); computing times: 4.506e-05s in par. loops, 2.313e-05s sorting, 2.146e-06s analyzing)
> Total merge + benchmarking process took 0.004327s, equivalent to 117.1/81.8 new/old ops (0.001407s for 2 clones -- as 38.1/26.6 ops, or 19.0/13.3 ops per clone), SPEEDUP of  1.432x
> Applying multi-merge (3 -> 1 leaves, 1 steps, 0 -> 1 th.sp.) yielded SPEEDUP of  1.432x (5.293e-05s -> 3.695e-05s), will amortize in      270.9 ops by saving 1.597e-05s per op.
> In 1 tuning rounds (tot. 0.0049s, 0.0014s for constructor, 2 clones) obtained a SPEEDUP of   43.2% (1.432x) (from 1527 to 2186 Mflops). Employed 0.003s for I/O of matrix plots.
> #pr: updating sample at index 3 (2^th of 4), 0^th touch for (0,0,0,0,0,2,0).
> First run of RSB Autotuner took 0.00787091 s  (5.293e-05 s -> 3.695e-05 s per spmv_sxsa) (tuned: 3 -> 1 lsubm).
> RSB Sparse Blocks Autotuner invoked requesting max 0 splits and max 0 merges in 1 rounds, auto threads spec.
> Will autotune matrix: 100 x 100, type C, 5050 nnz, 50 nnz/r, 1 subms, 1 lsubms, 2.0800 bpnz.
> Parameters: verbosity:1 mintimes:3 maxtimes:10 mindt:0 maxdt:10
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> Started tuning inner round: will search for an optimal matrix instance.
> Starting with requested 0 threads ; current default 1 ; at most 8.
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> 3 iterations (1 th.) took 0.0001171s; avg 3.902e-05s ( +/-   5.30/  7.54 %); best 3.695e-05s; worst 4.196e-05s; std dev. 2.135e-06 (taking best).
> Reference operation time is 3.69549e-05 s (2186 Mflops) with 1 threads.
> Building a matrix with 5050 nnz, 100 x 100
> Duplicates check: 5050 - 0 = 5050
>  converted COO to RSB in 1.327e-03 s (100.00 %)
>  analyzed arrays in 1.267e-03 s (95.49 %)
>  cleaned-up arrays in 2.503e-05 s (1.89 %)
>  deduplicated arrays in 1.693e-05 s (1.28 %)
>  sorted arrays in 9.537e-07 s (0.07 %)
>  shuffled partitions in 1.001e-05 s (0.75 %)
>  memory allocations took 1.907e-06 s (0.14 %)
>  leafs setup took 9.537e-07 s (0.07 %)
>  halfword conversion took 4.053e-06 s (0.31 %)
> Built (100 x 100)[0x5590205ae5d0]{C} @ (0(0..100),0(0..100)) (5050 nnz, 50 nnz/r) flags 0x42644094 (coo:0, csr:1, hw:0, ic:1, fi:0), storage: 1, subm: 1, symflags:'LS'
> Starting autotuning stage, with subdivision of 1 (current threads=1, requested threads=0, max threads = 8).
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.25
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 4325376 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 8
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.25
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> Building a matrix with 5050 nnz, 100 x 100
> Duplicates check: 5050 - 0 = 5050
>  converted COO to RSB in 1.902e-03 s (100.00 %)
>  analyzed arrays in 1.819e-03 s (95.64 %)
>  cleaned-up arrays in 2.503e-05 s (1.32 %)
>  deduplicated arrays in 1.693e-05 s (0.89 %)
>  sorted arrays in 9.537e-07 s (0.05 %)
>  shuffled partitions in 2.599e-05 s (1.37 %)
>  memory allocations took 3.099e-06 s (0.16 %)
>  leafs setup took 9.537e-07 s (0.05 %)
>  halfword conversion took 9.060e-06 s (0.48 %)
> Built (100 x 100)[0x55902055eee0]{C} @ (0(0..0),0(0..0)) (5050 nnz, 50 nnz/r) flags 0x42646096 (coo:0, csr:1, hw:1, ic:1, fi:0), storage: 1, subm: 3, symflags:'LS'
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.25
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.25
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.25
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> 3 iterations (1 th.) took 0.0002179s; avg 7.264e-05s ( +/-  28.45/ 56.89 %); best 5.198e-05s; worst 0.000114s; std dev. 2.922e-05 (taking best).
> Reference operation time is 5.19753e-05 s (1555 Mflops) with 1 threads.
> Challenging best inner round reference (3.69549e-05 s/1 threads) with: subdivision 0.25, 3 leaves, 2.121 bytes/nz, 5.19753e-05 s/0 threads (speedup 0.711009 x), same?n.
> New candidate clone performs slowly; discarding it: 100 x 100, type C, 5050 nnz, 50 nnz/r, 4 subms, 3 lsubms, 2.1212 bpnz
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.5
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 4325376 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 8
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.5
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> Building a matrix with 5050 nnz, 100 x 100
> Duplicates check: 5050 - 0 = 5050
>  converted COO to RSB in 1.928e-03 s (100.00 %)
>  analyzed arrays in 1.849e-03 s (95.89 %)
>  cleaned-up arrays in 1.693e-05 s (0.88 %)
>  deduplicated arrays in 1.597e-05 s (0.83 %)
>  sorted arrays in 0.000e+00 s (0.00 %)
>  shuffled partitions in 2.384e-05 s (1.24 %)
>  memory allocations took 1.311e-05 s (0.68 %)
>  leafs setup took 1.192e-06 s (0.06 %)
>  halfword conversion took 6.914e-06 s (0.36 %)
> Built (100 x 100)[0x5590205ac0c0]{C} @ (0(0..0),0(0..0)) (5050 nnz, 50 nnz/r) flags 0x42646096 (coo:0, csr:1, hw:1, ic:1, fi:0), storage: 1, subm: 10, symflags:'LS'
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.5
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.5
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.5
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> 3 iterations (1 th.) took 0.0001931s; avg 6.437e-05s ( +/-  10.00/ 16.67 %); best 5.794e-05s; worst 7.51e-05s; std dev. 7.637e-06 (taking best).
> Reference operation time is 5.79357e-05 s (1395 Mflops) with 1 threads.
> Challenging best inner round reference (3.69549e-05 s/1 threads) with: subdivision 0.5, 10 leaves, 2.206 bytes/nz, 5.79357e-05 s/0 threads (speedup 0.63786 x), same?n.
> New candidate clone performs slowly; discarding it: 100 x 100, type C, 5050 nnz, 50 nnz/r, 14 subms, 10 lsubms, 2.2059 bpnz
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 4325376 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 8
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> Building a matrix with 5050 nnz, 100 x 100
> Duplicates check: 5050 - 0 = 5050
>  converted COO to RSB in 1.924e-03 s (100.00 %)
>  analyzed arrays in 1.798e-03 s (93.44 %)
>  cleaned-up arrays in 1.383e-05 s (0.72 %)
>  deduplicated arrays in 2.718e-05 s (1.41 %)
>  sorted arrays in 0.000e+00 s (0.00 %)
>  shuffled partitions in 7.105e-05 s (3.69 %)
>  memory allocations took 5.007e-06 s (0.26 %)
>  leafs setup took 9.537e-07 s (0.05 %)
>  halfword conversion took 7.153e-06 s (0.37 %)
> Built (100 x 100)[0x55902059a790]{C} @ (0(0..0),0(0..0)) (5050 nnz, 50 nnz/r) flags 0x42646096 (coo:0, csr:1, hw:1, ic:1, fi:0), storage: 1, subm: 22, symflags:'LS'
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> 3 iterations (1 th.) took 0.00018s; avg 6e-05s ( +/-   1.85/  3.31 %); best 5.889e-05s; worst 6.199e-05s; std dev. 1.408e-06 (taking best).
> Reference operation time is 5.88894e-05 s (1372 Mflops) with 1 threads.
> Challenging best inner round reference (3.69549e-05 s/1 threads) with: subdivision 1, 22 leaves, 2.295 bytes/nz, 5.88894e-05 s/0 threads (speedup 0.62753 x), same?n.
> New candidate clone performs slowly; discarding it: 100 x 100, type C, 5050 nnz, 50 nnz/r, 30 subms, 22 lsubms, 2.2947 bpnz
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 2
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 4325376 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 8
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 2
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> Building a matrix with 5050 nnz, 100 x 100
> Duplicates check: 5050 - 0 = 5050
>  converted COO to RSB in 2.043e-03 s (100.00 %)
>  analyzed arrays in 1.925e-03 s (94.22 %)
>  cleaned-up arrays in 2.694e-05 s (1.32 %)
>  deduplicated arrays in 1.693e-05 s (0.83 %)
>  sorted arrays in 0.000e+00 s (0.00 %)
>  shuffled partitions in 5.794e-05 s (2.84 %)
>  memory allocations took 5.245e-06 s (0.26 %)
>  leafs setup took 3.099e-06 s (0.15 %)
>  halfword conversion took 7.868e-06 s (0.39 %)
> Built (100 x 100)[0x5590205401d0]{C} @ (0(0..0),0(0..0)) (5050 nnz, 50 nnz/r) flags 0x42646096 (coo:0, csr:1, hw:1, ic:1, fi:0), storage: 1, subm: 36, symflags:'LS'
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 2
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 2
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 2
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> 3 iterations (1 th.) took 0.0001931s; avg 6.437e-05s ( +/-   0.74/  1.11 %); best 6.39e-05s; worst 6.509e-05s; std dev. 5.15e-07 (taking best).
> Reference operation time is 6.38962e-05 s (1265 Mflops) with 1 threads.
> Challenging best inner round reference (3.69549e-05 s/1 threads) with: subdivision 2, 36 leaves, 2.383 bytes/nz, 6.38962e-05 s/0 threads (speedup 0.578358 x), same?n.
> New candidate clone performs slowly; discarding it: 100 x 100, type C, 5050 nnz, 50 nnz/r, 50 subms, 36 lsubms, 2.3834 bpnz
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 4
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 4325376 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 8
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 4
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> Building a matrix with 5050 nnz, 100 x 100
> Duplicates check: 5050 - 0 = 5050
>  converted COO to RSB in 1.550e-04 s (100.00 %)
>  analyzed arrays in 7.391e-05 s (47.69 %)
>  cleaned-up arrays in 1.407e-05 s (9.08 %)
>  deduplicated arrays in 1.502e-05 s (9.69 %)
>  sorted arrays in 0.000e+00 s (0.00 %)
>  shuffled partitions in 3.815e-05 s (24.62 %)
>  memory allocations took 1.907e-06 s (1.23 %)
>  leafs setup took 3.815e-06 s (2.46 %)
>  halfword conversion took 7.153e-06 s (4.62 %)
> Built (100 x 100)[0x5590205401d0]{C} @ (0(0..0),0(0..0)) (5050 nnz, 50 nnz/r) flags 0x42646096 (coo:0, csr:1, hw:1, ic:1, fi:0), storage: 1, subm: 36, symflags:'LS'
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 4
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 4
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 4
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> 3 iterations (1 th.) took 0.000196s; avg 6.533e-05s ( +/-   3.65/  4.01 %); best 6.294e-05s; worst 6.795e-05s; std dev. 2.051e-06 (taking best).
> Reference operation time is 6.29425e-05 s (1284 Mflops) with 1 threads.
> Challenging best inner round reference (3.69549e-05 s/1 threads) with: subdivision 4, 36 leaves, 2.383 bytes/nz, 6.29425e-05 s/0 threads (speedup 0.587121 x), same?n.
> New candidate clone performs slowly; discarding it: 100 x 100, type C, 5050 nnz, 50 nnz/r, 50 subms, 36 lsubms, 2.3834 bpnz
> Best sparse multiply performance with subdivision multiplier of 1: 2186.45 Mflops.
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> Last tuner inner round (1 of 1) took 0.012567 s (eq. to  3e+02/ 3e+02 old/new op.times), gained local/global speedup 1 x (3.69549e-05 : 3.69549e-05) / 1 x (3.69549e-05 : 3.69549e-05). This is not amortizable !
> Auto tuning inner round 1 did not find a configuration better than the original.
> In 1 tuning rounds (tot. 0.013s, 0.0096s for constructor, 0 clones) obtained NO speedup (best stays 2186 Mflops).
> Second run of RSB Autotuner took 0.013191 s and estimated a speedup of 1.000000 x (3.695e-05 s -> 3.695e-05 s per op) in same matrix (1 -> 1 lsubm)
> #min:1 0
> #max:1 0
> #sum:100 0
> #norm:10 0
> #used index storage compared to COO:10504 vs 40400 bytes (26.00%) ; compared to CSR:10504 vs 20604 bytes (50.99%)
> #%:CONSTRUCTOR_*:SORT	SCAN	INSERT	SCAN+INSERT
> %:CONSTRUCTOR_TIMES:lower-100x100-5050nz	S	N	1	100	100	5050	  0.000000	  0.001228	  0.000031	  0.001259
> %:UNSORTEDCOO2RSB_TIME:lower-100x100-5050nz	S	N	1	100	100	5050	  0.001259
> %:RSB_SUBDIVISION_TIME:lower-100x100-5050nz	S	N	1	100	100	5050	  0.001228
> %:RSB_SHUFFLE_TIME:lower-100x100-5050nz	S	N	1	100	100	5050	  0.000031
> %:ROW_MAJOR_SORT_TIME:lower-100x100-5050nz	S	N	1	100	100	5050	  0.000000
> %:ROW_MAJOR_SORT_SCALING:lower-100x100-5050nz	S	N	1	100	100	5050	      -nan
> %:SORTEDCOO2RSB_TIME:lower-100x100-5050nz	S	N	1	100	100	5050	  0.001259
> %:ROW_MAJOR_SORT_TO_MOP:lower-100x100-5050nz	S	N	1	100	100	5050	     0.000
> %:UNSORTEDCOO2RSB_SCALING:lower-100x100-5050nz	S	N	1	100	100	5050	      1.00
> %:SORTEDCOO2RSB_SCALING:lower-100x100-5050nz	S	N	1	100	100	5050	      1.00
> %:RSB_SUBDIVISION_SCALING:lower-100x100-5050nz	S	N	1	100	100	5050	      1.00
> %:RSB_SHUFFLE_SCALING:lower-100x100-5050nz	S	N	1	100	100	5050	      1.00
> %:CONSTRUCTOR_SCALING:lower-100x100-5050nz	S	N	1	100	100	5050	      -nan	      1.00	      1.00	      1.00
> #%:SM_COUNTS:	Tot	HalfwordCsr	FullwordCsr	HalfwordCoo	FullwordCoo
> %:SM_COUNTS:lower-100x100-5050nz	S	N	1	100	100	5050	1	1	0	0	0
> %:SM_IDXOCCUPATIONRSBVSCOOANDCSR:lower-100x100-5050nz	S	N	1	100	100	5050	10504	40400	20600
> %:SM_IDXOCCUPATION:lower-100x100-5050nz	S	N	1	100	100	5050	10504
> %:SM_MEMTRAFFIC:lower-100x100-5050nz	S	N	1	100	100	5050	    102200
> %:SM_MINMAXAVGNNZ:lower-100x100-5050nz	S	N	1	100	100	5050	5050	5050	5050
> #
> %operation:matrix	CONSTRUCTOR[1]	SPMV[1]	SPMV[1]
> %operation:lower-100x100-5050nz	0.00130606	1e+09	1e+09
> %constructor:matrix	SORT[1]	SCAN[1]	SHUFFLE[1]	INSERT[1]
> %constructor:lower-100x100-5050nz	0	0.00122809	0	3.09944e-05
> # so far, program took 12.601s of wall clock time; ancillary tests 0.000s; I/O 0.000s; checks 0.000s; conversions 0.000s; rsb/mkl tuning 0.137s/0.000s .
> getrusage() stats:
> ru_maxrss: 137 (maximum resident set size -- MB)
> ru_stime : 1.058s (system CPU time used)
> ru_utime : 13.76s (user CPU time used)
> # multi-type benchmarking (DSCZ) -- now using typecode Z (last was D).
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # so far, program took 12.601s of wall clock time; ancillary tests 0.000s; I/O 0.000s; checks 0.000s; conversions 0.000s; rsb/mkl tuning 0.137s/0.000s .
> # Using 1 threads
> # Using alpha=1 beta=1 order=cols for rsb_spmv/rsb_spsv/rsb_spmm/rsb_spsm.
> # will use input matrix flags: RSB_FLAG_USE_HALFWORD_INDICES, RSB_FLAG_SORTED_INPUT, RSB_FLAG_LOWER, RSB_FLAG_QUAD_PARTITIONING, RSB_FLAG_SYMMETRIC, RSB_FLAG_OWN_PARTITIONING_ARRAYS
> # Using 1 threads
> Building a matrix with 5050 nnz, 100 x 100
> Duplicates check: 5050 - 0 = 5050
>  converted COO to RSB in 2.360e-04 s (100.00 %)
>  analyzed arrays in 1.361e-04 s (57.68 %)
>  cleaned-up arrays in 8.821e-06 s (3.74 %)
>  deduplicated arrays in 8.106e-06 s (3.43 %)
>  sorted arrays in 0.000e+00 s (0.00 %)
>  shuffled partitions in 6.700e-05 s (28.38 %)
>  memory allocations took 5.960e-06 s (2.53 %)
>  leafs setup took 9.537e-07 s (0.40 %)
>  halfword conversion took 8.106e-06 s (3.43 %)
> Built (100 x 100)[0x55902059a790]{Z} @ (0(0..0),0(0..0)) (5050 nnz, 50 nnz/r) flags 0x2446196 (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 3, symflags:'LS'
> # Constructed matrix (took 0.000s): (100 x 100)[0x55902059a790]{Z} @ (0(0..0),0(0..0)) (5050 nnz, 50 nnz/r) flags 0x2446196 (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 3, symflags:'LS'
> # matrix consistency check took 0.000s (ok)
> RSB Sparse Blocks Autotuner invoked requesting max 6 splits and max 6 merges in 1 rounds, threads spec.0 (specify negative values to enable threads tuning).
> Will autotune matrix: 100 x 100, type Z, 5050 nnz, 50 nnz/r, 4 subms, 3 lsubms, 2.1212 bpnz.
> Parameters: verbosity:2 mintimes:3 maxtimes:10 mindt:0 maxdt:3
> Saved plot to  test-tuning-lower-100x100-5050nz--Z-N-1--base.eps
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> 3 iterations (1 th.) took 0.000185s; avg 6.167e-05s ( +/-  12.24/ 19.85 %); best 5.412e-05s; worst 7.391e-05s; std dev. 8.733e-06 (taking best).
> Reference operation time is 5.4121e-05 s (1493 Mflops) with 1 threads.
> Starting merge (same threads) based auto-tuning procedure (transA=N, nrhs=1, order=cols) (max 6 steps, inclusive 3 grace steps) on: 100 x 100, type Z, 5050 nnz, 50 nnz/r, 4 subms, 3 lsubms, 2.1212 bpnz (tpop: 5.412e-05  Mflops: 1492.950)
> Merge (3 -> 1 leaves) took w.c.t. of 5.007e-05s, ~4.387e-05s of computing time (of which 2.408e-05s sorting, 2.146e-06s analysis)
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> 3 iterations (1 th.) took 0.000103s; avg 3.433e-05s ( +/-   0.69/  1.39 %); best 3.409e-05s; worst 3.481e-05s; std dev. 3.372e-07 (taking best).
> Reference operation time is 3.40939e-05 s (2370 Mflops) with 1 threads.
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> After merge step 1: tpop: 3.409e-05 s   ~Mflops: 2369.928   nsubm:1 otn:1
> Applying merge (3 -> 1 leaves, 1 th.) yielded SPEEDUP of  1.587x: 5.412e-05s -> 3.409e-05s, so taking this instance.
> Saved plot to  test-tuning-lower-100x100-5050nz--Z-N-1--mv-tuned_merge1_1x1th.eps
> Merged all the matrix leaves: no reason to continue merging.
> A total of 1 merge steps (of max 6) (3 -> 1 subms) took 0.005165s (of which 5.317e-05s partitioning, 0.002886s I/O); computing times: 4.387e-05s in par. loops, 2.408e-05s sorting, 2.146e-06s analyzing)
> Total merge + benchmarking process took 0.005165s, equivalent to 151.5/95.4 new/old ops (0.00175s for 2 clones -- as 51.3/32.3 ops, or 25.7/16.2 ops per clone), SPEEDUP of  1.587x
> Applying multi-merge (3 -> 1 leaves, 1 steps, 0 -> 1 th.sp.) yielded SPEEDUP of  1.587x (5.412e-05s -> 3.409e-05s), will amortize in      257.9 ops by saving 2.003e-05s per op.
> In 1 tuning rounds (tot. 0.0054s, 0.0017s for constructor, 2 clones) obtained a SPEEDUP of   58.7% (1.587x) (from 1493 to 2370 Mflops). Employed 0.0039s for I/O of matrix plots.
> #pr: updating sample at index 4 (3^th of 4), 0^th touch for (0,0,0,0,0,3,0).
> First run of RSB Autotuner took 0.00968003 s  (5.412e-05 s -> 3.409e-05 s per spmv_sxsa) (tuned: 3 -> 1 lsubm).
> RSB Sparse Blocks Autotuner invoked requesting max 0 splits and max 0 merges in 1 rounds, auto threads spec.
> Will autotune matrix: 100 x 100, type Z, 5050 nnz, 50 nnz/r, 1 subms, 1 lsubms, 2.0800 bpnz.
> Parameters: verbosity:1 mintimes:3 maxtimes:10 mindt:0 maxdt:10
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> Started tuning inner round: will search for an optimal matrix instance.
> Starting with requested 0 threads ; current default 1 ; at most 8.
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> 3 iterations (1 th.) took 0.000123s; avg 4.101e-05s ( +/-   4.65/  6.98 %); best 3.91e-05s; worst 4.387e-05s; std dev. 2.06e-06 (taking best).
> Reference operation time is 3.91006e-05 s (2066 Mflops) with 1 threads.
> Building a matrix with 5050 nnz, 100 x 100
> Duplicates check: 5050 - 0 = 5050
>  converted COO to RSB in 1.326e-03 s (100.00 %)
>  analyzed arrays in 1.281e-03 s (96.60 %)
>  cleaned-up arrays in 1.597e-05 s (1.20 %)
>  deduplicated arrays in 1.502e-05 s (1.13 %)
>  sorted arrays in 0.000e+00 s (0.00 %)
>  shuffled partitions in 7.868e-06 s (0.59 %)
>  memory allocations took 1.192e-06 s (0.09 %)
>  leafs setup took 9.537e-07 s (0.07 %)
>  halfword conversion took 3.099e-06 s (0.23 %)
> Built (100 x 100)[0x5590205ae5d0]{Z} @ (0(0..100),0(0..100)) (5050 nnz, 50 nnz/r) flags 0x42644094 (coo:0, csr:1, hw:0, ic:1, fi:0), storage: 1, subm: 1, symflags:'LS'
> Starting autotuning stage, with subdivision of 1 (current threads=1, requested threads=0, max threads = 8).
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.25
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 4325376 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 8
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.25
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> Building a matrix with 5050 nnz, 100 x 100
> Duplicates check: 5050 - 0 = 5050
>  converted COO to RSB in 1.900e-04 s (100.00 %)
>  analyzed arrays in 8.583e-05 s (45.17 %)
>  cleaned-up arrays in 1.287e-05 s (6.78 %)
>  deduplicated arrays in 1.502e-05 s (7.90 %)
>  sorted arrays in 0.000e+00 s (0.00 %)
>  shuffled partitions in 6.604e-05 s (34.76 %)
>  memory allocations took 2.146e-06 s (1.13 %)
>  leafs setup took 9.537e-07 s (0.50 %)
>  halfword conversion took 5.960e-06 s (3.14 %)
> Built (100 x 100)[0x5590204df120]{Z} @ (0(0..0),0(0..0)) (5050 nnz, 50 nnz/r) flags 0x42646096 (coo:0, csr:1, hw:1, ic:1, fi:0), storage: 1, subm: 6, symflags:'LS'
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.25
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.25
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.25
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> 3 iterations (1 th.) took 0.001846s; avg 0.0006153s ( +/-  91.90/182.95 %); best 4.983e-05s; worst 0.001741s; std dev. 0.000796 (taking best).
> Reference operation time is 4.98295e-05 s (1622 Mflops) with 1 threads.
> Challenging best inner round reference (3.91006e-05 s/1 threads) with: subdivision 0.25, 6 leaves, 2.163 bytes/nz, 4.98295e-05 s/0 threads (speedup 0.784689 x), same?n.
> New candidate clone performs slowly; discarding it: 100 x 100, type Z, 5050 nnz, 50 nnz/r, 8 subms, 6 lsubms, 2.1632 bpnz
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.5
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 4325376 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 8
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.5
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> Building a matrix with 5050 nnz, 100 x 100
> Duplicates check: 5050 - 0 = 5050
>  converted COO to RSB in 9.894e-05 s (100.00 %)
>  analyzed arrays in 2.503e-05 s (25.30 %)
>  cleaned-up arrays in 1.287e-05 s (13.01 %)
>  deduplicated arrays in 1.407e-05 s (14.22 %)
>  sorted arrays in 0.000e+00 s (0.00 %)
>  shuffled partitions in 3.314e-05 s (33.49 %)
>  memory allocations took 5.960e-06 s (6.02 %)
>  leafs setup took 1.907e-06 s (1.93 %)
>  halfword conversion took 5.960e-06 s (6.02 %)
> Built (100 x 100)[0x55902059a790]{Z} @ (0(0..0),0(0..0)) (5050 nnz, 50 nnz/r) flags 0x42646096 (coo:0, csr:1, hw:1, ic:1, fi:0), storage: 1, subm: 16, symflags:'LS'
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.5
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.5
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 0.5
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> 3 iterations (1 th.) took 0.001998s; avg 0.0006661s ( +/-  90.08/162.45 %); best 6.604e-05s; worst 0.001748s; std dev. 0.0007666 (taking best).
> Reference operation time is 6.60419e-05 s (1223 Mflops) with 1 threads.
> Challenging best inner round reference (3.91006e-05 s/1 threads) with: subdivision 0.5, 16 leaves,  2.25 bytes/nz, 6.60419e-05 s/0 threads (speedup 0.592058 x), same?n.
> New candidate clone performs slowly; discarding it: 100 x 100, type Z, 5050 nnz, 50 nnz/r, 22 subms, 16 lsubms, 2.2503 bpnz
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 4325376 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 8
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> Building a matrix with 5050 nnz, 100 x 100
> Duplicates check: 5050 - 0 = 5050
>  converted COO to RSB in 1.211e-04 s (100.00 %)
>  analyzed arrays in 3.505e-05 s (28.94 %)
>  cleaned-up arrays in 1.383e-05 s (11.42 %)
>  deduplicated arrays in 1.407e-05 s (11.61 %)
>  sorted arrays in 0.000e+00 s (0.00 %)
>  shuffled partitions in 4.411e-05 s (36.42 %)
>  memory allocations took 4.053e-06 s (3.35 %)
>  leafs setup took 2.861e-06 s (2.36 %)
>  halfword conversion took 7.153e-06 s (5.91 %)
> Built (100 x 100)[0x55902059a790]{Z} @ (0(0..0),0(0..0)) (5050 nnz, 50 nnz/r) flags 0x42646096 (coo:0, csr:1, hw:1, ic:1, fi:0), storage: 1, subm: 34, symflags:'LS'
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> 3 iterations (1 th.) took 0.001899s; avg 0.000633s ( +/-  89.11/177.10 %); best 6.89e-05s; worst 0.001754s; std dev. 0.0007927 (taking best).
> Reference operation time is 6.8903e-05 s (1173 Mflops) with 1 threads.
> Challenging best inner round reference (3.91006e-05 s/1 threads) with: subdivision 1, 34 leaves, 2.372 bytes/nz, 6.8903e-05 s/0 threads (speedup 0.567474 x), same?n.
> New candidate clone performs slowly; discarding it: 100 x 100, type Z, 5050 nnz, 50 nnz/r, 47 subms, 34 lsubms, 2.3723 bpnz
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 2
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 4325376 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 8
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 2
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> Building a matrix with 5050 nnz, 100 x 100
> Duplicates check: 5050 - 0 = 5050
>  converted COO to RSB in 2.990e-04 s (100.00 %)
>  analyzed arrays in 4.101e-05 s (13.72 %)
>  cleaned-up arrays in 1.383e-05 s (4.63 %)
>  deduplicated arrays in 1.407e-05 s (4.70 %)
>  sorted arrays in 0.000e+00 s (0.00 %)
>  shuffled partitions in 2.120e-04 s (70.89 %)
>  memory allocations took 7.153e-06 s (2.39 %)
>  leafs setup took 2.146e-06 s (0.72 %)
>  halfword conversion took 8.821e-06 s (2.95 %)
> Built (100 x 100)[0x5590205702b0]{Z} @ (0(0..0),0(0..0)) (5050 nnz, 50 nnz/r) flags 0x42646096 (coo:0, csr:1, hw:1, ic:1, fi:0), storage: 1, subm: 36, symflags:'LS'
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 2
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 2
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 2
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> 3 iterations (1 th.) took 0.001965s; avg 0.000655s ( +/-  89.63/166.26 %); best 6.795e-05s; worst 0.001744s; std dev. 0.0007708 (taking best).
> Reference operation time is 6.79493e-05 s (1189 Mflops) with 1 threads.
> Challenging best inner round reference (3.91006e-05 s/1 threads) with: subdivision 2, 36 leaves, 2.383 bytes/nz, 6.79493e-05 s/0 threads (speedup 0.575439 x), same?n.
> New candidate clone performs slowly; discarding it: 100 x 100, type Z, 5050 nnz, 50 nnz/r, 50 subms, 36 lsubms, 2.3834 bpnz
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 4
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 4325376 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 8
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 4
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> Building a matrix with 5050 nnz, 100 x 100
> Duplicates check: 5050 - 0 = 5050
>  converted COO to RSB in 1.299e-04 s (100.00 %)
>  analyzed arrays in 3.815e-05 s (29.36 %)
>  cleaned-up arrays in 1.383e-05 s (10.64 %)
>  deduplicated arrays in 1.502e-05 s (11.56 %)
>  sorted arrays in 0.000e+00 s (0.00 %)
>  shuffled partitions in 4.506e-05 s (34.68 %)
>  memory allocations took 7.153e-06 s (5.50 %)
>  leafs setup took 2.861e-06 s (2.20 %)
>  halfword conversion took 6.914e-06 s (5.32 %)
> Built (100 x 100)[0x5590205702b0]{Z} @ (0(0..0),0(0..0)) (5050 nnz, 50 nnz/r) flags 0x42646096 (coo:0, csr:1, hw:1, ic:1, fi:0), storage: 1, subm: 36, symflags:'LS'
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 4
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 4
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 4
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> 3 iterations (1 th.) took 0.001878s; avg 0.000626s ( +/-  88.84/177.49 %); best 6.986e-05s; worst 0.001737s; std dev. 0.0007857 (taking best).
> Reference operation time is 6.98566e-05 s (1157 Mflops) with 1 threads.
> Challenging best inner round reference (3.91006e-05 s/1 threads) with: subdivision 4, 36 leaves, 2.383 bytes/nz, 6.98566e-05 s/0 threads (speedup 0.559727 x), same?n.
> New candidate clone performs slowly; discarding it: 100 x 100, type Z, 5050 nnz, 50 nnz/r, 50 subms, 36 lsubms, 2.3834 bpnz
> Best sparse multiply performance with subdivision multiplier of 1: 2066.46 Mflops.
> # librsb version 1.3.0.0 - 202202241108: Initializing
> # Cache block size total 34603008 bytes, per-thread 34603008 bytes
> # RSB_IO_WANT_MEMORY_HIERARCHY_INFO_STRING: unset
> # min_leaf_matrix_bytes : 32768
> # avg_leaf_matrix_bytes : 69206016
> # rsb_g_threads: 8
> # RSB_IO_WANT_EXECUTING_THREADS: 1
> # RSB_WANT_RSBPP: 1
> # RSB_IO_WANT_OUTPUT_STREAM: stdout
> # RSB_IO_WANT_VERBOSE_ERRORS: stderr
> # RSB_IO_WANT_VERBOSE_INIT: stdout
> # RSB_IO_WANT_VERBOSE_EXIT: stdout
> # RSB_IO_WANT_SORT_METHOD: 0
> # RSB_IO_WANT_SUBDIVISION_MULTIPLIER: 1
> # librsb version 1.3.0.0 - 202202241108: Initialization success 
> Last tuner inner round (1 of 1) took 0.014492 s (eq. to  4e+02/ 4e+02 old/new op.times), gained local/global speedup 1 x (3.91006e-05 : 3.91006e-05) / 1 x (3.91006e-05 : 3.91006e-05). This is not amortizable !
> Auto tuning inner round 1 did not find a configuration better than the original.
> In 1 tuning rounds (tot. 0.014s, 0.0026s for constructor, 0 clones) obtained NO speedup (best stays 2066 Mflops).
> Second run of RSB Autotuner took 0.0145378 s and estimated a speedup of 1.000000 x (3.910e-05 s -> 3.910e-05 s per op) in same matrix (1 -> 1 lsubm)
> #min:1 0
> #max:1 0
> #sum:100 0
> #norm:10 0
> #used index storage compared to COO:10504 vs 40400 bytes (26.00%) ; compared to CSR:10504 vs 20604 bytes (50.99%)
> #%:CONSTRUCTOR_*:SORT	SCAN	INSERT	SCAN+INSERT
> %:CONSTRUCTOR_TIMES:lower-100x100-5050nz	S	N	1	100	100	5050	  0.000000	  0.000136	  0.000067	  0.000203
> %:UNSORTEDCOO2RSB_TIME:lower-100x100-5050nz	S	N	1	100	100	5050	  0.000203
> %:RSB_SUBDIVISION_TIME:lower-100x100-5050nz	S	N	1	100	100	5050	  0.000136
> %:RSB_SHUFFLE_TIME:lower-100x100-5050nz	S	N	1	100	100	5050	  0.000067
> %:ROW_MAJOR_SORT_TIME:lower-100x100-5050nz	S	N	1	100	100	5050	  0.000000
> %:ROW_MAJOR_SORT_SCALING:lower-100x100-5050nz	S	N	1	100	100	5050	      -nan
> %:SORTEDCOO2RSB_TIME:lower-100x100-5050nz	S	N	1	100	100	5050	  0.000203
> %:ROW_MAJOR_SORT_TO_MOP:lower-100x100-5050nz	S	N	1	100	100	5050	     0.000
> %:UNSORTEDCOO2RSB_SCALING:lower-100x100-5050nz	S	N	1	100	100	5050	      1.00
> %:SORTEDCOO2RSB_SCALING:lower-100x100-5050nz	S	N	1	100	100	5050	      1.00
> %:RSB_SUBDIVISION_SCALING:lower-100x100-5050nz	S	N	1	100	100	5050	      1.00
> %:RSB_SHUFFLE_SCALING:lower-100x100-5050nz	S	N	1	100	100	5050	      1.00
> %:CONSTRUCTOR_SCALING:lower-100x100-5050nz	S	N	1	100	100	5050	      -nan	      1.00	      1.00	      1.00
> #%:SM_COUNTS:	Tot	HalfwordCsr	FullwordCsr	HalfwordCoo	FullwordCoo
> %:SM_COUNTS:lower-100x100-5050nz	S	N	1	100	100	5050	1	1	0	0	0
> %:SM_IDXOCCUPATIONRSBVSCOOANDCSR:lower-100x100-5050nz	S	N	1	100	100	5050	10504	40400	20600
> %:SM_IDXOCCUPATION:lower-100x100-5050nz	S	N	1	100	100	5050	10504
> %:SM_MEMTRAFFIC:lower-100x100-5050nz	S	N	1	100	100	5050	    183800
> %:SM_MINMAXAVGNNZ:lower-100x100-5050nz	S	N	1	100	100	5050	5050	5050	5050
> #
> %operation:matrix	CONSTRUCTOR[1]	SPMV[1]	SPMV[1]
> %operation:lower-100x100-5050nz	0.000236034	1e+09	1e+09
> %constructor:matrix	SORT[1]	SCAN[1]	SHUFFLE[1]	INSERT[1]
> %constructor:lower-100x100-5050nz	0	0.000136137	0	6.69956e-05
> # so far, program took 12.920s of wall clock time; ancillary tests 0.000s; I/O 0.000s; checks 0.000s; conversions 0.000s; rsb/mkl tuning 0.161s/0.000s .
> getrusage() stats:
> ru_maxrss: 137 (maximum resident set size -- MB)
> ru_stime : 1.267s (system CPU time used)
> ru_utime : 14.41s (user CPU time used)
> # benchmarking terminated --- finalizing run.
> # ====== BEGIN Total summary record.
> #pr: ========  All results (not limiting)
> #pr: Dump from a base of 4 samples (of max 4) ordered by (1,1,1,1,1,4,1) = (filename x cores x incX x incY x nrhs x typecode x transA).
> pr: BESTCODE MTX NR NC NNZ NRHS TYPE SYM TRANS NT AT-NT AT-MKL-NT BPNZ AT-BPNZ NSUBM AT-SUBM RSBBEST-MFLOPS OPTIME MKL-OPTIME AT-OPTIME AT-MKL-OPTIME AT-TIME RWminBW-GBps CB-bpf AT-MS CMFLOPS
> pr:    1:R_R  lower-100x100-5050nz 100 100 5050 1 D S N  1  1  0 2.1212 2.0800 3 1 2017.26 1.788e-05 0.000e+00 1.001e-05 0.000e+00 2.182e-02 5.32e+00 2.60e+00 1 2.02e-02
> pr:    2:R_R  lower-100x100-5050nz 100 100 5050 1 S S N  1  1  0 2.1212 2.0800 3 1 2289.86 1.597e-05 0.000e+00 8.821e-06 0.000e+00 9.219e-03 3.62e+00 1.56e+00 1 2.02e-02
> pr:    3:R_R  lower-100x100-5050nz 100 100 5050 1 C S N  1  1  0 2.1212 2.0800 3 1 2186.45 5.293e-05 0.000e+00 3.695e-05 0.000e+00 7.871e-03 1.44e+00 6.50e-01 1 8.08e-02
> pr:    4:R_R  lower-100x100-5050nz 100 100 5050 1 Z S N  1  1  0 2.1212 2.0800 3 1 2369.93 5.412e-05 0.000e+00 3.409e-05 0.000e+00 9.680e-03 2.82e+00 1.17e+00 1 8.08e-02
> #pr: below, we define 'successful' autotuning when speedup of 1.010000x is exceeded, and 'tuned' results even the ones which are same as untuned
> #pr: rsb autotuning was successful in     4 cases (100.00 %) and unsuccessful in 0 cases (0.00 %)
> #pr:  (in succ. cases rsb autotuning gave    avg.  65.4 % faster, avg. sp. ratio 1.654x, max sp. ratio 1.811x, avg. ratio 0.000x)
> #pr:  (in succ. cases rsb autotuning took an avg/min/max/tot of: 930.2/213.0/2178.6/3720.6   tuned ops)
> #pr:  (in succ. cases rsb autotuning took an avg/min/max/tot of: 531.2/148.7/1220.0/2124.7 untuned ops)
> #pr:  (and amortizes from untuned rsb in avg. 1259.4, min. 483.3, max. 2772.8 ops)
> #pr:  (avg/min/max (avg) nnz   per subm before successful tuning were       1683/      1683/      1683)
> #pr:  (avg/min/max (avg) nnz   per subm after  successful tuning were       5050/      5050/      5050)
> #pr:  (avg/min/max (avg) bytes per subm before successful tuning were      15150/      6733/     26933)
> #pr:  (avg/min/max (avg) bytes per subm after  successful tuning were      45450/     20200/     80800)
> #pr:  (avg/min/max (avg) bytes per nnz  before successful tuning were      2.121/     2.121/     2.121)
> #pr:  (avg/min/max operands (mtx,lhs,rhs) read bandwidth lower bound       3.252/     1.421/     5.243,GBps)
> #pr:  (avg/min/max operands (mtx,rhs:r;lhs:rw) bandwidth lower bound      13.201/     1.442/     5.323,GBps)
> #pr:  (avg/min/max code balance (bytes read at least once per flop)        1.495/     0.650/     2.599)
> #pr:  (avg/min/max (avg) bytes per nnz  after  successful tuning were      2.080/     2.080/     2.080)
> #pr:  (matrix has been subdivided  more/less/same            in resp.  0 / 4 /0 cases)
> #pr:  (matrix has used             more/less/same    threads in resp.  0 / 0 /4 cases)
> #pr: no unsuccessful rsb autotuning attempt (according to  1.01x threshold) 
> #pr: rsb auto tuning (either succ. or uns.) time was: on avg.:  0.01 s, min  0.01 s, max  0.02 s, tot  0.05 s (4 samples)
> #pr: rsb auto tuning (   only successful  ) time was: on avg.:  0.01 s, min  0.01 s, max  0.02 s, tot  0.05 s (4 samples)
> #pr:  best tun. rsb canon. mflops were: on avg. 2.216e+03,  min 2.017e+03,  max 2.370e+03  (4 samples)
> #pr:  ref. unt. rsb canon. mflops were: on avg. 1.353e+03,  min 1.130e+03,  max 1.527e+03  (4 samples)
> #pr:  best tun. rsb operation time was: on avg. 2.247e-05s, min 8.821e-06s, max 3.695e-05s, tot 8.988e-05s (4 samples)
> #pr:  ref. unt. rsb operation time was: on avg. 3.523e-05s, min 1.597e-05s, max 5.412e-05s, tot 1.409e-04s (4 samples)
> #pr:  min / max ratio of in-memory MEMSET bandwidth to extrapolated read bandwidth ratio: 1.367e+00 5.046e+00
> #pr:  in-cache to in-memory MEMSET bandwidth ratio: 2.463e+01
> #pr: Record collection took  1.06 s.
> #pr: Record comprises 50 memory benchmark samples (prepend RSB_PR_MBW=1 to dump this).
> #pr: Record comprises 81 environment variables in 3214 bytes (prepend RSB_PR_ENV=1 to dump this).
> # ======  END  Total summary record.
> #pr: ======== Saved a performance record of 4 samples to test.rpr
> # Removing the temporary record file test.rpr.tmp.
> # terminating run at 1659086434 (after 12.9s of w.c.t.)
> + ls -ltr test-tuning-lower-100x100-5050nz--C-N-1--base.eps test-tuning-lower-100x100-5050nz--C-N-1--mv-tuned_merge1_1x1th.eps test-tuning-lower-100x100-5050nz--D-N-1--base.eps test-tuning-lower-100x100-5050nz--D-N-1--mv-tuned_merge1_1x1th.eps test-tuning-lower-100x100-5050nz--S-N-1--base.eps test-tuning-lower-100x100-5050nz--S-N-1--mv-tuned_merge1_1x1th.eps test-tuning-lower-100x100-5050nz--Z-N-1--base.eps test-tuning-lower-100x100-5050nz--Z-N-1--mv-tuned_merge1_1x1th.eps
> -rw-r--r-- 1 user42 user42 85638 Jul 29 09:20 test-tuning-lower-100x100-5050nz--D-N-1--base.eps
> -rw-r--r-- 1 user42 user42 84560 Jul 29 09:20 test-tuning-lower-100x100-5050nz--D-N-1--mv-tuned_merge1_1x1th.eps
> -rw-r--r-- 1 user42 user42 85637 Jul 29 09:20 test-tuning-lower-100x100-5050nz--S-N-1--base.eps
> -rw-r--r-- 1 user42 user42 84560 Jul 29 09:20 test-tuning-lower-100x100-5050nz--S-N-1--mv-tuned_merge1_1x1th.eps
> -rw-r--r-- 1 user42 user42 85638 Jul 29 09:20 test-tuning-lower-100x100-5050nz--C-N-1--base.eps
> -rw-r--r-- 1 user42 user42 84559 Jul 29 09:20 test-tuning-lower-100x100-5050nz--C-N-1--mv-tuned_merge1_1x1th.eps
> -rw-r--r-- 1 user42 user42 85638 Jul 29 09:20 test-tuning-lower-100x100-5050nz--Z-N-1--base.eps
> -rw-r--r-- 1 user42 user42 84560 Jul 29 09:20 test-tuning-lower-100x100-5050nz--Z-N-1--mv-tuned_merge1_1x1th.eps
> + rsbench --read-performance-record test.rpr
> + ls -ltr test.txt
> -rw-r--r-- 1 user42 user42 4091 Jul 29 09:20 test.txt
> + RSB_PR_WLTC=2
> + RSB_PR_SR=0
> + rsbench --read-performance-record test.rpr
> + which latex
> /usr/bin/latex
> + which kpsepath
> /usr/bin/kpsepath
> ++ kpsepath tex
> ++ sed 's/!!//g;s/:/\n/g;'
> + find . /sbuild-nonexistent/.texlive2022/texmf-config/tex/kpsewhich// /sbuild-nonexistent/.texlive2022/texmf-var/tex/kpsewhich// /sbuild-nonexistent/texmf/tex/kpsewhich// /usr/local/share/texmf/tex/kpsewhich// /etc/texmf/tex/kpsewhich// /var/lib/texmf/tex/kpsewhich// /usr/share/texmf/tex/kpsewhich// /usr/share/texlive/texmf-dist/tex/kpsewhich// /sbuild-nonexistent/.texlive2022/texmf-config/tex/generic// /sbuild-nonexistent/.texlive2022/texmf-var/tex/generic// /sbuild-nonexistent/texmf/tex/generic// /usr/local/share/texmf/tex/generic// /etc/texmf/tex/generic// /var/lib/texmf/tex/generic// /usr/share/texmf/tex/generic// /usr/share/texlive/texmf-dist/tex/generic// /sbuild-nonexistent/.texlive2022/texmf-config/tex/latex// /sbuild-nonexistent/.texlive2022/texmf-var/tex/latex// /sbuild-nonexistent/texmf/tex/latex// /usr/local/share/texmf/tex/latex// /etc/texmf/tex/latex// /var/lib/texmf/tex/latex// /usr/share/texmf/tex/latex// /usr/share/texlive/texmf-dist/tex/latex// /sbuild-nonexistent/.texlive2022/texmf-config/tex/// /sbuild-nonexistent/.texlive2022/texmf-var/tex/// /sbuild-nonexistent/texmf/tex/// /usr/local/share/texmf/tex/// /etc/texmf/tex/// /var/lib/texmf/tex/// /usr/share/texmf/tex/// /usr/share/texlive/texmf-dist/tex/// -name sciposter.cls
> find: ‘/sbuild-nonexistent/.texlive2022/texmf-config/tex/kpsewhich//’: No such file or directory
> find: ‘/sbuild-nonexistent/.texlive2022/texmf-var/tex/kpsewhich//’: No such file or directory
> find: ‘/sbuild-nonexistent/texmf/tex/kpsewhich//’: No such file or directory
> find: ‘/usr/local/share/texmf/tex/kpsewhich//’: No such file or directory
> find: ‘/etc/texmf/tex/kpsewhich//’: No such file or directory
> find: ‘/var/lib/texmf/tex/kpsewhich//’: No such file or directory
> find: ‘/usr/share/texmf/tex/kpsewhich//’: No such file or directory
> find: ‘/usr/share/texlive/texmf-dist/tex/kpsewhich//’: No such file or directory
> find: ‘/sbuild-nonexistent/.texlive2022/texmf-config/tex/generic//’: No such file or directory
> find: ‘/sbuild-nonexistent/.texlive2022/texmf-var/tex/generic//’: No such file or directory
> find: ‘/sbuild-nonexistent/texmf/tex/generic//’: No such file or directory
> find: ‘/usr/local/share/texmf/tex/generic//’: No such file or directory
> find: ‘/usr/share/texmf/tex/generic//’: No such file or directory
> find: ‘/sbuild-nonexistent/.texlive2022/texmf-config/tex/latex//’: No such file or directory
> find: ‘/sbuild-nonexistent/.texlive2022/texmf-var/tex/latex//’: No such file or directory
> find: ‘/sbuild-nonexistent/texmf/tex/latex//’: No such file or directory
> find: ‘/usr/local/share/texmf/tex/latex//’: No such file or directory
> find: ‘/etc/texmf/tex/latex//’: No such file or directory
> find: ‘/var/lib/texmf/tex/latex//’: No such file or directory
> find: ‘/sbuild-nonexistent/.texlive2022/texmf-config/tex///’: No such file or directory
> find: ‘/sbuild-nonexistent/.texlive2022/texmf-var/tex///’: No such file or directory
> find: ‘/sbuild-nonexistent/texmf/tex///’: No such file or directory
> find: ‘/usr/local/share/texmf/tex///’: No such file or directory
> + exit 0
> for mf in pd.mtx vf.mtx ; do if test -f /<<PKGBUILDDIR>>/examples/$mf ; then true; else cp -p /<<PKGBUILDDIR>>/$mf /<<PKGBUILDDIR>>/examples/$mf ; fi; done
> for ii in hello snippets transpose power autotune backsolve hello-spblas io-spblas fortran fortran_rsb_fi cplusplus ; do echo /<<PKGBUILDDIR>>/examples/$ii ; if  /<<PKGBUILDDIR>>/examples/$ii ; then true ; else exit -1 ;fi ; done
> /<<PKGBUILDDIR>>/examples/hello
> Hello, RSB!
> Initializing the library...
> Correctly initialized the library.
> Attempting to set the RSB_IO_WANT_EXTRA_VERBOSE_INTERFACE library option.
> Failed setting the RSB_IO_WANT_EXTRA_VERBOSE_INTERFACE library option (reason string:
> Output to stream feature has been disabled at configure time.).
> This error may be safely ignored.
> Correctly allocated a matrix.
> Summary information of the matrix:
> (3 x 3)[0x555e01a93030]{D} @ (0(0..3),0(0..3)) (3 nnz, 1 nnz/r) flags 0x2040384 (coo:1, csr:0, hw:0, ic:1, fi:0), storage: 40, subm: 1, symflags:''
> Correctly performed a SPMV.
> Correctly freed the matrix.
> Correctly finalized the library.
> Program terminating with no error.
> /<<PKGBUILDDIR>>/examples/snippets
> Hello, RSB!
> Initializing the library...
> Correctly initialized the library.
> Attempting to set the RSB_IO_WANT_EXTRA_VERBOSE_INTERFACE library option.
> Failed setting the RSB_IO_WANT_EXTRA_VERBOSE_INTERFACE library option (reason string:
> Output to stream feature has been disabled at configure time.).
> This error may be safely ignored.
> Correctly allocated a matrix.
> Summary information of the matrix:
> (3 x 3)[0x55a59421d030]{D} @ (0(0..3),0(0..3)) (3 nnz, 1 nnz/r) flags 0x2040384 (coo:1, csr:0, hw:0, ic:1, fi:0), storage: 40, subm: 1, symflags:''
> Correctly performed a SPMV.
> Correctly freed the matrix.
> Correctly finalized the library.
> Program terminating with no error.
> %%MatrixMarket matrix coordinate real general
> 3 3 4
> 1	1	11
> 1	3	13
> 2	2	22
> 3	3	33
> %%MatrixMarket matrix coordinate real general
> 3 3 4
> 1	1	11
> 2	2	22
> 3	1	13
> 3	3	33
> %%MatrixMarket matrix coordinate real general
> 6 6 36
> 1	1	99
> 1	2	12
> 1	3	13
> 1	4	14
> 1	5	15
> 1	6	16
> 2	1	12
> 2	2	99
> 2	3	12
> 2	4	13
> 2	5	14
> 2	6	15
> 3	1	13
> 3	2	12
> 3	3	99
> 3	4	12
> 3	5	13
> 3	6	14
> 4	1	14
> 4	2	13
> 4	3	12
> 4	4	99
> 4	5	12
> 4	6	13
> 5	1	15
> 5	2	14
> 5	3	13
> 5	4	12
> 5	5	99
> 5	6	12
> 6	1	16
> 6	2	15
> 6	3	14
> 6	4	13
> 6	5	12
> 6	6	99
> Creating 5 x 5 matrix with 5 nonzeroes.
> 0/5  0 0 -> 0
> 1/5  1 0 -> 5
> 2/5  2 0 -> 10
> 3/5  3 0 -> 15
> 4/5  4 0 -> 20
> Done.
> Building a matrix with 5 nnz, 5 x 5
> Duplicates check: 5 - 0 = 5
>  converted COO to RSB in 3.408e-03 s (100.00 %)
>  analyzed arrays in 5.958e-04 s (17.48 %)
>  cleaned-up arrays in 0.000e+00 s (0.00 %)
>  deduplicated arrays in 9.537e-07 s (0.03 %)
>  sorted arrays in 1.271e-04 s (3.73 %)
>  shuffled partitions in 2.675e-03 s (78.49 %)
>  memory allocations took 3.099e-06 s (0.09 %)
>  leafs setup took 9.537e-07 s (0.03 %)
>  halfword conversion took 5.007e-06 s (0.15 %)
> Built (5 x 5)[0x55a59421f070]{D} @ (0(0..0),0(0..0)) (5 nnz, 1 nnz/r) flags 0x2046186 (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 2, symflags:''
> Allocated matrix of 5 nonzeroes:
> (5 x 5)[0x55a59421f070]{D} @ (0(0..0),0(0..0)) (5 nnz, 1 nnz/r) flags 0x2046186 (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 2, symflags:''
> 
> Before auto-tuning, 100 multiplications took 0.001195s.
> Threads autotuning (may take more than 1.500000s)...
> Will use autotuning routine to sample matrix: 5 x 5, type D, 5 nnz, 1 nnz/r, 3 subms, 2 lsubms, 4.0000 bpnz.
> Parameters: verbosity:1 mintimes:3 maxtimes:10 mindt:0 maxdt:0.1
> Sampling (15 x 0.1 s stages, transA=N, nrhs=2, timer gran.=2.93016e-08), 8 suggested as starting thread count(default).
> 3 iterations (8 th.) took 1.001e-05s; avg 3.338e-06s ( +/-   7.14/ 14.29 %); best 3.099e-06s; worst 3.815e-06s; std dev. 3.372e-07 (taking best).
> Reference operation time is 3.09944e-06 s (6.453 Mflops) with 8 threads.
> 3 iterations (8 th.) took 9.06e-06s; avg 3.02e-06s ( +/-   5.26/  2.63 %); best 2.861e-06s; worst 3.099e-06s; std dev. 1.124e-07 (taking best).
> Reference operation time is 2.86102e-06 s (6.991 Mflops) with 8 threads.
> After 0.000064s, autotuning routine did not find a better threads count configuration.
> (5 x 5)[0x55a59421f070]{D} @ (0(0..0),0(0..0)) (5 nnz, 1 nnz/r) flags 0x2046186 (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 2, symflags:''
> After threads auto-tuning, 100 multiplications took 0.000326s  --  effective speedup of 3.66374 x
> Matrix autotuning (may take more than 1.500000s; using 8 threads )...
> Will autotune matrix: 5 x 5, type D, 5 nnz, 1 nnz/r, 3 subms, 2 lsubms, 4.0000 bpnz.
> Parameters: verbosity:1 mintimes:3 maxtimes:10 mindt:0 maxdt:0.1
> Starting autotuning (15 x 0.1 s stages, transA=N, nrhs=2, timer gran.=2.93016e-08), 8 suggested as starting thread count.
> 3 iterations (8 th.) took 1.407e-05s; avg 4.689e-06s ( +/-  33.90/ 52.54 %); best 3.099e-06s; worst 7.153e-06s; std dev. 1.766e-06 (taking best).
> Reference operation time is 3.09944e-06 s (6.453 Mflops) with 8 threads.
> Starting merge (user-supplied threads) based auto-tuning procedure (transA=N, nrhs=2, order=cols) (max 6 steps, inclusive 3 grace steps) on: 5 x 5, type D, 5 nnz, 1 nnz/r, 3 subms, 2 lsubms, 4.0000 bpnz (tpop: 3.099e-06  Mflops: 6.453)
> Merge (2 -> 1 leaves) took w.c.t. of 1.192e-05s, ~2.861e-06s of computing time (of which 0s sorting, 1.907e-06s analysis)
> 3 iterations (8 th.) took 2.861e-06s; avg 9.537e-07s ( +/-  96.93/200.00 %); best 2.93e-08s; worst 2.861e-06s; std dev. 1.349e-06 (taking best).
> Reference operation time is 2.93016e-08 s (682.6 Mflops) with 8 threads.
> After merge step 1: tpop: 2.93e-08 s   ~Mflops: 682.556   nsubm:1 otn:8
> Applying merge (2 -> 1 leaves, 8 th.) yielded SPEEDUP of 105.777x: 3.099e-06s -> 2.93e-08s, so taking this instance.
> Merged all the matrix leaves: no reason to continue merging.
> A total of 1 merge steps (of max 6) (2 -> 1 subms) took 0.003051s (of which 0.001923s partitioning, 0s I/O); computing times: 2.861e-06s in par. loops, 0s sorting, 1.907e-06s analyzing)
> Total merge + benchmarking process took 0.003051s, equivalent to 104125.3/984.4 new/old ops (0.001132s for 2 clones -- as 38633.0/365.2 ops, or 19316.5/182.6 ops per clone), SPEEDUP of 105.777x
> Applying multi-merge (2 -> 1 leaves, 1 steps, 8 -> 8 th.sp.) yielded SPEEDUP of 105.777x (3.099e-06s -> 2.93e-08s), will amortize in      993.8 ops by saving 3.07e-06s per op.
> In 1 tuning rounds (tot. 0.0031s, 0.0011s for constructor, 2 clones) obtained a SPEEDUP of 10477.7% (105.8x) (from 6.453 to 682.6 Mflops).
> After 0.003123s, autotuning routine declared speedup of 105.777 x, when using threads count of 8.
> (5 x 5)[0x55a594222400]{D} @ (0(0..5),0(0..5)) (5 nnz, 1 nnz/r) flags 0x2040186 (coo:1, csr:0, hw:1, ic:1, fi:0), storage: 40, subm: 1, symflags:''
> After threads auto-tuning, 100 multiplications took 0.000023s  --  further speedup of 14.1031 x
> 0/2  0 0 -> 0
> 1/2  1 0 -> 5
> 0/2  0 3 -> 0
> 1/2  1 3 -> 5
> librsb timer-based profiling is not supported in this build. If you wish to have it, re-configure librsb with its support. So you can safely ignore the error you might just have seen printed out on screen.
> Hello, RSB!
> Initializing the library...
> Correctly initialized the library.
> Correctly allocated a matrix with 7 nonzeroes.
> Summary information of the matrix:
> (6 x 6)[0x55a59421f070]{D} @ (0(1..2),0(5..6)) (1 nnz, 0.17 nnz/r) flags 0x20443ee (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 1, symflags:'UT'
> Matrix printout:
> %%MatrixMarket matrix coordinate real general
> 6 6 1
> 2	6	1
> 
> We have a unitary vector:
> %%MatrixMarket matrix array real general
> 6 1
> 1
> 1
> 1
> 1
> 1
> 1
> 
> Multiplying matrix by unitary vector we get:
> %%MatrixMarket matrix array real general
> 6 1
> 1
> 2
> 1
> 1
> 1
> 1
> 
> Backsolving we should get a unitary vector:
> %%MatrixMarket matrix array real general
> 6 1
> 1
> 1
> 1
> 1
> 1
> 1
> All done.
> Correctly freed the matrix.
> Correctly finalized the library.
> Program terminating with no error.
> %%MatrixMarket matrix coordinate real general
> 3 3 4
> 1	1	11
> 2	2	22
> 3	1	31
> 3	3	33
> RSB matrix uses 4.000000 bytes per nnz.
> Rows between 2 and 2 have 2 nnz
> %%MatrixMarket matrix coordinate real general
> 3 3 4
> 1	1	-110
> 2	2	-220
> 3	1	-310
> 3	3	-330
> %%MatrixMarket matrix coordinate real general
> 3 3 4
> 1	1	12100
> 2	2	48400
> 3	1	136400
> 3	3	108900
> %%MatrixMarket matrix coordinate real general
> 3 3 4
> 1	1	-220
> 2	2	-440
> 3	1	-620
> 3	3	-660
> %%MatrixMarket matrix coordinate real general
> 3 3 4
> 1	1	11
> 2	2	22
> 3	1	31
> 3	3	33
> %%MatrixMarket matrix coordinate real general
> 3 3 4
> 1	1	11
> 2	2	22
> 3	1	31
> 3	3	33
> /<<PKGBUILDDIR>>/examples/transpose
> %%MatrixMarket matrix coordinate real general
> 3 3 4
> 1	1	11
> 1	3	13
> 2	2	22
> 3	3	33
> %%MatrixMarket matrix coordinate real general
> 3 3 4
> 1	1	11
> 2	2	22
> 3	1	13
> 3	3	33
> %%MatrixMarket matrix coordinate real general
> 6 6 36
> 1	1	99
> 1	2	12
> 1	3	13
> 1	4	14
> 1	5	15
> 1	6	16
> 2	1	12
> 2	2	99
> 2	3	12
> 2	4	13
> 2	5	14
> 2	6	15
> 3	1	13
> 3	2	12
> 3	3	99
> 3	4	12
> 3	5	13
> 3	6	14
> 4	1	14
> 4	2	13
> 4	3	12
> 4	4	99
> 4	5	12
> 4	6	13
> 5	1	15
> 5	2	14
> 5	3	13
> 5	4	12
> 5	5	99
> 5	6	12
> 6	1	16
> 6	2	15
> 6	3	14
> 6	4	13
> 6	5	12
> 6	6	99
> /<<PKGBUILDDIR>>/examples/power
> it:1 norm:46.3573 norm diff:46.3573
> it:2 norm:29.7377 norm diff:-16.6196
> it:3 norm:31.4458 norm diff:1.70813
> it:4 norm:32.3019 norm diff:0.85606
> it:5 norm:32.6947 norm diff:0.392803
> it:6 norm:32.8676 norm diff:0.172871
> it:7 norm:32.9425 norm diff:0.0749779
> it:8 norm:32.975 norm diff:0.0324516
> it:9 norm:32.9891 norm diff:0.0140839
> it:10 norm:32.9952 norm diff:0.00613785
> it:11 norm:32.9979 norm diff:0.00268936
> it:12 norm:32.9991 norm diff:0.00117493
> it:13 norm:32.9996 norm diff:0.000518799
> it:14 norm:32.9998 norm diff:0.000228882
> it:15 norm:32.9999 norm diff:0.000102997
> it:16 norm:33 norm diff:4.19617e-05
> it:17 norm:33 norm diff:2.67029e-05
> it:18 norm:33 norm diff:0
> /<<PKGBUILDDIR>>/examples/autotune
> Creating 500 x 500 matrix with 62500 nonzeroes.
> Building a matrix with 62500 nnz, 500 x 500
> Duplicates check: 62500 - 0 = 62500
>  converted COO to RSB in 2.997e-03 s (100.00 %)
>  analyzed arrays in 5.198e-05 s (1.73 %)
>  cleaned-up arrays in 1.040e-04 s (3.47 %)
>  deduplicated arrays in 1.719e-04 s (5.74 %)
>  sorted arrays in 2.334e-03 s (77.88 %)
>  shuffled partitions in 2.749e-04 s (9.17 %)
>  memory allocations took 2.313e-05 s (0.77 %)
>  leafs setup took 5.007e-06 s (0.17 %)
>  halfword conversion took 2.694e-05 s (0.90 %)
> Built (500 x 500)[0x55ea9c2a7e60]{D} @ (0(0..0),0(0..0)) (62500 nnz, 1.2e+02 nnz/r) flags 0x2046186 (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 19, symflags:''
> Allocated matrix of 62500 nonzeroes:
> (500 x 500)[0x55ea9c2a7e60]{D} @ (0(0..0),0(0..0)) (62500 nnz, 1.2e+02 nnz/r) flags 0x2046186 (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 19, symflags:''
> 
> Before auto-tuning, 100 multiplications took 0.016107s.
> Threads autotuning (may take more than 1.500000s)...
> Will use autotuning routine to sample matrix: 500 x 500, type D, 62500 nnz, 1.2e+02 nnz/r, 27 subms, 19 lsubms, 2.0653 bpnz.
> Parameters: verbosity:1 mintimes:3 maxtimes:10 mindt:0 maxdt:0.1
> Sampling (15 x 0.1 s stages, transA=N, nrhs=2, timer gran.=3.06964e-08), 8 suggested as starting thread count(default).
> 3 iterations (8 th.) took 0.000319s; avg 0.0001063s ( +/-  28.48/ 43.95 %); best 7.606e-05s; worst 0.0001531s; std dev. 3.352e-05 (taking best).
> Reference operation time is 7.60555e-05 s (3287 Mflops) with 8 threads.
> 3 iterations (8 th.) took 0.000473s; avg 0.0001577s ( +/-  46.17/ 90.37 %); best 8.488e-05s; worst 0.0003002s; std dev. 0.0001008 (taking best).
> Reference operation time is 8.4877e-05 s (2945 Mflops) with 8 threads.
> After 0.000859s, autotuning routine did not find a better threads count configuration.
> (500 x 500)[0x55ea9c2a7e60]{D} @ (0(0..0),0(0..0)) (62500 nnz, 1.2e+02 nnz/r) flags 0x2046186 (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 19, symflags:''
> After threads auto-tuning, 100 multiplications took 0.021238s  --  effective speedup of 0.758414 x
> Matrix autotuning (may take more than 1.500000s; using 8 threads )...
> Will autotune matrix: 500 x 500, type D, 62500 nnz, 1.2e+02 nnz/r, 27 subms, 19 lsubms, 2.0653 bpnz.
> Parameters: verbosity:1 mintimes:3 maxtimes:10 mindt:0 maxdt:0.1
> Starting autotuning (15 x 0.1 s stages, transA=N, nrhs=2, timer gran.=3.06964e-08), 8 suggested as starting thread count.
> 3 iterations (8 th.) took 0.000303s; avg 0.000101s ( +/-  23.76/ 43.51 %); best 7.701e-05s; worst 0.000145s; std dev. 3.112e-05 (taking best).
> Reference operation time is 7.70092e-05 s (3246 Mflops) with 8 threads.
> Starting merge (user-supplied threads) based auto-tuning procedure (transA=N, nrhs=2, order=cols) (max 6 steps, inclusive 3 grace steps) on: 500 x 500, type D, 62500 nnz, 1.2e+02 nnz/r, 27 subms, 19 lsubms, 2.0653 bpnz (tpop: 7.701e-05  Mflops: 3246.365)
> Merge (19 -> 16 leaves) took w.c.t. of 7.606e-05s, ~5.817e-05s of computing time (of which 2.408e-05s sorting, 6.914e-06s analysis)
> 3 iterations (8 th.) took 0.000226s; avg 7.534e-05s ( +/-   1.90/  2.22 %); best 7.391e-05s; worst 7.701e-05s; std dev. 1.277e-06 (taking best).
> Reference operation time is 7.39098e-05 s (3383 Mflops) with 8 threads.
> After merge step 1: tpop: 7.391e-05 s   ~Mflops: 3382.503   nsubm:16 otn:8
> Applying merge (19 -> 16 leaves, 8 th.) yielded SPEEDUP of  1.042x: 7.701e-05s -> 7.391e-05s, so taking this instance.
> Merge (16 -> 13 leaves) took w.c.t. of 5.388e-05s, ~4.697e-05s of computing time (of which 2.384e-05s sorting, 1.907e-06s analysis)
> 3 iterations (8 th.) took 0.0004609s; avg 0.0001536s ( +/-  53.75/ 93.22 %); best 7.105e-05s; worst 0.0002968s; std dev. 0.0001017 (taking best).
> Reference operation time is 7.10487e-05 s (3519 Mflops) with 8 threads.
> After merge step 2: tpop: 7.105e-05 s   ~Mflops: 3518.711   nsubm:13 otn:8
> Applying merge (16 -> 13 leaves, 8 th.) yielded SPEEDUP of  1.040x: 7.391e-05s -> 7.105e-05s, so taking this instance.
> Merge (13 -> 10 leaves) took w.c.t. of 0.0001578s, ~0.0001521s of computing time (of which 9.68e-05s sorting, 1.907e-06s analysis)
> 3 iterations (8 th.) took 0.000217s; avg 7.232e-05s ( +/-   8.68/  6.48 %); best 6.604e-05s; worst 7.701e-05s; std dev. 4.616e-06 (taking best).
> Reference operation time is 6.60419e-05 s (3785 Mflops) with 8 threads.
> After merge step 3: tpop: 6.604e-05 s   ~Mflops: 3785.473   nsubm:10 otn:8
> Applying merge (13 -> 10 leaves, 8 th.) yielded SPEEDUP of  1.076x: 7.105e-05s -> 6.604e-05s, so taking this instance.
> Merge (10 -> 7 leaves) took w.c.t. of 0.000104s, ~9.704e-05s of computing time (of which 4.601e-05s sorting, 1.907e-06s analysis)
> 3 iterations (8 th.) took 0.000833s; avg 0.0002777s ( +/-  48.48/ 81.51 %); best 0.0001431s; worst 0.000504s; std dev. 0.000161 (taking best).
> Reference operation time is 0.000143051 s (1748 Mflops) with 8 threads.
> After merge step 4: tpop: 0.0001431 s   ~Mflops: 1747.627   nsubm:7 otn:8
> Applying merge (10 -> 7 leaves, 8 th.) yielded SLOWDOWN (1th of 3 tolerable) of  2.166x: 6.604e-05s -> 0.0001431s.
> Skipping further merge based tests after 1 definite performance degradations in a row (and last exceeding limit).
> A total of 4 merge steps (of max 6) (19 -> 7 subms) took 0.003938s (of which 0.001713s partitioning, 0s I/O); computing times: 0.0003543s in par. loops, 0.0001907s sorting, 1.264e-05s analyzing)
> Total merge + benchmarking process took 0.003938s, equivalent to 59.6/51.1 new/old ops (0.0008938s for 4 clones -- as 13.5/11.6 ops, or 3.4/2.9 ops per clone), SPEEDUP of  1.166x
> Applying multi-merge (19 -> 10 leaves, 3 steps, 8 -> 8 th.sp.) yielded SPEEDUP of  1.166x (7.701e-05s -> 6.604e-05s), will amortize in      359.1 ops by saving 1.097e-05s per op.
> In 1 tuning rounds (tot. 0.0049s, 0.00089s for constructor, 4 clones) obtained a SPEEDUP of   16.6% (1.166x) (from 3246 to 3785 Mflops).
> After 0.004961s, autotuning routine declared speedup of 1.16606 x, when using threads count of 8.
> (500 x 500)[0x55ea9c2ae300]{D} @ (0(0..0),0(0..0)) (62500 nnz, 1.2e+02 nnz/r) flags 0x2046186 (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 10, symflags:''
> After threads auto-tuning, 100 multiplications took 0.013883s  --  further speedup of 1.52979 x
> librsb timer-based profiling is not supported in this build. If you wish to have it, re-configure librsb with its support. So you can safely ignore the error you might just have seen printed out on screen.
> /<<PKGBUILDDIR>>/examples/backsolve
> Hello, RSB!
> Initializing the library...
> Correctly initialized the library.
> Building a matrix with 7 nnz, 6 x 6
> Duplicates check: 1 - 0 = 1
>  converted COO to RSB in 4.709e-04 s (100.00 %)
>  analyzed arrays in 3.729e-04 s (79.19 %)
>  cleaned-up arrays in 5.960e-06 s (1.27 %)
>  deduplicated arrays in 0.000e+00 s (0.00 %)
>  sorted arrays in 2.146e-06 s (0.46 %)
>  shuffled partitions in 8.821e-06 s (1.87 %)
>  memory allocations took 6.890e-05 s (14.63 %)
>  leafs setup took 2.146e-06 s (0.46 %)
>  halfword conversion took 7.868e-06 s (1.67 %)
> Built (6 x 6)[0x5610cdb56060]{D} @ (0(0..1),0(5..6)) (1 nnz, 0.17 nnz/r) flags 0x20443ee (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 1, symflags:'UT'
> Correctly allocated a matrix with 7 nonzeroes.
> Summary information of the matrix:
> (6 x 6)[0x5610cdb56060]{D} @ (0(0..1),0(5..6)) (1 nnz, 0.17 nnz/r) flags 0x20443ee (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 1, symflags:'UT'
> Matrix printout:
> %%MatrixMarket matrix coordinate real general
> 6 6 1
> 1	6	1
> 
> We have a unitary vector:
> %%MatrixMarket matrix array real general
> 6 1
> 1
> 1
> 1
> 1
> 1
> 1
> 
> Multiplying matrix by unitary vector we get:
> %%MatrixMarket matrix array real general
> 6 1
> 2
> 1
> 1
> 1
> 1
> 1
> Will autotune matrix: 6 x 6, type D, 1 nnz, 0.17 nnz/r, 1 subms, 1 lsubms, 4.0000 bpnz.
> Parameters: verbosity:1 mintimes:3 maxtimes:10 mindt:0 maxdt:4.416e-08
> 3 iterations (8 th.) took 1.693e-05s; avg 5.643e-06s ( +/-  99.22/200.00 %); best 4.416e-08s; worst 1.693e-05s; std dev. 7.98e-06 (taking best).
> Reference operation time is 4.41551e-08 s (45.29 Mflops) with 8 threads.
> Starting merge (same threads) based auto-tuning procedure (transA=N, nrhs=1, order=cols) (max 6 steps, inclusive 3 grace steps) on: 6 x 6, type D, 1 nnz, 0.17 nnz/r, 1 subms, 1 lsubms, 4.0000 bpnz (tpop: 4.416e-08  Mflops: 45.295)
> Merge (1 -> 1 leaves) took w.c.t. of 0s, ~0s of computing time (of which 0s sorting, 0s analysis)
> 3 iterations (8 th.) took 9.537e-07s; avg 3.179e-07s ( +/-  86.11/200.00 %); best 4.416e-08s; worst 9.537e-07s; std dev. 4.496e-07 (taking best).
> Reference operation time is 4.41551e-08 s (45.29 Mflops) with 8 threads.
> After merge step 1: tpop: 4.416e-08 s   ~Mflops: 45.295   nsubm:1 otn:8
> Applying merge (1 -> 1 leaves, 8 th.) yielded NEGLIGIBLE change (1th in a row) (old/new=1.00000x): 4.416e-08s -> 4.416e-08s, so IGNORING this instance.
> Merged all the matrix leaves: no reason to continue merging.
> A total of 1 merge steps (of max 6) (1 -> 1 subms) took 1.407e-05s (of which 3.099e-06s partitioning, 0s I/O); computing times: 0s in par. loops, 0s sorting, 0s analyzing)
> Total merge + benchmarking process took 1.407e-05s, equivalent to 318.6/318.6 new/old ops (1.407e-05s for 1 clones -- as 318.6/318.6 ops, or 318.6/318.6 ops per clone), SPEEDUP of  1.000x (NO SPEEDUP)
> Merging based autotuning FAILED (=NO SPEEDUP); let's try splitting then...
> 3 iterations (8 th.) took 9.537e-07s; avg 3.179e-07s ( +/-  86.11/200.00 %); best 4.416e-08s; worst 9.537e-07s; std dev. 4.496e-07 (taking best).
> Reference operation time is 4.41551e-08 s (45.29 Mflops) with 8 threads.
> Starting split (same threads) based auto-tuning procedure (transA=N, nrhs=1, order=cols) (max 6 steps, inclusive 3 grace steps) on: 6 x 6, type D, 1 nnz, 0.17 nnz/r, 1 subms, 1 lsubms, 4.0000 bpnz (tpop: 4.416e-08  Mflops: 45.295)
> Split (1 -> 1 leaves, 1 -> 1 subms) took 1.097e-05s (of which: 9.537e-07s analysis, 0s mem.mgmt); compute time: 0s overall, 0s searches, 0s shuffle, 0s switch, 0s quadrants.
> 3 iterations (8 th.) took 1.192e-06s; avg 3.974e-07s ( +/-  88.89/200.00 %); best 4.416e-08s; worst 1.192e-06s; std dev. 5.62e-07 (taking best).
> Reference operation time is 4.41551e-08 s (45.29 Mflops) with 8 threads.
> After split step 1: tpop: 4.416e-08 s   ~Mflops: 45.295   nsubm:1 otn:8
> Applying split (1 -> 1 leaves, 8 th.) yielded NEGLIGIBLE change (1th in a row) (old/new=1.00000x): 4.416e-08s -> 4.416e-08s, so IGNORING this instance.
> Split (1 -> 1 leaves, 1 -> 1 subms) took 2.146e-06s (of which: 0s analysis, 0s mem.mgmt); compute time: 0s overall, 0s searches, 0s shuffle, 0s switch, 0s quadrants.
> 3 iterations (8 th.) took 9.537e-07s; avg 3.179e-07s ( +/-  86.11/200.00 %); best 4.416e-08s; worst 9.537e-07s; std dev. 4.496e-07 (taking best).
> Reference operation time is 4.41551e-08 s (45.29 Mflops) with 8 threads.
> After split step 2: tpop: 4.416e-08 s   ~Mflops: 45.295   nsubm:1 otn:8
> Applying split (1 -> 1 leaves, 8 th.) yielded NEGLIGIBLE change (2th in a row) (old/new=1.00000x): 4.416e-08s -> 4.416e-08s, so IGNORING this instance.
> Split (1 -> 1 leaves, 1 -> 1 subms) took 1.097e-05s (of which: 1.192e-06s analysis, 0s mem.mgmt); compute time: 0s overall, 0s searches, 0s shuffle, 0s switch, 0s quadrants.
> 3 iterations (8 th.) took 4.053e-06s; avg 1.351e-06s ( +/-  96.73/200.00 %); best 4.416e-08s; worst 4.053e-06s; std dev. 1.911e-06 (taking best).
> Reference operation time is 4.41551e-08 s (45.29 Mflops) with 8 threads.
> After split step 3: tpop: 4.416e-08 s   ~Mflops: 45.295   nsubm:1 otn:8
> Applying split (1 -> 1 leaves, 8 th.) yielded NEGLIGIBLE change (3th in a row) (old/new=1.00000x): 4.416e-08s -> 4.416e-08s, so IGNORING this instance.
> Split (1 -> 1 leaves, 1 -> 1 subms) took 1.907e-06s (of which: 0s analysis, 0s mem.mgmt); compute time: 0s overall, 0s searches, 0s shuffle, 0s switch, 0s quadrants.
> 3 iterations (8 th.) took 0s; avg 0s ( +/-   -inf/  -nan %); best 4.416e-08s; worst 0s; std dev. 0 (taking best).
> Reference operation time is 4.41551e-08 s (45.29 Mflops) with 8 threads.
> After split step 4: tpop: 4.416e-08 s   ~Mflops: 45.295   nsubm:1 otn:8
> Applying split (1 -> 1 leaves, 8 th.) yielded NEGLIGIBLE change (4th in a row) (old/new=1.00000x): 4.416e-08s -> 4.416e-08s, so IGNORING this instance.
> Split (1 -> 1 leaves, 1 -> 1 subms) took 1.216e-05s (of which: 0s analysis, 0s mem.mgmt); compute time: 0s overall, 0s searches, 0s shuffle, 0s switch, 0s quadrants.
> 3 iterations (8 th.) took 9.537e-07s; avg 3.179e-07s ( +/-  86.11/200.00 %); best 4.416e-08s; worst 9.537e-07s; std dev. 4.496e-07 (taking best).
> Reference operation time is 4.41551e-08 s (45.29 Mflops) with 8 threads.
> After split step 5: tpop: 4.416e-08 s   ~Mflops: 45.295   nsubm:1 otn:8
> Applying split (1 -> 1 leaves, 8 th.) yielded NEGLIGIBLE change (5th in a row) (old/new=1.00000x): 4.416e-08s -> 4.416e-08s, so IGNORING this instance.
> Split (1 -> 1 leaves, 1 -> 1 subms) took 9.537e-07s (of which: 0s analysis, 0s mem.mgmt); compute time: 0s overall, 0s searches, 0s shuffle, 0s switch, 0s quadrants.
> 3 iterations (8 th.) took 0s; avg 0s ( +/-   -inf/  -nan %); best 4.416e-08s; worst 0s; std dev. 0 (taking best).
> Reference operation time is 4.41551e-08 s (45.29 Mflops) with 8 threads.
> After split step 6: tpop: 4.416e-08 s   ~Mflops: 45.295   nsubm:1 otn:8
> Applying split (1 -> 1 leaves, 8 th.) yielded NEGLIGIBLE change (6th in a row) (old/new=1.00000x): 4.416e-08s -> 4.416e-08s, so IGNORING this instance.
> A total of 6 split steps (of max 6) (1 -> 1 subms) took 0.002006s (of which 5.913e-05s partitioning, 0s I/O); computing times: 0s in par. loops, 0s sorting, 2.146e-06s analyzing)
> Total split + benchmarking process took 0.002006s, equivalent to 45432.0/45432.0 new/old ops (5.96e-06s for 1 clones -- as 135.0/135.0 ops, or 135.0/135.0 ops per clone), SPEEDUP of  1.000x (NO SPEEDUP)
> In 1 tuning rounds (tot. 0.0021s, 2e-05s for constructor, 2 clones) obtained NO speedup (best stays 45.29 Mflops).
> 
> Backsolving we should get a unitary vector:
> %%MatrixMarket matrix array real general
> 6 1
> 1
> 1
> 1
> 1
> 1
> 1
> All done.
> Correctly freed the matrix.
> Correctly finalized the library.
> Program terminating with no error.
> /<<PKGBUILDDIR>>/examples/hello-spblas
> Hello, RSB!
> Correctly initialized the library.
> Correctly allocated a matrix.
> Correctly performed a SPMV.
> Correctly freed the matrix.
> Correctly finalized the library.
> Program terminating with no error.
> /<<PKGBUILDDIR>>/examples/io-spblas
> Hello, RSB!
> Correctly initialized the library.
> Correctly loaded and allocated a matrix from file pd.mtx.
> Now SPMV with NULL vectors will be attempted, resulting in an error (so don't worry).
> Correctly detected an error condition.
> Program correctly recovered from intentional error condition.
> Correctly freed the matrix.
> Correctly finalized the library.
> /<<PKGBUILDDIR>>/examples/fortran
> Building a matrix with 210 nnz, 20 x 20
> Duplicates check: 210 - 0 = 210
>  converted COO to RSB in 7.620e-04 s (100.00 %)
>  analyzed arrays in 3.314e-05 s (4.35 %)
>  cleaned-up arrays in 1.907e-06 s (0.25 %)
>  deduplicated arrays in 2.146e-06 s (0.28 %)
>  sorted arrays in 6.340e-04 s (83.20 %)
>  shuffled partitions in 2.599e-05 s (3.41 %)
>  memory allocations took 3.481e-05 s (4.57 %)
>  leafs setup took 4.053e-06 s (0.53 %)
>  halfword conversion took 1.597e-05 s (2.10 %)
> Built (20 x 20)[0x55f9f71a9580]{D} @ (0(0..0),0(0..0)) (210 nnz, 10 nnz/r) flags 0x2446396 (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 22, symflags:'LS'
> Will autotune matrix: 20 x 20, type D, 210 nnz, 10 nnz/r, 30 subms, 22 lsubms, 3.7524 bpnz.
> Parameters: verbosity:1 mintimes:3 maxtimes:10 mindt:0 maxdt:4.325e-08
> Starting autotuning (16 x 4.32491e-08 s stages, transA=N, nrhs=1, timer gran.=4.32491e-08), 8 suggested as starting thread count(default).
> 3 iterations (8 th.) took 0.000273s; avg 9.1e-05s ( +/-  49.43/ 87.86 %); best 4.601e-05s; worst 0.0001709s; std dev. 5.668e-05 (taking best).
> Reference operation time is 4.60148e-05 s (18.26 Mflops) with 8 threads.
> Starting merge (same threads) based auto-tuning procedure (transA=N, nrhs=1, order=cols) (max 6 steps, inclusive 3 grace steps) on: 20 x 20, type D, 210 nnz, 10 nnz/r, 30 subms, 22 lsubms, 3.7524 bpnz (tpop: 4.601e-05  Mflops: 18.255)
> Merge (22 -> 16 leaves) took w.c.t. of 0.003675s, ~0.003658s of computing time (of which 1.907e-06s sorting, 3.099e-06s analysis)
> 3 iterations (8 th.) took 0.002114s; avg 0.0007046s ( +/-  94.76/184.98 %); best 3.695e-05s; worst 0.002008s; std dev. 0.0009217 (taking best).
> Reference operation time is 3.69549e-05 s (22.73 Mflops) with 8 threads.
> After merge step 1: tpop: 3.695e-05 s   ~Mflops: 22.730   nsubm:16 otn:8
> Applying merge (22 -> 16 leaves, 8 th.) yielded SPEEDUP of  1.245x: 4.601e-05s -> 3.695e-05s, so taking this instance.
> Merge (16 -> 13 leaves) took w.c.t. of 1.216e-05s, ~3.099e-06s of computing time (of which 0s sorting, 3.099e-06s analysis)
> 3 iterations (8 th.) took 9.298e-05s; avg 3.099e-05s ( +/-   6.92/ 10.00 %); best 2.885e-05s; worst 3.409e-05s; std dev. 2.245e-06 (taking best).
> Reference operation time is 2.88486e-05 s (29.12 Mflops) with 8 threads.
> After merge step 2: tpop: 2.885e-05 s   ~Mflops: 29.117   nsubm:13 otn:8
> Applying merge (16 -> 13 leaves, 8 th.) yielded SPEEDUP of  1.281x: 3.695e-05s -> 2.885e-05s, so taking this instance.
> Merge (13 -> 10 leaves) took w.c.t. of 5.96e-06s, ~9.537e-07s of computing time (of which 0s sorting, 1.907e-06s analysis)
> 3 iterations (8 th.) took 9.203e-05s; avg 3.068e-05s ( +/-  21.50/ 36.79 %); best 2.408e-05s; worst 4.196e-05s; std dev. 8.018e-06 (taking best).
> Reference operation time is 2.40803e-05 s (34.88 Mflops) with 8 threads.
> After merge step 3: tpop: 2.408e-05 s   ~Mflops: 34.883   nsubm:10 otn:8
> Applying merge (13 -> 10 leaves, 8 th.) yielded SPEEDUP of  1.198x: 2.885e-05s -> 2.408e-05s, so taking this instance.
> Merge (10 -> 8 leaves) took w.c.t. of 5.007e-06s, ~2.146e-06s of computing time (of which 0s sorting, 9.537e-07s analysis)
> 3 iterations (8 th.) took 6.509e-05s; avg 2.17e-05s ( +/-  21.98/ 34.07 %); best 1.693e-05s; worst 2.909e-05s; std dev. 5.299e-06 (taking best).
> Reference operation time is 1.69277e-05 s (49.62 Mflops) with 8 threads.
> After merge step 4: tpop: 1.693e-05 s   ~Mflops: 49.623   nsubm:8 otn:8
> Applying merge (10 -> 8 leaves, 8 th.) yielded SPEEDUP of  1.423x: 2.408e-05s -> 1.693e-05s, so taking this instance.
> Merge (8 -> 6 leaves) took w.c.t. of 4.768e-06s, ~1.907e-06s of computing time (of which 0s sorting, 9.537e-07s analysis)
> 3 iterations (8 th.) took 5.698e-05s; avg 1.899e-05s ( +/-  10.88/  5.44 %); best 1.693e-05s; worst 2.003e-05s; std dev. 1.461e-06 (taking best).
> Reference operation time is 1.69277e-05 s (49.62 Mflops) with 8 threads.
> After merge step 5: tpop: 1.693e-05 s   ~Mflops: 49.623   nsubm:6 otn:8
> Applying merge (8 -> 6 leaves, 8 th.) yielded NEGLIGIBLE change (1th in a row) (old/new=1.00000x): 1.693e-05s -> 1.693e-05s, so IGNORING this instance.
> Merge (6 -> 3 leaves) took w.c.t. of 5.007e-06s, ~3.099e-06s of computing time (of which 1.192e-06s sorting, 0s analysis)
> 3 iterations (8 th.) took 0.0008221s; avg 0.000274s ( +/-  96.35/192.34 %); best 1.001e-05s; worst 0.0008011s; std dev. 0.0003727 (taking best).
> Reference operation time is 1.00136e-05 s (83.89 Mflops) with 8 threads.
> After merge step 6: tpop: 1.001e-05 s   ~Mflops: 83.886   nsubm:3 otn:8
> Applying merge (6 -> 3 leaves, 8 th.) yielded SPEEDUP of  1.690x: 1.693e-05s -> 1.001e-05s, so taking this instance.
> A total of 6 merge steps (of max 6) (22 -> 3 subms) took 0.008676s (of which 0.005291s partitioning, 0s I/O); computing times: 0.003669s in par. loops, 3.099e-06s sorting, 1.001e-05s analyzing)
> Total merge + benchmarking process took 0.008676s, equivalent to 866.4/188.5 new/old ops (7.844e-05s for 6 clones -- as 7.8/1.7 ops, or 1.3/0.3 ops per clone), SPEEDUP of  4.595x
> Applying multi-merge (22 -> 3 leaves, 6 steps, 0 -> 8 th.sp.) yielded SPEEDUP of  4.595x (4.601e-05s -> 1.001e-05s), will amortize in      241.0 ops by saving 3.6e-05s per op.
> In 1 tuning rounds (tot. 0.009s, 7.8e-05s for constructor, 6 clones) obtained a SPEEDUP of  359.5% (4.595x) (from 18.26 to 83.89 Mflops).
>  autotuner chose            8  threads
> Will autotune matrix: 20 x 20, type D, 210 nnz, 10 nnz/r, 4 subms, 3 lsubms, 2.6286 bpnz.
> Parameters: verbosity:1 mintimes:3 maxtimes:10 mindt:0 maxdt:4.325e-08
> Starting autotuning (16 x 4.32491e-08 s stages, transA=N, nrhs=1, timer gran.=4.32491e-08), 8 suggested as starting thread count(default).
> 3 iterations (8 th.) took 2.003e-05s; avg 6.676e-06s ( +/-  25.00/ 50.00 %); best 5.007e-06s; worst 1.001e-05s; std dev. 2.36e-06 (taking best).
> ~ 8 threads: 5.007e-06s  (1.7e+02 Mflops) (0/2 degradations so far)  - 
> 3 iterations (7 th.) took 2.694e-05s; avg 8.98e-06s ( +/-  33.63/ 56.64 %); best 5.96e-06s; worst 1.407e-05s; std dev. 3.618e-06 (taking best).
>   7 threads: 5.96e-06s  (1.4e+02 Mflops) (1/2 degradations so far)  - 
> 3 iterations (6 th.) took 1.884e-05s; avg 6.278e-06s ( +/-  20.25/ 25.32 %); best 5.007e-06s; worst 7.868e-06s; std dev. 1.189e-06 (taking best).
>   6 threads: 5.007e-06s  (1.7e+02 Mflops) (0/2 degradations so far)  - 
> 3 iterations (5 th.) took 1.717e-05s; avg 5.722e-06s ( +/-  12.50/  8.33 %); best 5.007e-06s; worst 6.199e-06s; std dev. 5.15e-07 (taking best).
>   5 threads: 5.007e-06s  (1.7e+02 Mflops) (0/2 degradations so far)  - 
> 3 iterations (4 th.) took 1.907e-05s; avg 6.358e-06s ( +/-  21.25/ 27.50 %); best 5.007e-06s; worst 8.106e-06s; std dev. 1.296e-06 (taking best).
>   4 threads: 5.007e-06s  (1.7e+02 Mflops) (0/2 degradations so far)  - 
> 3 iterations (3 th.) took 0.0006969s; avg 0.0002323s ( +/-  95.69/175.47 %); best 1.001e-05s; worst 0.0006399s; std dev. 0.0002886 (taking best).
>   3 threads: 1.001e-05s  (84 Mflops) (1/2 degradations so far)  - 
> 3 iterations (2 th.) took 1.693e-05s; avg 5.643e-06s ( +/-  11.27/  5.63 %); best 5.007e-06s; worst 5.96e-06s; std dev. 4.496e-07 (taking best).
>   2 threads: 5.007e-06s  (1.7e+02 Mflops) (0/2 degradations so far)  - 
> 3 iterations (1 th.) took 1.097e-05s; avg 3.656e-06s ( +/-  21.74/ 10.87 %); best 2.861e-06s; worst 4.053e-06s; std dev. 5.62e-07 (taking best).
>   1 threads: 2.861e-06s  (2.9e+02 Mflops) (0/2 degradations so far)  - 
> Best threads choice is 1; starting threads were 8; max speed gap is 3.5x; search took 0.00088s.
> Starting merge (and threads) based auto-tuning procedure (transA=N, nrhs=1, order=cols) (max 6 steps, inclusive 3 grace steps) on: 20 x 20, type D, 210 nnz, 10 nnz/r, 4 subms, 3 lsubms, 2.6286 bpnz (tpop: 2.861e-06  Mflops: 293.601)
> Merge (3 -> 1 leaves) took w.c.t. of 1.097e-05s, ~7.153e-06s of computing time (of which 1.907e-06s sorting, 9.537e-07s analysis)
> 3 iterations (8 th.) took 2.861e-06s; avg 9.537e-07s ( +/-   0.00/  0.00 %); best 9.537e-07s; worst 9.537e-07s; std dev. 0 (taking best).
> ~ 8 threads: 9.537e-07s  (8.8e+02 Mflops) (0/2 degradations so far)  - 
> 3 iterations (7 th.) took 2.146e-06s; avg 7.153e-07s ( +/-  93.95/ 66.67 %); best 4.325e-08s; worst 1.192e-06s; std dev. 5.15e-07 (taking best).
>   7 threads: 4.325e-08s  (1.9e+04 Mflops) (0/2 degradations so far)  - 
> 3 iterations (6 th.) took 1.907e-06s; avg 6.358e-07s ( +/-  93.20/ 50.00 %); best 4.325e-08s; worst 9.537e-07s; std dev. 4.496e-07 (taking best).
>   6 threads: 4.325e-08s  (1.9e+04 Mflops) (0/2 degradations so far)  - 
> 3 iterations (5 th.) took 1.907e-06s; avg 6.358e-07s ( +/-  93.20/ 50.00 %); best 4.325e-08s; worst 9.537e-07s; std dev. 4.496e-07 (taking best).
>   5 threads: 4.325e-08s  (1.9e+04 Mflops) (0/2 degradations so far)  - 
> 3 iterations (4 th.) took 1.907e-06s; avg 6.358e-07s ( +/-  93.20/ 50.00 %); best 4.325e-08s; worst 9.537e-07s; std dev. 4.496e-07 (taking best).
>   4 threads: 4.325e-08s  (1.9e+04 Mflops) (0/2 degradations so far)  - 
> 3 iterations (3 th.) took 1.907e-06s; avg 6.358e-07s ( +/-  93.20/ 50.00 %); best 4.325e-08s; worst 9.537e-07s; std dev. 4.496e-07 (taking best).
>   3 threads: 4.325e-08s  (1.9e+04 Mflops) (0/2 degradations so far)  - 
> 3 iterations (2 th.) took 3.099e-06s; avg 1.033e-06s ( +/-   7.69/ 15.38 %); best 9.537e-07s; worst 1.192e-06s; std dev. 1.124e-07 (taking best).
>   2 threads: 9.537e-07s  (8.8e+02 Mflops) (1/2 degradations so far)  - 
> 3 iterations (1 th.) took 2.861e-06s; avg 9.537e-07s ( +/-   0.00/  0.00 %); best 9.537e-07s; worst 9.537e-07s; std dev. 0 (taking best).
>   1 threads: 9.537e-07s  (8.8e+02 Mflops) (2/2 degradations so far)  - 
> Best threads choice is 7; starting threads were 8; max speed gap is 22x; search took 0.0011s.
> After merge step 1: tpop: 4.325e-08 s   ~Mflops: 19422.356   nsubm:1 otn:7
> Applying merge (3 -> 1 leaves, 7 th.) yielded SPEEDUP of 66.152x: 2.861e-06s -> 4.325e-08s, so taking this instance.
> Merged all the matrix leaves: no reason to continue merging.
> A total of 1 merge steps (of max 6) (3 -> 1 subms) took 0.00243s (of which 1.502e-05s partitioning, 0s I/O); computing times: 7.153e-06s in par. loops, 1.907e-06s sorting, 9.537e-07s analyzing)
> Total merge + benchmarking process took 0.00243s, equivalent to 56185.2/849.3 new/old ops (0.001289s for 2 clones -- as 29807.1/450.6 ops, or 14903.5/225.3 ops per clone), SPEEDUP of 66.152x
> Applying multi-merge (3 -> 1 leaves, 1 steps, 1 -> 7 th.sp.) yielded SPEEDUP of 66.152x (2.861e-06s -> 4.325e-08s), will amortize in      862.4 ops by saving 2.818e-06s per op.
> In 1 tuning rounds (tot. 0.0033s, 0.0013s for constructor, 2 clones) obtained a SPEEDUP of 6515.2% (66.15x) (from 293.6 to 1.942e+04 Mflops).
>  check results are ok
> Building a matrix with 36 nnz, 6 x 6
> Duplicates check: 36 - 0 = 36
>  converted COO to RSB in 2.217e-05 s (100.00 %)
>  analyzed arrays in 5.007e-06 s (22.58 %)
>  cleaned-up arrays in 0.000e+00 s (0.00 %)
>  deduplicated arrays in 9.537e-07 s (4.30 %)
>  sorted arrays in 5.960e-06 s (26.88 %)
>  shuffled partitions in 2.861e-06 s (12.90 %)
>  memory allocations took 4.292e-06 s (19.35 %)
>  leafs setup took 0.000e+00 s (0.00 %)
>  halfword conversion took 2.146e-06 s (9.68 %)
> Built (6 x 6)[0x55f9f71af690]{Z} @ (0(0..6),0(0..6)) (36 nnz, 6 nnz/r) flags 0x20440b4 (coo:0, csr:1, hw:0, ic:1, fi:0), storage: 1, subm: 1, symflags:'UL'
>  Read matrix pd.mtx            6 x           6 :          36
>  Matrix has no symmetry
> Using NRHS=4
> Repeated USMV took   0.3099E-04 s
> A single USMM took   0.1311E-04 s
> USMM-to-USMV speed ratio is is   2.364    x
>  Call auto-tuning routine..
>  Repeat measurement.
> Tuned USMM took   0.2861E-05 s
> Tuned-to-untuned speed ratio is is   4.583    x
>  FAILED:           0
>  PASSED:           2
> /<<PKGBUILDDIR>>/examples/fortran_rsb_fi
> %%MatrixMarket matrix coordinate real general
> 2 2 4
> 1	1	1
> 1	2	1
> 2	1	1
> 2	2	1
>  Optimal number of threads:           0
> %%MatrixMarket matrix coordinate real general
> 2 2 4
> 1	1	1
> 1	2	1
> 2	1	1
> 2	2	1
>  type=d dims=2x2 sym=g diag=g blocks=1x1 usmv alpha= 3 beta= 1 incx=1 incy=1 trans=n is ok
>  rsb module fortran test is ok
>  rsb module fortran test is ok
> %%MatrixMarket matrix coordinate real symmetric
> 2 2 3
> 1	1	11
> 2	1	21
> 2	2	22
>  Optimal number of threads:           0
> %%MatrixMarket matrix coordinate real symmetric
> 2 2 3
> 1	1	11
> 2	1	21
> 2	2	22
>    215.00000000000000        264.00000000000000     
>  type=d dims=2x2 sym=s diag=g blocks=1x1 usmv alpha= 4 beta= 1 incx=1 incy=1 trans=n is ok
>  FAILED:           0
>  PASSED:           3
> /<<PKGBUILDDIR>>/examples/cplusplus
> %%MatrixMarket matrix coordinate real general
> 6 6 7
> 1	1	1
> 2	1	2
> 2	2	1
> 3	3	1
> 4	4	1
> 5	5	1
> 6	6	1
> ./autotune  /<<PKGBUILDDIR>>/pd.mtx
> Loading matrix from file "/<<PKGBUILDDIR>>/pd.mtx".
> Building a matrix with 36 nnz, 6 x 6
> Duplicates check: 36 - 0 = 36
>  converted COO to RSB in 1.948e-03 s (100.00 %)
>  analyzed arrays in 2.789e-05 s (1.43 %)
>  cleaned-up arrays in 9.537e-07 s (0.05 %)
>  deduplicated arrays in 9.537e-07 s (0.05 %)
>  sorted arrays in 1.858e-03 s (95.37 %)
>  shuffled partitions in 2.408e-05 s (1.24 %)
>  memory allocations took 7.153e-06 s (0.37 %)
>  leafs setup took 4.053e-06 s (0.21 %)
>  halfword conversion took 1.287e-05 s (0.66 %)
> Built (6 x 6)[0x55a2e9b455a0]{D} @ (0(0..0),0(0..0)) (36 nnz, 6 nnz/r) flags 0x42046186 (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 22, symflags:''
> Considering D clone.
> Base matrix:
> (6 x 6)[0x55a2e9b48d00]{D} @ (0(0..0),0(0..0)) (36 nnz, 6 nnz/r) flags 0x2046186 (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 22, symflags:''
> 
> Will use autotuning routine to sample matrix: 6 x 6, type D, 36 nnz, 6 nnz/r, 29 subms, 22 lsubms, 4.6667 bpnz.
> Parameters: verbosity:1 mintimes:3 maxtimes:10 mindt:0 maxdt:0.1
> 3 iterations (8 th.) took 0.0002551s; avg 8.504e-05s ( +/-  55.14/ 36.54 %); best 3.815e-05s; worst 0.0001161s; std dev. 3.374e-05 (taking best).
> Reference operation time is 3.8147e-05 s (3.775 Mflops) with 8 threads.
> After 0.000293s, autotuning routine did not find a better threads count configuration.
> 
> Will autotune matrix: 6 x 6, type D, 36 nnz, 6 nnz/r, 29 subms, 22 lsubms, 4.6667 bpnz.
> Parameters: verbosity:1 mintimes:3 maxtimes:10 mindt:0 maxdt:0.1
> 3 iterations (8 th.) took 0.0001168s; avg 3.894e-05s ( +/-   2.65/  4.69 %); best 3.791e-05s; worst 4.077e-05s; std dev. 1.296e-06 (taking best).
> Reference operation time is 3.79086e-05 s (3.799 Mflops) with 8 threads.
> Starting merge (same threads) based auto-tuning procedure (transA=N, nrhs=2, order=cols) (max 6 steps, inclusive 3 grace steps) on: 6 x 6, type D, 36 nnz, 6 nnz/r, 29 subms, 22 lsubms, 4.6667 bpnz (tpop: 3.791e-05  Mflops: 3.799)
> Merge (22 -> 16 leaves) took w.c.t. of 0.0002329s, ~0.0002313s of computing time (of which 9.537e-07s sorting, 3.099e-06s analysis)
> 3 iterations (8 th.) took 0.000504s; avg 0.000168s ( +/-  79.85/ 98.25 %); best 3.386e-05s; worst 0.0003331s; std dev. 0.0001241 (taking best).
> Reference operation time is 3.38554e-05 s (4.253 Mflops) with 8 threads.
> After merge step 1: tpop: 3.386e-05 s   ~Mflops: 4.253   nsubm:16 otn:8
> Applying merge (22 -> 16 leaves, 8 th.) yielded SPEEDUP of  1.120x: 3.791e-05s -> 3.386e-05s, so taking this instance.
> Merge (16 -> 10 leaves) took w.c.t. of 1.812e-05s, ~1.407e-05s of computing time (of which 9.537e-07s sorting, 2.146e-06s analysis)
> 3 iterations (8 th.) took 0.001069s; avg 0.0003564s ( +/-  60.12/104.59 %); best 0.0001421s; worst 0.0007291s; std dev. 0.0002645 (taking best).
> Reference operation time is 0.000142097 s (1.013 Mflops) with 8 threads.
> After merge step 2: tpop: 0.0001421 s   ~Mflops: 1.013   nsubm:10 otn:8
> Applying merge (16 -> 10 leaves, 8 th.) yielded SLOWDOWN (1th of 3 tolerable) of  4.197x: 3.386e-05s -> 0.0001421s.
> Skipping further merge based tests after 1 definite performance degradations in a row (and last exceeding limit).
> A total of 2 merge steps (of max 6) (22 -> 10 subms) took 0.00193s (of which 0.000257s partitioning, 0s I/O); computing times: 0.0002453s in par. loops, 1.907e-06s sorting, 5.245e-06s analyzing)
> Total merge + benchmarking process took 0.00193s, equivalent to 57.0/50.9 new/old ops (2.48e-05s for 2 clones -- as 0.7/0.7 ops, or 0.4/0.3 ops per clone), SPEEDUP of  1.120x
> Applying multi-merge (22 -> 16 leaves, 1 steps, 0 -> 8 th.sp.) yielded SPEEDUP of  1.120x (3.791e-05s -> 3.386e-05s), will amortize in      476.2 ops by saving 4.053e-06s per op.
> In 1 tuning rounds (tot. 0.0021s, 2.5e-05s for constructor, 2 clones) obtained a SPEEDUP of   12.0% (1.12x) (from 3.799 to 4.253 Mflops).
> After 0.002098s, global autotuning declared speedup of 1.11972 x, when using threads count of 8 and a new matrix:
> (6 x 6)[0x55a2e9b4b030]{D} @ (0(0..0),0(0..0)) (36 nnz, 6 nnz/r) flags 0x2046186 (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 16, symflags:''
> 
> Considering S clone.
> Building a matrix with 36 nnz, 6 x 6
> Duplicates check: 36 - 0 = 36
>  converted COO to RSB in 4.506e-05 s (100.00 %)
>  analyzed arrays in 1.907e-05 s (42.33 %)
>  cleaned-up arrays in 9.537e-07 s (2.12 %)
>  deduplicated arrays in 0.000e+00 s (0.00 %)
>  sorted arrays in 0.000e+00 s (0.00 %)
>  shuffled partitions in 1.407e-05 s (31.22 %)
>  memory allocations took 4.053e-06 s (8.99 %)
>  leafs setup took 9.537e-07 s (2.12 %)
>  halfword conversion took 5.960e-06 s (13.23 %)
> Built (6 x 6)[0x55a2e9b48d00]{S} @ (0(0..0),0(0..0)) (36 nnz, 6 nnz/r) flags 0x42046186 (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 16, symflags:''
> Base matrix:
> (6 x 6)[0x55a2e9b48d00]{S} @ (0(0..0),0(0..0)) (36 nnz, 6 nnz/r) flags 0x2046186 (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 16, symflags:''
> 
> Will use autotuning routine to sample matrix: 6 x 6, type S, 36 nnz, 6 nnz/r, 21 subms, 16 lsubms, 4.5000 bpnz.
> Parameters: verbosity:1 mintimes:3 maxtimes:10 mindt:0 maxdt:0.1
> 3 iterations (8 th.) took 0.0003211s; avg 0.000107s ( +/-  69.93/ 97.10 %); best 3.219e-05s; worst 0.000211s; std dev. 7.584e-05 (taking best).
> Reference operation time is 3.21865e-05 s (4.474 Mflops) with 8 threads.
> After 0.000336s, autotuning routine did not find a better threads count configuration.
> 
> Will autotune matrix: 6 x 6, type S, 36 nnz, 6 nnz/r, 21 subms, 16 lsubms, 4.5000 bpnz.
> Parameters: verbosity:1 mintimes:3 maxtimes:10 mindt:0 maxdt:0.1
> 3 iterations (8 th.) took 8.202e-05s; avg 2.734e-05s ( +/-   1.45/  2.91 %); best 2.694e-05s; worst 2.813e-05s; std dev. 5.62e-07 (taking best).
> Reference operation time is 2.69413e-05 s (5.345 Mflops) with 8 threads.
> Starting merge (same threads) based auto-tuning procedure (transA=N, nrhs=2, order=cols) (max 6 steps, inclusive 3 grace steps) on: 6 x 6, type S, 36 nnz, 6 nnz/r, 21 subms, 16 lsubms, 4.5000 bpnz (tpop: 2.694e-05  Mflops: 5.345)
> Merge (16 -> 13 leaves) took w.c.t. of 6.914e-06s, ~2.146e-06s of computing time (of which 0s sorting, 9.537e-07s analysis)
> 3 iterations (8 th.) took 7.606e-05s; avg 2.535e-05s ( +/-   9.72/ 18.50 %); best 2.289e-05s; worst 3.004e-05s; std dev. 3.317e-06 (taking best).
> Reference operation time is 2.28882e-05 s (6.291 Mflops) with 8 threads.
> After merge step 1: tpop: 2.289e-05 s   ~Mflops: 6.291   nsubm:13 otn:8
> Applying merge (16 -> 13 leaves, 8 th.) yielded SPEEDUP of  1.177x: 2.694e-05s -> 2.289e-05s, so taking this instance.
> Merge (13 -> 10 leaves) took w.c.t. of 5.96e-06s, ~9.537e-07s of computing time (of which 0s sorting, 9.537e-07s analysis)
> 3 iterations (8 th.) took 6.199e-05s; avg 2.066e-05s ( +/-   8.85/ 16.54 %); best 1.884e-05s; worst 2.408e-05s; std dev. 2.418e-06 (taking best).
> Reference operation time is 1.88351e-05 s (7.645 Mflops) with 8 threads.
> After merge step 2: tpop: 1.884e-05 s   ~Mflops: 7.645   nsubm:10 otn:8
> Applying merge (13 -> 10 leaves, 8 th.) yielded SPEEDUP of  1.215x: 2.289e-05s -> 1.884e-05s, so taking this instance.
> Merge (10 -> 7 leaves) took w.c.t. of 8.106e-06s, ~1.192e-06s of computing time (of which 0s sorting, 1.192e-06s analysis)
> 3 iterations (8 th.) took 0.0002451s; avg 8.17e-05s ( +/-  82.78/160.89 %); best 1.407e-05s; worst 0.0002131s; std dev. 9.296e-05 (taking best).
> Reference operation time is 1.40667e-05 s (10.24 Mflops) with 8 threads.
> After merge step 3: tpop: 1.407e-05 s   ~Mflops: 10.237   nsubm:7 otn:8
> Applying merge (10 -> 7 leaves, 8 th.) yielded SPEEDUP of  1.339x: 1.884e-05s -> 1.407e-05s, so taking this instance.
> Merge (7 -> 4 leaves) took w.c.t. of 6.199e-06s, ~1.907e-06s of computing time (of which 0s sorting, 1.192e-06s analysis)
> 3 iterations (8 th.) took 2.503e-05s; avg 8.345e-06s ( +/-   5.71/  8.57 %); best 7.868e-06s; worst 9.06e-06s; std dev. 5.15e-07 (taking best).
> Reference operation time is 7.86781e-06 s (18.3 Mflops) with 8 threads.
> After merge step 4: tpop: 7.868e-06 s   ~Mflops: 18.302   nsubm:4 otn:8
> Applying merge (7 -> 4 leaves, 8 th.) yielded SPEEDUP of  1.788x: 1.407e-05s -> 7.868e-06s, so taking this instance.
> Merge (4 -> 1 leaves) took w.c.t. of 4.053e-06s, ~2.146e-06s of computing time (of which 1.192e-06s sorting, 9.537e-07s analysis)
> 3 iterations (8 th.) took 4.053e-06s; avg 1.351e-06s ( +/-  97.67/129.41 %); best 3.145e-08s; worst 3.099e-06s; std dev. 1.296e-06 (taking best).
> Reference operation time is 3.14474e-08 s (4579 Mflops) with 8 threads.
> After merge step 5: tpop: 3.145e-08 s   ~Mflops: 4579.073   nsubm:1 otn:8
> Applying merge (4 -> 1 leaves, 8 th.) yielded SPEEDUP of 250.190x: 7.868e-06s -> 3.145e-08s, so taking this instance.
> Merged all the matrix leaves: no reason to continue merging.
> A total of 5 merge steps (of max 6) (16 -> 1 subms) took 0.002016s (of which 0.001473s partitioning, 0s I/O); computing times: 8.345e-06s in par. loops, 1.192e-06s sorting, 5.245e-06s analyzing)
> Total merge + benchmarking process took 0.002016s, equivalent to 64101.6/74.8 new/old ops (8.726e-05s for 6 clones -- as 2774.8/3.2 ops, or 462.5/0.5 ops per clone), SPEEDUP of 856.710x
> Applying multi-merge (16 -> 1 leaves, 5 steps, 0 -> 8 th.sp.) yielded SPEEDUP of 856.710x (2.694e-05s -> 3.145e-08s), will amortize in       74.9 ops by saving 2.691e-05s per op.
> In 1 tuning rounds (tot. 0.0021s, 8.7e-05s for constructor, 6 clones) obtained a SPEEDUP of 85571.0% (856.7x) (from 5.345 to 4579 Mflops).
> After 0.002140s, global autotuning declared speedup of 856.71 x, when using threads count of 8 and a new matrix:
> (6 x 6)[0x55a2e9b4f070]{S} @ (0(0..6),0(0..6)) (36 nnz, 6 nnz/r) flags 0x2244086 (coo:0, csr:1, hw:1, ic:1, fi:0), storage: 1, subm: 1, symflags:''
> 
> Considering C clone.
> Building a matrix with 36 nnz, 6 x 6
> Duplicates check: 36 - 0 = 36
>  converted COO to RSB in 6.294e-05 s (100.00 %)
>  analyzed arrays in 2.193e-05 s (34.85 %)
>  cleaned-up arrays in 0.000e+00 s (0.00 %)
>  deduplicated arrays in 0.000e+00 s (0.00 %)
>  sorted arrays in 0.000e+00 s (0.00 %)
>  shuffled partitions in 1.907e-05 s (30.30 %)
>  memory allocations took 1.502e-05 s (23.86 %)
>  leafs setup took 1.907e-06 s (3.03 %)
>  halfword conversion took 5.007e-06 s (7.95 %)
> Built (6 x 6)[0x55a2e9b51270]{C} @ (0(0..0),0(0..0)) (36 nnz, 6 nnz/r) flags 0x42046186 (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 22, symflags:''
> Base matrix:
> (6 x 6)[0x55a2e9b51270]{C} @ (0(0..0),0(0..0)) (36 nnz, 6 nnz/r) flags 0x2046186 (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 22, symflags:''
> 
> Will use autotuning routine to sample matrix: 6 x 6, type C, 36 nnz, 6 nnz/r, 29 subms, 22 lsubms, 4.6667 bpnz.
> Parameters: verbosity:1 mintimes:3 maxtimes:10 mindt:0 maxdt:0.1
> 3 iterations (8 th.) took 0.00015s; avg 4.999e-05s ( +/-  12.24/ 18.28 %); best 4.387e-05s; worst 5.913e-05s; std dev. 6.585e-06 (taking best).
> Reference operation time is 4.3869e-05 s (13.13 Mflops) with 8 threads.
> After 0.000163s, autotuning routine did not find a better threads count configuration.
> 
> Will autotune matrix: 6 x 6, type C, 36 nnz, 6 nnz/r, 29 subms, 22 lsubms, 4.6667 bpnz.
> Parameters: verbosity:1 mintimes:3 maxtimes:10 mindt:0 maxdt:0.1
> 3 iterations (8 th.) took 0.0001101s; avg 3.672e-05s ( +/-   7.14/  9.09 %); best 3.409e-05s; worst 4.005e-05s; std dev. 2.485e-06 (taking best).
> Reference operation time is 3.40939e-05 s (16.89 Mflops) with 8 threads.
> Starting merge (same threads) based auto-tuning procedure (transA=N, nrhs=2, order=cols) (max 6 steps, inclusive 3 grace steps) on: 6 x 6, type C, 36 nnz, 6 nnz/r, 29 subms, 22 lsubms, 4.6667 bpnz (tpop: 3.409e-05  Mflops: 16.895)
> Merge (22 -> 16 leaves) took w.c.t. of 1.001e-05s, ~4.768e-06s of computing time (of which 0s sorting, 1.907e-06s analysis)
> 3 iterations (8 th.) took 0.0005829s; avg 0.0001943s ( +/-  86.13/168.10 %); best 2.694e-05s; worst 0.0005209s; std dev. 0.000231 (taking best).
> Reference operation time is 2.69413e-05 s (21.38 Mflops) with 8 threads.
> After merge step 1: tpop: 2.694e-05 s   ~Mflops: 21.380   nsubm:16 otn:8
> Applying merge (22 -> 16 leaves, 8 th.) yielded SPEEDUP of  1.265x: 3.409e-05s -> 2.694e-05s, so taking this instance.
> Merge (16 -> 10 leaves) took w.c.t. of 1.502e-05s, ~6.914e-06s of computing time (of which 9.537e-07s sorting, 1.907e-06s analysis)
> 3 iterations (8 th.) took 0.0003359s; avg 0.000112s ( +/-  79.56/158.91 %); best 2.289e-05s; worst 0.0002899s; std dev. 0.0001258 (taking best).
> Reference operation time is 2.28882e-05 s (25.17 Mflops) with 8 threads.
> After merge step 2: tpop: 2.289e-05 s   ~Mflops: 25.166   nsubm:10 otn:8
> Applying merge (16 -> 10 leaves, 8 th.) yielded SPEEDUP of  1.177x: 2.694e-05s -> 2.289e-05s, so taking this instance.
> Merge (10 -> 7 leaves) took w.c.t. of 7.868e-06s, ~1.907e-06s of computing time (of which 9.537e-07s sorting, 1.907e-06s analysis)
> 3 iterations (8 th.) took 4.101e-05s; avg 1.367e-05s ( +/-   5.81/  9.88 %); best 1.287e-05s; worst 1.502e-05s; std dev. 9.603e-07 (taking best).
> Reference operation time is 1.28746e-05 s (44.74 Mflops) with 8 threads.
> After merge step 3: tpop: 1.287e-05 s   ~Mflops: 44.739   nsubm:7 otn:8
> Applying merge (10 -> 7 leaves, 8 th.) yielded SPEEDUP of  1.778x: 2.289e-05s -> 1.287e-05s, so taking this instance.
> Merge (7 -> 4 leaves) took w.c.t. of 4.053e-06s, ~2.146e-06s of computing time (of which 1.192e-06s sorting, 9.537e-07s analysis)
> 3 iterations (8 th.) took 2.313e-05s; avg 7.709e-06s ( +/-   7.22/  5.15 %); best 7.153e-06s; worst 8.106e-06s; std dev. 4.052e-07 (taking best).
> Reference operation time is 7.15256e-06 s (80.53 Mflops) with 8 threads.
> After merge step 4: tpop: 7.153e-06 s   ~Mflops: 80.531   nsubm:4 otn:8
> Applying merge (7 -> 4 leaves, 8 th.) yielded SPEEDUP of  1.800x: 1.287e-05s -> 7.153e-06s, so taking this instance.
> Merge (4 -> 1 leaves) took w.c.t. of 3.815e-06s, ~1.907e-06s of computing time (of which 9.537e-07s sorting, 9.537e-07s analysis)
> 3 iterations (8 th.) took 2.861e-06s; avg 9.537e-07s ( +/-   0.00/  0.00 %); best 9.537e-07s; worst 9.537e-07s; std dev. 0 (taking best).
> Reference operation time is 9.53674e-07 s (604 Mflops) with 8 threads.
> After merge step 5: tpop: 9.537e-07 s   ~Mflops: 603.980   nsubm:1 otn:8
> Applying merge (4 -> 1 leaves, 8 th.) yielded SPEEDUP of  7.500x: 7.153e-06s -> 9.537e-07s, so taking this instance.
> Merged all the matrix leaves: no reason to continue merging.
> A total of 5 merge steps (of max 6) (22 -> 1 subms) took 0.002495s (of which 5.198e-05s partitioning, 0s I/O); computing times: 1.764e-05s in par. loops, 4.053e-06s sorting, 7.629e-06s analyzing)
> Total merge + benchmarking process took 0.002495s, equivalent to 2616.0/73.2 new/old ops (5.674e-05s for 6 clones -- as 59.5/1.7 ops, or 9.9/0.3 ops per clone), SPEEDUP of 35.750x
> Applying multi-merge (22 -> 1 leaves, 5 steps, 0 -> 8 th.sp.) yielded SPEEDUP of 35.750x (3.409e-05s -> 9.537e-07s), will amortize in       75.3 ops by saving 3.314e-05s per op.
> In 1 tuning rounds (tot. 0.0026s, 5.7e-05s for constructor, 6 clones) obtained a SPEEDUP of 3475.0% (35.75x) (from 16.89 to 604 Mflops).
> After 0.002645s, global autotuning declared speedup of 35.75 x, when using threads count of 8 and a new matrix:
> (6 x 6)[0x55a2e9b4f6c0]{C} @ (0(0..6),0(0..6)) (36 nnz, 6 nnz/r) flags 0x2244086 (coo:0, csr:1, hw:1, ic:1, fi:0), storage: 1, subm: 1, symflags:''
> 
> Considering Z clone.
> Building a matrix with 36 nnz, 6 x 6
> Duplicates check: 36 - 0 = 36
>  converted COO to RSB in 6.008e-05 s (100.00 %)
>  analyzed arrays in 2.503e-05 s (41.67 %)
>  cleaned-up arrays in 0.000e+00 s (0.00 %)
>  deduplicated arrays in 0.000e+00 s (0.00 %)
>  sorted arrays in 0.000e+00 s (0.00 %)
>  shuffled partitions in 2.384e-05 s (39.68 %)
>  memory allocations took 4.053e-06 s (6.75 %)
>  leafs setup took 2.146e-06 s (3.57 %)
>  halfword conversion took 5.007e-06 s (8.33 %)
> Built (6 x 6)[0x55a2e9b51270]{Z} @ (0(0..0),0(0..0)) (36 nnz, 6 nnz/r) flags 0x42046186 (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 28, symflags:''
> Base matrix:
> (6 x 6)[0x55a2e9b51270]{Z} @ (0(0..0),0(0..0)) (36 nnz, 6 nnz/r) flags 0x2046186 (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 28, symflags:''
> 
> Will use autotuning routine to sample matrix: 6 x 6, type Z, 36 nnz, 6 nnz/r, 37 subms, 28 lsubms, 4.4444 bpnz.
> Parameters: verbosity:1 mintimes:3 maxtimes:10 mindt:0 maxdt:0.1
> 3 iterations (8 th.) took 0.0004289s; avg 0.000143s ( +/-  64.31/ 68.59 %); best 5.102e-05s; worst 0.000241s; std dev. 7.77e-05 (taking best).
> Reference operation time is 5.10216e-05 s (11.29 Mflops) with 8 threads.
> After 0.002052s, autotuning routine did not find a better threads count configuration.
> 
> Will autotune matrix: 6 x 6, type Z, 36 nnz, 6 nnz/r, 37 subms, 28 lsubms, 4.4444 bpnz.
> Parameters: verbosity:1 mintimes:3 maxtimes:10 mindt:0 maxdt:0.1
> 3 iterations (8 th.) took 0.00033s; avg 0.00011s ( +/-  56.43/110.91 %); best 4.792e-05s; worst 0.000232s; std dev. 8.626e-05 (taking best).
> Reference operation time is 4.79221e-05 s (12.02 Mflops) with 8 threads.
> Starting merge (same threads) based auto-tuning procedure (transA=N, nrhs=2, order=cols) (max 6 steps, inclusive 3 grace steps) on: 6 x 6, type Z, 36 nnz, 6 nnz/r, 37 subms, 28 lsubms, 4.4444 bpnz (tpop: 4.792e-05  Mflops: 12.019)
> Merge (28 -> 22 leaves) took w.c.t. of 1.192e-05s, ~5.007e-06s of computing time (of which 1.192e-06s sorting, 3.099e-06s analysis)
> 3 iterations (8 th.) took 0.0004499s; avg 0.00015s ( +/-  74.72/133.39 %); best 3.791e-05s; worst 0.00035s; std dev. 0.0001418 (taking best).
> Reference operation time is 3.79086e-05 s (15.19 Mflops) with 8 threads.
> After merge step 1: tpop: 3.791e-05 s   ~Mflops: 15.194   nsubm:22 otn:8
> Applying merge (28 -> 22 leaves, 8 th.) yielded SPEEDUP of  1.264x: 4.792e-05s -> 3.791e-05s, so taking this instance.
> Merge (22 -> 16 leaves) took w.c.t. of 1.407e-05s, ~4.053e-06s of computing time (of which 9.537e-07s sorting, 3.099e-06s analysis)
> 3 iterations (8 th.) took 0.0003471s; avg 0.0001157s ( +/-  74.04/140.25 %); best 3.004e-05s; worst 0.000278s; std dev. 0.0001148 (taking best).
> Reference operation time is 3.00407e-05 s (19.17 Mflops) with 8 threads.
> After merge step 2: tpop: 3.004e-05 s   ~Mflops: 19.174   nsubm:16 otn:8
> Applying merge (22 -> 16 leaves, 8 th.) yielded SPEEDUP of  1.262x: 3.791e-05s -> 3.004e-05s, so taking this instance.
> Merge (16 -> 10 leaves) took w.c.t. of 1.192e-05s, ~6.199e-06s of computing time (of which 9.537e-07s sorting, 1.907e-06s analysis)
> 3 iterations (8 th.) took 0.0003679s; avg 0.0001226s ( +/-  81.34/106.29 %); best 2.289e-05s; worst 0.000253s; std dev. 9.639e-05 (taking best).
> Reference operation time is 2.28882e-05 s (25.17 Mflops) with 8 threads.
> After merge step 3: tpop: 2.289e-05 s   ~Mflops: 25.166   nsubm:10 otn:8
> Applying merge (16 -> 10 leaves, 8 th.) yielded SPEEDUP of  1.312x: 3.004e-05s -> 2.289e-05s, so taking this instance.
> Merge (10 -> 7 leaves) took w.c.t. of 6.914e-06s, ~1.907e-06s of computing time (of which 9.537e-07s sorting, 9.537e-07s analysis)
> 3 iterations (8 th.) took 0.0001049s; avg 3.497e-05s ( +/-  63.18/120.23 %); best 1.287e-05s; worst 7.701e-05s; std dev. 2.974e-05 (taking best).
> Reference operation time is 1.28746e-05 s (44.74 Mflops) with 8 threads.
> After merge step 4: tpop: 1.287e-05 s   ~Mflops: 44.739   nsubm:7 otn:8
> Applying merge (10 -> 7 leaves, 8 th.) yielded SPEEDUP of  1.778x: 2.289e-05s -> 1.287e-05s, so taking this instance.
> Merge (7 -> 4 leaves) took w.c.t. of 5.96e-06s, ~1.907e-06s of computing time (of which 0s sorting, 9.537e-07s analysis)
> 3 iterations (8 th.) took 2.599e-05s; avg 8.663e-06s ( +/-  20.18/ 15.60 %); best 6.914e-06s; worst 1.001e-05s; std dev. 1.296e-06 (taking best).
> Reference operation time is 6.91414e-06 s (83.31 Mflops) with 8 threads.
> After merge step 5: tpop: 6.914e-06 s   ~Mflops: 83.308   nsubm:4 otn:8
> Applying merge (7 -> 4 leaves, 8 th.) yielded SPEEDUP of  1.862x: 1.287e-05s -> 6.914e-06s, so taking this instance.
> Merge (4 -> 1 leaves) took w.c.t. of 4.053e-06s, ~2.146e-06s of computing time (of which 9.537e-07s sorting, 9.537e-07s analysis)
> 3 iterations (8 th.) took 3.099e-06s; avg 1.033e-06s ( +/-   7.69/ 15.38 %); best 9.537e-07s; worst 1.192e-06s; std dev. 1.124e-07 (taking best).
> Reference operation time is 9.53674e-07 s (604 Mflops) with 8 threads.
> After merge step 6: tpop: 9.537e-07 s   ~Mflops: 603.980   nsubm:1 otn:8
> Applying merge (4 -> 1 leaves, 8 th.) yielded SPEEDUP of  7.250x: 6.914e-06s -> 9.537e-07s, so taking this instance.
> Merged all the matrix leaves: no reason to continue merging.
> A total of 6 merge steps (of max 6) (28 -> 1 subms) took 0.001497s (of which 7.415e-05s partitioning, 0s I/O); computing times: 2.122e-05s in par. loops, 5.007e-06s sorting, 1.097e-05s analyzing)
> Total merge + benchmarking process took 0.001497s, equivalent to 1569.8/31.2 new/old ops (8.44e-05s for 7 clones -- as 88.5/1.8 ops, or 12.6/0.3 ops per clone), SPEEDUP of 50.250x
> Applying multi-merge (28 -> 1 leaves, 6 steps, 0 -> 8 th.sp.) yielded SPEEDUP of 50.250x (4.792e-05s -> 9.537e-07s), will amortize in       31.9 ops by saving 4.697e-05s per op.
> In 1 tuning rounds (tot. 0.0036s, 8.4e-05s for constructor, 7 clones) obtained a SPEEDUP of 4925.0% (50.25x) (from 12.02 to 604 Mflops).
> After 0.003647s, global autotuning declared speedup of 50.25 x, when using threads count of 8 and a new matrix:
> (6 x 6)[0x55a2e9b48140]{Z} @ (0(0..6),0(0..6)) (36 nnz, 6 nnz/r) flags 0x2244086 (coo:0, csr:1, hw:1, ic:1, fi:0), storage: 1, subm: 1, symflags:''
> 
> gmake[4]: Leaving directory '/<<PKGBUILDDIR>>/examples'
> gmake[3]: Leaving directory '/<<PKGBUILDDIR>>'
> ./rsbench -Q 30.0Q
> ERROR 0xffffffff : An unspecified error occurred.
> 	 vectors  :
> 	 vectors  :
> 	 vectors  :
> 	 vectors  :
> 	 vectors  :
> 	 vectors  :
> 	 vectors  :
> 	 vectors  :
> INTERNALS TEST: BEGIN (IGNORE THE ERROR PRINTOUT HERE BELOW, IT'S PART OF THE TEST)
> INTERNALS TEST: END
> MATRIX SUMS TEST: BEGIN
> MATRIX SUMS TEST: END
> MATRIX ASSEMBLY FLAGS TEST: BEGIN
> MATRIX ASSEMBLY FLAGS TEST: END
> REGRESSION TEST: BEGIN
> REGRESSION TEST: END
> SORT CHECK: BEGIN
> SORT CHECK: END
> MTX PRINT TEST BEGIN
> %%MatrixMarket matrix coordinate real general
> 2 2 2
> 1	1	2
> 2	2	1
> %%MatrixMarket matrix coordinate real general
> 2 2 2
> 1	1	2
> 2	2	1
> %%MatrixMarket matrix coordinate complex general
> 2 2 2
> 1	1	2 0
> 2	2	1 0
> %%MatrixMarket matrix coordinate complex general
> 2 2 2
> 1	1	2 0
> 2	2	1 0
> MTX PRINT TEST END
> DIFF PRINT TEST BEGIN
> 	 vectors diff :
> 4 : 4 0
> 5 : -4 0
> ...(for a total of 2 differing entries)...
> 	 vectors diff :
> 4 : 4 0
> 5 : -4 0
> ...(for a total of 2 differing entries)...
> 	 vectors diff :
> 4 : 4 0 0 0
> ...(for a total of 1 differing entries)...
> 	 vectors diff :
> 4 : 4 0 0 0
> ...(for a total of 1 differing entries)...
> 0 0
> 0 0
> 3 3
> -3 -3
> 4 0
> -4 0
> 0
> 0
> 3
> -3
> 4
> -4
> 0 0
> 0 3
> 3 0
> 0
> 3
> 4
> 0 0
> 0 0
> 3 3
> -3 -3
> 4 0
> -4 0
> 0
> 0
> 3
> -3
> 4
> -4
> 0 0
> 0 3
> 3 0
> 0
> 3
> 4
> 0 0 0 0
> 0 0 0 0
> 3 0 3 0
> -3 0 -3 0
> 4 0 0 0
> -4 0 0 0
> 0 0
> 0 0
> 3 0
> -3 0
> 4 0
> -4 0
> 0 0 0 0
> 0 0 3 0
> 3 0 0 0
> 0 0
> 3 0
> 4 0
> 0 0 0 0
> 0 0 0 0
> 3 0 3 0
> -3 0 -3 0
> 4 0 0 0
> -4 0 0 0
> 0 0
> 0 0
> 3 0
> -3 0
> 4 0
> -4 0
> 0 0 0 0
> 0 0 3 0
> 3 0 0 0
> 0 0
> 3 0
> 4 0
> DIFF PRINT TEST END
> Beginning large binary search test.
> Detected 33303298048 bytes of memory, comprehensive of 19381891072 of free memory.
> On this system, maximal array of coordinates can have 2147483137 elements and occupy 8589932548 bytes.
> Will perform the test using less memory (17592186041895 MB) than on the maximal coordinate indices array (18446744071066100736) allows.
> Skipping test: too little memory.
> Skipping large binary search test.
> BASIC SPARSE BLAS TEST: BEGIN
> INIT INTERFACE TEST: BEGIN
> got RSB_IO_WANT_EXTRA_VERBOSE_INTERFACE: -1
> got RSB_IO_WANT_IS_INITIALIZED_MARKER: 1
> INIT INTERFACE TEST: END (SUCCESS)
> DEVEL PRINT TEST: BEGIN
> (4 x 4)[0x55ee9b1578e0]{S} @ (0(0..0),0(0..0)) (4 nnz, 1 nnz/r) flags 0x2046186 (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 2, symflags:''
> RSB_FLAG_USE_HALFWORD_INDICES |
> RSB_FLAG_SORTED_INPUT |
> RSB_FLAG_WANT_COO_STORAGE |
> RSB_FLAG_QUAD_PARTITIONING |
> RSB_FLAG_WANT_BCSS_STORAGE |
> RSB_FLAG_ASSEMBLED_IN_COO_ARRAYS |
> RSB_FLAG_OWN_PARTITIONING_ARRAYS |
> RSB_FLAG_SORT_INPUT
> (2 x 2)[0x55ee9b1579f0]{S} @ (0(0..2),0(0..2)) (2 nnz, 1 nnz/r) flags 0x2144386 (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 0, symflags:''
> (2 x 2)[0x55ee9b157b00]{S} @ (2(2..4),2(2..4)) (2 nnz, 1 nnz/r) flags 0x2144386 (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 0, symflags:''
> #R 4 x 4, 4 nnz (16 bytes), 16 index space for bytes, 544 bytes for 2 structs (2 of which are on the diagonal) (1e+02% of nnz are on the diagonal) 
> #N at 0 0, 4 x 4, 4 nnz ( 25%)
> #T at 0 0, 2 x 2, 2 nnz ( 50%)
> #T at 2 2, 2 x 2, 2 nnz ( 50%)
> ( 0x2046186 = { rec:1 coo:1 css:1 hw:1 ic:1 fi:0 symflags: } )
> DEVEL PRINT TEST: END
> PRINT TEST: BEGIN [QUIET]
> (2 x 2)[0x55ee9b1579f0]{S} @ (0(0..2),0(0..2)) (2 nnz, 1 nnz/r) flags 0x2144386 (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 0, symflags:''
> (2 x 2)[0x55ee9b157b00]{S} @ (2(2..4),2(2..4)) (2 nnz, 1 nnz/r) flags 0x2144386 (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 0, symflags:''
> (4 x 4)[0x55ee9b1578e0]{S} @ (0(0..0),0(0..0)) (4 nnz, 1 nnz/r) flags 0x2046186 (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 2, symflags:''
> RSB_FLAG_USE_HALFWORD_INDICES |
> RSB_FLAG_SORTED_INPUT |
> RSB_FLAG_WANT_COO_STORAGE |
> RSB_FLAG_QUAD_PARTITIONING |
> RSB_FLAG_WANT_BCSS_STORAGE |
> RSB_FLAG_ASSEMBLED_IN_COO_ARRAYS |
> RSB_FLAG_OWN_PARTITIONING_ARRAYS |
> RSB_FLAG_SORT_INPUT
> 0000000000000000
> PRINT TEST: END (SUCCESS)
> BASIC SPARSE BLAS TEST: END (SUCCESS)
> STRESS SPARSE BLAS TEST: BEGIN
> STRESS SPARSE BLAS TEST: END (SUCCESS)
> SPARSE BLAS TESTS: END (SUCCESS)
> BASIC PRIMITIVES TEST: BEGIN
> BASIC PRIMITIVES TEST: END (SUCCESS)
> ADVANCED SPARSE BLAS TEST: BEGIN [limit 30.000000s] [QUIET]
> Terminating testing earlier due to user timeout request: test took 30.033412 s, max allowed was 30.000000.
> 	PASSED:43059
> 	FAILED:0
> ADVANCED SPARSE BLAS TEST: END (SUCCESS)
> gmake qtests -C librsbpp
> gmake[3]: Entering directory '/<<PKGBUILDDIR>>/librsbpp'
> gmake  all-am
> gmake[4]: Entering directory '/<<PKGBUILDDIR>>/librsbpp'
> gmake[4]: Leaving directory '/<<PKGBUILDDIR>>/librsbpp'
> ./rsbtt
> if ! test -f G.mtx ; then cp -p /<<PKGBUILDDIR>>/librsbpp/G.mtx . ; fi ; /bin/bash /<<PKGBUILDDIR>>/librsbpp/test.sh
> ++ ./rsbpp Td,s G.mtx
> ++ grep Z-sort
> ++ wc -l
> + test 54 = 54
> ++ ./rsbpp Td G.mtx
> ++ grep Z-sort
> ++ wc -l
> + test 27 = 27
> ++ ./rsbpp Td,z G.mtx
> ++ grep Z-sort
> ++ wc -l
> + test 54 = 54
> ++ ./rsbpp vTd,z G.mtx
> ++ grep Z-sort
> ++ wc -l
> + test 54 = 54
> ++ ./rsbpp vTd,z G.mtx
> ++ grep Z-sort
> ++ wc -l
> + test 54 = 54
> ++ ./rsbpp vvvTd,z G.mtx
> ++ grep Zorted
> ++ wc -l
> + test 8 = 8
> ++ ./rsbpp vvTd,z G.mtx
> ++ grep Z-sort
> ++ wc -l
> + test 54 = 54
> ++ ./rsbpp vvTd,z G.mtx
> ++ grep Range
> ++ wc -l
> + test 0 = 0
> ++ ./rsbpp vvvTd,z G.mtx
> ++ grep Range
> ++ wc -l
> + test 258 -gt 0
> ++ ./rsbpp vvvTd,z S.mtx
> ++ grep Range
> ++ wc -l
> + test 0 -eq 0
> ++ ./rsbpp vvvTd,z G.mtx
> ++ grep Range
> ++ wc -l
> + test 258 = 258
> ++ OMP_NUM_THREADS=1
> ++ ./rsbpp m10M10I1r1,4,8sFv
> ++ grep spmm-
> ++ wc -l
> + test 9 = 9
> ++ OMP_NUM_THREADS=1
> ++ grep spmm-
> ++ ./rsbpp C1000m100M100I1r1,4,8sFv
> ++ wc -l
> + test 9 = 9
> ++ OMP_NUM_THREADS=1
> ++ ./rsbpp C1000m100M100I1r1sFvtN,T
> ++ grep spmm-
> ++ wc -l
> + test 3 = 3
> ++ OMP_NUM_THREADS=1
> ++ ./rsbpp C1000m100M100I1r1vtN,TsF
> ++ grep spmm-
> ++ wc -l
> + test 2 = 2
> ++ OMP_NUM_THREADS=1
> ++ ./rsbpp C1000m100M100I1r0vtN,TsF
> ++ grep spmm-
> ++ wc -l
> + test 0 = 0
> ++ OMP_NUM_THREADS=1
> ++ RSB_NUM_THREADS=1
> ++ ./rsbpp vvvC1000m100M100I1r1vtN,TorsF
> ++ grep Recursing
> ++ wc -l
> + test 4 = 4
> ++ OMP_NUM_THREADS=2
> ++ RSB_NUM_THREADS=2
> ++ ./rsbpp vvvC1000m100M100I1r1vtN,TorsF
> ++ grep Recursing
> ++ wc -l
> + test 4 = 4
> ++ OMP_NUM_THREADS=1
> ++ RSB_NUM_THREADS=1
> ++ ./rsbpp vvvC1000m100M100I1r1vtN,ToRsF
> ++ grep Recursing
> ++ wc -l
> + test 208 = 208
> ++ OMP_NUM_THREADS=2
> ++ RSB_NUM_THREADS=2
> ++ ./rsbpp vvvC1000m100M100I1r1vtN,ToRsF
> ++ grep Recursing
> ++ wc -l
> + test 410 = 410
> echo "Skipping tests based on Google Test (not detected at configure time)"
> Skipping tests based on Google Test (not detected at configure time)
> gmake[3]: Leaving directory '/<<PKGBUILDDIR>>/librsbpp'
> gmake qtests -C rsblib
> gmake[3]: Entering directory '/<<PKGBUILDDIR>>/rsblib'
> gmake -C examples tests
> gmake[4]: Entering directory '/<<PKGBUILDDIR>>/rsblib/examples'
> if test ! -f ../A.mtx ; then cp /<<PKGBUILDDIR>>/rsblib/A.mtx ..; fi
> ./mtx2bin ../A.mtx A.mtx.bin '?' ; test $? != 0
> usage: /<<PKGBUILDDIR>>/rsblib/examples/.libs/mtx2bin matrix-input-file [matrix-output-file [type]]
> with [type] among S D C Z  ; default D
> ./mtx2bin ../non-existent.mtx A.mtx.bin 'S' ; test $? != 0
> usage: /<<PKGBUILDDIR>>/rsblib/examples/.libs/mtx2bin matrix-input-file [matrix-output-file [type]]
> with [type] among S D C Z  ; default D
> gmake[4]: Leaving directory '/<<PKGBUILDDIR>>/rsblib/examples'
> RSBP_QUIET=1 ./rsb /<<PKGBUILDDIR>>/rsblib/T.mtx
> RSB constructed
> A:nr:3 nc:3 nnz:6 normOne:6.3 normInf:9.6
> %%MatrixMarket matrix coordinate real general
> 3 3 6
> 1	1	1.1000000000000001
> 2	1	2.1000000000000001
> 2	2	2.2000000000000002
> 3	1	3.1000000000000001
> 3	2	3.2000000000000002
> 3	3	3.2999999999999998
> T:nr:3 nc:3 nnz:6 normOne:6.3 normInf:9.6
> %%MatrixMarket matrix coordinate real general
> 3 3 6
> 1	1	1.1000000000000001
> 2	1	2.1000000000000001
> 2	2	2.2000000000000002
> 3	1	3.1000000000000001
> 3	2	3.2000000000000002
> 3	3	3.2999999999999998
> SPMM:
> B:
> 1.1 1.1 
> 1.1 1.1 
> 1.1 1.1 
> C:
> 0 0 
> 0 0 
> 0 0 
> before tuning for SPMV:
> (3 x 3)[0x55595fb7adc0]{D} @ (0(0..0),0(0..0)) (6 nnz, 2 nnz/r) flags 0x2046186 (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 3, symflags:''
> **
> x:
> 1.1 
> 1.1 
> 1.1 
> y:
> 99 
> 99 
> 99 
> SPMV:
> y <- Rsb_A    * x:
> 1.21 
> 4.73 
> 10.56 
> SPSV:
> y <- Rsb_A    \ y:
> 1.1 
> 1.1 
> 1.1 
> **
> x:
> 1.1 1.1 
> 1.1 1.1 
> 1.1 1.1 
> y:
> 99 99 
> 99 99 
> 99 99 
> SPMM:
> y <- Rsb_A    * x:
> 1.21 1.21 
> 4.73 4.73 
> 10.56 10.56 
> SPSM:
> y <- Rsb_A    \ y:
> 1.1 1.1 
> 1.1 1.1 
> 1.1 1.1 
> SumIntoMyValues:
> A:nr:3 nc:3 nnz:6 normOne:6.3 normInf:9.6
> %%MatrixMarket matrix coordinate real general
> 3 3 6
> 1	1	1.1000000000000001
> 2	1	2.1000000000000001
> 2	2	2.2000000000000002
> 3	1	3.1000000000000001
> 3	2	3.2000000000000002
> 3	3	3.2999999999999998
> Matrix after SumIntoMyValues:
> 31 
> 32 
> 33 
> 
> A:nr:3 nc:3 nnz:6 normOne:37.4 normInf:105.6
> %%MatrixMarket matrix coordinate real general
> 3 3 6
> 1	1	1.1000000000000001
> 2	1	2.1000000000000001
> 2	2	2.2000000000000002
> 3	1	34.100000000000001
> 3	2	35.200000000000003
> 3	3	36.299999999999997
> ReplaceMyValues:
> Adjusted Values:
> 3100 
> 3200 
> 3300 
> A:nr:3 nc:3 nnz:6 normOne:3300 normInf:9600
> %%MatrixMarket matrix coordinate real general
> 3 3 6
> 1	1	1.1000000000000001
> 2	1	2.1000000000000001
> 2	2	2.2000000000000002
> 3	1	3100
> 3	2	3200
> 3	3	3300
> Diagonal Values Before:
> 1.1 
> 2.2 
> 3300 
> Diagonal Values After :
> -1.1 
> -2.2 
> -3300 
>  terminating run with RSBEP_NO_STUB=0, exit code=0
> BEGIN
> Rsb_Matrix_test_multimatrix_ms_mnrhs
> BEGIN
> (3 x 3)[0x55595fb926a0]{D} @ (0(0..0),0(0..0)) (6 nnz, 2 nnz/r) flags 0x204619e (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 3, symflags:'LT'
> Tuned with speedup factor of 1.19048:
> (3 x 3)[0x55595fbb0bc0]{D} @ (0(0..0),0(0..0)) (6 nnz, 2 nnz/r) flags 0x204619e (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 3, symflags:'LT'
> (3 x 3)[0x55595fb926a0]{D} @ (0(0..0),0(0..0)) (6 nnz, 2 nnz/r) flags 0x204619e (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 3, symflags:'LT'
> Tuned with speedup factor of 1.27273:
> (3 x 3)[0x55595fbb48f0]{D} @ (0(0..0),0(0..0)) (6 nnz, 2 nnz/r) flags 0x204619e (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 3, symflags:'LT'
> (3 x 3)[0x55595fb926a0]{D} @ (0(0..0),0(0..0)) (6 nnz, 2 nnz/r) flags 0x204619e (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 3, symflags:'LT'
> Tuned with speedup factor of 1.47059:
> (3 x 3)[0x55595fbb0bc0]{D} @ (0(0..0),0(0..0)) (6 nnz, 2 nnz/r) flags 0x204619e (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 3, symflags:'LT'
> (3 x 3)[0x55595fb926a0]{D} @ (0(0..0),0(0..0)) (6 nnz, 2 nnz/r) flags 0x204619e (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 3, symflags:'LT'
> Tuned with speedup factor of 1.08696:
> (3 x 3)[0x55595fbb48f0]{D} @ (0(0..0),0(0..0)) (6 nnz, 2 nnz/r) flags 0x204619e (coo:1, csr:1, hw:1, ic:1, fi:0), storage: 40, subm: 3, symflags:'LT'
> END
> OK: terminating with no allocations registered in librsb
>  [*] tests terminated successfully !
> gmake[3]: Leaving directory '/<<PKGBUILDDIR>>/rsblib'
> gmake qtests -C rsbtest
> gmake[3]: Entering directory '/<<PKGBUILDDIR>>/rsbtest'
> if test ! -f A.mtx ; then cp /<<PKGBUILDDIR>>/rsbtest/A.mtx . ; fi
> ./rsbtest --version | dd if=/dev/stdin of=/dev/stdout bs=16M status=none iflag=fullblock  | grep -q -i using # diagnostic
> ./rsbtest --version | dd if=/dev/stdin of=/dev/stdout bs=16M status=none iflag=fullblock  | grep -q librsb
> ./rsbtest --help | dd if=/dev/stdin of=/dev/stdout bs=16M status=none iflag=fullblock  | grep -q help
> ./rsbtest  -V       | dd if=/dev/stdin of=/dev/stdout bs=16M status=none iflag=fullblock  | grep -q librsb
> ./rsbtest --types all --types abcd --types '?' --verbose | dd if=/dev/stdin of=/dev/stdout bs=16M status=none iflag=fullblock  | grep -q configured.to.support.types
> ./rsbtest --quiet --types all --only-test-case-n  4        | dd if=/dev/stdin of=/dev/stdout bs=16M status=none iflag=fullblock  | grep -q '\<1.*success'
> ./rsbtest --no-tune --max_t 0.01 --serial | dd if=/dev/stdin of=/dev/stdout bs=16M status=none iflag=fullblock  | grep -q Building
> ./rsbtest --no-tune --max_t 0.01 --max 1 --nrhs 1 --beta 1 --incy 1 --incx 1 --no-trans --alpha 1 --type d --rand  --serial . | dd if=/dev/stdin of=/dev/stdout bs=16M status=none iflag=fullblock  | grep -q adding
> ! ./rsbtest --mkl A.mkl
> running on ip-10-84-234-251
> Built without the MKL.
> ( ! ./rsbtest --unrecognized-option-triggers-abort )
> running on ip-10-84-234-251
> /<<PKGBUILDDIR>>/rsbtest/.libs/rsbtest: unrecognized option '--unrecognized-option-triggers-abort'
> unrecognized option, aborting.
> ( ./rsbtest --no-tune --max_t 0.01 --skip-loading-hermitian-matrices --skip-loading-unsymmetric-matrices --tune-maxt 10 --tune-maxr 10 --verbose-tuning --extra-verbose-interface --min_t 0.01 --max_t 0.01 --mintimes 1 --maxtimes 1 --verbose  --skip-loading-symmetric-matrices A.mtx | dd if=/dev/stdin of=/dev/stdout bs=16M status=none iflag=fullblock  | grep -q skip )
> ./rsbtest --no-tune --max_t 0.01 A_non_existent.mtx | dd if=/dev/stdin of=/dev/stdout bs=16M status=none iflag=fullblock  | grep -q problems.opening
> ( ! ./rsbtest --render-only --skip-loading-hermitian-matrices --skip-loading-unsymmetric-matrices --tune-maxt 10 --tune-maxr 10 --verbose-tuning --extra-verbose-interface --min_t 0.01 --max_t 0.01 --mintimes 1 --maxtimes 1 --verbose  A.mtx | dd if=/dev/stdin of=/dev/stdout bs=16M status=none iflag=fullblock  | grep -q Rendering )
> ( ! ./rsbtest --no-tune --max_t 0.01 --quiet --types all --nthreads 1,2 --maxtimes 1 -+ A.mtx | dd if=/dev/stdin of=/dev/stdout bs=16M status=none iflag=fullblock  | grep -q 2.threads )
> ( ! ./rsbtest --no-tune --max_t 0.01 --quiet --render-only A.mtx > /dev/null )
> ! ./rsbtest --no-tune --max_t 0.01 --quiet --max 1 --nrhs 1 --beta 1 --incy 1 --incx 1 --render --no-trans --alpha 1 --type all A.mtx
> running on ip-10-84-234-251
> Will not invoke autotuning routine.
> Benchmark will sample for at most 0.01 s
> Built without render support!
> ( ./rsbtest --no-tune --max_t 0.01 --max 1 --nrhs 1 --beta 1 --incy 1 --incx 1 --no-trans --alpha 1 --type d --rand  --quiet --skip-loading-if-less-nnz-matrices 4 A.mtx | dd if=/dev/stdin of=/dev/stdout bs=16M status=none iflag=fullblock  | grep -q adding )
> ( ./rsbtest --no-tune --max_t 0.01 --max 1 --nrhs 1 --beta 1 --incy 1 --incx 1 --no-trans --alpha 1 --type d --rand  --quiet --skip-loading-if-less-nnz-matrices 7 A.mtx | dd if=/dev/stdin of=/dev/stdout bs=16M status=none iflag=fullblock  | grep -q no.matrix )
> ./rsbtest --no-tune --max_t 0.01 --quiet --skip-loading-not-unsymmetric-matrices A.mtx | dd if=/dev/stdin of=/dev/stdout bs=16M status=none iflag=fullblock  | grep -q no.matrix
> (  ./rsbtest  --max 1 --nrhs 1,2 --beta 1 --incy 1,2 --incx 1 --alpha 1 --type all              | dd if=/dev/stdin of=/dev/stdout bs=16M status=none iflag=fullblock  | grep -q flop )
> (! ./rsbtest  --max 1 --nrhs 1,2 --beta 1 --incy 1,2 --incx 1 --alpha 1 --type all --no-timings | dd if=/dev/stdin of=/dev/stdout bs=16M status=none iflag=fullblock  | grep -q flop )
> if which rsbench && rsbench -v | dd if=/dev/stdin of=/dev/stdout bs=16M status=none iflag=fullblock  | grep -q version ; then rsbench --generate-matrix -r 4 -c 4 -b 4 > test.mtx ; fi
> ./rsbtest --max 1 --nrhs 1 --beta 1 --incy 1 --incx 1 --no-rectangular --no-trans --alpha 1 --type d  --transA Q | dd if=/dev/stdin of=/dev/stdout bs=16M status=none iflag=fullblock  | grep -q no.valid
> ./rsbtest --quiet --skip-except-every-random-n-test-cases 1000000000 --max 1 | dd if=/dev/stdin of=/dev/stdout bs=16M status=none iflag=fullblock  | grep -q 1.compa
> ./rsbtest --quiet --skip-except-every-n-test-cases 1000000000 --max 1 | dd if=/dev/stdin of=/dev/stdout bs=16M status=none iflag=fullblock  | grep -q 1.compa
> ./rsbtest --quiet --no-tune --max_t 0.01 --skip-loading-if-more-nnz-matrices 5 A.mtx  | dd if=/dev/stdin of=/dev/stdout bs=16M status=none iflag=fullblock  | grep -q no.matri
> timeout 2 ./rsbtest --max-test-time 1 -q                   | dd if=/dev/stdin of=/dev/stdout bs=16M status=none iflag=fullblock  | grep -q '.*success'
> ./rsbtest --quiet --types d --only-test-case-n  4        | dd if=/dev/stdin of=/dev/stdout bs=16M status=none iflag=fullblock  | grep -q '\<1.*success'
> ./rsbtest --self-test > /dev/null
> ./rsbtest --quiet --types blas --only-test-case-n  4         | dd if=/dev/stdin of=/dev/stdout bs=16M status=none iflag=fullblock  | grep -q success
> ./rsbtest --quiet A.mtx --no-tune --max_t 0.01 --nrhs 1,2 --incx 1,2 --incy 1,2 --report /dev/null | dd if=/dev/stdin of=/dev/stdout bs=16M status=none iflag=fullblock  | grep -q Report
> cp A.mtx A_underscore.mtx # tests LaTeX escaping
> ./rsbtest --quiet A_underscore.mtx --report test.tex --no-basename-render --no-tune --max_t 0.01 --types d --nrhs 1 --incx 1 --incy 1 --no-trans --alpha 1 --beta 1 | dd if=/dev/stdin of=/dev/stdout bs=16M status=none iflag=fullblock  | grep LaTeX > /dev/null && if test latex != 'false' ; then latex -interaction=batchmode test.tex ; fi
> This is pdfTeX, Version 3.141592653-2.6-1.40.24 (TeX Live 2022/Debian) (preloaded format=latex)
>  restricted \write18 enabled.
> entering extended mode
> gmake[3]: *** [Makefile:920: qtests] Error 1
> gmake[3]: Leaving directory '/<<PKGBUILDDIR>>/rsbtest'
> make[2]: *** [Makefile:3221: qtests] Error 2


The full build log is available from:
http://qa-logs.debian.net/2022/07/28/librsb_1.3.0.1+dfsg-2_unstable.log

All bugs filed during this archive rebuild are listed at:
https://bugs.debian.org/cgi-bin/pkgreport.cgi?tag=ftbfs-20220728;users=lucas@debian.org
or:
https://udd.debian.org/bugs/?release=na&merged=ign&fnewerval=7&flastmodval=7&fusertag=only&fusertagtag=ftbfs-20220728&fusertaguser=lucas@debian.org&allbugs=1&cseverity=1&ctags=1&caffected=1#results

A list of current common problems and possible solutions is available at
http://wiki.debian.org/qa.debian.org/FTBFS . You're welcome to contribute!

If you reassign this bug to another package, please marking it as 'affects'-ing
this package. See https://www.debian.org/Bugs/server-control#affects

If you fail to reproduce this, please provide a build log and diff it with mine
so that we can identify if something relevant changed in the meantime.



More information about the debian-science-maintainers mailing list