MPI_THREAD_SINGLE

BASELINE: 1 thread

-M multi is important, because we want to use MT version of UCX. This adds overhead
time numactl --physcpubind=0-15 ucx_perftest -t tag_bw -M multi c3-4 -n 5000000 -s 1 -O 1

---------------------------------------------------------------------
niters			5000000
msg_size		1	1	1	1024	1024
inflight  		1	10	1000	1	10
---------------------------------------------------------------------

ucx_perftest		

------------------------------
OpenMPI + HPCX 2.4
------------------------------
mpi_avail (waitany)   	
mpi_avail (testany)   	0.71	0.80	12.1	1.51	1.44
mpi_avail_iter		0.69	0.72	0.90	1.51	1.54

------------------------------
Intel MPI  - even worse than the results below
------------------------------
mpi_avail (waitany)   	3.03	3.06	9.48	3.42	13.2
mpi_avail (testany)   	3.05	3.06	9.57	3.42	13.2
mpi_avail_iter 		3.03	3.05	9.46	3.41	13.2

------------------------------
ghex_futures.cpp
------------------------------
MPI (ompi)  		0.71	0.74	0.94	1.59	1.53
MPI (intel)		5.11	5.23	5.10	5.56	5.64
UCX			0.61	0.62	0.67	1.38	1.38

------------------------------
ghex_msg_cb.cpp
------------------------------
Sending a shared_message using callbacks.
n requests are submited and completed in turns.
Both backends can be used.

MPI (ompi)  		1.28	1.27	1.74	2.20	2.17
MPI (intel)		5.60	5.54	7.82	5.83	6.12
UCX			0.62	0.63	1.01	1.38	1.42
UCX (raw smsg)				0.80
UCX nbr/ghex progress	1.26	1.27	1.30	1.98	2.10

------------------------------
ghex_msg_cb_avail.cpp
------------------------------
Sending a shared_message using callbacks. There is n in-flight requests.
Messages are sent as slots become available, i.e., requests are completed.
Both backends can be used.

MPI (ompi)  		1.31	1.27	2.21	2.00	2.17
MPI (intel)		
UCX			0.62	0.62	1.03	1.25	1.35
UCX (raw smsg)				0.81
UCX nbr/ghex progress	1.20	1.18	1.35	2.10	2.07

------------------------------
ghex_msg_cb_resubmit.cpp
------------------------------
Sending a shared_message using callbacks. There is n in-flight requests.
Messages are sent as slots become available, i.e., requests are completed.
recv requests are resubmited inside the recv callback.
Both backends can be used.

MPI (ompi)		1.30	1.30	1.86	2.22	2.14
MPI (intel)		
UCX			0.62	0.62	1.07	1.52	1.35
UCX (raw smsg)				0.80
UCX nbr/ghex progress	1.25	1.18	1.39	1.99	2.08

------------------------------
ghex_msg_cb_dynamic.cpp
ghex_msg_cb_dynamic_resubmit.cpp
------------------------------
GHEX takes ownership of the message for the duration of the comm,
in user's code the message gets out of scope after posting.

UCX			1.35	1.29	1.67	2.42	2.37
UCX (raw smsg)		1.24	1.19	1.49	2.32	2.40
UCX (pool alloc circ)	1.09	1.05	3.25	1.82	1.84
UCX (pool, raw smsg)	0.97	0.90	3.12	1.69	1.70


------------------------------
ghex_ptr_cb.cpp
------------------------------
Sending buffer directly through pointer, using callbacks.
n requests are submited and completed in turns.
only UCX

UCX			0.62	0.62	0.77	1.47	1.46

------------------------------
ghex_ptr_cb_avail.cpp
------------------------------
Sending buffer directly through pointer, using callbacks.
There is n in-flight requests.
Messages are sent as slots become available, i.e., requests are completed.
only UCX

UCX			0.62	0.62	0.82	1.43	1.44
