1. ghex_futures benchmark is efficient with the UCX communicator only with the request pool
   managed inside communicator_ucx.hpp. Currently this is a circular buffer with pointer
   at the first known `free` request. The buffer is static, and searched linearly. If there
   are no more free requests, it fails.

2. some thought is needed to handle multi-threading. communicator is per-thread, and that means
   that it has to be either declared threadprivate, or allocated dynamically, while the pointer
   is declared thread-private. I need the pointer anyway in the ucx communicator to access the
   communicator from inside the callbacks.

3. pmi interface needs a place. It can be implemented using MPI collectives, of course, but
   that will not scale well for large runs. A pmix implementation is also available. We could add
   other implemnetations - this is not hard.

4. pool allocator seems to be efficient only with absolutely zero 'extras' - see pool_allocator.hpp.
   It's a simple list of free pointers. Each thread has its own list, so no locks when (de)allocate.
   However, the threads can deallocate (put into it's own free list) other threads' buffers:
   in UCX any thread attached to a given worker can be called back. Hence the pool allocator must
   dynamically add / remove allocations, or redistribute them among threads, if there is a large load
   imbalance.

5. the callbacks now take a const message reference
   void recv_callback(int rank, int tag, const MsgType &mesg)
   However, in this form I have to discard the const qualifier when I submit the recv request inside the callback, because
   the recv function in the callback communicator takes MsgType without const.