* CUDA: use mmvq for mul-mat-id for small batch sizes * add mmvq too * Fix perf issue on ampere. Use mmvf mm-id only for non-nvidia GPUs * templatize multi_token_path
* CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16