This allows disabling the CUDA implementation of ggml_moe_sum to
compare performance with ggml_cuda_op_fused_add.
When GGML_DISABLE_MOE_SUM_CUDA is defined:
- moesum.cu becomes empty (no CUDA kernel)
- ggml_moe_sum falls back to CPU implementation
- Setting LLAMA_DISABLE_MOE_SUM=1 will use ggml_add loop
which triggers ggml_cuda_op_fused_add
Usage for comparison:
- ggml_moe_sum (CUDA): default (both flags unset)
- ggml_cuda_op_fused_add: -DGGML_DISABLE_MOE_SUM_CUDA=1 -DLLAMA_DISABLE_MOE_SUM=1