Implement GATED_DELTA_NET for the CANN (Ascend NPU) backend using a
batched approach that groups all attention heads into a single 3-D
BatchMatMul per recurrence step, reducing kernel launches from
O(n_seqs × H × n_tokens) to O(n_seqs × n_tokens).
Key design decisions:
- Use aclnnBatchMatMul (rank-3 only) with shape [H, S_v, S_v] to batch
all H heads together for M×k, outer-product, and M×q steps
- Pre-allocate temporary buffers (g_exp, mk, delta, outer) reused
across all time steps to avoid per-step allocations
- Support both scalar gate (g shape [1,H]) and KDA per-dim gate
(g shape [S_v,H]) via appropriate broadcast shapes
- Fall back to naive per-head scalar loop for permuted/GQA/non-F32
inputs that don't meet batched path requirements
- Relax CANN precision tolerance to 1e-6 in tests to account for
different FP32 accumulation order in BatchMatMul vs scalar loops