- Use bfloat16 dtype for UNet on Blackwell GPUs (compute major >= 12)
which have native bf16 tensor core support
- Skip manual_cast for bfloat16 weights to avoid unnecessary casting
- Fix numpy TypeError with bfloat16 tensors in patch.py and
ip_adapter.py by converting to float32 before .numpy() calls
Tested on RTX 5070 (sm_120, CUDA 12.8) with PyTorch nightly (cu128).
Generates images at ~3.2 it/s including Image Prompt (IP-Adapter) mode.
Fixes#3862, #4123, #4141