setting up RPC + callback on each split completion
1. start rpc server on local instance on two different ports with 5GB
allocated each.
2. set up another callback on completion of a split. This seems cleaner
than trying to second-guess which tensor is the boundary of a split.
3. run it with 8B model @ 4bit, observe split_done captured at a reasonable place.
Next step - bring back linear speculation and start speculating on another remote
instances.