We only used inner_pool in the prefill FFW function, and there we
can achieve sufficient parallelism on the rows of the matrix-vector
multiplications.
Benchmark results on a 1600-token summarization task:
```
Prefill speed
Num threads BEFORE AFTER
4 9.24 t/s 9.76 t/s
18 31.41 t/s 31.16 t/s
32 31.41 t/s 45.13 t/s
64 31.03 t/s 57.85 t/s
```
Move Path into io.h and use for opening files.
Removes dependency of gemma_lib on args.
Separate Windows codepath instead of emulating POSIX functions.
Plus lint fixes.
PiperOrigin-RevId: 626279004