Commit Graph

353 Commits

Author SHA1 Message Date
Jan Wassenberg 229bd078a1 1.29x speedup: bf16 C1/C2. Extend most ops to any type, expand test coverage.
Also increase dot_test.cc range for Zen4, and matmul_test tolerance (failing in some configs)

PiperOrigin-RevId: 801789922
2025-09-01 06:34:04 -07:00
Jan Wassenberg 0ae8646731 Fix remainder handling for Paligemma
No longer attempt to skip the remainder handling because B might also be a non-padded view.

PiperOrigin-RevId: 800890805
2025-08-29 07:25:52 -07:00
Marie White 973e284ed6 Refactor Matmul to use a policy class for parallelization.
PiperOrigin-RevId: 800864489
2025-08-29 05:40:39 -07:00
Jan Wassenberg 6c39a2dea4 1.01x speedup: More bf16 activations to reduce DecompressA.
Also move observer call into function, format gemma_args.

PiperOrigin-RevId: 800827400
2025-08-29 03:19:01 -07:00
Jan Wassenberg 7288891439 Remove F64 partial storage in matmul.
Also remove no longer used kMaxN; row_ptrs only used for C

PiperOrigin-RevId: 800774757
2025-08-29 00:12:08 -07:00
Jan Wassenberg 98ddc166db Expand ThreadingContext comments
PiperOrigin-RevId: 800479954
2025-08-28 08:32:10 -07:00
Marie White 6128e758ff Change ffw_out from B16 to F32.
PiperOrigin-RevId: 800330411
2025-08-28 00:01:39 -07:00
Jan Wassenberg 5411fd846d Minor: batched NotifyGenerate, fix comment/dep
PiperOrigin-RevId: 799889802
2025-08-26 23:33:17 -07:00
Jan Wassenberg 86afd53076 1.04x speedup: Parallelize SoftCap
Also require opt-in constexpr flag for observer callbacks, update zones

PiperOrigin-RevId: 799655163
2025-08-26 11:55:20 -07:00
Jan Wassenberg ed2f0bd1b0 Fix pos assertions, refs #665
Ensure the streaming func pos matches the number of calls.
Add two arguments that control pos+1 and pos+=1 behavior.
Also cleanup/add comments.
run: use batch_stream_func, add assert, higher verbosity for MM autotune output
PiperOrigin-RevId: 799511163
2025-08-26 04:50:40 -07:00
Jan Wassenberg 9bf0fe4e37 Internal change
PiperOrigin-RevId: 799509375
2025-08-26 04:44:08 -07:00
Jan Wassenberg d3a5ddf657 Merge pull request #663 from junjihashimoto:feature/api-server
PiperOrigin-RevId: 797731089
2025-08-24 11:57:05 +02:00
Rhett Stucki 73f1140dca Fix an off-by-one error after StreamAndUpdateEOS() to remove the MSAN warning about reading an uninitialized variable in the kv_cache.
The logic for choosing whether or not to attend to the last token during prefill wasn't completely consistent with StreamAndUpdateEOS(), causing an off-by-one error that prevented the kv_cache from being fully populated.

PiperOrigin-RevId: 797614310
2025-08-20 22:59:58 -07:00
Junji Hashimoto 41321611fd feature: add API server and client with Google protocol 2025-08-21 11:32:48 +09:00
Phil Culliton 78573b6718 Internal change. Add deduction for 270M.
PiperOrigin-RevId: 795041810
2025-08-14 08:04:38 -07:00
Phil Culliton d044801c1d Internal change
PiperOrigin-RevId: 794620076
2025-08-13 09:47:45 -07:00
Jan Wassenberg 71406cf6d0 More profiler interface fixes: hwy:: plus avoid ADD_ZONE
PiperOrigin-RevId: 794493165
2025-08-13 03:15:48 -07:00
Jan Wassenberg faa4102992 (Resubmit) Prepare profiler annotations for new API
Pass hwy::Profiler& to low-level functions.
Used ThreadingContext arg instead of NestedPools.
Use new PROFILER_ZONE3.

PiperOrigin-RevId: 794461159
2025-08-13 01:38:24 -07:00
The gemma.cpp Authors a2d9133f7d Prepare profiler annotations for new API
Pass hwy::Profiler& to low-level functions.
Used ThreadingContext arg instead of NestedPools.
Use new PROFILER_ZONE3.

PiperOrigin-RevId: 793865287
2025-08-11 17:51:38 -07:00
Jan Wassenberg 4cbf63e6f0 Prepare profiler annotations for new API
Pass hwy::Profiler& to low-level functions.
Used ThreadingContext arg instead of NestedPools.
Use new PROFILER_ZONE3.

PiperOrigin-RevId: 793821255
2025-08-11 15:34:52 -07:00
Jan Wassenberg 4e062d68f7 Update BlobWriter comments, WriteAll->Finalize
PiperOrigin-RevId: 790792133
2025-08-04 10:01:38 -07:00
Jan Wassenberg 701841897b Default to disabling per-socket parallelization
weights: default to Read for small-batch (only look at qbatch, not the larger prefill tbatch)
PiperOrigin-RevId: 790787643
2025-08-04 09:49:14 -07:00
Jan Wassenberg 799c264df3 Pre-tune thread pool before matmul
Also improve profiler annotations - remove near-zero ones and add more for startup

PiperOrigin-RevId: 789352414
2025-07-31 08:45:26 -07:00
Charles Zhao 50ee1a3e92 Write SBS progressively.
(1) Directly write to file in BlobWriter::Add and destruct the MatOwner to release the rams.

(2) Write a fake header to indicate this is V2, and write correct header and directory at the end of the file.

(3) Tested on loading sbs written the old way, and new way, both worked.

PiperOrigin-RevId: 789306837
2025-07-31 06:05:38 -07:00
Jan Wassenberg 8715eda512 Improved layer idx parsing
PiperOrigin-RevId: 788868522
2025-07-30 05:49:45 -07:00
Jan Wassenberg d831ddce5b Fix file mapping: was letting the smart pointer go out of scope
Also save+print the IO mode used.

PiperOrigin-RevId: 788848165
2025-07-30 04:30:10 -07:00
Jan Wassenberg d22ba2ac96 Update layer index parsing and allow tokenizer override
PiperOrigin-RevId: 788797948
2025-07-30 01:22:34 -07:00
Jan Wassenberg d1638587f0 1.14x batch decode speedup: parallelize RMSNorm ops
Activations was over-parallelized, use single pool instead.
Also improve profiler zone annotations,
pass through worker args (for tracking concurrency), now non-optional.

PiperOrigin-RevId: 788790976
2025-07-30 00:55:45 -07:00
Jan Wassenberg ac0d751d20 Rename GetModelConfig->Config
PiperOrigin-RevId: 788506480
2025-07-29 10:18:12 -07:00
Jeremiah Harmsen 33fabd4ed1 Internal change.
PiperOrigin-RevId: 788463042
2025-07-29 08:21:29 -07:00
Jan Wassenberg e76e29ce11 De-singleton ThreadingContext so callers can pass in their own
weights.cc: fix BindB argument for bf16 tensors
threading_test: enable autotune
PiperOrigin-RevId: 785763618
2025-07-22 02:08:46 -07:00
Jan Wassenberg 5474146129 Back to f32 kv_cache, but via typedef
PiperOrigin-RevId: 785422614
2025-07-21 07:05:35 -07:00
Jan Wassenberg 56c9196eb6 Add blob_path to config deduction message
PiperOrigin-RevId: 782188689
2025-07-11 18:58:56 -07:00
Jan Wassenberg 4bc44d5678 Minor: ModelWeightsPtrs -> WeightsPtrs
PiperOrigin-RevId: 781954533
2025-07-11 06:11:51 -07:00
Jan Wassenberg a04cc287b2 Move MatMulEnv out of Gemma to enable concurrent calls
Also update benchmark_helper config print: add profiler, remove free mem

PiperOrigin-RevId: 774662974
2025-06-23 01:20:09 -07:00
Jan Wassenberg 0f70f285e0 1.1x prefill and decode speedup (attention/activations)
Optimizations
- Better load-balancing in attention threading
(Previously, clusters were limited by #heads)
- Add MulByConstTo to avoid zero-init
- Parallel activations

Cleanup
- Prepare for RowPtr in A or B
- Pass through thread_id to ops
- Avoid warning in bench_matmul

PiperOrigin-RevId: 773723423
2025-06-20 08:59:53 -07:00
Jan Wassenberg 4f5785b0fd Update instrumentation for new Highway wall-time profiler
Pass the thread index through and use new zone_id.

PiperOrigin-RevId: 773344242
2025-06-19 07:46:04 -07:00
Jan Wassenberg 7f62c2606e Fix bf16 KV recompression and Rope(), fixes #608
Also add more helpful error message for prompt > seq_len

Also update ops_test, adding coverage for Rope().

PiperOrigin-RevId: 772945644
2025-06-18 09:14:20 -07:00
Biruk Mammo 88284387db Reduce warning noise.
PiperOrigin-RevId: 772941142
2025-06-18 09:01:40 -07:00
Jan Wassenberg 343482c7ef 1.02x batch decode speedup: BF16 KV cache
ops-inl.h: Vectorize Rope(), template
Remove unused MulBy, and extra-arg overloads of MulByConst and Softmax
Fix for DecompressAndZeroPad: ensure second vector filled

PiperOrigin-RevId: 772779163
2025-06-17 23:21:59 -07:00
Jan Wassenberg f2adbfbcab Batch inference fixes: set pos during prefill, fix assert
PiperOrigin-RevId: 772458760
2025-06-17 07:09:44 -07:00
Jan Wassenberg cd80d8b24d Speed up builds by skipping rarely used targets
Centralize previous code into GEMMA_DISABLED_TARGETS

PiperOrigin-RevId: 772433723
2025-06-17 05:44:20 -07:00
Jan Wassenberg 9a02d6be68 Add --prompt_file and testdata for it. Refs #608
Linux terminals truncate input after 4096 chars.
testdata is Frankenstein from project Gutenberg, which are long out of copyright.

Also fix loss of coherence after long context caused by incorrect IsGlobalLayer.
Move that to config.h and use max_seq_len as the initializer to make this clear.

Also avoid dynamic allocation for GriffinActivations.

PiperOrigin-RevId: 772333225
2025-06-16 23:41:07 -07:00
Biruk Mammo 5f3797f6e1 Allow creating empty `AttentionActivations` for experimental code.
PiperOrigin-RevId: 772077675
2025-06-16 10:19:11 -07:00
Jan Wassenberg 6773e4517c Split Activations into Griffin/Attention to reduce memory usage for attention-only tests.
PiperOrigin-RevId: 772025282
2025-06-16 07:52:59 -07:00
RangerUFO 7aac765e96 Add `Append` method to `AllQueries` 2025-06-16 20:39:27 +08:00
Jan Wassenberg e5c81f64a1 Major refactor: clarify query_idx (global) vs qi. Refs #607
Fix missing pos increment for last prefill and check that in gemma_test.
Thanks to @ufownl for pointing this out.

Change argument lists to QBatch with accessors.
Increase default seq_len to 8k.

PiperOrigin-RevId: 771937385
2025-06-16 02:42:02 -07:00
Jan Wassenberg 01cdefeda7 1.64x batch=1 prefill speedup: nested parallelization for Attention
(DotSoftmaxWeightedSum)
Also fix tsan error in matmul (atomic_flag instead of static)

PiperOrigin-RevId: 770241705
2025-06-11 11:28:46 -07:00
Jan Wassenberg c027a45a2e MatPtr-ify KV, shared div_seq_len, --seq_len flag
PiperOrigin-RevId: 770194455
2025-06-11 09:49:38 -07:00
Jan Wassenberg b84149310b Fix paligemma, update its test
Must not pass image tokens to the EmbedMMToken used for text.
Caught by next presubmit test.

paligemma_test: move function bodies into class, regroup variables
PiperOrigin-RevId: 770040014
2025-06-11 02:12:12 -07:00