Jan Wassenberg
1605925d1e
Add int8 quantization stats
...
Compute the L1 error and Shannon SNR (higher is better).
PiperOrigin-RevId: 846832280
2025-12-19 12:43:03 -08:00
Krzysztof Rymski
08a0760271
Internal changes
...
PiperOrigin-RevId: 846663686
2025-12-19 03:43:15 -08:00
Krzysztof Rymski
b73a9ede8f
Internal changes
...
PiperOrigin-RevId: 846648337
2025-12-19 02:46:18 -08:00
Balazs Racz
0ac55f71ed
Avoid using Row() for unaligned storage.
...
PiperOrigin-RevId: 846214605
2025-12-18 05:10:57 -08:00
Krzysztof Rymski
6661d3a60c
Internal changes
...
PiperOrigin-RevId: 846140314
2025-12-18 01:26:43 -08:00
Phil Culliton
b8a409dbba
Use hn::Sub for vector subtraction in flash attention.
...
PiperOrigin-RevId: 845883321
2025-12-17 12:57:34 -08:00
Balazs Racz
596bdfe5af
Separate monolithic gemma_lib library into more specific cc_library targets.
...
Creates new cc_library targets for :attention, :tensor_stats and :activations. Eliminates cyclic dependencies between these libraries.
PiperOrigin-RevId: 845690136
2025-12-17 03:31:16 -08:00
Balazs Racz
baa69dfb78
Makes the entire runtime_config passed into the activations constructor.
...
PiperOrigin-RevId: 845153671
2025-12-16 01:56:52 -08:00
Krzysztof Rymski
44dfd69b9b
Internal changes
...
PiperOrigin-RevId: 844759322
2025-12-15 07:14:37 -08:00
Jan Wassenberg
0c64987a96
Abort if args are unrecognized, refactor argument passing
...
This catches typos/incorrect usage.
Refactor: group Loader/Threading/Inference into GemmaArgs.
All *Args ctors now have an extra ConsumedArgs& argument.
PiperOrigin-RevId: 844690553
2025-12-15 03:18:45 -08:00
Jan Wassenberg
f50550f4ce
Warning fixes (sign mismatch), switch default
...
PiperOrigin-RevId: 844679375
2025-12-15 02:41:19 -08:00
Martin Stolle
506fb22be7
No public description
...
PiperOrigin-RevId: 843665619
2025-12-12 06:37:17 -08:00
Balazs Racz
338cd8a36e
Factors out a new cc_library `:query` from `:gemma-lib`.
...
Moves query-related structs/classes to gemma/query.h.
This refactors PerQuery, AllQueries, and QBatch into a dedicated header file, gemma/query.h, and updates BUILD dependencies accordingly.
PiperOrigin-RevId: 843604293
2025-12-12 02:53:56 -08:00
Jan Wassenberg
73c3627b67
Add tensor stats and output
...
tensor_info: add missing header
io: fix mode
weights.h: add layer_idx to LayerWeightsPtrs
PiperOrigin-RevId: 843531051
2025-12-11 22:52:46 -08:00
Martin Stolle
78deacc357
Make attention configurable on the command line.
...
PiperOrigin-RevId: 842760721
2025-12-10 09:34:06 -08:00
Martin Stolle
2441ff01bf
internal change
...
PiperOrigin-RevId: 842749037
2025-12-10 09:01:15 -08:00
Martin Stolle
9689fc82f9
internal change
...
PiperOrigin-RevId: 842205671
2025-12-09 06:17:08 -08:00
Krzysztof Rymski
64d700cab5
Internal changes
...
PiperOrigin-RevId: 842194766
2025-12-09 05:42:03 -08:00
Martin Stolle
14a9ecf21d
Factor out SumHeads
...
PiperOrigin-RevId: 842138081
2025-12-09 02:23:16 -08:00
Martin Stolle
1014ae9e2a
Adding a simple test for GemmaAttention
...
PiperOrigin-RevId: 842135414
2025-12-09 02:13:03 -08:00
Martin Stolle
b510ba2ab2
Improve clarity of indices II
...
Sorry, didn't see this one before.
PiperOrigin-RevId: 840218378
2025-12-04 06:33:33 -08:00
Martin Stolle
9348048885
Clean up toPtrs to delegate to toPtr
...
PiperOrigin-RevId: 840214969
2025-12-04 06:22:04 -08:00
Martin Stolle
d2090fddf3
Improve clarity of indices
...
PiperOrigin-RevId: 839805634
2025-12-03 10:11:21 -08:00
Jan Wassenberg
a084d33e41
Fix Gemma3 image: ensure A matrix is packed, preallocate
...
Also ignore -2 tokens
PiperOrigin-RevId: 838869988
2025-12-01 11:47:23 -08:00
Krzysztof Rymski
6e5e4123f1
Internal changes
...
PiperOrigin-RevId: 837775282
2025-11-28 02:37:06 -08:00
Krzysztof Rymski
c153d5255b
Internal changes
...
PiperOrigin-RevId: 837001762
2025-11-26 01:05:35 -08:00
Martin Stolle
8696f6dd17
Clarify indices
...
PiperOrigin-RevId: 836235539
2025-11-24 08:27:59 -08:00
Jan Wassenberg
37a25c9ffe
Fix warning (signed vs unsigned)
...
PiperOrigin-RevId: 836106478
2025-11-24 00:51:17 -08:00
Charles Zhao
0e5f4cbf1b
Implement Continus Batching.
...
(1) A function GenerateTWithContinuousBatching is added to use continuous batching when enabled.
(2) The ContinuousQBatch is added as a subclass of QBatch to manage prefill, insert, used-kv-cache-collection.
(3) Also expanded the unit test to more diverse cases.
PiperOrigin-RevId: 836090261
2025-11-23 23:54:02 -08:00
Martin Stolle
88a03b7ec4
Added access to softmax attention internals to regular attention
...
PiperOrigin-RevId: 835244205
2025-11-21 09:01:01 -08:00
Martin Stolle
5a500872b8
Internal change
...
PiperOrigin-RevId: 835115693
2025-11-21 01:17:45 -08:00
Martin Stolle
49d420aeaf
Add some comments.
...
PiperOrigin-RevId: 834173319
2025-11-19 01:09:15 -08:00
The gemma.cpp Authors
b8f6be72b1
Improves autodetection of Gemma3-1B.
...
Uses the key_norm and query_norm layers to disambiguate between the Gemma2-2B and Gemma3-1B models.
Since Gemma3-1B is not multimodal, ViT is not an effective disambiguator. KQ normalization is a structural disambiguator between gemma2 and gemma3.
PiperOrigin-RevId: 833213331
2025-11-17 01:12:50 -08:00
Jan Wassenberg
3e18db17f4
Avoid hard-coding kPatchSize. Thanks @Somet2mes for reporting. Fixes #762 .
...
PiperOrigin-RevId: 829308896
2025-11-07 00:32:31 -08:00
Charles Zhao
f8131339a7
Refactor for continous batching. This cl does not change the current behavior of the code. It only extract two functions that will later be called for adding continuous batching.
...
PiperOrigin-RevId: 829104661
2025-11-06 14:20:17 -08:00
Martin Stolle
35e9f9f05f
Introduce attention implementation configurability.
...
PiperOrigin-RevId: 828971705
2025-11-06 08:43:41 -08:00
Jan Wassenberg
091b4567c9
Minor: ParallelismStrategy->Parallelism
...
PiperOrigin-RevId: 828936578
2025-11-06 06:56:10 -08:00
Jan Wassenberg
a344a70c59
Change (old) attention behavior to disallow wraparound, enforced via assertion.
...
Shared kU64PerLine constant
PiperOrigin-RevId: 828072451
2025-11-04 11:52:40 -08:00
Charles Zhao
3a63a12624
Allow prefill only run by allowing max_prompt_size == seq_len
...
PiperOrigin-RevId: 827415258
2025-11-03 03:17:54 -08:00
Phil Culliton
ab87807a4c
Pre-compress query activations to BF16 before FlashAttention.
...
PiperOrigin-RevId: 826524997
2025-10-31 09:49:44 -07:00
Ray Smith
8a100c1e8d
Added access to flash attention internals to TileFlashAttention4
...
PiperOrigin-RevId: 826011137
2025-10-30 06:50:05 -07:00
Phil Culliton
116cd6eff6
BF16 mixed-mode flash attention
...
PiperOrigin-RevId: 825433929
2025-10-29 01:48:28 -07:00
Jan Wassenberg
4bd465ffd3
Also update attention.h to type-erased query_norm_scale
...
PiperOrigin-RevId: 825014334
2025-10-28 06:48:33 -07:00
Jan Wassenberg
3cc0139ebb
Fix excessive KC/MC from prior change
...
This could lead to stack overflow in B_storage.
Also do not require specific type for query_norm_scale,
update batch sizes for attention tensors,
more verbose Mat shape/type checks.
PiperOrigin-RevId: 824987689
2025-10-28 05:33:01 -07:00
Biruk Mammo
5a05857deb
[Gemma.cpp] Allows non-owned arguments for attention methods.
...
* Adds and uses a new `AttentionActivationPtrs` that holds non-owning `MatPtrs`. Acts as a view into `AttentionActivations`.
* Updates `QBatch` to hold non-owning `MatPtr`s to the kv caches.
* Enables the `MatPtrT` default constructor for simpler initializations.
* Pulls out and passes `LayerWeightsPtrs::query_norm_scale` directly. While `LayerWeightsPtrs` already held non-owning `MatPtr`s, this change avoids the need to find and construct several empty weight tensors just to construct one `query_norm_scale` tensor.
PiperOrigin-RevId: 824584177
2025-10-27 10:43:25 -07:00
Theotime Combes
1bdde1af3c
Add config flag for global timescale & rely on config to deduce wrapping
...
PiperOrigin-RevId: 823512377
2025-10-24 06:54:56 -07:00
Jan Wassenberg
3ed403e287
Major cleanup of profiler zones, add Caller annotation for all pool.Run
...
Pass ThreadingContext instead of Pools/Profiler individually, for access to Zones
Add GCPP_ZONE helper
Add Caller argument to pool.Run to enable new stats
Remove most direct dependencies on ThreadPool, prefer ParallelFor
PiperOrigin-RevId: 822934530
2025-10-23 01:54:24 -07:00
Phil Culliton
503aaddd65
Add 8-bit integer quantization (I8Stream) to Gemma.cpp.
...
PiperOrigin-RevId: 819787856
2025-10-15 09:25:20 -07:00
Ray Smith
ee18916abf
Removed the PROFILER_ZONE from the most highly called functions to reduce the overhead.
...
PiperOrigin-RevId: 819739402
2025-10-15 07:10:04 -07:00
Ray Smith
fb6fa793f4
Added a global (to gemma) zones list to enable most call sites to PROFILER_ZONE3 to avoid the sychronization required for the static const initialization of the zone handle.
...
Improved flash_attention to enable profiling using the new zones.
PiperOrigin-RevId: 819235421
2025-10-14 08:30:58 -07:00