gemma.cpp

Commit Graph

Author	SHA1	Message	Date
Jan Wassenberg	7263ab8445	MatMul simplification, threading strategy improvements remove MatMul f32 special case (smaller code), types: Add u32/u64 for use by Activations move renamed ParallelismStrategy to threading_context so can pass ctx ensure worker index is unique across clusters matmul.h: const member functions for renamed policy classes (easier to call) PiperOrigin-RevId: 802848086	2025-09-03 21:45:07 -07:00
Jan Wassenberg	b7b3d353db	Simplify MatMul: remove F32 special case (build time) Also move kMaxM into separate kMaxBatchSize PiperOrigin-RevId: 802086590	2025-09-02 04:29:21 -07:00
Jan Wassenberg	1e3c853e80	Add ParallelFor wrapper function and one new mode Move ParallelismType from matmul.h to threading.h Replace SmallParallelFor with ParallelFor and the new mode PiperOrigin-RevId: 802038452	2025-09-02 01:40:09 -07:00
Jan Wassenberg	229bd078a1	1.29x speedup: bf16 C1/C2. Extend most ops to any type, expand test coverage. Also increase dot_test.cc range for Zen4, and matmul_test tolerance (failing in some configs) PiperOrigin-RevId: 801789922	2025-09-01 06:34:04 -07:00
Jan Wassenberg	0ae8646731	Fix remainder handling for Paligemma No longer attempt to skip the remainder handling because B might also be a non-padded view. PiperOrigin-RevId: 800890805	2025-08-29 07:25:52 -07:00
Marie White	973e284ed6	Refactor Matmul to use a policy class for parallelization. PiperOrigin-RevId: 800864489	2025-08-29 05:40:39 -07:00
Jan Wassenberg	6c39a2dea4	1.01x speedup: More bf16 activations to reduce DecompressA. Also move observer call into function, format gemma_args. PiperOrigin-RevId: 800827400	2025-08-29 03:19:01 -07:00
Jan Wassenberg	7288891439	Remove F64 partial storage in matmul. Also remove no longer used kMaxN; row_ptrs only used for C PiperOrigin-RevId: 800774757	2025-08-29 00:12:08 -07:00
Jan Wassenberg	98ddc166db	Expand ThreadingContext comments PiperOrigin-RevId: 800479954	2025-08-28 08:32:10 -07:00
Marie White	6128e758ff	Change ffw_out from B16 to F32. PiperOrigin-RevId: 800330411	2025-08-28 00:01:39 -07:00
Jan Wassenberg	5411fd846d	Minor: batched NotifyGenerate, fix comment/dep PiperOrigin-RevId: 799889802	2025-08-26 23:33:17 -07:00
Jan Wassenberg	86afd53076	1.04x speedup: Parallelize SoftCap Also require opt-in constexpr flag for observer callbacks, update zones PiperOrigin-RevId: 799655163	2025-08-26 11:55:20 -07:00
Jan Wassenberg	ed2f0bd1b0	Fix pos assertions, refs #665 Ensure the streaming func pos matches the number of calls. Add two arguments that control pos+1 and pos+=1 behavior. Also cleanup/add comments. run: use batch_stream_func, add assert, higher verbosity for MM autotune output PiperOrigin-RevId: 799511163	2025-08-26 04:50:40 -07:00
Jan Wassenberg	9bf0fe4e37	Internal change PiperOrigin-RevId: 799509375	2025-08-26 04:44:08 -07:00
Jan Wassenberg	d3a5ddf657	Merge pull request #663 from junjihashimoto:feature/api-server PiperOrigin-RevId: 797731089	2025-08-24 11:57:05 +02:00
Rhett Stucki	73f1140dca	Fix an off-by-one error after StreamAndUpdateEOS() to remove the MSAN warning about reading an uninitialized variable in the kv_cache. The logic for choosing whether or not to attend to the last token during prefill wasn't completely consistent with StreamAndUpdateEOS(), causing an off-by-one error that prevented the kv_cache from being fully populated. PiperOrigin-RevId: 797614310	2025-08-20 22:59:58 -07:00
Junji Hashimoto	41321611fd	feature: add API server and client with Google protocol	2025-08-21 11:32:48 +09:00
Phil Culliton	78573b6718	Internal change. Add deduction for 270M. PiperOrigin-RevId: 795041810	2025-08-14 08:04:38 -07:00
Phil Culliton	d044801c1d	Internal change PiperOrigin-RevId: 794620076	2025-08-13 09:47:45 -07:00
Jan Wassenberg	71406cf6d0	More profiler interface fixes: hwy:: plus avoid ADD_ZONE PiperOrigin-RevId: 794493165	2025-08-13 03:15:48 -07:00
Jan Wassenberg	faa4102992	(Resubmit) Prepare profiler annotations for new API Pass hwy::Profiler& to low-level functions. Used ThreadingContext arg instead of NestedPools. Use new PROFILER_ZONE3. PiperOrigin-RevId: 794461159	2025-08-13 01:38:24 -07:00
The gemma.cpp Authors	a2d9133f7d	Prepare profiler annotations for new API Pass hwy::Profiler& to low-level functions. Used ThreadingContext arg instead of NestedPools. Use new PROFILER_ZONE3. PiperOrigin-RevId: 793865287	2025-08-11 17:51:38 -07:00
Jan Wassenberg	4cbf63e6f0	Prepare profiler annotations for new API Pass hwy::Profiler& to low-level functions. Used ThreadingContext arg instead of NestedPools. Use new PROFILER_ZONE3. PiperOrigin-RevId: 793821255	2025-08-11 15:34:52 -07:00
Jan Wassenberg	4e062d68f7	Update BlobWriter comments, WriteAll->Finalize PiperOrigin-RevId: 790792133	2025-08-04 10:01:38 -07:00
Jan Wassenberg	701841897b	Default to disabling per-socket parallelization weights: default to Read for small-batch (only look at qbatch, not the larger prefill tbatch) PiperOrigin-RevId: 790787643	2025-08-04 09:49:14 -07:00
Jan Wassenberg	799c264df3	Pre-tune thread pool before matmul Also improve profiler annotations - remove near-zero ones and add more for startup PiperOrigin-RevId: 789352414	2025-07-31 08:45:26 -07:00
Charles Zhao	50ee1a3e92	Write SBS progressively. (1) Directly write to file in BlobWriter::Add and destruct the MatOwner to release the rams. (2) Write a fake header to indicate this is V2, and write correct header and directory at the end of the file. (3) Tested on loading sbs written the old way, and new way, both worked. PiperOrigin-RevId: 789306837	2025-07-31 06:05:38 -07:00
Jan Wassenberg	8715eda512	Improved layer idx parsing PiperOrigin-RevId: 788868522	2025-07-30 05:49:45 -07:00
Jan Wassenberg	d831ddce5b	Fix file mapping: was letting the smart pointer go out of scope Also save+print the IO mode used. PiperOrigin-RevId: 788848165	2025-07-30 04:30:10 -07:00
Jan Wassenberg	d22ba2ac96	Update layer index parsing and allow tokenizer override PiperOrigin-RevId: 788797948	2025-07-30 01:22:34 -07:00
Jan Wassenberg	d1638587f0	1.14x batch decode speedup: parallelize RMSNorm ops Activations was over-parallelized, use single pool instead. Also improve profiler zone annotations, pass through worker args (for tracking concurrency), now non-optional. PiperOrigin-RevId: 788790976	2025-07-30 00:55:45 -07:00
Jan Wassenberg	ac0d751d20	Rename GetModelConfig->Config PiperOrigin-RevId: 788506480	2025-07-29 10:18:12 -07:00
Jeremiah Harmsen	33fabd4ed1	Internal change. PiperOrigin-RevId: 788463042	2025-07-29 08:21:29 -07:00
Jan Wassenberg	e76e29ce11	De-singleton ThreadingContext so callers can pass in their own weights.cc: fix BindB argument for bf16 tensors threading_test: enable autotune PiperOrigin-RevId: 785763618	2025-07-22 02:08:46 -07:00
Jan Wassenberg	5474146129	Back to f32 kv_cache, but via typedef PiperOrigin-RevId: 785422614	2025-07-21 07:05:35 -07:00
Jan Wassenberg	56c9196eb6	Add blob_path to config deduction message PiperOrigin-RevId: 782188689	2025-07-11 18:58:56 -07:00
Jan Wassenberg	4bc44d5678	Minor: ModelWeightsPtrs -> WeightsPtrs PiperOrigin-RevId: 781954533	2025-07-11 06:11:51 -07:00
Jan Wassenberg	a04cc287b2	Move MatMulEnv out of Gemma to enable concurrent calls Also update benchmark_helper config print: add profiler, remove free mem PiperOrigin-RevId: 774662974	2025-06-23 01:20:09 -07:00
Jan Wassenberg	0f70f285e0	1.1x prefill and decode speedup (attention/activations) Optimizations - Better load-balancing in attention threading (Previously, clusters were limited by #heads) - Add MulByConstTo to avoid zero-init - Parallel activations Cleanup - Prepare for RowPtr in A or B - Pass through thread_id to ops - Avoid warning in bench_matmul PiperOrigin-RevId: 773723423	2025-06-20 08:59:53 -07:00
Jan Wassenberg	4f5785b0fd	Update instrumentation for new Highway wall-time profiler Pass the thread index through and use new zone_id. PiperOrigin-RevId: 773344242	2025-06-19 07:46:04 -07:00
Jan Wassenberg	7f62c2606e	Fix bf16 KV recompression and Rope(), fixes #608 Also add more helpful error message for prompt > seq_len Also update ops_test, adding coverage for Rope(). PiperOrigin-RevId: 772945644	2025-06-18 09:14:20 -07:00
Biruk Mammo	88284387db	Reduce warning noise. PiperOrigin-RevId: 772941142	2025-06-18 09:01:40 -07:00
Jan Wassenberg	343482c7ef	1.02x batch decode speedup: BF16 KV cache ops-inl.h: Vectorize Rope(), template Remove unused MulBy, and extra-arg overloads of MulByConst and Softmax Fix for DecompressAndZeroPad: ensure second vector filled PiperOrigin-RevId: 772779163	2025-06-17 23:21:59 -07:00
Jan Wassenberg	f2adbfbcab	Batch inference fixes: set pos during prefill, fix assert PiperOrigin-RevId: 772458760	2025-06-17 07:09:44 -07:00
Jan Wassenberg	cd80d8b24d	Speed up builds by skipping rarely used targets Centralize previous code into GEMMA_DISABLED_TARGETS PiperOrigin-RevId: 772433723	2025-06-17 05:44:20 -07:00
Jan Wassenberg	9a02d6be68	Add --prompt_file and testdata for it. Refs #608 Linux terminals truncate input after 4096 chars. testdata is Frankenstein from project Gutenberg, which are long out of copyright. Also fix loss of coherence after long context caused by incorrect IsGlobalLayer. Move that to config.h and use max_seq_len as the initializer to make this clear. Also avoid dynamic allocation for GriffinActivations. PiperOrigin-RevId: 772333225	2025-06-16 23:41:07 -07:00
Biruk Mammo	5f3797f6e1	Allow creating empty `AttentionActivations` for experimental code. PiperOrigin-RevId: 772077675	2025-06-16 10:19:11 -07:00
Jan Wassenberg	6773e4517c	Split Activations into Griffin/Attention to reduce memory usage for attention-only tests. PiperOrigin-RevId: 772025282	2025-06-16 07:52:59 -07:00
RangerUFO	7aac765e96	Add `Append` method to `AllQueries`	2025-06-16 20:39:27 +08:00
Jan Wassenberg	e5c81f64a1	Major refactor: clarify query_idx (global) vs qi. Refs #607 Fix missing pos increment for last prefill and check that in gemma_test. Thanks to @ufownl for pointing this out. Change argument lists to QBatch with accessors. Increase default seq_len to 8k. PiperOrigin-RevId: 771937385	2025-06-16 02:42:02 -07:00
Jan Wassenberg	01cdefeda7	1.64x batch=1 prefill speedup: nested parallelization for Attention (DotSoftmaxWeightedSum) Also fix tsan error in matmul (atomic_flag instead of static) PiperOrigin-RevId: 770241705	2025-06-11 11:28:46 -07:00
Jan Wassenberg	c027a45a2e	MatPtr-ify KV, shared div_seq_len, --seq_len flag PiperOrigin-RevId: 770194455	2025-06-11 09:49:38 -07:00
Jan Wassenberg	b84149310b	Fix paligemma, update its test Must not pass image tokens to the EmbedMMToken used for text. Caught by next presubmit test. paligemma_test: move function bodies into class, regroup variables PiperOrigin-RevId: 770040014	2025-06-11 02:12:12 -07:00
Jan Wassenberg	ec02726cf7	6x large-batch, short-prompt prefill speedup Parallelize over queries instead of tokens introduce non_eos so we only iterate over not yet EOS queries; remove TokenStreamer. move RMSNormInplaceBatched out of Transformer to call the latter from prefill Consistent arg order. Fix gemma_test EOS handling which (caught by msan), remove from tokenizer.h Also add output to gemma_batch_bench, fix name PiperOrigin-RevId: 769676106	2025-06-10 09:56:20 -07:00
Daniel Keysers	d7b23d532a	Restructure internal initialization. PiperOrigin-RevId: 769507096	2025-06-10 01:25:31 -07:00
Jan Wassenberg	6ee628ba38	Further cleanup: separate MatMulEnv arg move row_ptrs into MatMulEnv Consistent arg order: layer, activations, kv_cache, env PiperOrigin-RevId: 767886386	2025-06-05 20:48:32 -07:00
Jan Wassenberg	0e2cab5187	Avoid warning about inability to map, unless explicitly requested PiperOrigin-RevId: 767633815	2025-06-05 09:10:08 -07:00
Jan Wassenberg	3a266c662c	Split gemma-inl into separate source files weights, mat: zero-initialize padding, required since the MatMul "avoid B decompress" optimization. PiperOrigin-RevId: 767562313	2025-06-05 05:36:44 -07:00
RangerUFO	a82f8d5690	Fix compilation error on G++ 9.4	2025-06-04 17:39:37 +08:00
Jan Wassenberg	6897313080	3x speedup of EmbedImagePatches - GEMM, not GEMV. Required fixes to handling of non-vector aligned A. Also move row ptrs to MatMulEnv. PiperOrigin-RevId: 767029036	2025-06-04 01:18:52 -07:00
Jan Wassenberg	9efdcfd45c	1.07x batch decode speedup: more BF16 weights and activations BF16 att_sums and ffw_out Support BF16 B views without decompression Support arbitrary types in MulByConstAndAdd, AddFrom Also update profiler annotations in ops-inl.h PiperOrigin-RevId: 766995010	2025-06-03 23:30:18 -07:00
Jan Wassenberg	839a642992	Fix paligemma_test, refs #588 Detect PaliGemma models from layer names Remove unused allocator arg from CreateInvTimescale matmul: only warn once about dim divisibility Print config also in tests if --verbosity 2 PiperOrigin-RevId: 766605131	2025-06-03 04:45:22 -07:00
Jan Wassenberg	ad3002a21c	Merge branch 'dev' into bugfix/vit_attn	2025-06-03 09:29:52 +02:00
Jan Wassenberg	794a21a4e6	Major refactor to de-templatize gemma-inl and weights This replaces per-weight instantiations of all code with only per-MatMul/norm. Reduces binary size by 133KiB. WeightsOwner is no longer required for type erasing, hence it is replaced with ModelWeightsPtrs. Also remove unused EmbedToken, replaced with EmbedMMToken. PiperOrigin-RevId: 766497657	2025-06-02 23:01:35 -07:00
RangerUFO	93de2be938	Fix the broken VitAttention	2025-06-03 12:40:13 +08:00
Jan Wassenberg	cf4d7ceb82	1.16x decode speedup: remove last MatVec in Attention Precompute row pointers. Remove no longer used MHA support; QStride -> qkv_dim. Remove RowPtr from MatMul interface, use only MatPtrT. Require opt-in define for NUQ to speed up builds. Also fix io.cc on Windows. PiperOrigin-RevId: 766228108	2025-06-02 09:40:29 -07:00
The gemma.cpp Authors	9c3e089b09	Internal change. PiperOrigin-RevId: 765218260	2025-05-30 09:18:44 -07:00
The gemma.cpp Authors	1e8642f8f4	Internal change. PiperOrigin-RevId: 765037449	2025-05-29 22:51:16 -07:00
Jan Wassenberg	3890eb5412	Remove backprop/ Also remove MatPtrT::Packed(); use PackedScale1 instead where const, or Row(0). PiperOrigin-RevId: 764243198	2025-05-28 07:01:17 -07:00
Jan Wassenberg	627cc04db9	Decouple MatMul from gemma-inl: precompile for all input types Call MatMulStatic instead of MatMul. Also fix build error due to Highway's Lanes not being constexpr. PiperOrigin-RevId: 763777269	2025-05-27 07:08:58 -07:00
Jan Wassenberg	421a2ab8ac	Add comments explaining non-padded tensors, kNoPad -> kPacked PiperOrigin-RevId: 763352173	2025-05-26 03:03:38 -07:00
RangerUFO	2771f463f9	Fix the ViT weights loading	2025-05-22 12:13:29 +08:00
RangerUFO	6debdbe341	Minor fixes for ViT	2025-05-20 22:27:10 +08:00
Jan Wassenberg	cb188d4a0e	Fix RowT issue and improve Griffin (currently still broken) Use type-safe MatPtrT via dynamic_cast, avoid/remove unsafe RowT activations: Griffin tensors are now padded Griffin: add batching support, fix conv1d_cache allocation weights: bundle to TensorToRead, add kNoPad flag, fix SplitW1 const-correct fix for ForEachTensor blob_store: move BlobIO2 to .cc and rename BlobIO PiperOrigin-RevId: 760610094	2025-05-19 07:02:10 -07:00
Jan Wassenberg	e890d46f30	1.31x batch prefill, 1.24x batch decode speedup: NUMA binding Only the weights; binding MatMul output worsens batch=1 prefill. Update gemma_batch_bench to use --decode_qbatch. Fix/remove prefill_activations in gemma-inl.h. Refactor: use BasePageBytes directly when binding Move BindB/C to .cc by de-templatizing Remove MatOwners::AllocateFor because it is weights-specific (binding or not) Disband MatOwners, replace with vector PiperOrigin-RevId: 759610477	2025-05-16 07:42:13 -07:00
Jan Wassenberg	c443adee33	3.8x speedup of weights loading via preadv on Linux Also move BlobReader reading functionality to weights.cc PiperOrigin-RevId: 759240310	2025-05-15 11:55:15 -07:00
Jan Wassenberg	38a08d8095	Replace last ConstMat with MatPtr This is to reduce the number of MatMul overloads in preparation for de-templatizing. PiperOrigin-RevId: 758288589	2025-05-13 10:55:22 -07:00
RangerUFO	30ad625f42	Fix the wrapping field of the deduced model config	2025-05-13 23:02:03 +08:00
Jan Wassenberg	8a312e9b89	Split W1/W2 as a load-time preprocess. Remove kOnlyAllocate - no longer used. Rename ReadOrAllocate -> ReadFromBlobs. Rename Reshape -> Fixup to reflect the new scope. Remove no longer used ShrinkRows. This simplifies gemma-inl and is a prerequisite for removing ConstMat (whose .ofs was previously used for merged tensors) PiperOrigin-RevId: 758214083	2025-05-13 07:39:59 -07:00
Jan Wassenberg	2038dfd9cc	Minor: rename compression/shared -> types.h PiperOrigin-RevId: 758199851	2025-05-13 06:53:21 -07:00
Jan Wassenberg	d538a6d6c6	Cleanup: remove unused kCyclic, remove 2 suffix Also remove now unused allocator arg and fix warnings (cast, struct/class mismatch) PiperOrigin-RevId: 758098495	2025-05-13 01:06:41 -07:00
Biruk Mammo	ba21e3beb4	Adds a `GemmaAttention` constructor that takes an explicit `ThreadingContext`. PiperOrigin-RevId: 757839682	2025-05-12 11:17:05 -07:00
Jan Wassenberg	45ad847a41	Replace RowVectorBatch with MatStorageT KVCache: add ctor required for MatStorageT, remove Create; bf_pre_ffw_rms_out -> pre_ffw_rms_out optimize_test: larger vocab_size requires more steps shared.h: Remove unused u128 type correctly set Activation matrix rows, avoid passing as arg ops: pass Mat instead of pointers/sizes; vectorize LayerNorm; support any weight type mat: add OverrideRows, used by SetBatchSize PiperOrigin-RevId: 757790736	2025-05-12 09:16:12 -07:00
Jan Wassenberg	252a4e955e	Remove support for Gemma 1 and PaliGemma 1 models, superseded by (Pali)Gemma 2. PiperOrigin-RevId: 756671308	2025-05-09 02:17:27 -07:00
Biruk Mammo	d834c07042	Exposes `GemmaAttention::DotSoftmaxWeightedSum` for experimentation. Also in this change: * The computation for a single `q` is factored out and exposed. * Strided `ConstMat` views into the KV caches are introduced to enable experimentation with various KV cache layouts. PiperOrigin-RevId: 756339313	2025-05-08 09:19:04 -07:00
The gemma.cpp Authors	20757046db	cleanup, new conversation methods, bugfixes - chore: unused parameters cleaned up - bugfix: explicitly use hwy::Span in GenerateInternal() to prevent runtime crashes due to memory layout incompatibility - bugfix: explicit nullptr check in LogDebug - chore: length-related parameters renamed for clarity - feature: SaveConversation() can be optionally used to save copy of a conversation that ResetConversation() will rewind to upon request, rather than just an empty KV cache - feature: GetCurrentConversation() can be used to query the current conversation's name PiperOrigin-RevId: 755873147	2025-05-07 08:52:44 -07:00
Jan Wassenberg	e9ecb7794d	Fix gcc build error and gemma3 crash, thanks @ufownl, fixes #551 PiperOrigin-RevId: 755729478	2025-05-07 00:59:18 -07:00
Jan Wassenberg	c8d92948f4	Move fields, io* and blob* from compression/ into io/ PiperOrigin-RevId: 755445712	2025-05-06 11:17:19 -07:00
Jan Wassenberg	275135d7e8	Rename-only: remove Allocator2 etc suffixes now that refactoring is complete PiperOrigin-RevId: 755397220	2025-05-06 09:12:43 -07:00
Jan Wassenberg	8d0882b966	Huge refactor of weight handling and model loading. Weight handling: - new ModelStore2 supports both pre-2025 multi-file and single-file formats - simpler ForEachTensor with TensorArgs - tensors are constructed with their full suffixed name I/O: - support mmap and stride - Simplified SbsWriter, single insert(); add SbsReader Misc: - kMockTokenizer: allow creating with unavailable tokenizer - configs.h: Simpler enum validity checks via kSentinel - matmul.h: remove unused enable_bind (now in allocator.h) - tensor_info: single TensorInfoRegistry class, rename from tensor_index.h Frontends: - Replace Allocate/CreateGemma with ctor(LoaderArgs, MatMulEnv&) - Deduce model/weight type, remove --model and parsing - Replace most common.h includes with configs.h - Remove --compressed_weights, use --weights instead - Remove ModelInfo, replaced by ModelConfig. Backprop: - Reduce max loss, remove backward_scalar_test (timeout) - Update thresholds because new RandInit changes rng eval order and thus numerics PiperOrigin-RevId: 755317484	2025-05-06 04:44:21 -07:00
Jan Wassenberg	160a5824fb	Cleanup: include fixes/comments, fix leak, vector reserve Also remove unused RowSpan configs.cc: Assign prompt wrapping to ModelConfig configs.h: simplify EnumValid via sentinel PiperOrigin-RevId: 750278497	2025-04-22 12:01:46 -07:00
The gemma.cpp Authors	ba10c88a94	Add C API and C# interop files This change adds a basic C API that allows access to Gemma functionality from other programming languages. The functionality is exposed via a shared library (DLL on Windows), with C++ interfaces and a basic C# interop wrapper included. To build the DLL, use the `windows-dll` preset, which includes the C and C++ sources as follows: ``` cmake --preset windows-dll cmake --build --config Release --preset windows-dll -j 4 ``` This should generate a `gemma.dll` in `<build-dir>/Release`. To build for non-Windows, the appropriate C++ DLL linking will need to be done to generate a shared library for the target OS. PiperOrigin-RevId: 750246272	2025-04-22 10:35:47 -07:00
prajwalc22	2407150f84	Merge branch 'feature-prompt-flag' of github.com:prajwalc22/gemma.cpp into feature-prompt-flag	2025-04-17 23:54:46 +05:30
prajwalc22	a9e56c27eb	removed unnecessary threading.h import	2025-04-17 23:44:23 +05:30
Prajwal Choudhari	09dfb144c0	Merge branch 'dev' into feature-prompt-flag	2025-04-17 18:53:28 +05:30
prajwalc22	f55c321397	Address review feedback: Fix prefill_tbatch_size and variable placement issues	2025-04-17 10:15:21 +05:30
prajwalc22	27c28cc938	Address review feedback: Fix prefill_tbatch_size and variable placement issues	2025-04-17 10:15:05 +05:30
Jan Wassenberg	87a658b1c6	Minor cleanup, on-demand NUQ buffer allocation threading_context: add profiler compress-inl: add constexpr, on-demand alloc NUQ buffer gemma_py: model->gemma Move ScaleWeights to compress.cc Move PromptWrapping to configs.h PiperOrigin-RevId: 748347896	2025-04-16 10:49:43 -07:00
prajwalc22	8246e49199	Add non-interactive mode support - Added prompt flag to InferenceArgs for non-interactive mode - Set user-facing options to verbosity level 1 - Fixed prompt_size declaration and variable ordering in run.cc - Properly set prompt_size after WrapAndTokenize calls - Moved kVerboseLogTokens block after prompt_size is set	2025-04-16 16:26:52 +05:30
prajwalc22	cbf179990f	Add --prompt flag for non-interactive mode	2025-04-16 15:34:43 +05:30

1 2 3 4 5 ...

406 Commits