gemma.cpp

Commit Graph

Author	SHA1	Message	Date
Jan Wassenberg	627cc04db9	Decouple MatMul from gemma-inl: precompile for all input types Call MatMulStatic instead of MatMul. Also fix build error due to Highway's Lanes not being constexpr. PiperOrigin-RevId: 763777269	2025-05-27 07:08:58 -07:00
Jan Wassenberg	421a2ab8ac	Add comments explaining non-padded tensors, kNoPad -> kPacked PiperOrigin-RevId: 763352173	2025-05-26 03:03:38 -07:00
RangerUFO	2771f463f9	Fix the ViT weights loading	2025-05-22 12:13:29 +08:00
RangerUFO	6debdbe341	Minor fixes for ViT	2025-05-20 22:27:10 +08:00
Jan Wassenberg	cb188d4a0e	Fix RowT issue and improve Griffin (currently still broken) Use type-safe MatPtrT via dynamic_cast, avoid/remove unsafe RowT activations: Griffin tensors are now padded Griffin: add batching support, fix conv1d_cache allocation weights: bundle to TensorToRead, add kNoPad flag, fix SplitW1 const-correct fix for ForEachTensor blob_store: move BlobIO2 to .cc and rename BlobIO PiperOrigin-RevId: 760610094	2025-05-19 07:02:10 -07:00
Jan Wassenberg	e890d46f30	1.31x batch prefill, 1.24x batch decode speedup: NUMA binding Only the weights; binding MatMul output worsens batch=1 prefill. Update gemma_batch_bench to use --decode_qbatch. Fix/remove prefill_activations in gemma-inl.h. Refactor: use BasePageBytes directly when binding Move BindB/C to .cc by de-templatizing Remove MatOwners::AllocateFor because it is weights-specific (binding or not) Disband MatOwners, replace with vector PiperOrigin-RevId: 759610477	2025-05-16 07:42:13 -07:00
Jan Wassenberg	c443adee33	3.8x speedup of weights loading via preadv on Linux Also move BlobReader reading functionality to weights.cc PiperOrigin-RevId: 759240310	2025-05-15 11:55:15 -07:00
Jan Wassenberg	38a08d8095	Replace last ConstMat with MatPtr This is to reduce the number of MatMul overloads in preparation for de-templatizing. PiperOrigin-RevId: 758288589	2025-05-13 10:55:22 -07:00
RangerUFO	30ad625f42	Fix the wrapping field of the deduced model config	2025-05-13 23:02:03 +08:00
Jan Wassenberg	8a312e9b89	Split W1/W2 as a load-time preprocess. Remove kOnlyAllocate - no longer used. Rename ReadOrAllocate -> ReadFromBlobs. Rename Reshape -> Fixup to reflect the new scope. Remove no longer used ShrinkRows. This simplifies gemma-inl and is a prerequisite for removing ConstMat (whose .ofs was previously used for merged tensors) PiperOrigin-RevId: 758214083	2025-05-13 07:39:59 -07:00
Jan Wassenberg	2038dfd9cc	Minor: rename compression/shared -> types.h PiperOrigin-RevId: 758199851	2025-05-13 06:53:21 -07:00
Jan Wassenberg	d538a6d6c6	Cleanup: remove unused kCyclic, remove 2 suffix Also remove now unused allocator arg and fix warnings (cast, struct/class mismatch) PiperOrigin-RevId: 758098495	2025-05-13 01:06:41 -07:00
Biruk Mammo	ba21e3beb4	Adds a `GemmaAttention` constructor that takes an explicit `ThreadingContext`. PiperOrigin-RevId: 757839682	2025-05-12 11:17:05 -07:00
Jan Wassenberg	45ad847a41	Replace RowVectorBatch with MatStorageT KVCache: add ctor required for MatStorageT, remove Create; bf_pre_ffw_rms_out -> pre_ffw_rms_out optimize_test: larger vocab_size requires more steps shared.h: Remove unused u128 type correctly set Activation matrix rows, avoid passing as arg ops: pass Mat instead of pointers/sizes; vectorize LayerNorm; support any weight type mat: add OverrideRows, used by SetBatchSize PiperOrigin-RevId: 757790736	2025-05-12 09:16:12 -07:00
Jan Wassenberg	252a4e955e	Remove support for Gemma 1 and PaliGemma 1 models, superseded by (Pali)Gemma 2. PiperOrigin-RevId: 756671308	2025-05-09 02:17:27 -07:00
Biruk Mammo	d834c07042	Exposes `GemmaAttention::DotSoftmaxWeightedSum` for experimentation. Also in this change: * The computation for a single `q` is factored out and exposed. * Strided `ConstMat` views into the KV caches are introduced to enable experimentation with various KV cache layouts. PiperOrigin-RevId: 756339313	2025-05-08 09:19:04 -07:00
The gemma.cpp Authors	20757046db	cleanup, new conversation methods, bugfixes - chore: unused parameters cleaned up - bugfix: explicitly use hwy::Span in GenerateInternal() to prevent runtime crashes due to memory layout incompatibility - bugfix: explicit nullptr check in LogDebug - chore: length-related parameters renamed for clarity - feature: SaveConversation() can be optionally used to save copy of a conversation that ResetConversation() will rewind to upon request, rather than just an empty KV cache - feature: GetCurrentConversation() can be used to query the current conversation's name PiperOrigin-RevId: 755873147	2025-05-07 08:52:44 -07:00
Jan Wassenberg	e9ecb7794d	Fix gcc build error and gemma3 crash, thanks @ufownl, fixes #551 PiperOrigin-RevId: 755729478	2025-05-07 00:59:18 -07:00
Jan Wassenberg	c8d92948f4	Move fields, io* and blob* from compression/ into io/ PiperOrigin-RevId: 755445712	2025-05-06 11:17:19 -07:00
Jan Wassenberg	275135d7e8	Rename-only: remove Allocator2 etc suffixes now that refactoring is complete PiperOrigin-RevId: 755397220	2025-05-06 09:12:43 -07:00
Jan Wassenberg	8d0882b966	Huge refactor of weight handling and model loading. Weight handling: - new ModelStore2 supports both pre-2025 multi-file and single-file formats - simpler ForEachTensor with TensorArgs - tensors are constructed with their full suffixed name I/O: - support mmap and stride - Simplified SbsWriter, single insert(); add SbsReader Misc: - kMockTokenizer: allow creating with unavailable tokenizer - configs.h: Simpler enum validity checks via kSentinel - matmul.h: remove unused enable_bind (now in allocator.h) - tensor_info: single TensorInfoRegistry class, rename from tensor_index.h Frontends: - Replace Allocate/CreateGemma with ctor(LoaderArgs, MatMulEnv&) - Deduce model/weight type, remove --model and parsing - Replace most common.h includes with configs.h - Remove --compressed_weights, use --weights instead - Remove ModelInfo, replaced by ModelConfig. Backprop: - Reduce max loss, remove backward_scalar_test (timeout) - Update thresholds because new RandInit changes rng eval order and thus numerics PiperOrigin-RevId: 755317484	2025-05-06 04:44:21 -07:00
Jan Wassenberg	160a5824fb	Cleanup: include fixes/comments, fix leak, vector reserve Also remove unused RowSpan configs.cc: Assign prompt wrapping to ModelConfig configs.h: simplify EnumValid via sentinel PiperOrigin-RevId: 750278497	2025-04-22 12:01:46 -07:00
The gemma.cpp Authors	ba10c88a94	Add C API and C# interop files This change adds a basic C API that allows access to Gemma functionality from other programming languages. The functionality is exposed via a shared library (DLL on Windows), with C++ interfaces and a basic C# interop wrapper included. To build the DLL, use the `windows-dll` preset, which includes the C and C++ sources as follows: ``` cmake --preset windows-dll cmake --build --config Release --preset windows-dll -j 4 ``` This should generate a `gemma.dll` in `<build-dir>/Release`. To build for non-Windows, the appropriate C++ DLL linking will need to be done to generate a shared library for the target OS. PiperOrigin-RevId: 750246272	2025-04-22 10:35:47 -07:00
prajwalc22	2407150f84	Merge branch 'feature-prompt-flag' of github.com:prajwalc22/gemma.cpp into feature-prompt-flag	2025-04-17 23:54:46 +05:30
prajwalc22	a9e56c27eb	removed unnecessary threading.h import	2025-04-17 23:44:23 +05:30
Prajwal Choudhari	09dfb144c0	Merge branch 'dev' into feature-prompt-flag	2025-04-17 18:53:28 +05:30
prajwalc22	f55c321397	Address review feedback: Fix prefill_tbatch_size and variable placement issues	2025-04-17 10:15:21 +05:30
prajwalc22	27c28cc938	Address review feedback: Fix prefill_tbatch_size and variable placement issues	2025-04-17 10:15:05 +05:30
Jan Wassenberg	87a658b1c6	Minor cleanup, on-demand NUQ buffer allocation threading_context: add profiler compress-inl: add constexpr, on-demand alloc NUQ buffer gemma_py: model->gemma Move ScaleWeights to compress.cc Move PromptWrapping to configs.h PiperOrigin-RevId: 748347896	2025-04-16 10:49:43 -07:00
prajwalc22	8246e49199	Add non-interactive mode support - Added prompt flag to InferenceArgs for non-interactive mode - Set user-facing options to verbosity level 1 - Fixed prompt_size declaration and variable ordering in run.cc - Properly set prompt_size after WrapAndTokenize calls - Moved kVerboseLogTokens block after prompt_size is set	2025-04-16 16:26:52 +05:30
prajwalc22	cbf179990f	Add --prompt flag for non-interactive mode	2025-04-16 15:34:43 +05:30
prajwalc22	f3116d2577	Add --prompt flag for non-interactive mode This change adds a --prompt command-line option that allows users to provide prompts directly without entering interactive mode, which is useful for scripting and automation.	2025-04-16 09:45:02 +05:30
The gemma.cpp Authors	7164a5e844	Internal change. PiperOrigin-RevId: 746953110	2025-04-12 20:27:49 -07:00
Jan Wassenberg	8532da47f7	Major refactor of allocator/args: use new ThreadingContext2 instead of monostate/init in each frontend Add ThreadingArgs(replaces AppArgs) backprop: use Packed() accessor and MakePacked factory and row-based access to allow for stride compress_weights: remove, moving to py-only exporter instead Move MatPtr to mat.h and revise interface: - Generic MatOwner - rename accessors to Packed* - support stride/row accessors, fix RowPtr stride Add TypeBits(Type) Move GenerateMat to test_util-inl for sharing between matmul test/bench Move internal init to gemma.cc to avoid duplication Rename GemmaEnv model_ to gemma_ for disambiguating vs upcoming ModelStorage Remove --compressed_weights, use --weights instead. tensor_index: add ExtentsFromInfo and TensorIndexLLM/Img Allocator: use normal unique_ptr for AllocBytes so users can call directly threading: use -> because AlignedPtr no longer assumes arrays PiperOrigin-RevId: 745918637	2025-04-10 01:29:54 -07:00
Copybara-Service	bef91a3f03	Merge pull request #529 from ufownl:refactor/wrap_and_tokenize PiperOrigin-RevId: 745174371	2025-04-08 09:22:26 -07:00
Jan Wassenberg	4e6aa36e9b	Minor cleanup: enable 0,0 Extents2D, add SerializedSpan typedef, include fixes PiperOrigin-RevId: 745068776	2025-04-08 03:35:55 -07:00
RangerUFO	cc2e14e654	Improve `GemmaChatTemplate` to handle vision prompt wrapping	2025-03-29 11:31:40 +08:00
RangerUFO	c39295f497	Inline the ctor of `GemmaChatTemplate`	2025-03-29 11:31:40 +08:00
RangerUFO	d1615b56b2	Fix the prompt wrapping of gemma3-1b again It seems that the previous fix was changed back due to a merge error.	2025-03-29 11:31:39 +08:00
RangerUFO	ca4ee2b63f	Refactor `WrapAndTokenize` to work properly with Gemma3	2025-03-29 11:31:39 +08:00
RangerUFO	d42deaa27c	Set the secondary EOS for Gemma2 So that we can remove the `<end_of_turn>` filter that was set up specifically for Gemma2.	2025-03-22 01:32:22 +08:00
RangerUFO	2bad79f110	Fix the EOS checking The secondary eos is usually `<end_of_turn>`, which can appear in the prompt, so we can only check it not in the prompt.	2025-03-22 01:32:22 +08:00
Phil Culliton	05b1cce9f7	Add support for a secondary EOS token PiperOrigin-RevId: 738898976	2025-03-20 12:28:31 -07:00
Jan Wassenberg	83219e3c68	Add note on attention length and SFP PiperOrigin-RevId: 738698399	2025-03-20 00:39:06 -07:00
RangerUFO	b16ce9a0b4	Fix the prompt wrapping of gemma3-1b	2025-03-18 16:52:38 +08:00
Jan Wassenberg	1b72c22345	Refactor Gemma ctor and improve pool NUMA support Gemma receives a MatMulEnv arg, with comment on lifetime Split threading into topology so the latter can be used in allocator Add AllocClasses() for non-POD (ThreadPool) Support binding pool to NUMA node Update threading_test with latency measurements Also update Highway version. PiperOrigin-RevId: 736904748	2025-03-14 10:19:00 -07:00
Phil Culliton	1b1b63d560	Fix PaliGemma models. PiperOrigin-RevId: 736483021	2025-03-13 06:28:29 -07:00
Phil Culliton	4ab601da10	Internal change. PiperOrigin-RevId: 736015810	2025-03-11 23:20:20 -07:00
Phil Culliton	9d83ff202e	Internal change. PiperOrigin-RevId: 736014152	2025-03-11 23:10:48 -07:00
Jan Wassenberg	2bdf26d81d	Support bf16 output of Matmul Adds Stride to ConstMat, to support decompression of C output for test matmul_test: add line numbers to output Also ignore "N is not a multiple of nc" when N==nc PiperOrigin-RevId: 731096662	2025-02-25 17:53:20 -08:00
Jan Wassenberg	b3b4b9f92f	With new matmul, much larger batch sizes are advantageous, default to 256. Can still override via command line argument. PiperOrigin-RevId: 730502653	2025-02-24 10:21:58 -08:00
Jan Wassenberg	f9d93e4a42	Matmul rewrite: fp64 sums, hierarchical parallelization, cache-blocking, autotuning Remove empty matmul_unit_test. Up to 25 TFLOP/s on 2xZen4 for 512,3072,24576. PiperOrigin-RevId: 729123576	2025-02-20 08:33:46 -08:00
Apoorv Reddy	0e5b59d24d	Implements FusedSoftmaxAndSampleTopK. This computes softmax on the top-K logits, instead of computing softmax first and then getting top-K probs. So we end up avoiding renormalizing too. Additionally, modify softmax to do temperature scaling, if temp != 1.0 PiperOrigin-RevId: 727702149	2025-02-16 21:30:06 -08:00
Copybara-Service	c495b25995	Merge pull request #493 from ufownl:bugfix/compress_weights_le PiperOrigin-RevId: 725585921	2025-02-11 05:10:13 -08:00
Apoorv Reddy	64cf6dfe0a	Using TimingInfo methods and cleaning up args to DecodeStepT PiperOrigin-RevId: 725580125	2025-02-11 04:49:14 -08:00
Apoorv Reddy	780e376023	Add KVCache.DeepCopy() . Will be useful for implementing sampling functionality like beam sampling, parallel sampling, CoT Decoding (à la https://arxiv.org/abs/2402.10200 ) PiperOrigin-RevId: 725156316	2025-02-10 04:10:29 -08:00
Apoorv Reddy	9b3e7ea8a2	Factor out DecodeStepT from GenerateT into a separate function. This will be useful for adding sampling functionality like beam decoding, parallel sampling, cot decoding (as described in the [Chain-of-Thought Reasoning Without Prompting paper](https://arxiv.org/abs/2402.10200)) PiperOrigin-RevId: 725151530	2025-02-10 03:53:08 -08:00
RangerUFO	3a5a6dbcad	Fix the link error when building `compress_weights` with Clang on macOS	2025-02-09 00:13:25 +08:00
Jan Wassenberg	b18bd781f6	Windows build fixes: struct vs class, unused arg/var, avoid VLA, Deleter arg, casts PiperOrigin-RevId: 724340518	2025-02-07 07:38:55 -08:00
Phil Culliton	7ccc6abe87	Allow conversion, loading and inference with NUQ. PiperOrigin-RevId: 723507890	2025-02-05 07:45:54 -08:00
Daniel Keysers	bcdb0d65bd	Assorted small cleanups. PiperOrigin-RevId: 720548132	2025-01-28 06:09:45 -08:00
Daniel Keysers	e997468496	Apply PositionalEncodingQK always in-place. PiperOrigin-RevId: 718851803	2025-01-23 07:09:30 -08:00
Apoorv Reddy	ce807a31a1	internal change PiperOrigin-RevId: 718824952	2025-01-23 05:31:11 -08:00
Jan Wassenberg	a60b564b88	Infra improvements (2) ops.h: move CreateInvTimescale to allow calling without depending on gemma Pass around MatMulEnv instead of pools to avoid re-creating the env profiler.h can now be used outside SIMD code allocator: add StepBytes and QuantumSteps rename worker thread with package/cluster in the name threading: add Visit* to IndexRange PiperOrigin-RevId: 718766704	2025-01-23 01:55:19 -08:00
Daniel Keysers	f37402da57	Add parameter for base_frequency to CreateInvTimeScale(). Extract a few local variables to make code easier to read (hopefully). PiperOrigin-RevId: 718749053	2025-01-23 00:56:44 -08:00
Phil Culliton	9646edc908	Internal change PiperOrigin-RevId: 717916568	2025-01-21 07:53:49 -08:00
Jan Wassenberg	c4398fc72d	Infra improvements: allocator: support mmap, fixed Bind, add padding bench_matmul: Add PreventElision BUILD: add ops_test build target matmul.h: move ConstMat here; dynamic alloc of MatMulEnv matmul_test: remove benchmarking replace fprintf with HWY_WARN threading.cc: support splitting large clusters (disabled); package_idx->pkg_idx, smaller IndexRangePartition PiperOrigin-RevId: 717512274	2025-01-20 06:22:49 -08:00
Daniel Keysers	493688f6f1	Allow interactive use with new single-file weight format. Add section about new weights format to README.md. Remove model_type_required parameter. Update error handling for flags. PiperOrigin-RevId: 715788822	2025-01-15 07:22:33 -08:00
Ray Smith	b93231a47d	Moved the vit config fields to their own config struct PiperOrigin-RevId: 715692800	2025-01-15 01:09:49 -08:00
Ray Smith	9d40f0117e	Added ability to load/save a complete model file, including tokenizer. PiperOrigin-RevId: 707914366	2024-12-19 07:59:41 -08:00
Daniel Keysers	62c70d6715	Rename ModelTraining to PromptWrapping which is a more accurate name. PiperOrigin-RevId: 705881500	2024-12-13 07:45:59 -08:00
Ray Smith	6254f2e5ca	Removed duplicated tensor sizes from weights.h by changing the constructor used for MatPtrT PiperOrigin-RevId: 705085054	2024-12-11 06:30:28 -08:00
Daniel Keysers	aed17396be	Make prompt wrapping more consistent and fix duplicated tokens for multi-turn. Do not echo <end_of_turn> tokens to the user. Have verbosity=0 only show the dialog. PiperOrigin-RevId: 705021391	2024-12-11 01:52:00 -08:00
Ray Smith	e69bc3bc1c	Added the TensorInfo arg to the compressor so the shape and scale can be output correctly to the file in future. Corrected some errors in the TensorIndex. PiperOrigin-RevId: 705014619	2024-12-11 01:26:35 -08:00
Copybara-Service	d8135e836f	Merge pull request #460 from ericcurtin:common PiperOrigin-RevId: 704684454	2024-12-10 06:33:37 -08:00
Daniel Keysers	331d2ccc02	Add support for 448px resolution to PaliGemma and PaliGemma2. PiperOrigin-RevId: 704361579	2024-12-09 11:38:10 -08:00
Eric Curtin	a971088ac2	Refactor `gemma/common.cc` to improve readability and safety Use `std::size` for array size calculations. Replace C-style string manipulations with `std::string` methods. Simplify `std::transform` usage for case conversion. Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2024-12-09 16:36:25 +00:00
Phil Culliton	9dfe2a76be	Internal change PiperOrigin-RevId: 702961613	2024-12-04 20:41:47 -08:00
Ray Smith	3d1625d8c5	Improved consistency of compressor API, and added a universal method with a target type arg. Moved configs pybind up to root level. PiperOrigin-RevId: 698743417	2024-11-21 05:27:40 -08:00
Ray Smith	73640d2521	Added tensor_index as a single source of truth on tensor shapes/sources and transformations PiperOrigin-RevId: 697903886	2024-11-19 00:25:39 -08:00
Ray Smith	7d685a267f	Added pybind for configs. Added ability to test configs for equality. PiperOrigin-RevId: 697572671	2024-11-18 04:03:51 -08:00
Daniel Keysers	719699f132	Make top_k a runtime argument (instead of a model argument). PiperOrigin-RevId: 696170691	2024-11-13 09:48:59 -08:00
Daniel Keysers	e54d9cbddd	Fix Griffin model: - use HalfRope position encodings - zero-initialize the caches for each Generate at position 0 The lack of the latter made the tests in gemma_test dependent on each other. PiperOrigin-RevId: 694509054	2024-11-08 08:30:53 -08:00
Jan Wassenberg	868b01601f	Simpler MatMul interface, vocab types, Tristate for use_spinning Add Extents2D, Range2D vocab types Matmul uses ConstMat for inputs and RowPtr for output Move RowVectorBatch to basics.h Separate threading.cc Fix topology string: report cores not LPs, and #HT Move QStride/IsMHA into LayerConfig ImageTokens does not require make_unique. matmul_test: no longer require template args PiperOrigin-RevId: 692963605	2024-11-04 07:48:29 -08:00
Daniel Keysers	583bd93e9a	Factor out addition of ViTConfig to a ModelConfig. Use ModelConfig values for ImageTokens. Output timing info for image token generation. Add a method to copy image data into Image class directly. Minor changes: pipe ModelTraining to more places. PiperOrigin-RevId: 690572283	2024-10-28 05:29:33 -07:00
Jan Wassenberg	02ce1e344f	Use NestedPools, add NUMA infra Improved threading.h, fix thread counts for single package/cluster systems Temporarily forces to a single socket. Prefill 29.28 tps, decode 6.92. Also fix benchmarks.cc build, update tensor allocator to Allocator PiperOrigin-RevId: 687307167	2024-10-18 08:11:18 -07:00
Daniel Keysers	c6384574db	Fix PaliGemma's GenerateImageTokensT(). Move image related config values from LayerConfig to ModelConfig. Minor changes: Add a few comments, remove gcpp:: qualification where it wasn't needed in a few places, define local constants in VitAttention.DotSoftmaxWeightedSum() PiperOrigin-RevId: 687210519	2024-10-18 01:34:13 -07:00
Ray Smith	0d68555f87	Eliminated TConfig. Changed CompressedLayer and CompressedWeights to be constructed with an instance of a LayerConfig and WeightsConfig respectively. Added CompressedModel to remove ByteStorageT and get rid of most of the type casting, as well as allowing the default destructor to be used and work properly. Adjusted WeightsWrapper and ForwardLayer etc to match. The only remaining template arg is the weight type. This enables all the instantiations to be deleted, apart from one per type. It also enables (but not yet done) the config to be stored in the blob file instead of having to be specified separately. Reduces the size of the gemma_lib and weights shared libraries by a factor of 4.3 and 3.2 respectively. PiperOrigin-RevId: 686870060	2024-10-17 05:04:22 -07:00
Daniel Keysers	a4d6adbc43	Introduce QueryResult in GemmaEnv and add a shortcut for WrapAndTokenize. Remove max_tokens (and rely on only max_generated_tokens). PiperOrigin-RevId: 685662260	2024-10-14 04:45:21 -07:00
Daniel Keysers	5d0167904d	Fix PaliGemma model loading. PiperOrigin-RevId: 685591935	2024-10-13 23:48:55 -07:00
Jan Wassenberg	6ab3ff5bde	Minor cleanup, Windows+Bazel build fixes add app.h comment compress-inl: remove unused typedef gemma-inl: add missing HWY_ATTR and cast separate sum-inl.h and basics.h headers replace more hwy::bfloat16_t with BF16 update include pragmas update dot_test thresholds update Highway version in Bazel for HWY_RCAST_ALIGNED fix PiperOrigin-RevId: 684464326	2024-10-10 09:05:06 -07:00
Ray Smith	85958f5fd3	Added MatPtr/MatPtrT/MatStorageT/MatStorage as a dynamically-sized replacement for CompressedArray. Definition of array size is moved to the constructor. Allocation is separate and parallelized. All users of weights_raw.h migrated to CompressedWeights and weights_raw.h deleted. Replaced all previous ForEachTensor functions with a single unified function. PiperOrigin-RevId: 684451604	2024-10-10 08:22:30 -07:00
Jan Wassenberg	2c28b18eb0	Add NestedPools: one per socket/cluster Use in dot_test app.h: add new flags and rename num_threads to max_threads matmul: Parallelize MatMulSlow and enable spinning, more large/fewer medium test cases PiperOrigin-RevId: 683216386	2024-10-07 09:40:19 -07:00
Jan Wassenberg	bd53b0f7c3	Fix MSAN issue for multiturn. Rewind the prior EOS token. Also move MaybeCheckInitialized to allocator.h PiperOrigin-RevId: 683187458	2024-10-07 08:07:54 -07:00
Ray Smith	895ee4c6ce	Moved Internal code around to simplify PiperOrigin-RevId: 681877329	2024-10-03 07:55:21 -07:00
Krzysztof Ostrowski	12291e1ac0	Internal change. PiperOrigin-RevId: 681583569	2024-10-02 14:03:34 -07:00
Krzysztof Ostrowski	b3239bf509	Internal change. PiperOrigin-RevId: 681530185	2024-10-02 11:33:06 -07:00
Jan Wassenberg	96d2ab7d31	Minor fix to profiler zone and add comment PiperOrigin-RevId: 681350546	2024-10-02 01:37:50 -07:00
Jan Wassenberg	7d9fcda0d8	-467ms startup: parallel Reshape Also split Softmax into Argmax helper, add comments; add profiler zones + fix IDE warning PiperOrigin-RevId: 680954573	2024-10-01 04:11:35 -07:00
Jan Wassenberg	2d14d796e3	1.09x decode speedup for topk=1/temp0: fuse softmax and sample PiperOrigin-RevId: 680589099	2024-09-30 08:37:41 -07:00

1 2 3 4 5 ...

337 Commits