gemma.cpp

Commit Graph

Author	SHA1	Message	Date
Jan Wassenberg	c4398fc72d	Infra improvements: allocator: support mmap, fixed Bind, add padding bench_matmul: Add PreventElision BUILD: add ops_test build target matmul.h: move ConstMat here; dynamic alloc of MatMulEnv matmul_test: remove benchmarking replace fprintf with HWY_WARN threading.cc: support splitting large clusters (disabled); package_idx->pkg_idx, smaller IndexRangePartition PiperOrigin-RevId: 717512274	2025-01-20 06:22:49 -08:00
Ray Smith	9d40f0117e	Added ability to load/save a complete model file, including tokenizer. PiperOrigin-RevId: 707914366	2024-12-19 07:59:41 -08:00
Ray Smith	6254f2e5ca	Removed duplicated tensor sizes from weights.h by changing the constructor used for MatPtrT PiperOrigin-RevId: 705085054	2024-12-11 06:30:28 -08:00
Ray Smith	e69bc3bc1c	Added the TensorInfo arg to the compressor so the shape and scale can be output correctly to the file in future. Corrected some errors in the TensorIndex. PiperOrigin-RevId: 705014619	2024-12-11 01:26:35 -08:00
Jan Wassenberg	f74d496879	Threading/infra improvements. * Add ParallelizeRange helpers and partitioning helpers Refactor Pinning class, store original affinity (required to construct another NestedPools after pinning happened) Compress: * prevent Compress printing stats in tests * zero-pad tensors Matmul: * add matmul_unit_test (TODO) and bench_matmul * matmul_test: change norm to row vectors (that is what is added) and include bf16 rounding error * Prepare for L2/L3 retrieval PiperOrigin-RevId: 700603811	2024-11-27 01:12:00 -08:00
Ray Smith	73640d2521	Added tensor_index as a single source of truth on tensor shapes/sources and transformations PiperOrigin-RevId: 697903886	2024-11-19 00:25:39 -08:00
Jan Wassenberg	868b01601f	Simpler MatMul interface, vocab types, Tristate for use_spinning Add Extents2D, Range2D vocab types Matmul uses ConstMat for inputs and RowPtr for output Move RowVectorBatch to basics.h Separate threading.cc Fix topology string: report cores not LPs, and #HT Move QStride/IsMHA into LayerConfig ImageTokens does not require make_unique. matmul_test: no longer require template args PiperOrigin-RevId: 692963605	2024-11-04 07:48:29 -08:00
RangerUFO	7d313aaade	Fix compilation errors of "compress_weights" target	2024-10-19 21:30:30 +08:00
Jan Wassenberg	02ce1e344f	Use NestedPools, add NUMA infra Improved threading.h, fix thread counts for single package/cluster systems Temporarily forces to a single socket. Prefill 29.28 tps, decode 6.92. Also fix benchmarks.cc build, update tensor allocator to Allocator PiperOrigin-RevId: 687307167	2024-10-18 08:11:18 -07:00
Ray Smith	0d68555f87	Eliminated TConfig. Changed CompressedLayer and CompressedWeights to be constructed with an instance of a LayerConfig and WeightsConfig respectively. Added CompressedModel to remove ByteStorageT and get rid of most of the type casting, as well as allowing the default destructor to be used and work properly. Adjusted WeightsWrapper and ForwardLayer etc to match. The only remaining template arg is the weight type. This enables all the instantiations to be deleted, apart from one per type. It also enables (but not yet done) the config to be stored in the blob file instead of having to be specified separately. Reduces the size of the gemma_lib and weights shared libraries by a factor of 4.3 and 3.2 respectively. PiperOrigin-RevId: 686870060	2024-10-17 05:04:22 -07:00
Ray Smith	85958f5fd3	Added MatPtr/MatPtrT/MatStorageT/MatStorage as a dynamically-sized replacement for CompressedArray. Definition of array size is moved to the constructor. Allocation is separate and parallelized. All users of weights_raw.h migrated to CompressedWeights and weights_raw.h deleted. Replaced all previous ForEachTensor functions with a single unified function. PiperOrigin-RevId: 684451604	2024-10-10 08:22:30 -07:00
Jan Wassenberg	13a9f76f64	Fix mismatch between blob_store and compress interfaces (bytes) PiperOrigin-RevId: 673027268	2024-09-10 10:59:17 -07:00
Jan Wassenberg	8c0a8834c1	Major compression update, arbitrary-len unpack + new Dot Compression: * Implement {any packed} x {bf16, f32} 'Load2' and DecompressAndZeroPad * New compression test for all packed formats, add to GEMMA_TEST_FILES, remove from sfp/nuq_test * Decompress->DecompressAndZeroPad, use PackedSpan for args with bounds checking * NUQ: support arbitrary-length enc/dec * New compression/shared, remove sfp.h and nuq.h * Move Store2 into Traits and provide Compress2 wrapper * Remove unused Decompress()-with-pool overload * Simplify CompressedArrayLen, rename to CompressedArrayElements * Remove unused DistortionStats b_l1_ Misc: * Add compensated and Kahan dot, support any length * Use same Dot function everywhere * Move exact arithmetic functions into fp_arith * use FloatPtr and MatPtr typedefs in tests; less stack usage * Rename args to packed/raw * Remove Traits::Name, instead TypeName<T>() * Move kMaxSFP and kClusters/kGroupSize into Sfp/NuqStream PiperOrigin-RevId: 672868468	2024-09-10 02:22:19 -07:00
Jan Wassenberg	5c0da8c8c3	Minor cleanup/fixes: - optimize_test simplify prompt check - Fix SFP arg case - Fix includes - Align inputs in test - IsInside: add DASSERT - Fix PerClusterPool NumThreads PiperOrigin-RevId: 672530385	2024-09-09 06:58:09 -07:00
Jan Wassenberg	c29e9752c7	Refactor/cleanup, remove even_odd * New compression/shared.h, remove sfp.h * Remove unused DistortionStats b_l1_ * Move exact arithmetic functions into fp_arith * Remove even_odd optimization for MatVec (mostly unused) * use BF16 typedef more widely * Add kMaxSFP constant PiperOrigin-RevId: 670996386	2024-09-04 09:25:13 -07:00
Jan Wassenberg	8e028632f7	0.98x prefill: refactor in prep for cache blocking. Slower because we now init tiles of C and accumulate into them. Also remove unused var in optimize_test and use BF16 typedef. PiperOrigin-RevId: 662115916	2024-08-12 09:26:29 -07:00
Daniel Keysers	e87e65ca45	Add scale parameter to MatMul. Add accessor to CompressedArray that asserts the scale is 1 and use it. PiperOrigin-RevId: 653604840	2024-07-18 06:58:56 -07:00
Daniel Keysers	ff34370aac	Simplify FFW by using MatMul_4x4_Batch_Add. Affects only the griffin model, where prefill TPS improves by about 70%. PiperOrigin-RevId: 652878176	2024-07-16 09:41:23 -07:00
Jan Wassenberg	a0e808e341	Add compression/ comments, especially on SFP range PiperOrigin-RevId: 642238720	2024-06-11 05:47:49 -07:00
Jan Wassenberg	5c3e5f7038	Remove no longer required stats.h - use Highway version instead PiperOrigin-RevId: 640440379	2024-06-05 01:37:48 -07:00
Paul Chang	c0643577c3	Minor internal refactoring. PiperOrigin-RevId: 635852078	2024-05-21 10:29:59 -07:00
Jan Wassenberg	e9a0caed87	Further improve IO, enable multiple backends without -D. Move Path into io.h and use for opening files. Removes dependency of gemma_lib on args. Separate Windows codepath instead of emulating POSIX functions. Plus lint fixes. PiperOrigin-RevId: 626279004	2024-04-19 00:40:29 -07:00
Jan Wassenberg	a982ec1287	Move code to gemma/ so we can remove error-prone copybara: comments. Also fix includes and Lint warnings. PiperOrigin-RevId: 623127487	2024-04-09 04:45:42 -07:00
Luca Versari	4c23932289	Improve weight handling. - Allow scaling of SFP weights - Allow using uncompressed weights - Do not try to compress weights in the main model calls - Reduce code duplication in weight handling with some macros Co-authored-by: Eugene Kliuchnikov <eustas@google.com> Co-authored-by: Thomas Fischbacher <tfish@google.com> Co-authored-by: Zoltan Szabadka <szabadka@google.com>	2024-04-06 11:08:47 +02:00
Austin Huang	e29cd566cf	initial commit	2024-02-21 03:31:22 +00:00

25 Commits