Commit Graph

  • cc1d256cff
    Update CMakePresets.json Hitesh K V 2025-10-16 12:08:29 +0530
  • 9b6ed1a58f gemma_batch_bench: generate more unique prompts Jan Wassenberg 2025-10-15 15:45:27 -0700
  • 503aaddd65 Add 8-bit integer quantization (I8Stream) to Gemma.cpp. Phil Culliton 2025-10-15 09:24:38 -0700
  • ee18916abf Removed the PROFILER_ZONE from the most highly called functions to reduce the overhead. Ray Smith 2025-10-15 07:09:32 -0700
  • e3e8511e79 Initialization of profiler zones. Ray Smith 2025-10-15 03:05:30 -0700
  • fb6fa793f4 Added a global (to gemma) zones list to enable most call sites to PROFILER_ZONE3 to avoid the sychronization required for the static const initialization of the zone handle. Improved flash_attention to enable profiling using the new zones. Ray Smith 2025-10-14 08:30:23 -0700
  • 3e9bb7df80
    Update README.md Hitesh K V 2025-10-10 11:33:09 +0530
  • 035273c184 tune pool kSpin mode in threading_context Jan Wassenberg 2025-10-07 08:35:44 -0700
  • 9dc802c7aa Add logging to io.cc on failed write and read. Nitin Gangahar 2025-10-06 10:25:07 -0700
  • 684a0444e9 Reduced parallelism for TransposeQ, making each thread read and write within its own cache lines Ray Smith 2025-10-02 08:14:37 -0700
  • 277f396710 Reduced parallelism for TransposeQ, making each thread read and write within its own cache lines Ray Smith 2025-10-02 05:00:19 -0700
  • 14244664c8 Avoid transposing Q when it isn't needed Ray Smith 2025-10-02 05:16:03 -0700
  • fe5a39990e Improve FlashAttention threading: Jan Wassenberg 2025-10-02 02:36:29 -0700
  • 6098a022b3 Increased parallelism for RMSNormAndPositionalEncoding Ray Smith 2025-10-01 07:10:40 -0700
  • 2f6cbde8ff Added a smaller tile size to flash attention for smaller batch sizes Ray Smith 2025-09-30 05:48:50 -0700
  • 4974f24832 Fixed bug with softcap in single flash attention Ray Smith 2025-09-30 02:17:18 -0700
  • 16536996d1 Remove less useful spammy log lines. Nitin Gangahar 2025-09-29 02:28:04 -0700
  • 667a3f117a Utilize multiple cores to read weight batches. Nitin Gangahar 2025-09-26 11:27:56 -0700
  • d15731d201 Used hn::BroadcastLane instead of Set(..., x.raw) Ray Smith 2025-09-25 09:41:30 -0700
  • 4f0c633248 (1) Added QueryResultAndMetrics and BatchQueryModelWithMetrics to also return TimingInfo besides query results. Charles Zhao 2025-09-23 17:01:56 -0700
  • fac8aac4cb Internal change Jan Wassenberg 2025-09-22 05:36:32 -0700
  • 501fdf000e Remove no longer used MatVec Jan Wassenberg 2025-09-19 09:02:44 -0700
  • b603425bf3 Fix batch inference: dangling reference Jan Wassenberg 2025-09-16 08:01:21 -0700
  • f3bc1c17da 1.03x speedup: fused FFN Jan Wassenberg 2025-09-15 10:25:59 -0700
  • 59db30e209 add const restriction for benchmark_helper.cc, and paligemma_helper.cc to remove a few uncessary copies. Charles Zhao 2025-09-14 16:26:55 -0700
  • c9b8479f7d Added zero-initialization to att_out. Re-enabled flash attention when HWY_NATIVE_DOT_BF16 is not available. Ray Smith 2025-09-12 07:47:36 -0700
  • 2695aab5d2 Temporarily disable flash pending msan fix Jan Wassenberg 2025-09-10 07:25:07 -0700
  • ba6131311a Fix gemma_batch_bench for flash attention Jan Wassenberg 2025-09-10 05:32:03 -0700
  • 9457258330 Refactor MatMul to accept views in the kernel functions Jan Wassenberg 2025-09-09 22:09:09 -0700
  • f10ac41a20 Added flash attention, with both a single-q function, and a register-tiled function. The register-tiled version achieves a speed-up by a factor of about 9.7 over the previous attention function on an AVX3-enabled machine. Ray Smith 2025-09-09 08:04:45 -0700
  • 24b1760f03 Refactor: move Worker to ThreadingContext, factor out MMDecompress Jan Wassenberg 2025-09-09 07:55:39 -0700
  • 461a9c7d1b Matmul refactoring towards fusion Jan Wassenberg 2025-09-09 07:13:03 -0700
  • 34ceee6c30 Update MatMul comments, removing mention of partial. Jan Wassenberg 2025-09-09 05:56:57 -0700
  • a5ab99e4ba Memory use reduction: smaller/single MMStorage Jan Wassenberg 2025-09-09 05:32:20 -0700
  • 06e5da1e22 Cleanup: split CacheInfo from Allocator, MatMul helper functions Jan Wassenberg 2025-09-08 02:23:29 -0700
  • 6e52a835c6 Faster startup on tsan: use hierarchical parallelism for BF16 conversion Jan Wassenberg 2025-09-07 22:50:01 -0700
  • cbe24eac51 1.15x speedup: parallel sampling, enabled by new RNG Jan Wassenberg 2025-09-05 07:23:33 -0700
  • ad7d7a2713 Further adjust dot_test threshold (numerics) Jan Wassenberg 2025-09-05 05:49:35 -0700
  • 2b4c16e243 Remove Griffin support Jan Wassenberg 2025-09-05 02:34:54 -0700
  • 56186193c1 Replace mt19937 with new generator to enable parallel sampling Jan Wassenberg 2025-09-04 23:48:37 -0700
  • 5d1693e806 Internal change Jan Wassenberg 2025-09-04 10:30:42 -0700
  • afd82376a5 Add AES-CTR RNG for parallel sampling (not yet used) Jan Wassenberg 2025-09-04 05:58:08 -0700
  • 4be4799727 Remove kMaxPackages and per-package-related code Jan Wassenberg 2025-09-04 03:32:35 -0700
  • 7263ab8445 MatMul simplification, threading strategy improvements Jan Wassenberg 2025-09-03 21:44:39 -0700
  • 74ffe079c4 Create separate MMStorage objects per cluster. Marie White 2025-09-03 09:35:13 -0700
  • c783b82a82 Internal change Phil Culliton 2025-09-03 08:35:20 -0700
  • b7b3d353db Simplify MatMul: remove F32 special case (build time) Jan Wassenberg 2025-09-02 04:28:49 -0700
  • 1e3c853e80 Add ParallelFor wrapper function and one new mode Jan Wassenberg 2025-09-02 01:39:28 -0700
  • 3737224132 Add in-cluster parallel policy. Update policy to include cluster_idx. Marie White 2025-09-02 00:14:05 -0700
  • 27cb8e12d9 Handle non-threading parallel policy. Marie White 2025-09-02 00:02:18 -0700
  • 0d2e74d74a Add MMOptions as an argument to Matmul. Marie White 2025-09-01 23:46:07 -0700
  • 229bd078a1 1.29x speedup: bf16 C1/C2. Extend most ops to any type, expand test coverage. Jan Wassenberg 2025-09-01 06:32:24 -0700
  • bc0c0bac8b Add non-threading parallel policy. Marie White 2025-08-29 08:38:19 -0700
  • 00b70f69c5 Include parallelism type in DoMatMul. Also remove package handling. Marie White 2025-08-29 08:04:05 -0700
  • 0ae8646731 Fix remainder handling for Paligemma Jan Wassenberg 2025-08-29 07:25:14 -0700
  • 973e284ed6 Refactor Matmul to use a policy class for parallelization. Marie White 2025-08-29 05:40:06 -0700
  • 6c39a2dea4 1.01x speedup: More bf16 activations to reduce DecompressA. Jan Wassenberg 2025-08-29 03:18:28 -0700
  • 7288891439 Remove F64 partial storage in matmul. Jan Wassenberg 2025-08-29 00:11:31 -0700
  • 31c09cca4c f32 LoopKC: 1.37x(M=512), 1.19(M=128) single-K F32,BF16 matmul speedup on SKX Jan Wassenberg 2025-08-28 08:55:15 -0700
  • 98ddc166db Expand ThreadingContext comments Jan Wassenberg 2025-08-28 08:31:25 -0700
  • 6128e758ff Change ffw_out from B16 to F32. Marie White 2025-08-28 00:01:01 -0700
  • 85cc51795c Internal change. The gemma.cpp Authors 2025-08-26 08:07:23 -0700
  • 5411fd846d Minor: batched NotifyGenerate, fix comment/dep Jan Wassenberg 2025-08-26 23:32:43 -0700
  • 86afd53076 1.04x speedup: Parallelize SoftCap Jan Wassenberg 2025-08-26 11:54:48 -0700
  • ed2f0bd1b0 Fix pos assertions, refs #665 Jan Wassenberg 2025-08-26 04:50:06 -0700
  • 9bf0fe4e37 Internal change Jan Wassenberg 2025-08-26 04:43:26 -0700
  • d3a5ddf657 Merge pull request #663 from junjihashimoto:feature/api-server Jan Wassenberg 2025-08-24 11:57:05 +0200
  • 73f1140dca Fix an off-by-one error after StreamAndUpdateEOS() to remove the MSAN warning about reading an uninitialized variable in the kv_cache. Rhett Stucki 2025-08-20 22:59:24 -0700
  • 41321611fd feature: add API server and client with Google protocol Junji Hashimoto 2025-08-20 11:05:09 +0900
  • 41a86d41a9 Fix preadv error: only enable if we have a handle Jan Wassenberg 2025-08-15 06:30:07 -0700
  • 78573b6718 Internal change. Add deduction for 270M. Phil Culliton 2025-08-14 08:04:10 -0700
  • d044801c1d Internal change Phil Culliton 2025-08-13 09:47:05 -0700
  • 71406cf6d0 More profiler interface fixes: hwy:: plus avoid ADD_ZONE Jan Wassenberg 2025-08-13 03:15:07 -0700
  • faa4102992 (Resubmit) Prepare profiler annotations for new API Jan Wassenberg 2025-08-13 01:37:53 -0700
  • a2d9133f7d Prepare profiler annotations for new API The gemma.cpp Authors 2025-08-11 17:51:09 -0700
  • 4cbf63e6f0 Prepare profiler annotations for new API Jan Wassenberg 2025-08-11 15:34:20 -0700
  • eef564e8f0 Prepare profiler annotations for new API Jan Wassenberg 2025-08-08 16:50:54 -0700
  • 2e9c93a609 Merge pull request #649 from KaranocaVe:main Copybara-Service 2025-08-08 10:35:57 -0700
  • 33fbac0880 Exporter updates/fixes Jan Wassenberg 2025-08-04 22:35:59 -0700
  • 4e062d68f7 Update BlobWriter comments, WriteAll->Finalize Jan Wassenberg 2025-08-04 10:00:54 -0700
  • 701841897b Default to disabling per-socket parallelization Jan Wassenberg 2025-08-04 09:48:22 -0700
  • b56b2f05e4 Automated Code Change Ivo Ristovski List 2025-08-01 13:29:16 -0700
  • eaf05cd04e
    Merge 6dd1cd277f into 799c264df3 copybara-service[bot] 2025-08-01 20:11:15 +0000
  • 6dd1cd277f Automated Code Change The gemma.cpp Authors 2025-07-11 05:32:57 -0700
  • 799c264df3 Pre-tune thread pool before matmul Jan Wassenberg 2025-07-31 08:44:47 -0700
  • 32286f0465
    Merge branch 'dev' into main KaranocaVe 2025-07-31 22:40:56 +0800
  • 50ee1a3e92 Write SBS progressively. Charles Zhao 2025-07-31 06:05:02 -0700
  • 0ea118ebbe Update run.cc, CMakeLists and README for incompatible code, dependency changes and argument updates KaranocaVe 2025-07-31 00:59:16 +0800
  • 8715eda512 Improved layer idx parsing Jan Wassenberg 2025-07-30 05:49:13 -0700
  • d831ddce5b Fix file mapping: was letting the smart pointer go out of scope Jan Wassenberg 2025-07-30 04:29:27 -0700
  • 2141d4788d Add IsAppendOnly flag to file and if true, disable parallel writes Jan Wassenberg 2025-07-30 01:51:08 -0700
  • d22ba2ac96 Update layer index parsing and allow tokenizer override Jan Wassenberg 2025-07-30 01:21:54 -0700
  • d1638587f0 1.14x batch decode speedup: parallelize RMSNorm ops Jan Wassenberg 2025-07-30 00:54:55 -0700
  • ac0d751d20 Rename GetModelConfig->Config Jan Wassenberg 2025-07-29 10:17:14 -0700
  • 33fabd4ed1 Internal change. Jeremiah Harmsen 2025-07-29 08:20:36 -0700
  • e76e29ce11 De-singleton ThreadingContext so callers can pass in their own Jan Wassenberg 2025-07-22 02:07:58 -0700
  • 5474146129 Back to f32 kv_cache, but via typedef Jan Wassenberg 2025-07-21 07:04:55 -0700
  • 56c9196eb6 Add blob_path to config deduction message Jan Wassenberg 2025-07-11 18:58:16 -0700
  • 349c86f2d9 Fix bench_matmul perf regression: A input should be padded Jan Wassenberg 2025-07-11 07:35:52 -0700
  • 4bc44d5678 Minor: ModelWeightsPtrs -> WeightsPtrs Jan Wassenberg 2025-07-11 06:10:51 -0700