Commit Graph

11 Commits

Author SHA1 Message Date
netrunnereve ae0b5ea7ae oops
oops
2024-04-29 23:07:52 -04:00
netrunnereve 8916954a82 merge 2024-04-29 22:17:12 -04:00
Justine Tunney 4b1c3c98b4
llamafile : use 64-bit integers in sgemm (#6928) 2024-04-26 17:05:33 +03:00
Eve fb80f13cd4
Update sgemm.cpp 2024-04-25 04:03:29 +00:00
netrunnereve 063a31f7a8 sse load 2024-04-24 23:00:02 -04:00
netrunnereve dee9566dc7 reduce 256 to 128 (and back!) conversions 2024-04-24 00:22:38 -04:00
netrunnereve 9facb0f07a combine denibble with load 2024-04-23 23:46:49 -04:00
netrunnereve 257391aae3 style 2024-04-22 23:48:07 -04:00
netrunnereve 86d1d84642 basic avx implementation 2024-04-22 23:35:02 -04:00
Justine Tunney 192090bae4
llamafile : improve sgemm.cpp (#6796)
* llamafile : improve sgemm.cpp

- Re-enable by default
- Fix issue described in #6716
- Make code more abstract, elegant, and maintainable
- Faster handling of weirdly shaped `m` an `n` edge cases

* Address review comments

* Help clang produce fma instructions

* Address review comments
2024-04-22 22:00:36 +03:00
Justine Tunney 8cc91dc63c
ggml : add llamafile sgemm (#6414)
This change upstreams llamafile's cpu matrix multiplication kernels
which improve image and prompt evaluation speed. For starters, Q4_0
and Q8_0 weights should go ~40% faster on CPU. The biggest benefits
are with data types like f16 / f32, which process prompts 2x faster
thus making them faster than quantized data types for prompt evals.

This change also introduces bona fide AVX512 support since tinyBLAS
is able to exploit the larger register file. For example, on my CPU
llama.cpp llava-cli processes an image prompt at 305 tokens/second,
using the Q4_K and Q4_0 types, which has always been faster than if
we used f16 LLaVA weights, which at HEAD go 188 tokens/second. With
this change, f16 LLaVA performance leap frogs to 464 tokens/second.

On Intel Core i9-14900K this change improves F16 prompt perf by 5x.
For example, using llama.cpp at HEAD with Mistral 7b f16 to process
a 215 token prompt will go 13 tok/sec. This change has fixes making
it go 52 tok/sec. It's mostly thanks to my vectorized outer product
kernels but also because I added support for correctly counting the
number of cores on Alderlake, so the default thread count discounts
Intel's new efficiency cores. Only Linux right now can count cores.

This work was sponsored by Mozilla who's given permission to change
the license of this code from Apache 2.0 to MIT. To read more about
what's improved, and how it works, see: https://justine.lol/matmul/
2024-04-16 21:55:30 +03:00