Commit Graph

143 Commits

Author SHA1 Message Date
Clint Herron b7f24dba00 Separating test calls to give more helpful stack traces on failure. Adding verbose messages to give visibility for what is being tested. 2024-04-29 14:05:30 -04:00
Clint Herron 7b04c215eb Fixing build when C++17 is not present. 2024-04-29 14:05:30 -04:00
Clint Herron 8f17815ca8 Add slightly more complex grammar with quantifiers to test references with quantifiers. 2024-04-29 14:05:30 -04:00
Clint Herron ab606c030f Add tests around quantifiers to ensure both matching and non-matching compliance. 2024-04-29 14:05:30 -04:00
Clint Herron 9cd07c2f9d Cleaning up integration tests to share code between tests and make it simpler to add new tests. 2024-04-29 14:05:30 -04:00
Georgi Gerganov f4ab2a4147
llama : fix BPE pre-tokenization (#6920)
* merged the changes from deepseeker models to main branch

* Moved regex patterns to unicode.cpp and updated unicode.h

* Moved header files

* Resolved issues

* added and refactored unicode_regex_split and related functions

* Updated/merged the deepseek coder pr

* Refactored code

* Adding unicode regex mappings

* Adding unicode regex function

* Added needed functionality, testing remains

* Fixed issues

* Fixed issue with gpt2 regex custom preprocessor

* unicode : fix? unicode_wstring_to_utf8

* lint : fix whitespaces

* tests : add tokenizer tests for numbers

* unicode : remove redundant headers

* tests : remove and rename tokenizer test scripts

* tests : add sample usage

* gguf-py : reader prints warnings on duplicate keys

* llama : towards llama3 tokenization support (wip)

* unicode : shot in the dark to fix tests on Windows

* unicode : first try custom implementations

* convert : add "tokenizer.ggml.pre" GGUF KV (wip)

* llama : use new pre-tokenizer type

* convert : fix pre-tokenizer type writing

* lint : fix

* make : add test-tokenizer-0-llama-v3

* wip

* models : add llama v3 vocab file

* llama : adapt punctuation regex + add llama 3 regex

* minor

* unicode : set bomb

* unicode : set bomb

* unicode : always use std::wregex

* unicode : support \p{N}, \p{L} and \p{P} natively

* unicode : try fix windows

* unicode : category support via std::regex

* unicode : clean-up

* unicode : simplify

* convert : add convert-hf-to-gguf-update.py

ggml-ci

* lint : update

* convert : add falcon

ggml-ci

* unicode : normalize signatures

* lint : fix

* lint : fix

* convert : remove unused functions

* convert : add comments

* convert : exercise contractions

ggml-ci

* lint : fix

* cmake : refactor test targets

* tests : refactor vocab tests

ggml-ci

* tests : add more vocabs and tests

ggml-ci

* unicode : cleanup

* scripts : ignore new update script in check-requirements.sh

* models : add phi-3, mpt, gpt-2, starcoder

* tests : disable obsolete

ggml-ci

* tests : use faster bpe test

ggml-ci

* llama : more prominent warning for old BPE models

* tests : disable test-tokenizer-1-bpe due to slowness

ggml-ci

---------

Co-authored-by: Jaggzh <jaggz.h@gmail.com>
Co-authored-by: Kazim Abrar Mahi <kazimabrarmahi135@gmail.com>
2024-04-29 16:58:41 +03:00
Tristan Druyen abd3314064
llama : add phi 3 chat template (#6857)
* Add phi 3 chat template & tests

* test : fix chat template result

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-24 11:52:37 +03:00
Wouter 7dbdba5690
llama : add llama-3 chat template (#6751)
* Added llama-3 chat template

* Update llama.cpp

Co-authored-by: Samuel Tallet <36248671+SamuelTallet@users.noreply.github.com>

* Update llama.cpp

Co-authored-by: Samuel Tallet <36248671+SamuelTallet@users.noreply.github.com>

* Update tests/test-chat-template.cpp

Co-authored-by: Samuel Tallet <36248671+SamuelTallet@users.noreply.github.com>

* Added EOS stop sequence according to https://github.com/ggerganov/llama.cpp/pull/6751#issuecomment-2065602862

* Removed adding of BOS token before first message

* Removed bos token from expected output from llama-3

* Update tests/test-chat-template.cpp

Co-authored-by: Rene Leonhardt <65483435+reneleonhardt@users.noreply.github.com>

* Update tests/test-chat-template.cpp

Co-authored-by: Rene Leonhardt <65483435+reneleonhardt@users.noreply.github.com>

* Added <|end_of_text|> as another stop token

* Reverted last change of adding the end_of_text stop word for llama 3

---------

Co-authored-by: Wouter Tichelaar <tichelaarw@spar.net>
Co-authored-by: Samuel Tallet <36248671+SamuelTallet@users.noreply.github.com>
Co-authored-by: Rene Leonhardt <65483435+reneleonhardt@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-21 16:03:39 +03:00
slaren 0d56246f4b
ggml : group all experts in a single ggml_mul_mat_id (#6505)
* ggml : group all experts in a single ggml_mul_mat_id
cuda : improve mmid row copy

* cuda : fix bin bcast with non-cont src0

* test-backend-ops : only run all mul mat tests for base types

* llama : disable moe offloading with SYCL

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-18 15:18:48 +02:00
Shijie f4dea7da18
llama : add qwen2moe (#6074)
* support qwen2moe

* fix-review

* metal : support unary ops for nelements % 4 != 0

* metal : require contiguousness for float4 unary kernels

* metal : require contiguousness for float4 unary kernels (cont)

* fix-review

* names : for brevity "SHARED_EXP" -> "SHEXP"

* llama : reuse build_moe_ffn()

* llama : add model type name

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-16 18:40:48 +03:00
Olivier Chafik 7593639ce3
`main`: add --json-schema / -j flag (#6659)
* main: add --json-schema / -j

* json: move json-schema-to-grammar to common lib

* json: fix zig build
2024-04-15 18:35:21 +01:00
Chao Jiang 04fbc5f23e
Add Command R chat template (#6650)
* Add chat template for command-r model series

* Fix indentation

* Add chat template test for command-r models and update the implementation to trim whitespaces

* Remove debug print
2024-04-14 18:16:34 +02:00
Olivier Chafik ab9a3240a9
JSON schema conversion: ️ faster repetitions, min/maxLength for strings, cap number length (#6555)
* json: rename python schema converter to make import easier

* server: skip null json_schema / grammar fields

* json: deps management for primitive rules (+ allow null values)

* json: optimize repetitions for minItems/maxItems and regexps: `a{,3}` goes from `"a"? "a"? "a"?` (explosive combos) to `(a (a (a)?)?)?`

* grammars: add troubleshooting section to readme

* json: cap length of numbers to 15 digits before/after decimal point

(avoids infinite gen, e.g. "one third" -> `0.333333333333...`)

* json: unify all repetition code (w/ or w/o sep)

* json: support string minLength/maxLength

* server+json: update server/README w/ result_format

* nits

* json: fix type error w/ python 3.8

* json: fix server/README (json_schema in /completion vs. result_format in /v1/chat/completions)

* json: simplify DOT `{"type": "string", "pattern": "^.$"}`

* json: remove recursion in opt_repetitions (avoids Python stack overflow)

* json: rm dead code

* json: rm useless assert & ggml.h import
2024-04-12 19:43:38 +01:00
slaren fbbc030ba9
metal : unify mul_mv_id kernels (#6556) 2024-04-12 18:13:20 +02:00
Olivier Chafik cbaadc9294
grammars: 1.5x faster inference w/ complex grammars (vector reserves / reuses) (#6609)
* grammars: reserve rejects & next candidates

* grammars: reuse new_stacks

* grammars: fix missing sig change in llama.h

* grammars: fix test (api changed)

* grammars: update gbnf-validator.cpp

* grammars: simpler syntax (no swap)
2024-04-11 19:47:34 +01:00
Clint Herron 57dd02c44b
Tests: Added integration tests for GBNF parser (#6472)
* Added integration tests for GBNF parser to validate correctness of parsing, as well as correctness of string matching. Intended for use to pin behavior while working on performance improvements.

* Fixing whitespace errors and cleaning error message alert to be clearer.

* Removing hacky include to llama.cpp from grammar integration test now that needed functions are available via internal API.

* Comment cleanup.

* Reorganizing tests for readability.

* Cleaning up debug message to make a bit more sense.
2024-04-06 10:31:33 -04:00
kaizau 1ff4d9f3d6
Add OpenChat, Alpaca, Vicuna chat templates (#6397)
* Add openchat chat template

* Add chat template test for openchat

* Add chat template for vicuna

* Add chat template for orca-vicuna

* Add EOS for vicuna templates

* Combine vicuna chat templates

* Add tests for openchat and vicuna chat templates

* Add chat template for alpaca

* Add separate template name for vicuna-orca

* Remove alpaca, match deepseek with jinja output

* Regenerate chat template test with add_generation_prompt

* Separate deepseek bos from system message

* Match openchat template with jinja output

* Remove BOS token from templates, unprefix openchat
2024-04-03 17:24:31 +02:00
slaren 08a0c02060
ggml : mul_mat_id use the same tensor for all the experts (#6387)
* ggml : update mul_mat_id to use the same tensor for all the experts

* update cuda

* minor

* update metal

* update test-backend-ops

* fix cuda

* Update ggml-metal.m

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* update convert.py

* update convert-hf-to-gguf.py

* update convert.py for mixtral hf models

* Update convert-hf-to-gguf.py

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* cuda : support non-pow-2 number of experts

* allow quantize to work for split and merged experts models in the same way

* cleanup + disable mmap automatically with split tensors models

* update imatrix

* test-backend-ops : test qwen argsort

* update grok model loading

* llama : add merged experts tensors to the grok tensor map

* minor

* gguf : bump version

* fix quantizing of merged experts

* convert-hf-to-gguf.py : update grok (untested)

* make linter happy

* cuda/argsort : use shared memory instead of pool memory

* convert : fix grok tensor names

* metal : add support for non-pow-2 argsort

* llama : more loader cleanup, better error checking

* cuda : fix warning

* llama : still use mmap for loading old models, but copy the data to a host buffer

* add review note

* llama : remove ffn tensor counting + add sanity check

ggml-ci

* convert : fix handling of n_experts == None

ggml-ci

* imatrix : fix ncall counters

* llama : produce error if imatrix size does not match

* quantize : terminate on errors + trace logs

ggml-ci

* metal : pad shared memory to 16 bytes

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-03 16:07:05 +03:00
Kawrakow 55c1b2a3bb
IQ1_M: 1.75 bpw quantization (#6302)
* iq1_m: basics

* iq1_m: basics-2

* iq1_m: CUDA dequantize works

Very 1st shot I get PPL = 9.76 for LLaMA-v2-7B.

* iq1_m: separate shifts for each group of 8 in a block

We get
PPL(LLaMA-v2-7B ) = 9.2810
PPL(LLaMA-v2-13B) = 6.8105

Not bad, but slightly higher than
  sqrt(PPL(IQ1_S) * PPL(IQ2_XXS))
which is the expected outcome given that IQ1_M is
halfway between IQ1_S and IQ2_XXS in terms of bpw.
From this, we would expect
 PPL = 9.14 for LLaMA-v2-7B
 PPL = 6.63 for LLaMA-v2-13B

* iq1_m: go to 3-bit scales

There is slight increase in PPL, but the 0.0625 bpw reduction
in size is totally worth it.

We now have
PPL(LLaMA-v2-7B ) = 9.4469 at 1.96 bpw
PPL(LLaMA-v2-13B) = 6.8717 at 1.93 bpw
PPL(LLaMA-v2-70B) = 4.8568 at 1.85 bpw

* iq1_m: scalar dot product

* iq1_m: AVX2 dot product

* iq1_m: very slightly faster AVX2 dot product

* iq1_m: ARM_NEON dot product

Works, but very slow (10.5 t/s)

* iq1_m: Metal - dequantize works, dot product does not

* iq1_m: Metal now works

About the same performance as iq1_s.

* iq1_m: minor

* iq1_m: checking pure iq1_m quantization

It is pretty bad: PPL(LLaMA-v2-7B) = 34 if we quantize output.weight
with Q4_K.

* iiq1_m: slightly faster ARM_NEON dot product

10.5 t/s -> 11.65 t/s

* iq1_m: faster ARM_NEON dot product

11.65 t/s -> 14.9 t/s

* iq1_m: another minor ARM_NEON dot product improvement

14.9 -> 15.0 t/s

* iq1_m: small PPL improvement via super-block scale adjustment

After quantizing block scales redo the super-block scale fit.

PPL(LLaMA-v2-7B ) = 9.3346
PPL(LLaMA-v2-13B) = 6.8419
PPL(LLaMA-v2-70B) = 4.8294
PPL(Mistral-7B  ) = 8.1624

* iq1_m: adapt to CUDA refactoring

* iq1_m: remove unused variable

We have progressed to warnings being errors.

* iq1_m: add to backend-ops tests

* iq1_m: fix Windows ARM

* iq1_m: use common definition of iq1m_scale_t

* cuda: assert -> NO_DEVICE_CODE

* iq1_M: PR comments

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-03-26 15:21:27 +01:00
Kawrakow 1f2fd4e727
tests : include IQ2_XXS and IQ2_XS in test-quantize-fns (#6303)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-03-25 19:33:15 +02:00
Olivier Chafik f77a8ffd3b
tests : conditional python & node json schema tests (#6207)
* json: only attempt python & node schema conversion tests if their bins are present

Tests introduced in https://github.com/ggerganov/llama.cpp/pull/5978
disabled in https://github.com/ggerganov/llama.cpp/pull/6198

* json: orange warnings when tests skipped

* json: ensure py/js schema conv tested on ubuntu-focal-make

* json: print env vars in test
2024-03-22 15:09:07 +02:00
Olivier Chafik 72114edf06
json-schema-to-grammar : fix order of props + non-str const/enum (#6232)
* json: ordered json in server/schema converter to respect orig order

* json: ws nits

* json: support non-string const / enums
2024-03-22 15:07:44 +02:00
Georgi Gerganov 95d576b48e
metal : pad n_ctx by 32 (#6177)
* metal : require ne00 >= 128 for mat-mat kernels

ggml-ci

* llama : pad n_ctx by 32

ggml-ci
2024-03-22 09:36:03 +02:00
Georgi Gerganov 924ce1dce7
tests : disable system() calls (#6198)
ggml-ci
2024-03-21 16:20:05 +02:00
Olivier Chafik 5b7b0ac8df
json-schema-to-grammar improvements (+ added to server) (#5978)
* json: fix arrays (disallow `[,1]`)

* json: support tuple types (`[number, string]`)

* json: support additionalProperties (`{[k: string]: [string,number][]}`)

* json: support required / optional properties

* json: add support for pattern

* json: resolve $ref (and support https schema urls)

* json: fix $ref resolution

* join: support union types (mostly for nullable types I think)

* json: support allOf + nested anyOf

* json: support any (`{}` or `{type: object}`)

* json: fix merge

* json: temp fix for escapes

* json: spaces in output and unrestricted output spaces

* json: add typings

* json:fix typo

* Create ts-type-to-grammar.sh

* json: fix _format_literal (json.dumps already escapes quotes)

* json: merge lit sequences and handle negatives

{"type": "string", "pattern": "^({\"question\": \"[^\"]+\", \"response\": \"[^\"]+\"}\\n)+$"}

* json: handle pattern repetitions

* Update json-schema-to-grammar.mjs

* Create regex-to-grammar.py

* json: extract repeated regexp patterns to subrule

* Update json-schema-to-grammar.py

* Update json-schema-to-grammar.py

* Update json-schema-to-grammar.py

* json: handle schema from pydantic Optional fields

* Update json-schema-to-grammar.py

* Update json-schema-to-grammar.py

* Update ts-type-to-grammar.sh

* Update ts-type-to-grammar.sh

* json: simplify nullable fields handling

* json: accept duplicate identical rules

* json: revert space to 1 at most

* json: reuse regexp pattern subrules

* json: handle uuid string format

* json: fix literal escapes

* json: add --allow-fetch

* json: simplify range escapes

* json: support negative ranges in patterns

* Delete commit.txt

* json: custom regex parser, adds dot support & JS-portable

* json: rm trailing spaces

* Update json-schema-to-grammar.mjs

* json: updated server & chat `( cd examples/server && ./deps.sh )`

* json: port fixes from mjs to python

* Update ts-type-to-grammar.sh

* json: support prefixItems alongside array items

* json: add date format + fix uuid

* json: add date, time, date-time formats

* json: preserve order of props from TS defs

* json: port schema converter to C++, wire in ./server

* json: nits

* Update json-schema-to-grammar.cpp

* Update json-schema-to-grammar.cpp

* Update json-schema-to-grammar.cpp

* json: fix mjs implementation + align outputs

* Update json-schema-to-grammar.mjs.hpp

* json: test C++, JS & Python versions

* json: nits + regen deps

* json: cleanup test

* json: revert from c++17 to 11

* json: nit fixes

* json: dirty include for test

* json: fix zig build

* json: pass static command to std::system in tests (fixed temp files)

* json: fix top-level $refs

* json: don't use c++20 designated initializers

* nit

* json: basic support for reserved names `{number:{number:{root:number}}}`

* Revamp test cmake to allow args (WORKING_DIRECTORY needed for JSON test)

* json: re-ran server deps.sh

* json: simplify test

* json: support mix of additional props & required/optional

* json: add tests for some expected failures

* json: fix type=const in c++, add failure expectations for non-str const&enum

* json: test (& simplify output of) empty schema

* json: check parsing in test + fix value & string refs

* json: add server tests for OAI JSON response_format

* json: test/fix top-level anyOf

* json: improve grammar parsing failures

* json: test/fix additional props corner cases

* json: fix string patterns (was missing quotes)

* json: ws nit

* json: fix json handling in server when there's no response_format

* json: catch schema conversion errors in server

* json: don't complain about unknown format type in server if unset

* json: cleaner build of test

* json: create examples/json-schema-pydantic-example.py

* json: fix date pattern

* json: move json.hpp & json-schema-to-grammar.{cpp,h} to common

* json: indent 4 spaces

* json: fix naming of top-level c++ function (+ drop unused one)

* json: avoid using namespace std

* json: fix zig build

* Update server.feature

* json: iostream -> fprintf

* json: space before & refs for consistency

* json: nits
2024-03-21 11:50:43 +00:00
Xuan Son Nguyen aab606a11f
llama : add Orion chat template (#6066) 2024-03-15 10:44:57 +02:00
slaren d8fd0ccf6a
test-backend-ops : skip CPU backend by default (#6028) 2024-03-13 15:58:30 +02:00
Georgi Gerganov 83796e62bc
llama : refactor unicode stuff (#5992)
* llama : refactor unicode stuff

ggml-ci

* unicode : names

* make : fix c++ compiler

* unicode : names

* unicode : straighten tables

* zig : fix build

* unicode : put nfd normalization behind API

ggml-ci

* swift : fix build

* unicode : add BOM

* unicode : add <cstdint>

ggml-ci

* unicode : pass as cpts as const ref
2024-03-11 17:47:47 +02:00
Georgi Gerganov 5b09797321
ggml : remove old quantization functions (#5942)
* ggml : remove old quantization functions

ggml-ci

* ggml : simplify ggml_quantize_chunk

ggml-ci

* ggml : restrict correctness

ggml-ci

* ggml : remove hist data from the quantization API

ggml-ci

* tests : remove hist usage in test-backend-ops

ggml-ci

* vulkan : remove hist and fix typo
2024-03-09 15:53:59 +02:00
Georgi Gerganov 2c4f566c88
tests : gitignore ggml-common.h 2024-03-09 14:17:11 +02:00
leejet 7d43c585dc
add some new ops, fix some operators and add batch operations to certain operators. (ggml/747)
* cuda: fix group_norm

* cuda: add batch inference support for ggml_pad/ggml_upscale

* add ggml_arrange

* add ggml_timestep_embedding

* update ggml_arange/ggml_timestep_embedding tests

* cuda: fix im2col

* add ggml_arange/ggml_timestep_embbeding support for metal backend

* fix some bugs

* fix some bugs

* Update ggml.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml-cuda.cu

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml-metal.m

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml-metal.m

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml-metal.metal

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* modify according to the review comments

* ggml : fix compile warnings + code style

* ggml : normalize compute_forward calls + fix seg fault in debug

* minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-03-04 10:39:10 +02:00
Kawrakow 0becb22ac0
IQ4_XS: a 4.25 bpw quantization (#5747)
* Try IQ4_NL with blocks of 64 - does not look good

* iq4_xs: go to super-blocks of 256 and 6-bit scales for blocks of 32

* iq4_xs: CUDA works - 133.2 t/s

* iq4_xs: AVX2 dot product

* iq4_xs: ARM_NEON dot product

* iq4_nl: Metal implementation

As usual, Metal / Apple Silicon don't like my quants.

* iq3_xs: minor fix

* iq4_xs: shrink by using IQ3_S for attn_k and attn_q

* iq4_xs: revert using IQ3_S for attn_k and attn_v

PPL vs size is good, but CPU performance suffers: on M2 Max
TG-128 drops to 21.7 t/s from 28.8, and on a Ryzen-7950X
to 14.5 t/s from 15.8 t/s. On CUDA we have 135 t/s when
using IQ3_S vs 133 t/s with pure IQ4_XS.

* Fix CI

* iq4_xs: Added forgotten check for 256 divisibility

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-27 16:34:24 +02:00
Kawrakow a33e6a0d2a
Adding IQ2_S and IQ2_M to complete coverage of the 2-3 bit quantization range (#5721)
* Adding IQ2_S and IQ2_M as a single cumulative commit

* Update examples/quantize/quantize.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-26 18:28:38 +02:00
Georgi Gerganov ab336a9d5e
code : normalize enum names (#5697)
* coda : normalize enum names

ggml-ci

* code : cont

* code : cont
2024-02-25 12:09:09 +02:00
Kawrakow 4c4cb30736
IQ3_S: a much better alternative to Q3_K (#5676)
* iq4_nl: squash commits for easier rebase

* Basics (quantize, dequantize)
* CUDA dequantize and dot product
* Slightly faster CUDA dot product (120 t/s)
* Switch to 6-bit scales
* Scalar dot product
* AVX2 dot product
* ARM_NEON dot product
* Works on metal, but still slow
* Slightly better Metal dot product
* Another small Metal improvement
* Metal dot product is getting there
* Faster CUDA dot product
* Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided
* Report the actual bpw
* Add _xs mix that is 4.05 bpw for non-MoE models
* Remove IQ4_XS for now, slightly adjust kvalues_iq4nl
* AVX2 dot product uses Q8_0 instead of Q8_K
* Add to test-backend-ops
* Minor fix
* Also use use Q5_K for attn_output in MoE models
* Fixes after merging latest master
* Switching to blocks of 32
* AVX2 for blocks of 32
* Scaler dot product for blocks of 32
* ARM_NEON dot product for blocks of 32
* Metal kernels for blocks of 32
* Slightly faster Metal kernels

* Resurrecting iq3_xs

After all the experimentation, nothing was better than this.

* Minor PPL improvement via a block scale fudge factor

* Minor improvement via 3 neighbours

* iq3_xs: working scalar and AVX2 dot products

* iq3_xs: ARM_NEON dot product - works but extremely slow (10 t/s)

* iq3_xs: working Metal implementation

* Adding IQ3_M - IQ3_XS mix with mostly Q4_K

* iiq3_xs: a 3.4375 bpw variant

* iq3_xs: make CUDA work for new version

* iq3_xs: make scalar and AVX2 work for new version

* iq3_s: make ARM_NEON work with new version

* iq3_xs: make new version work on metal

Performance is very similar to Q3_K_S

* iq3_xs: tiny Metal speed improvement

* iq3_xs: tiny Metal speed improvement

* Fix stupid warning

* Q3_K_XS now uses a mix of IQ3_XS and IQ3_XXS

* iq3_xs: rename to iq3_s

* iq3_s: make tests pass

* Move Q3_K_XS mix to 3.25 bpw

* Attempt to fix failing tests

* Another attempt to fix the Windows builds

* Attempt to fix ROCm

* ROCm again

* iq3_s: partial fix for QK_K = 64

* iq3_s: make it work on metal for QK_K = 64

Pleasent surprise: the coding was super-block size independent,
so all it took was to delete some QK_K == 256 guards.

* Will this fix ROCm?

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-24 16:23:52 +02:00
Xuan Son Nguyen 373ee3fbba
Add Gemma chat template (#5665)
* add gemma chat template

* gemma: only apply system_prompt on non-model message
2024-02-22 19:10:21 +01:00
Xuan Son Nguyen a46f50747b
server : fallback to chatml, add AlphaMonarch chat template (#5628)
* server: fallback to chatml

* add new chat template

* server: add AlphaMonarch to test chat template

* server: only check model template if there is no custom tmpl

* remove TODO
2024-02-22 10:33:24 +02:00
Kawrakow a14679cc30
IQ4_NL: 4-bit non-linear quants with blocks of 32 (#5590)
* iq4_nl: squash commits for easier rebase

* Basics (quantize, dequantize)
* CUDA dequantize and dot product
* Slightly faster CUDA dot product (120 t/s)
* Switch to 6-bit scales
* Scalar dot product
* AVX2 dot product
* ARM_NEON dot product
* Works on metal, but still slow
* Slightly better Metal dot product
* Another small Metal improvement
* Metal dot product is getting there
* Faster CUDA dot product
* Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided
* Report the actual bpw
* Add _xs mix that is 4.05 bpw for non-MoE models
* Remove IQ4_XS for now, slightly adjust kvalues_iq4nl
* AVX2 dot product uses Q8_0 instead of Q8_K
* Add to test-backend-ops
* Minor fix
* Also use use Q5_K for attn_output in MoE models
* Fixes after merging latest master
* Switching to blocks of 32
* AVX2 for blocks of 32
* Scaler dot product for blocks of 32
* ARM_NEON dot product for blocks of 32
* Metal kernels for blocks of 32
* Slightly faster Metal kernels

* iq4_nl: Fix after merging with master

* iq4_nl: another fix after merging with master

* Use IQ4_NL instead of Q4_K when using k-quants is not possible

* Fix typo that makes several tests fail

* It was the ggml_vdotq thing missed inside the brackets

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-21 11:39:52 +02:00
Xuan Son Nguyen 11b12de39b
llama : add llama_chat_apply_template() (#5538)
* llama: add llama_chat_apply_template

* test-chat-template: remove dedundant vector

* chat_template: do not use std::string for buffer

* add clarification for llama_chat_apply_template

* llama_chat_apply_template: add zephyr template

* llama_chat_apply_template: correct docs

* llama_chat_apply_template: use term "chat" everywhere

* llama_chat_apply_template: change variable name to "tmpl"
2024-02-19 10:23:37 +02:00
Herman Semenov 5d3de51f97
ggml, common, examples, tests : fixed type arguments in printf (#5528) 2024-02-18 18:20:12 +02:00
Kawrakow bd2d4e393b
1.5 bit quantization (#5453)
* iq1_s: WIP basics

* iq1_s: CUDA is working

* iq1_s: scalar CPU dot product

* iq1_s: WIP AVX2 dot product - something is not right

* Fix tests

* Fix shadow warnings

* Fix after merge with latest master

* iq1_s: AVX2 finally works

* iq1_s: ARM_NEON dot product. Works, but not very fast

* iq1_s: better grid

* iq1_s: use IQ2_XXS for attn_output

At a cost of 0.04 extra bpw this gives a big improvement in PPL.

* iq1_s: Metal basics

Dequantize works, but not dot product

* iq1_s: Metal works, but quite slow

As usual, Apple Silicon does not like the code I write.

* iq1_s: Tests

* iq1_s: slightly faster dot product

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-02-18 18:16:55 +02:00
Georgi Gerganov 8f1be0d42f
ggml : add ALiBi support for ggml_soft_max_ext (#5488)
* ggml : avoid recomputing alibi slopes (CPU)

* llama : reuse hparams.f_max_alibi_bias in all cases

ggml-ci

* ggml : support alibi bias in ggml_soft_max_ext (CPU + Metal)

ggml-ci

* ggml : handle all SRCs (do not break on first null)

ggml-ci

* tests : do not use slope for large soft_max

accumulates too much error

ggml-ci

* ggml : alternative ALiBi without extra tensor

We compute the slopes in the kernel

ggml-ci

* cuda : add ALiBi support in ggml_soft_max_ext

ggml-ci

* ggml : deprecate ggml_alibi

* ggml : support multi-sequence ALiBi (Metal)

ggml-ci

* cuda : add multi-seq ALiBi + remote F16 soft_max

ggml-ci

* ggml : update deprecation message

* ggml : fix pos ptr when no ALiBi

ggml-ci

* cuda : fix performance (pow -> powf)

* cuda : precompute ALiBi constants

* metal : pre-compute ALiBi slopes

ggml-ci

* llama : init kq_pos only if needed

ggml-ci

* test-backend-ops : add null pos test to soft_max

test-backend-ops : replace soft_max tests

ggml-ci

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-02-17 23:04:16 +02:00
bmwl f486f6e1e5
ggml : add numa options (#5377)
* Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h

* Reverted Makefile

* Fixed include

* Removed sched.h from ggml.h, moved ggml_get_numa_affinity into ggml.c, removed trailing whitespace and fixed up a few inconsistent variables

* removed trailing whitespace

* Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h

* Reverting Makefile

* Fixed a number of issues with the move from BOOL to ggml_numa_strategies. Added a note about mirror mode note being implemented yet

* Removing MIRROR_MODE code for this PR

* Removing last bit of MIRROR_MODE code for this PR

* Removing unneeded branch in server.cpp example and moving get_numa_affinity and making it static

* Fixed lingering init_llama_backend() bool calls in tests and examples

* Remote enum llama_numa_strategies

* Revert bad merge with dynatemp flags

* add missing enum ggml_numa_strategies declaration and revert sync problem with master

* add missing enum ggml_numa_strategies declaration

* fixed ggml_init_numa variable

* Update ggml.h

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* Update READMEs with info about numa flags, change INTERLEAVE strategy name to DISTRIBUTE everywhere, implement the improved distribution strategy from @rankaiyx, fix a spelling mistake and un-merge some bad merges

* split numa init out from llama_backend_init and created llama_numa_init. Updated all code paths and samples

* Fix up some boolean vs enum comparisons

* Added #ifdefs for non-Linux OS that don't have cpu_set_t datatype

* Update ggml.h

Align enum values

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml.c

Remove whitespace

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml.c

align paremeters

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update examples/server/server.cpp

remove whitespace and align brace

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update common/common.cpp

Remove whitespace and align brace

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* unified ggml_numa_strategy enum and fixed text alignment in server.cpp example

* Update ggml.c

simplified return for platforms without NUMA support

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* removed redundant else from cli argument processing of --numa

* whitespace

---------

Co-authored-by: root <root@nenya.lothlorien.ca>
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
2024-02-16 11:31:07 +02:00
Georgi Gerganov cf45252a7c
tests : multi-thread the tokenizer tests (#5474)
* tests : multi-thread the tokenizer tests

ggml-ci

* unicode : fix data race for unidentified codepoints

ggml-ci

* unicode : minor style fixes

ggml-ci
2024-02-13 15:14:22 +02:00
Georgi Gerganov 99b8b43d7b
tests : disable moe test (#5473) 2024-02-13 11:20:24 +02:00
snadampal a07d0fee1f
ggml : add mmla kernels for quantized GEMM (#4966)
* ggml: aarch64: implement smmla kernel for q8_0_q8_0 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q8_0_q8_0 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: aarch64: implement smmla kernel for q4_0_q8_0 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q4_0_q8_0 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: aarch64: implement smmla kernel for q4_1_q8_1 quantized gemm

armv8.2-a and above supports MMLA instructions that have higher
throughput than DOT. this commit adds mmla kernel for
q4_1_q8_1 gemm. The feature is enabled if the platform supports
"__ARM_FEATURE_MATMUL_INT8"

On AWS Graviton3 processors this kernel resulted up to 1.5x
improvement for prompt evaluation throughput compared to the
default sdot kernel.

* ggml: update unit tests for the new vec_dot interface

* llama.cpp: add MATMUL_INT8 capability to system_info
2024-02-11 15:22:33 +02:00
Johannes Gäßler 26d4efd11e
sampling: fix top_k <= 0 (#5388)
* sampling: fix top_k <= 0

* Update llama.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-08 09:46:30 +01:00
Georgi Gerganov 8504d2d0da
tests : .gitignore obj files 2024-02-08 09:46:47 +02:00
Michael Klimenko 52bb63c708
refactor : switch to emplace_back to avoid extra object (#5291) 2024-02-03 13:23:37 +02:00
JidongZhang-THU 15606309a0
llava : add MobileVLM support (#5132)
* New Feature:
    1. Sum_Rows:
        fix cuda kernel overflow
        fix block shape error when nrows too big
    2. Im2Col:
        Support Batch in cuda
        Support f32 to f32 both in cpu && cuda
    3. DepthWiseConv:
        Support by Im2Col && MulMat
    4. Pool_2d:
        Supoort avg pooling in cuda
    5. HardSigmoid:
        Imp in cuda
    6. HardSwish:
        Imp in cuda

* fix tabs instead of spaces

* code clean

* CUDA POOL2D

* ADD POOL2D test case in test-backend-ops.cpp

* code clean

* fix pool2d_kernel

nits

* fix bug in pool2d kernel

* fix avg pooling, count_include_pad

nits

* test-backend-ops : add more pool_2d tests

* cuda : fix warnings and formatting

* ggml : check types in release builds too in pool_2d

* test-backend-ops : remove f16 pool_2d tests

* cuda : more style fixes

* Add assert in ggml_cuda_op_pool2d

* pool2d float padding fallback

* test-backend-ops : add dst_type to im2col

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-01-31 15:10:15 +02:00