* vulkan: scalar flash attention implementation
* vulkan: always use fp32 for scalar flash attention
* vulkan: use vector loads in scalar flash attention shader
* vulkan: remove PV matrix, helps with register usage
* vulkan: reduce register usage in scalar FA, but perf may be slightly worse
* vulkan: load each Q value once. optimize O reduction. more tuning
* vulkan: support q4_0/q8_0 KV in scalar FA
* CI: increase timeout to accommodate newly-supported tests
* vulkan: for scalar FA, select between 1 and 8 rows
* vulkan: avoid using Float16 capability in scalar FA
* [CANN] Support ELU and CONV_TRANSPOSE_1D
* [CANN]Modification review comments
* [CANN]Modification review comments
* [CANN]name adjustment
* [CANN]remove lambda used in template
* [CANN]Use std::func instead of template
* [CANN]Modify the code according to the review comments
---------
Signed-off-by: noemotiovon <noemotiovon@gmail.com>
* ci: add visionOS build workflow
Add a new GitHub Actions workflow for building on visionOS with CMake and Xcode.
* ggml: Define _DARWIN_C_SOURCE for visionOS to fix missing u_xxx typedefs
* ci: remove define hacks for u_xxx system types
---------
Co-authored-by: Giovanni Petrantoni <7008900+sinkingsugar@users.noreply.github.com>
This commit adds the --symlinks option to the zip command used to create
the xcframework zip file. This is necessary to create symlinks in the
zip file. Without this option, the Versions symlink is stored as a
regular directory entry in the zip file, rather than as a symlink in the
zip which causes the followig error in xcode:
```console
Couldn't resolve framework symlink for '/Users/danbev/work/ai/llama.cpp/tmp_1/build-apple/llama.xcframework/macos-arm64_x86_64/llama.framework/Versions/Current': readlink(/Users/danbev/work/ai/llama.cpp/tmp_1/build-apple/llama.xcframework/macos-arm64_x86_64/llama.framework/Versions/Current): Invalid argument (22)
```
Refs: https://github.com/ggml-org/llama.cpp/pull/11996#issuecomment-2727026377
This commit adds the fetch-depth: 0 option to the checkout action in the
build.yml workflow file (0 meaning that it fetches the complete
history). The default value is 1 when not specified which only fetches
the latest commit.
This is necessary to ensure that `git rev-list --count HEAD` counts the
total number of commits in the history. Currently because the default is
being used the name of the xcframework artifact is always
llama-b1-xcframework.
The commit add the name parameter to the upload-artifact action to
ensure that the artifact is uploaded with the correct name.
The motivation for this is that currently the uploaded xcframework
is named as llama-b1-xcframework.zip. With this change the name of this
artifact should contain the build number like the other artifacts.
* ci : remove xframework upload
This commit removes the upload of the xframework zip file as an
artifact.
The motivation for this change is that the xframework zip file is
currently being uploaded as part of strategy and will therefore be
attempted to be uploaded multiple times and will fail the build.
The uploading should be moved to somewhere else in the build to avoid
this.
* ci : add xcframework upload to macos-latest job
* llama : add xcframework build script
This commit adds a script to build an XCFramework for Apple
ios, macos, visionos, and tvos platforms.
The generated XCFramework can then be added to a project and used in
the same way as a regular framework. The llama.swiftui example project
has been updated to use the XCFramework and can be started using the
following command:
```console
$ open examples/llama.swiftui/llama.swiftui.xcodeproj/
```
Refs: https://github.com/ggml-org/llama.cpp/issues/10747
* examples : remove llama.cpp (source dir ref) from project.pbxproj
This commit removes the reference to llama.cpp from the project.pbxproj
file since Package.swift has been removed.
* ci : updated build.yml to use build-xcframework.sh
* ci : add xcframework build to github releases
This commit adds the ability to create a GitHub release with the
xcframework build artifact.
* scripts : add apple app validation scripts
This commit adds scripts that can validate the iOS, macOS, tvOS, and
VisionOS applications. The scripts create a simple test app project,
copy the llama.xcframework to the test project, build and archive the
app, create an IPA from the archive, and validate the IPA using altool.
The motivation for this is to provide some basic validation and
hopefully avoid having to manually validate apps in Xcode.
* llama : remove Package.swift
This commit removes the Package.swift file, as we are now building an
XCFramework for the project.
* llama : remove Sources and spm-headers directories
* llama : use TargetConditionals.h for visionOS/tvOS
This commit tries to address/improve an issue with the server tests
which are failing with a timeout. Looking at the logs it seems like
they are timing out after 12 seconds:
```
FAILED unit/test_chat_completion.py::test_completion_with_json_schema[False-json_schema0-6-"42"] - TimeoutError: Server did not start within 12 seconds
```
This is somewhat strange as in utils.py we have the following values:
```python
DEFAULT_HTTP_TIMEOUT = 12
if "LLAMA_SANITIZE" in os.environ or "GITHUB_ACTION" in os.environ:
DEFAULT_HTTP_TIMEOUT = 30
def start(self, timeout_seconds: int | None = DEFAULT_HTTP_TIMEOUT) -> None:
```
It should be the case that a test running in a github action should have
a timeout of 30 seconds. However, it seems like this is not the case.
Inspecting the logs from the CI job we can see the following environment
variables:
```console
Run cd examples/server/tests
2 cd examples/server/tests
3 ./tests.sh
4 shell: /usr/bin/bash -e {0}
5 env:
6 LLAMA_LOG_COLORS: 1
7 LLAMA_LOG_PREFIX: 1
8 LLAMA_LOG_TIMESTAMPS: 1
9 LLAMA_LOG_VERBOSITY: 10
10 pythonLocation: /opt/hostedtoolcache/Python/3.11.11/x64
```
This probably does not address the underlying issue that the servers
that are providing the models to be downloaded occasionally take a
longer time to response but might improve these situations in some
cases.
* vulkan: initial support for IQ1_S and IQ1_M quantizations
* vulkan: define MMV kernels for IQ1 quantizations
* devops: increase timeout of Vulkan tests again
* vulkan: simplify ifdef for init_iq_shmem
* musa: Update MUSA SDK version to rc3.1.1
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* musa: Remove workaround in PR #10042
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* init version
* fix auto scroll
* bring back copy btn
* bring back thought process
* add lint and format check on CI
* remove lang from html tag
* allow multiple generations at the same time
* lint and format combined
* fix unused var
* improve MarkdownDisplay
* fix more latex
* fix code block cannot be selected while generating
* initial porting of previous LLG patch
* update for new APIs
* build: integrate llguidance as an external project
* use '%llguidance' as marker to enable llg lark syntax
* add some docs
* clarify docs
* code style fixes
* remove llguidance.h from .gitignore
* fix tests when llg is enabled
* pass vocab not model to llama_sampler_init_llg()
* copy test-grammar-integration.cpp to test-llguidance.cpp
* clang fmt
* fix ref-count bug
* build and run test
* gbnf -> lark syntax
* conditionally include llguidance test based on LLAMA_LLGUIDANCE flag
* rename llguidance test file to test-grammar-llguidance.cpp
* add gh action for llg test
* align tests with LLG grammar syntax and JSON Schema spec
* llama_tokenizer() in fact requires valid utf8
* update llg
* format file
* add $LLGUIDANCE_LOG_LEVEL support
* fix whitespace
* fix warning
* include <cmath> for INFINITY
* add final newline
* fail llama_sampler_init_llg() at runtime
* Link gbnf_to_lark.py script; fix links; refer to llg docs for lexemes
* simplify #includes
* improve doc string for LLAMA_LLGUIDANCE
* typo in merge
* bump llguidance to 0.6.12
---------
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* vulkan: initial support for IQ3_S
* vulkan: initial support for IQ3_XXS
* vulkan: initial support for IQ2_XXS
* vulkan: initial support for IQ2_XS
* vulkan: optimize Q3_K by removing branches
* vulkan: implement dequantize variants for coopmat2
* vulkan: initial support for IQ2_S
* vulkan: vertically realign code
* port failing dequant callbacks from mul_mm
* Fix array length mismatches
* vulkan: avoid using workgroup size before it is referenced
* tests: increase timeout for Vulkan llvmpipe backend
---------
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
* release : pack /lib and /include in the packages
* cmake : put libs in /bin
* TMP : push artifacts
* Revert "TMP : push artifacts"
This reverts commit 4decf2c4df.
* ci : fix HIP cmake compiler options to be on first line
* ci : restore the original HIP commands
* ci : change ubuntu build from latest to 20.04
* ci : try to fix macos build rpaths
* ci : remove obsolete MacOS build
* TMP : push artifacts
* ci : change back to ubuntu latest
* ci : macos set build rpath to "@loader_path"
* ci : fix typo
* ci : change ubuntu package to 22.04
* Revert "TMP : push artifacts"
This reverts commit 537b09e70f.
This is a fork of linenoise that is C++17 compatible. I intend on
adding it to llama-run so we can do things like traverse prompt
history via the up and down arrows:
https://github.com/ericcurtin/linenoise.cpp
Signed-off-by: Eric Curtin <ecurtin@redhat.com>
* ensure mul mat shaders work on systems with subgroup size less than 32
more fixes
add test
* only s_warptile_mmq needs to be run with 32 threads or more
* [cl][adreno] Add Adreno GPU support
Add new OpenCL backend to support Adreno GPUs
---------
Co-authored-by: Skyler Szot <quic_sszot@quicinc.com>
Co-authored-by: Shangqing Gu <quic_shawngu@quicinc.com>
Co-authored-by: Alexander Angus <quic_aangus@quicinc.com>
Co-authored-by: Hongqiang Wang <quic_wangh@quicinc.com>
Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
* [cl][ci] Add workflow for CL
* [cl][adreno] Fix memory leak for non SMALL_ALLOC path
* opencl: integrate backend dyn.load interface and fix compiler and format warnings
* opencl: remove small-alloc support and fix build errors for non-opencl platforms
* opencl: fixed merge conflict (MUSA added twice in cmake)
* opencl-ci: use RUNNER_TEMP instead of github.workspace
* opencl: fix embed tool invocation with python3
* opencl: CI workflow fixes
* opencl: Clean up small-alloc in CMake files
* opencl: cleanup ggml-opencl2 header file
* opencl: use ulong for offsets and strides in ADD kernel
* opencl: use cl_ulong for all offsets
* opencl: use cl_ulong for sizes and strides
* opencl: use `GGML_LOG_xxx` instead of `fprintf(stderr, ...)`
* opencl: rename backend `opencl2` -> `opencl`
* opencl: rename kernel files `ggml-opencl2` -> `ggml-opencl`
* opencl: make OpenCL required, remove redundant lib and inc directories
* `ggml-base`, `..` and `.` are added by `ggml_add_backend_library`
* opencl: rename backend - funcs, structs, etc `opencl2` -> `opencl`
* opencl: remove copyright marker since main license already covers
* opencl: replace some more OPENCL2 leftovers
* opencl: remove limits on `tensor_extra`
* opencl: use pools for `tensor_extra`
* opencl: fix compiler warnings with GCC and Clang
Still getting the warning about clCreateCmdQueue being obsolete.
Will fix that separately.
* opencl: fail gracefully if opencl devices are not available
Also for unsupported GPUs.
* opencl: fix MSVC builds (string length error)
* opencl: check for various requirements, allow deprecated API
* opencl: update log message for unsupported GPUs
---------
Co-authored-by: Skyler Szot <quic_sszot@quicinc.com>
Co-authored-by: Shangqing Gu <quic_shawngu@quicinc.com>
Co-authored-by: Alexander Angus <quic_aangus@quicinc.com>
Co-authored-by: Hongqiang Wang <quic_wangh@quicinc.com>
Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
* llama : use cmake for swift build
* swift : <> -> ""
* ci : remove make
* ci : disable ios build
* Revert "swift : <> -> """
This reverts commit d39ffd9556.
* ci : try fix ios build
* ci : cont
* ci : cont
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* hide buttons in dropdown menu
* use npm as deps manager and vite as bundler
* fix build
* fix build (2)
* fix responsive on mobile
* fix more problems on mobile
* sync build
* (test) add CI step for verifying build
* fix ci
* force rebuild .hpp files
* cmake: clean up generated files pre build
* server : replace behave with pytest
* fix test on windows
* misc
* add more tests
* more tests
* styling
* log less, fix embd test
* added all sequential tests
* fix coding style
* fix save slot test
* add parallel completion test
* fix parallel test
* remove feature files
* update test docs
* no cache_prompt for some tests
* add test_cache_vs_nocache_prompt
* sycl: Use syclcompat::dp4a
* Using the syclcompat version allow the compiler to optimize the
operation with native function
* Update news section
* Update CI Windows oneAPI version to 2025.0
* Reword doc
* Call syclcompat::dp4a inside dpct::dp4a
This reverts commit 90cb61d692.
* metal : opt-in compile flag for BF16
ggml-ci
* ci : use BF16
ggml-ci
* swift : switch back to v12
* metal : has_float -> use_float
ggml-ci
* metal : fix BF16 check in MSL
ggml-ci
* q6_k instruction reordering attempt
* better subtract method
* should be theoretically faster
small improvement with shuffle lut, likely because all loads are already done at that stage
* optimize bit fiddling
* handle -32 offset separately. bsums exists for a reason!
* use shift
* Update ggml-quants.c
* have to update ci macos version to 13 as 12 doesnt work now. 13 is still x86
* server : added with_pieces functionality to /tokenize endpoint
* server : Add tokenize with pieces tests to server.feature
* Handle case if tokenizer splits along utf8 continuation bytes
* Add example of token splitting
* Remove trailing ws
* Fix trailing ws
* Maybe fix ci
* maybe this fix windows ci?
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* py : type-check all Python scripts with Pyright
* server-tests : use trailing slash in openai base_url
* server-tests : add more type annotations
* server-tests : strip "chat" from base_url in oai_chat_completions
* server-tests : model metadata is a dict
* ci : disable pip cache in type-check workflow
The cache is not shared between branches, and it's 250MB in size,
so it would become quite a big part of the 10GB cache limit of the repo.
* py : fix new type errors from master branch
* tests : fix test-tokenizer-random.py
Apparently, gcc applies optimisations even when pre-processing,
which confuses pycparser.
* ci : only show warnings and errors in python type-check
The "information" level otherwise has entries
from 'examples/pydantic_models_to_grammar.py',
which could be confusing for someone trying to figure out what failed,
considering that these messages can safely be ignored
even though they look like errors.
* try to fix CUDA ci with --allow-unsupported-compiler
* trigger when build.yml changes
* another test
* try exllama/bdashore3 method
* install vs build tools before cuda toolkit
* try win-2019
* ggml: Added OpenMP for multi-threads processing
* ggml : Limit the number of threads used to avoid deadlock
* update shared state n_threads in parallel region
* clear numa affinity for main thread even with openmp
* enable openmp by default
* fix msvc build
* disable openmp on macos
* ci : disable openmp with thread sanitizer
* Update ggml.c
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Supercedes #4024 and #4813.
CMake's native HIP support has become the
recommended way to add HIP code into a project (see
[here](https://rocm.docs.amd.com/en/docs-6.0.0/conceptual/cmake-packages.html#using-hip-in-cmake)).
This PR makes the following changes:
1. The environment variable `HIPCXX` or CMake option
`CMAKE_HIP_COMPILER` should be used to specify the HIP
compiler. Notably this shouldn't be `hipcc`, but ROCm's clang,
which usually resides in `$ROCM_PATH/llvm/bin/clang`. Previously
this was control by `CMAKE_C_COMPILER` and `CMAKE_CXX_COMPILER`.
Note that since native CMake HIP support is not yet available on
Windows, on Windows we fall back to the old behavior.
2. CMake option `CMAKE_HIP_ARCHITECTURES` is used to control the
GPU architectures to build for. Previously this was controled by
`GPU_TARGETS`.
3. Updated the Nix recipe to account for these new changes.
4. The GPU targets to build against in the Nix recipe is now
consistent with the supported GPU targets in nixpkgs.
5. Added CI checks for HIP on both Linux and Windows. On Linux, we test
both the new and old behavior.
The most important part about this PR is the separation of the
HIP compiler and the C/C++ compiler. This allows users to choose
a different C/C++ compiler if desired, compared to the current
situation where when building for ROCm support, everything must be
compiled with ROCm's clang.
~~Makefile is unchanged. Please let me know if we want to be
consistent on variables' naming because Makefile still uses
`GPU_TARGETS` to control architectures to build for, but I feel
like setting `CMAKE_HIP_ARCHITECTURES` is a bit awkward when you're
calling `make`.~~ Makefile used `GPU_TARGETS` but the README says
to use `AMDGPU_TARGETS`. For consistency with CMake, all usage of
`GPU_TARGETS` in Makefile has been updated to `AMDGPU_TARGETS`.
Thanks to the suggestion of @jin-eld, to maintain backwards
compatibility (and not break too many downstream users' builds), if
`CMAKE_CXX_COMPILER` ends with `hipcc`, then we still compile using
the original behavior and emit a warning that recommends switching
to the new HIP support. Similarly, if `AMDGPU_TARGETS` is set but
`CMAKE_HIP_ARCHITECTURES` is not, then we forward `AMDGPU_TARGETS`
to `CMAKE_HIP_ARCHITECTURES` to ease the transition to the new
HIP support.
Signed-off-by: Gavin Zhao <git@gzgz.dev>
* logging: add proper checks for clang to avoid errors and warnings with VA_ARGS
* build: add CMake Presets and toolchian files for Windows ARM64
* matmul-int8: enable matmul-int8 with MSVC and fix Clang warnings
* ci: add support for optimized Windows ARM64 builds with MSVC and LLVM
* matmul-int8: fixed typos in q8_0_q8_0 matmuls
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* matmul-int8: remove unnecessary casts in q8_0_q8_0
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Disable benchmark on forked repo
* only check owner on schedule event
* check owner on push also
* more readable as multi-line
* ternary won't work
* style++
* test++
* enable actions debug
* test--
* remove debug
* test++
* do debug where we can get logs
* test--
* this is driving me crazy
* correct github.event usage
* remove test condition
* correct github.event usage
* test++
* test--
* event_name is pull_request_target
* test++
* test--
* update ref checks
* convert.py: add python logging instead of print()
* convert.py: verbose flag takes priority over dump flag log suppression
* convert.py: named instance logging
* convert.py: use explicit logger id string
* convert.py: convert extra print() to named logger
* convert.py: sys.stderr.write --> logger.error
* *.py: Convert all python scripts to use logging module
* requirements.txt: remove extra line
* flake8: update flake8 ignore and exclude to match ci settings
* gh-actions: add flake8-no-print to flake8 lint step
* pre-commit: add flake8-no-print to flake8 and also update pre-commit version
* convert-hf-to-gguf.py: print() to logger conversion
* *.py: logging basiconfig refactor to use conditional expression
* *.py: removed commented out logging
* fixup! *.py: logging basiconfig refactor to use conditional expression
* constant.py: logger.error then exit should be a raise exception instead
* *.py: Convert logger error and sys.exit() into a raise exception (for atypical error)
* gguf-convert-endian.py: refactor convert_byteorder() to use tqdm progressbar
* verify-checksum-model.py: This is the result of the program, it should be printed to stdout.
* compare-llama-bench.py: add blank line for readability during missing repo response
* reader.py: read_gguf_file() use print() over logging
* convert.py: warning goes to stderr and won't hurt the dump output
* gguf-dump.py: dump_metadata() should print to stdout
* convert-hf-to-gguf.py: print --> logger.debug or ValueError()
* verify-checksum-models.py: use print() for printing table
* *.py: refactor logging.basicConfig()
* gguf-py/gguf/*.py: use __name__ as logger name
Since they will be imported and not run directly.
* python-lint.yml: use .flake8 file instead
* constants.py: logger no longer required
* convert-hf-to-gguf.py: add additional logging
* convert-hf-to-gguf.py: print() --> logger
* *.py: fix flake8 warnings
* revert changes to convert-hf-to-gguf.py for get_name()
* convert-hf-to-gguf-update.py: use triple quoted f-string instead
* *.py: accidentally corrected the wrong line
* *.py: add compilade warning suggestions and style fixes
* gguf-debug: Example how to use ggml callback for debugging
* gguf-debug: no mutex, verify type, fix stride.
* llama: cv eval: move cb eval field in common gpt_params
* ggml_debug: use common gpt_params to pass cb eval.
Fix get tensor SIGV random.
* ggml_debug: ci: add tests
* ggml_debug: EOL in CMakeLists.txt
* ggml_debug: Remove unused param n_batch, no batching here
* ggml_debug: fix trailing spaces
* ggml_debug: fix trailing spaces
* common: fix cb_eval and user data not initialized
* ci: build revert label
* ggml_debug: add main test label
* doc: add a model: add a link to ggml-debug
* ggml-debug: add to make toolchain
* ggml-debug: tests add the main label
* ggml-debug: ci add test curl label
* common: allow the warmup to be disabled in llama_init_from_gpt_params
* ci: add curl test
* ggml-debug: better tensor type support
* gitignore : ggml-debug
* ggml-debug: printing also the sum of each tensor
* ggml-debug: remove block size
* eval-callback: renamed from ggml-debug
* eval-callback: fix make toolchain
---------
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>