hongruichen
54b3021e0c
Merge branch 'master' into dev-refactoring
2025-05-27 10:21:33 +08:00
Xuan-Son Nguyen
e16c4731c7
ggml : fix the order of ggml_unary_op ( #13718 )
2025-05-23 08:12:48 +02:00
Xuan-Son Nguyen
cf4cb59e64
ggml : add ggml_gelu_erf() ( #13667 )
...
* ggml : add ggml_gelu_na (not approximated)
* fix naming order
* rename na --> erf
* apply review suggesions
* revert naming order
2025-05-21 16:26:33 +02:00
Johannes Gäßler
6c35981a64
mnist: fix segmentation fault (ggml/1227)
2025-05-19 13:29:56 +03:00
Johannes Gäßler
10d2af0eaa
llama/ggml: add LLM training support ( #10544 )
...
* llama/ggml: add LLM training support
more compact progress bar
llama_save_model_to_file
llama_opt_param_filter
ggml_graph_dup force_grads
refactor ggml_opt, fix test-opt
* remove logits_all
* refactor CUDA implementation for ACC
* reset graph at beginning of opt period
2025-05-12 14:44:49 +02:00
David Huang
7f323a589f
Add `--no-op-offload` to improve `-ot` pp perf in MoE models like llama4 400B ( #13386 )
2025-05-11 14:18:39 +02:00
hongruichen
aca70692d3
Merge branch 'master' into dev-refactoring
2025-05-08 10:11:26 +08:00
Johannes Gäßler
2356fb1d53
CUDA: fix bad asserts for partial offload ( #13337 )
2025-05-06 13:58:51 +02:00
Johannes Gäßler
9070365020
CUDA: fix logic for clearing padding with -ngl 0 ( #13320 )
2025-05-05 22:32:13 +02:00
Diego Devesa
4254bb4951
ggml : fix ggml_gallocr_ptr type (ggml/1205)
2025-05-01 09:58:44 +03:00
Johannes Gäßler
69699be48a
CUDA: fix q_nope_absorbed prec for DS 2 Lite f16 ( #13137 )
2025-04-28 09:29:26 +02:00
SXX
77d5e9a76a
ggml: move fp16/bf16 conversion optimizations to CPU backend + export conversion APIs ( #13107 )
...
* ggml: dynamic x86_64 feature detection for FP32 <-> FP16/BF16 conversion
* move fp converter to ggml-cpu
* Switch ggml_compute_forward_get_rows_f16/bf16 to new ggml_cpu_fp16/bf16_to_fp32
2025-04-26 16:05:31 +02:00
Radoslav Gerganov
553a5c3a9f
rpc : do not wait for response when sending RPC_CMD_SET_TENSOR ( #12943 )
...
RPC_CMD_SET_TENSOR always returns an empty response and we send this 4
times per token. We can improve TG speed if we don't wait for this empty
response.
The performance impact of this change depends on the network latency.
2025-04-25 10:08:08 +03:00
Acly
c6e8cc28c1
ggml : Depthwise 2D convolution (ggml/1152)
...
* ggml-cpu : kernels for faster depthwise 2D convolution
* fix compile: remove static after moving to ops.cpp
* add dilation for depthwise_conv_2d
* review: rename to ggml_conv_2d_dw_direct, remove redundant struct keywords, pass by ref, whitespace
* review: rename depthwise_conv_2d -> conv_2d_dw everywhere
2025-04-24 17:32:47 +03:00
hongruichen
a0e54cfc70
Merge branch 'master' into dev-refactoring
2025-04-24 21:33:23 +08:00
nullname
beff5c4b78
feat: op perf opt ( #38 )
...
* add op define xml
* copy qnn libs in cmake
* fix htp skel path
* add windows copy file list
* wip
* add generated package
* remove unused params
* add cmake list
* set qnn sdk and hexagon sdk path
* wip
* wip
* fix tools version
* fix compiling error
* fix dims calc
* wip
* add mulmat 2d
* wip
* reduction
* wip
* wip
* fix compiling error in x64
* wip
* fix device description in emulator
* wip
* add flag
* copy necessary libs
* wip
* load HtpPrepare first for emulator
* enable custom op for 2d matrix
* verify op config before add to node
* Revert "verify op config before add to node"
This reverts commit 206dec826e560625e053c4c78e023994f993526e.
* wip
* wip
* wip
* revert tool version change
* use hexagon sdk version 5.5.0
https://docs.qualcomm.com/bundle/publicresource/topics/80-77512-2/release-notes-wrapper.html?product=1601111740010422#5.5.0
* wip
* move to sub dir
* add hexagon npu device and server lib
* fix npu lib build
* refactoring: rename QNNBackend enum
* fix compiling error
* wip
* remove qnn/backend.hpp
* add hexagon dsp host layer
* extract rpc_mem from qnn submodule
* fix dsp compiling error
* wip
* wip
* open and lose npu device
* split objects into separated files
* fix linking error
* add npu_tensor
* add host graph
* map rpc buffer before usage
* fix some todos
* add shared module
* split rpc_interface from rpc_mem
* get get_dsp_arch from device
* wip
* rename host classes
* fix hexagon sdk arch getter
* fix device open
* fix linking error
* fix crash
* use tensor_data_type
* fix npu lib crash
* fix debug log print
* skip empty graph
* wip
* add log
* fix unmap fail
* fix tensor set
* remove some logs
* flush back memory after finished
* fix nb
* wip
* wip
* add helper function
* impl add op
* fix some add in test-backend-ops
* add elt wise sub and mul
* fix crash on some inplace op
* wip
* fix elt wise op calc
* wip
* split mul_mat into file
* add caps array
* wip
* wip
* print support/unsupport op
* copy lldb-server for newer android sdk
* add tensor_spec
* add assert
* fix crash when loading model
* rename cmake option
* fix name
* fix device memory and description
* fix compiling error on qnn only build
* fix some potential UBs
* fix comments
2025-04-21 12:06:16 +08:00
Radoslav Gerganov
2db9ba1464
rpc : add RPC_CMD_HELLO ( #12955 )
...
Add RPC_CMD_HELLO for getting the version of the protocol implemend by
the server. Follow the semantic versioning rules at https://semver.org
Hopefully this bring better user experience when we make breaking
changes at the protocol level and avoid issues like #12465
2025-04-18 10:13:42 +03:00
hongruichen
a004951bb9
Merge branch 'master' into dev-refactoring
2025-04-16 00:39:25 +08:00
Diego Devesa
fe92821ea9
ggml : add bilinear upscale support (ggml/1185)
2025-04-11 00:17:47 +03:00
Diego Devesa
459895c326
ggml : add more generic custom op, remove deprecated custom ops (ggml/1183)
...
* ggml : add more generic ggml_custom op
* ggml : remove deprecated custom ops
2025-04-11 00:17:47 +03:00
hongruichen
e4fcdd418f
Merge branch 'master' into dev-refactoring
2025-04-03 23:57:00 +08:00
Georgi Gerganov
b4ae50810e
metal : improve FA + improve MoE ( #12612 )
...
* ggml : FA with different K, V head sizes (CPU)
ggml-ci
* metal : add FA with HS=192
* metal : extend FA to support different K and V head sizes
ggml-ci
* metal : add FA vector kernels for heads K 192 and V 128
ggml-ci
* ggml : restrict op on other backends to equal head sizes
ggml-ci
* metal : optimize FA-vec kernel
ggml-ci
* metal : FA remove mq registers
* metal : improve MoE mul_mat_id condition
ggml-ci
* metal : fix comments + remove unnecessary addition
ggml-ci
* metal : avoid too much shared memory usage with mul_mat_id
ggml-ci
2025-03-28 20:21:59 +02:00
Radoslav Gerganov
ab6ab8f809
rpc : send hash when tensor data is above some fixed threshold ( #12496 )
...
* rpc : send hash when tensor data is above some fixed threshold
ref #10095
* rpc : put cache under $HOME/.cache/llama.cpp
* try to fix win32 build
* another try to fix win32 build
* remove llama as dependency
2025-03-28 08:18:04 +02:00
hongruichen
c2887f0377
Merge branch 'master' into dev-refactoring
2025-03-22 12:36:55 +08:00
nullname
a1ab67478f
[feat] add more op ( #35 )
...
* move op key generate function to kOpCaps
* fix op desc print
* try fix rms_norm
* Revert "try fix rms_norm"
This reverts commit 33b296098012909cb482fc29b52b28098dc971cd.
* add quantization type support by converting them to float
* enable quantization tensor for mulmat in gpu/npu
* fix asan error
* add log and assert
* insert output convert operator after mulmat
* add log
* fix some error in running
* disable permute again
* add log
* add error function
* Revert "add error function"
This reverts commit f92ff47798ac8053fb776c55efbb1a98469c7af1.
* add log
* more log
* disable convert op in graph
* wip
* add f16 config for graph
* set f16 precision for f16 graph
* fix override data type
* add comment
* add config flag to enable quantize type
* add log
* more quantized type for cpu and gpu backend
* enable all quant types for cpu and gpu backend
* rename
* wip
* add log
* remove unused functions
* skip permute
* remove get_qnn_op_input_param_count
* fallback to generic_get_op_desc if no op_desc
* revert 'skip permute'
* Revert "revert 'skip permute'"
This reverts commit 5761e31fd23c69c4cabf6fd9fac1a0d3e5a74968.
* wip
* add log
* print qnn tensor type
* add log
* limit the max size of tensor
* add log
* fix tensor size limiter
* small improve on tensor info printer
* disable sqrt and div to pass test-backend-ops for 8 gen 2
* remove debug log in release build
* add log
* skip permute in src
* wip
* disable reshape
* skip mul at decoder start
* wip
* add log
* add qnn_scoped_timer
* add perf tracker in graph
* add cmake options GGML_QNN_ENABLE_PERFORMANCE_TRACKING
* fix flag name
* use milli-second
* wip
* fix comment string
* add file for profiler
* change qnn-cpu to GGML_BACKEND_DEVICE_TYPE_ACCEL, so that we can run tests on cpu
* wip
* profiler: refactoring
* wip
* add implement for print_profile_events
* set-up profiler for graph
* set profiler to graph execute
* pretty print events
* unified log print prefix
* print event count
* enable optrace
* print duration at event end
* wip
* add more detailed soc information
* wip
* move device caps array into qnn-lib.cpp
* remove lib_name in device_context
* move get_graph_key_from_cgraph to graph.cpp
* add override type for tensor key
* use override_type instead of original data type for graph key
* append op type to tensor name to fix error in qwen
* remove todo
* wip
2025-03-22 12:34:31 +08:00
Molly Sophia
7dfad387e3
llama: Add support for RWKV v7 architecture ( #12412 )
...
* ggml: Add op l2_norm
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* ggml: Add op rwkv_wkv7
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* llama: Add support for RWKV7 and ARWKV7 models
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* llama: fix inference with RWKV6Qwen2
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* llama: add more (a)rwkv7 variants in size
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Apply code-format changes
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* fix MUSA build
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* llama: fix shape error with rwkv using llama-parallel
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
---------
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2025-03-18 07:27:50 +08:00
hongruichen
525cd2d641
Merge branch 'master' into dev-refactoring
2025-03-14 10:04:11 +08:00
Rémy O
07d1572347
ggml-cpu: Faster IQ1 mul_mat_vec on AVX2 using BMI2 instructions ( #12154 )
...
* ggml-cpu: Faster IQ1 mul_mat_vec on AVX2 using BMI2 instructions
* cmake: Add GGML_BMI2 build option
* ggml: enable BMI2 on relevant CPU variants
* ggml-cpu: include BMI2 in backend score
* ggml-cpu: register BMI2 in ggml_backend_cpu_get_features
* ggml-cpu: add __BMI2__ define when using MSVC
2025-03-06 02:26:10 +01:00
hongruichen
27cec63802
Merge branch 'master' into dev-refactoring
2025-03-05 22:22:21 +08:00
mgroeber9110
5bbe6a9fe9
ggml : portability fixes for VS 2017 ( #12150 )
...
* Add include files for std::min/max and std::toupper/tolower
* win32: move _USE_MATH_DEFINES before includes to ensure M_PI is defined
* Use GGML_RESTRICT instead of "restrict" keyword everywhere, and use "__restrict" in MSVC plain C mode
* win32: only use __restrict in MSVC if C11/C17 support is not enabled
---------
Co-authored-by: Marcus Groeber <Marcus.Groeber@cerence.com>
2025-03-04 18:53:26 +02:00
William Tambellini
70680c48e5
ggml : upgrade init_tensor API to return a ggml_status ( #11854 )
...
* Upgrade init_tensor API to return a ggml_status
To prepare for an 'abort-free' ggml
(ggml not to abort on OOMs but return a OOM status),
as agreeed with Diego in the ggml repo,
upgrade the init_tensor() and view_init() APIs
to return a ggml_status.
* misc fixes
---------
Co-authored-by: slaren <slarengh@gmail.com>
2025-02-28 14:41:47 +01:00
hongruichen
84328ffbda
Merge branch 'master' into dev-refactoring
2025-02-24 14:39:12 +08:00
Aaron Teo
af7747c95a
ggml-cpu: Support s390x SIMD Instruction Set ( #12019 )
...
* ggml: add s390x ARCH_FLAGS for compilation
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: add SIMD for s390x using vector intrinsics
SIMD is activated for:
* ggml_vec_dot_f32
* ggml_vec_dot_f16
* ggml_vec_mad_f32
* ggml_vec_mad_f16
* ggml_vec_mad_f32_unroll
* ggml_vec_scale_f32
* ggml_vec_scale_f16
SIMD is NOT activated for:
* ggml_vec_dot_f16_unroll (pending bugfix)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: fix missing escape character in GGML_F32x4_REDUCE
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: add temporary patch for GGML_F32_ARR and GGML_F16_ARR
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: fix s390x GGML_F32x4_REDUCE
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: full SIMD activation for F32,F16 s390x
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: add option to disable s390x VXE/VXE2
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: change vecintrin.h include to ggml-cpu-impl
* add __VXE__ and __VXE2__ macros
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* cmake: add s390x target detection for VX/VXE/VXE2
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: move s390x vector intrinsics to ggml-cpu-impl.h
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: s390x Q8_0 SIMD
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: correct documentation for Q8_0
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: s390x reduce code complexity Q8_0
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: s390x bugfix typo Q8_0
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: s390x SIMD activated for Q4_1
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: s390x inline vec_reve
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: s390x SIMD activation for Q4_0
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: add VXE backend feature
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: remove test.py
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: s390x SIMD activation for quantize_row_q8_0
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: s390x SIMD activation for quantize_row_q8_1
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: s390x SIMD activation for iq4_xs
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: bugfix iq4_xs
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: s390x SIMD activation for iq4_nl
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: add float, double, and long vector data type
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: clean up iq4_xs SIMD
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: fix improper use of restrict keyword
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: update warning message for ggml_vec_tbl
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: untested implementation of ggml_vec_dot_iq2_xxs_q8_K
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: update ggml_vec_dot_q4_1_q8_1 to use typedefs
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: switch to restrict for iq4_nl
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: slight dot product speed improvement for q4_1_q8_1
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: s390x SIMD activation for q6_K
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: add missing `_t` to ggml_int8x16x4_t
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: fix missing `_t` for ggml_vec_xl_s8x4
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: fix more missing `_t`
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: add unroll and prefetch to Q8_0
increase of 3.86% for prompt processing and 32.22% for token generation
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: patch Q8_0 to use proper vector sizes
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: optimise Q8_0 dot prod compute kernel further
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: add unroll and prefetch to Q4_1
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: refactor Q6_K variable naming for readability
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: fix Q6_K typos
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: s390x SIMD activation for Q5_K
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: fix wrong char*x16_t naming
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: Q5_K y0 wrong signness
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: fix Q5_K invalid uchar type
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: fix Q5_K invalid uchar type
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: s390x SIMD activation for Q4_K
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: fix Q4_K invalid vector intrinsics
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: simplify ggml_padd_s16 compute kernel
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: correct ggml-cpu vxe wording
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: change ggml_aligned_malloc alignment to 256
256 is the cache line size for s390x platforms
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: resolve pr merge via cherry-pick 225bbbf
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml : fix LoongArch compile error with 128-bit SIMD (#11701 )
* ggml: resolve pr merge via cherry-pick 4571953
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml: cmake remove fork when determining s390x machine type
thank you @ericcurtin
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
---------
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Co-authored-by: Jinyang He <hejinyang@loongson.cn>
Co-authored-by: junchao-zhao <68935141+junchao-loongson@users.noreply.github.com>
2025-02-22 21:39:24 +00:00
Charles Xu
c5d91a7400
ggml-cpu: Add CPU backend support for KleidiAI library ( #11390 )
...
* ggml-cpu: Add CPU backend support for KleidiAI library
* Add environmental variable GGML_KLEIDIAI_SME
* Add support for multithread LHS conversion
* Switch kernel selection order to dotprod and i8mm
* updates for review comments
* More updates for review comments
* Reorganize and rename KleidiAI files
* Move ggml-cpu-traits.h to source file
* Update cmake for SME build and add alignment for SME
* Remove append GGML_USE_CPU_KLEIDIAI to the GGML_CDEF_PUBLIC list
2025-02-20 15:06:51 +02:00
Georgi Gerganov
68ff663a04
repo : update links to new url ( #11886 )
...
* repo : update links to new url
ggml-ci
* cont : more urls
ggml-ci
2025-02-15 16:40:57 +02:00
hongruichen
ba324b0252
Merge branch 'master' into dev-refactoring
2025-02-12 23:51:03 +08:00
bandoti
fef0cbeadf
cleanup: fix compile warnings associated with gnu_printf ( #11811 )
2025-02-12 10:06:53 -04:00
Danny Milosavljevic
c2a67efe38
vulkan: Make Vulkan optional at runtime ( #11493 ). ( #11494 )
...
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-02-10 07:17:21 +01:00
Johannes Gäßler
864a0b67a6
CUDA: use mma PTX instructions for FlashAttention ( #11583 )
...
* CUDA: use mma PTX instructions for FlashAttention
* __shfl_sync workaround for movmatrix
* add __shfl_sync to HIP
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-02-02 19:31:09 +01:00
Hongrui Chen
3ed9f5b5df
Merge branch 'master' into dev-refactoring
2025-01-18 22:37:56 +08:00
Radoslav Gerganov
667d72846c
rpc : early register backend devices ( #11262 )
...
Early register RPC devices and do not propagate RPC specifics in the
llama model structures.
ref: #10609
2025-01-17 10:57:09 +02:00
Johannes Gäßler
9c8dcefe17
CUDA: backwards pass for misc. ops, add tests ( #11257 )
...
* CUDA: backwards pass for misc. ops, add tests
* remove restrict from pointers
2025-01-16 16:43:38 +01:00
Johannes Gäßler
432df2d5f9
RoPE: fix back, CUDA support for back + noncont. ( #11240 )
...
* RoPE: fix back, CUDA support for back + noncont.
* fix comments reg. non-cont. RoPE support [no-ci]
2025-01-15 12:51:37 +01:00
hongruichen
c410717015
Merge branch 'master' into dev-refactoring
...
# Conflicts:
# ggml/CMakeLists.txt
# src/llama.cpp
2025-01-10 11:20:00 +08:00
Molly Sophia
ee7136c6d1
llama: add support for QRWKV6 model architecture ( #11001 )
...
llama: add support for QRWKV6 model architecture (#11001 )
* WIP: Add support for RWKV6Qwen2
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* RWKV: Some graph simplification
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Add support for RWKV6Qwen2 with cpu and cuda GLA
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* RWKV6[QWEN2]: Concat lerp weights together to reduce cpu overhead
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Fix some typos
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* code format changes
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Fix wkv test & add gla test
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Fix cuda warning
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Update README.md
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Update ggml/src/ggml-cuda/gla.cu
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Fix fused lerp weights loading with RWKV6
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* better sanity check skipping for QRWKV6 in llama-quant
thanks @compilade
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: compilade <git@compilade.net>
---------
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: compilade <git@compilade.net>
2025-01-10 09:58:08 +08:00
Johannes Gäßler
53ff6b9b9f
GGUF: C++ refactor, backend support, misc fixes ( #11030 )
...
* GGUF: C++ refactor, backend support, misc fixes
remove ggml_tensor.backend
update CODEOWNERS [no ci]
remove gguf_get_data from API
revise GGUF API data types
2025-01-07 18:01:58 +01:00
Hongrui Chen
8f07b3e3f6
Merge branch 'master' into dev-refactoring
...
# Conflicts:
# ggml/src/ggml-backend-reg.cpp
2024-12-26 19:00:11 +08:00
Georgi Gerganov
0bf2d10c55
tts : add OuteTTS support ( #10784 )
...
* server : add "tokens" output
ggml-ci
* server : output embeddings for all tokens when pooling = none
ggml-ci
* server : be explicit about the pooling type in the tests
ggml-ci
* server : do not normalize embeddings when there is no pooling
ggml-ci
* llama : add OuteTTS support (wip)
* wip
* extract features
* first conv
* group norm
* resnet conv
* resnet
* attn
* pos net
* layer norm
* convnext
* head
* hann window
* fix n_embd + remove llama.cpp hacks
* compute hann window
* fft
* spectrum processing
* clean-up
* tts : receive input text and generate codes
* clip : fix new conv name
* tts : minor fix
* tts : add header + minor fixes
ggml-ci
* tts : add matchematical constant
ggml-ci
* tts : fix sampling + cut initial noise
* tts : fixes
* tts : update default samplers
ggml-ci
* tts : text pre-processing
* tts : outetts-voc -> wavtokenizer-dec
* tts : remove hardcoded constants
ggml-ci
* tts : fix tensor shapes
* llama : refactor wavtokenizer tensors
ggml-ci
* cont
ggml-ci
* cont [no ci]
* llama : update WavTokenizer to non-causal attn
* llama : handle no-vocab detokenization
* tts : add Python example for OuteTTS (wip)
* tts : extend python example to generate spectrogram
ggml-ci
* server : fix rebase artifacts
* tts : enable "return_tokens" in Python example
ggml-ci
* tts : minor fixes
* common : support HF download for vocoder
2024-12-18 19:27:21 +02:00
HimariO
ba1cb19cdd
llama : add Qwen2VL support + multimodal RoPE ( #10361 )
...
* Barebone Qwen2VL LLM convertor
* Add Qwen2VL cli entrypoint
* [WIP] add qwen2vl arch
* Verify m-rope output
* Add vl-rope/2d-rope support for qwen2vl ViT
* update qwen2vl cli tool
* update 5D tensor op workaround
* [WIP] qwen2vl vision model
* make batch and clip utils compatible with qwen2vl
* [WIP] create inference workflow, gguf convert script but fix
* correcting vision-rope behavior, add the missing last layer back to ViT
* add arg parser to qwen2vl_surgery
* replace variable size array with vector
* cuda-gdb cmake preset
* add fp32 mrope, vision rope kernel
* add fp16 support for qwen2vl and m-rope
* add `GGML_ROPE_TYPE_MROPE`, `GGML_ROPE_TYPE_VISION`
* fix rope op mode switching, out dated func args
* update `llama_hparams`
* update to keep up stream changes
* resolve linter, test errors
* add makefile entry, update speical image padding token
* add mrope unit test, fix few compiler warnings
* rename `mrope` related function, params
* minor updates on debug util, bug fixs
* add `m-rope` testcase to `test-backend-ops`
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* fix traililng whitespce
* store `llama_hparams.rope_sections` with fixed size array
* update position id tensor size check in GGML_OP_ROPE
* minor updates
* update `ggml_backend_*_supports_op` of unsupported backends
* remote old `rope_section` compare operator
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-12-14 14:43:46 +02:00
hongruichen
6d3267aad1
Merge branch 'master' into dev-refactoring
...
# Conflicts:
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
2024-12-14 15:38:16 +08:00