* Support diffusion models: Add Dream 7B
* Move diffusion to examples
* Move stuff to examples. Add patch to not use kv-cache
* Address review comments
* Make sampling fast
* llama: remove diffusion functions
* Add basic timings + cleanup
* More cleanup
* Review comments: better formating, use LOG instead std::cerr, re-use batch, use ubatch instead of max_length
* fixup!
* Review: move everything to diffusion-cli for now
Add LLAMA_API to fix the run-time error with llama-cpp-python in Windows env:
attributeError: function 'llama_kv_self_seq_div' not found.
Did you mean: 'llama_kv_self_seq_add'?
Although llama_kv_self_seq_div() has been marked deprecated but
it is necessary to export it to make llama-cpp-python happy.
Observed software version:
OS: windows
compiler: MSVC
llama-cpp-python: tag: v0.3.12-cu124
llama.cpp: tag: b5833
Signed-off-by: Min-Hua Chen <minhuadotchen@gmail.com>
Co-authored-by: Min-Hua Chen <minhua.chen@neuchips.ai>
* Add PLaMo-2 model using hybrid memory module
* Fix z shape
* Add cmath to include from llama-vocab.h
* Explicitly dequantize normalization weights before RoPE apply
* Revert unnecessary cast because the problem can be solved by excluding attn_k, attn_q when quantizing
* Use ATTN_K/Q_NORM for k,q weights to prevent quantization
* Remove SSM_BCDT that is not used from anywhere
* Do not duplicate embedding weights for output.weight
* Fix tokenizer encoding problem for multibyte strings
* Apply suggestion from @CISC
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Use LLM_FFN_SWIGLU instead of splitting ffn_gate and ffn_up
* Remove unnecessary part for Grouped Query Attention
* Fix how to load special token id to gguf
* Remove unused tensor mapping
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Remove llama_vocab_plamo2 class and replace it with llm_tokenizer_plamo2_session to follow the other tokenizer implementations
* Update src/llama-vocab.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Fix plamo2 tokenizer session to prevent multiple calls of build()
---------
Co-authored-by: Francis Couture-Harpin <git@compilade.net>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Remove un-necessary templates from class definition and packing functions
Reduce deeply nested conditionals, if-else switching in mnapck function
Replace repetitive code with inline functions in Packing functions
2 ~ 7% improvement in Q8 Model
15 ~ 50% improvement in Q4 Model
Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
* CUDA: add set rows for f32 and f16
* Review: change kernel params, use strides from host
* Use 1-d kernel
* Review: use int64_t for blockDim.x, rename nb->s for clarity
* vulkan: support SET_ROWS
Add variants of the copy_to_quant shader that do the SET_ROWS operation.
Change these shaders to spread the work across the workgroup.
The memory access pattern is probably not great (one thread per quant block),
but should be fine for now.
* vulkan: optimize set_rows
Larger workgroups for non-quant types.
Set "norepeat" (there is manual repeat logic).
Use fastmod.
* vulkan: allow unclamped loads in coopmat2 mul_mat_id shader
* vulkan: increase coopmat2 mul_mat_id tile size
* vulkan: optimize mat_mul_id row_ids search to batch loads, and port to coopmat1 path
* vulkan: use smaller FA row size when head size is large. applies to both scalar and CM2 paths (CM1 isn't used due to shared memory limits)
* feat: add mixed precision dot product implementation and function declaration
* feat: implement mixed precision vector dot product and conversion functions
* fix: update data type handling in matrix multiplication implementation
* fix: adjust row count handling in matrix multiplication implementation for accurate slicing
* fix: optimize matrix multiplication implementation by unroll loop
* update performance tracking for matrix multiplication implementation
* add fetching
* wip
* fix: support F16 * F32 multiplication in is_mul_mat_supported function
* fix: improve src0 fetching logic in vec_dot_product_mixed_impl for better alignment handling
* fix test failure for row width 67
* try fix failed test
* fix: rename aligned_address to align_down for clarity in vector alignment handling
* wip
* qnn fix: update device capabilities for quantized types in qnn-lib to improve compatibility
* fix test failure at width == 193
* fix: replace zero vector initialization with previous vector in mixed dot product implementation
* wip
* fix: improve handling of last vector in mixed dot product implementation
* wip
* wip
* wip
* wip
* Enhance mul_mat_f32 function to support quantized types and improve static assertions
* rename
* Refactor dequantization functions to use npu_device_fp16_t and improve type handling
* Optimize dequantization in dequantize_row_q8_0 by replacing qf32 multiplication with qf16
* Optimize dequantization in dequantize_row_q4_0 by replacing qf32 multiplication with qf16
* Add hvx_vsf_convert_vhf function for improved vector conversion
* add perf logs
* Refactor dequantize_row_q4_0 for alignment
* Update logging in supports_op_impl and supports_op to use ggml_op_desc for better clarity
* Add support for ROPE operation in NPU capabilities and related functions
* Implement ROPE operation in tensor and op_rope, including cache initialization and correction dimension calculations
* enable ROPE by adding operation validation
* add support to freq is null case
* wip
* Refactor rope_f32 to improve indexing by introducing total_planes calculation
* reformat
* Refactor rope_f32 to optimize data access patterns by introducing row and plane pointers
* Add performance tracking to rope_f32 function for enhanced profiling
* Refactor rope_f32 to use a templated implementation
* Refactor rope_impl to replace loop with memcpy for improved performance
* Refactor mul_mat_impl to support quantization as a template parameter
* wip
* wip
* Refactor rope_impl to optimize plane indexing in the processing loop
* Add aligned vector dot product implementation for mixed precision types
* wip
* Enhance matrix multiplication for F32 and F16 types with alignment checks
* Optimize vec_dot_product_mix_aligned_impl for improved performance with additional vector sums
* Add alignment checks for matrix multiplication and vector dot products
* Refactor matrix multiplication to use function pointers for improved readability and maintainability
* Fix alignment check in is_dot_product_aligned to ensure correct vector size handling
* Remove unused f16_to_f32_table parameter from quantization and dequantization functions
* wip
* Add L2 fetch for src1 plane rows in matrix multiplication implementation
* wip
* Refactor hvx_vsf_convert_vhf to accept an additional parameter for flexibility in vector multiplication
* Refactor vec_dot_product_mix_aligned_impl to improve variable naming for clarity
* Refactor load_dual_block_generic and dequantize_row_q4_0 to improve performance
* Refactor vector operation functions to improve clarity and consistency in variable usage
* wip
* wip
* Refactor dequantize_row_q4_0_impl for improved clarity and performance in vector operations
* wip
* Update load_dual_block_generic to use intrinsics
* Refactor load_dual_block_generic and load_qual_block_generic for improved performance and clarity
* wip
* wip
* Optimize dequantize_row_q8_0 for improved performance by unrolling for loop
* wip
* wip
* fix typo