161 lines
8.5 KiB
Plaintext
161 lines
8.5 KiB
Plaintext
[[docs:funcstructs:llama.cpp]]
|
|
== llama.cpp
|
|
|
|
|
|
[[docs:funcstructs:llama.cpp:llama_model_load]]
|
|
=== llama_model_load
|
|
|
|
Signature:
|
|
[.codebit]#`static int llama_model_load(const std::string & fname, std::vector<std::string> & splits, llama_model & model, llama_model_params & params)`#
|
|
|
|
Loads the model data from the given file using a [.codebit]#`llama_model_loader`#. Called by [.codebit]#`llama_model_load_from_file_impl(...)`#.
|
|
|
|
|
|
[[docs:funcstructs:llama.cpp:struct-llm_build_context]]
|
|
=== struct llm_build_context
|
|
|
|
This structure's purpose is to help build the computation graphs ([.codebit]#`struct ggml_cgraph`#) for various model architectures through its special builder methods: [.codebit]#`build_llama()`#, [.codebit]#`build_deci()`#, [.codebit]#`build_baichuan()`#, [.codebit]#`build_bert()`#, etc. Its constructor has the following signature:
|
|
|
|
[.codebit]#`llm_build_context(llama_context & lctx, const llama_ubatch & ubatch, const llm_build_cb & cb, bool worst_case)`#.
|
|
|
|
Note that its [.codebit]#`init()`# method must be called before using any of the builder methods.
|
|
|
|
|
|
[[docs:funcstructs:llama.cpp:struct-llm_build_context.init]]
|
|
=== struct llm_build_context.init
|
|
|
|
Signature:
|
|
[.codebit]#`void init()`#
|
|
|
|
Through a call to [.codebit]#`ggml_init(...)`#, it generates a [.codebit]#`ggml_context`# that uses the [.codebit]#`buf_compute_meta`# member of the [.codebit]#`llama_context`# the object was constructed with as a buffer.
|
|
|
|
|
|
[[docs:funcstructs:llama.cpp:struct-llm_build_context.build_llama]]
|
|
=== struct llm_build_context.build_llama
|
|
|
|
Signature:
|
|
[.codebit]#`struct ggml_cgraph * build_llama()`#
|
|
|
|
One of [.codebit]#`llm_build_context`#'s graph builder methods. Like all the others, it begins with a call to [.codebit]#`ggml_new_graph_custom(...)`#, follows with a section that creates and ties the tensor operations and finishes with a call to [.codebit]#`ggml_build_forward_expand(...)`#, which links the tensors to the graph.
|
|
|
|
NOTE: Builder methods [.codebit]#`build_bert(...)`#, [.codebit]#`build_t5_dec(...)`#, [.codebit]#`build_rwkv6(...)`# and [.codebit]#`build_rwkv6qwen2(...)`# have additional calls to [.codebit]#`ggml_build_forward_expand(...)`# in the tensor building section.
|
|
|
|
|
|
[[docs:funcstructs:llama.cpp:llama_build_graph]]
|
|
=== llama_build_graph
|
|
|
|
Signature:
|
|
[.codebit]#`static struct ggml_cgraph * llama_build_graph(llama_context & lctx, const llama_ubatch & ubatch, bool worst_case)`#
|
|
|
|
Builds the computation graph ([.codebit]#`struct ggml_cgraph`#).
|
|
|
|
First, it creates a lambda function with the following signature: [.codebit]#`(struct ggml_tensor * cur, const char * name, int il)->void`#, where [.codebit]#`il`# is the index of the tensor's layer. This function will be passed to [.codebit]#`llm_build_context`#'s constructor and used as a callback in the builder functions. It first sets the tensor's name to \{name}-\{il} (as long as its length doesn't exceed [.codebit]#`GGML_MAX_NAME`# (currently 64), in which case it is truncated, see [.codebit]#`ggml_tensor_format_name(...)`#) if [.codebit]#`il>=0`# and to \{name} otherwise, then it attempts to offload as many normalization tensors as possible from the cpu backend to the backends of the devices indicated by [.codebit]#`struct llama_model.dev_layer(il)`#, if certain parameters require this:
|
|
|
|
[source,C++]
|
|
----
|
|
const bool full_offload = lctx.model.params.n_gpu_layers > (int) lctx.model.hparams.n_layer;
|
|
if (ubatch.n_tokens < 32 || full_offload) {
|
|
if (il != -1 && strcmp(name, "norm") == 0) {
|
|
const auto & dev_layer = lctx.model.dev_layer(il);
|
|
for (auto & backend : lctx.backends) {
|
|
if (ggml_backend_get_device(backend.get()) == dev_layer) {
|
|
if (ggml_backend_supports_op(backend.get(), cur)) {
|
|
ggml_backend_sched_set_tensor_backend(lctx.sched.get(), cur, backend.get());
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----
|
|
|
|
NOTE: Normalization tensors are created by calls to [.codebit]#`llm_build_norm(...)`# from [.codebit]#`llm_build_context`#'s builder functions. Through a call to the callback described earlier, [.codebit]#`llm_build_norm(...)`# sets the tensor's name to "`norm`" (or "`norm_w`", which won't have the same effect), then this tensor is potentially moved to the desired backend in the callback. However, most of the time, after [.codebit]#`llm_build_norm(...)`# creates a normalization tensor, the caller builder function invokes the callback again to change its name to something more specific, like "`attn_norm`" or "`ffn_norm`". This results in most normalization tensors remaining on the specified backends while having names other than "`norm`".
|
|
|
|
Secondly, [.codebit]#`llm_build_context`# is instantiated and initialized:
|
|
|
|
[source,C++]
|
|
----
|
|
struct llm_build_context llm(lctx, ubatch, cb, worst_case);
|
|
|
|
llm.init();
|
|
----
|
|
|
|
Lastly, the proper builder function is called based on [.codebit]#`llama_model`#'s [.codebit]#`arch`# member and the result is returned.
|
|
|
|
|
|
[[docs:funcstructs:llama.cpp:llama_graph_compute]]
|
|
=== llama_graph_compute
|
|
|
|
Signature:
|
|
[.codebit]#`static enum ggml_status llama_graph_compute(llama_context & lctx, ggml_cgraph * gf, int n_threads, ggml_threadpool * threadpool)`#
|
|
|
|
As its name implies, this function computes a [.codebit]#`ggml_cgraph`# in a given [.codebit]#`llama_context`#. First it performs some threadpool management which was not well analyzed, then it calls [.codebit]#`ggml_backend_sched_graph_compute_async(...)`# for the actual graph computation, after which it logs any failures and returns a [.codebit]#`ggml_status`#.
|
|
|
|
|
|
[[docs:funcstructs:llama.cpp:llama_decode_impl]]
|
|
=== llama_decode_impl
|
|
|
|
Signature:
|
|
[.codebit]#`static int llama_decode_impl(llama_context & lctx, llama_batch inp_batch)`#
|
|
|
|
This function handles the inference process. It has the following structure:
|
|
|
|
* input batch processing
|
|
* inference loop (until the input batch is emptied):
|
|
** batch preparation
|
|
** [.codebit]#`ggml_backend_sched_reset(...)`#
|
|
** [.codebit]#`ggml_backend_sched_set_eval_callback(...)`#: this sets the scheduler's callback function to the user-provided one (if any)
|
|
** [.codebit]#`llama_build_graph(...)`#
|
|
** setting pointers to the output tensors. There are 2 types of outputs: logits, which are always extracted from the last tensor in the computation graph, and embeddings, which are extracted from the first tensor named "`result_embd_pooled`" (if at all)
|
|
** [.codebit]#`ggml_backend_sched_alloc_graph`#: this will also called indirectly by [.codebit]#`llama_graph_compute(...)`#, so I believe this call is redundant
|
|
** [.codebit]#`llama_set_inputs(...)`#
|
|
** [.codebit]#`llama_graph_compute(...)`#
|
|
** output extraction
|
|
* output processing
|
|
* kv cache defragmentation (if needed)
|
|
* [.codebit]#`ggml_backend_sched_reset(...)`#
|
|
|
|
|
|
[[docs:funcstructs:llama.cpp:llama_backend_init]]
|
|
=== llama_backend_init
|
|
|
|
Signature:
|
|
[.codebit]#`void llama_backend_init(void)`#
|
|
|
|
Calls [.codebit]#`ggml_time_init()`#, then [.codebit]#`ggml_init(...)`# and [.codebit]#`ggml_free(...)`# to initialize the f16 tables.
|
|
|
|
|
|
[[docs:funcstructs:llama.cpp:llama_model_load_from_file_impl]]
|
|
=== llama_model_load_from_file_impl
|
|
|
|
Signature:
|
|
[.codebit]#`static struct llama_model * llama_model_load_from_file_impl(const std::string & path_model std::vector<std::string> & splits, struct llama_model_params params)`#
|
|
|
|
Constructs a [.codebit]#`struct llama_model`# and sets its devices (using calls to [.codebit]#`ggml_backend_dev_count()`# and [.codebit]#`ggml_backend_dev_get(...)`#), logs information on their memory, calls [.codebit]#`llama_model_load(...)`# and logs possible errors before returning.
|
|
|
|
|
|
[[docs:funcstructs:llama.cpp:llama_model_load_from_file]]
|
|
=== llama_model_load_from_file
|
|
|
|
Signature:
|
|
[.codebit]#`struct llama_model * llama_model_load_from_file(const char * path_model, struct llama_model_params params)`#
|
|
|
|
Wrapper for [.codebit]#`llama_model_load_from_file_impl`# (calls it with and empty [.codebit]#`splits`# parameter).
|
|
|
|
|
|
[[docs:funcstructs:llama.cpp:llama_init_from_model]]
|
|
=== llama_init_from_model
|
|
|
|
Signature:
|
|
[.codebit]#`struct llama_context * llama_init_from_model(struct llama_model * model, struct llama_context_params params)`#
|
|
|
|
Constructs a [.codebit]#`llama_context`# object, sets up its members according to the [.codebit]#`params`# argument, then initializes (by calls to [.codebit]#`ggml_backend_dev_init(...)`#) the backends of the devices set in [.codebit]#`model`# and adds them to [.codebit]#`llama_context.backends`#. The rest is undocumented.
|
|
|
|
|
|
[[docs:funcstructs:llama.cpp:llama_decode]]
|
|
=== llama_decode
|
|
|
|
Signature:
|
|
[.codebit]#`int32_t llama_decode(struct llama_context * ctx, struct llama_batch batch)`#
|
|
|
|
Wrapper for [.codebit]#`llama_decode_impl(...)`#.
|