54 lines
2.9 KiB
Plaintext
54 lines
2.9 KiB
Plaintext
[[docs:funcstructs:llama-context.h]]
|
|
== llama-context.h
|
|
|
|
|
|
[[docs:funcstructs:llama-context.h:struct-llama_context]]
|
|
=== struct llama_context
|
|
|
|
This structure contains most, if not all of the information crucial for a run. Here are some of its members:
|
|
|
|
* [.codebit]#`const struct llama_model & model`#: a reference to the model to be used
|
|
* [.codebit]#`struct llama_cparams cparams`#: this contains the eval_callback and eval_callback_user_data (see the [.codebit]#`ggml_backend_sched_compute_splits(...)`# section for more details)
|
|
* [.codebit]#`std::vector<ggml_backend_ptr> backends`#: these contain interfaces with functions specialized for each available backend, see [.codebit]#`struct ggml_backend`# for more details
|
|
* [.codebit]#`ggml_backend_t backend_cpu`#: same as above, but for the cpu backend
|
|
* [.codebit]#`std::vector<uint8_t> buf_compute_meta`#: serves as the buffer for the [.codebit]#`ggml_context`# used to build the [.codebit]#`ggml_cgraph`# in [.codebit]#`struct llm_build_context`#
|
|
* [.codebit]#`ggml_backend_sched_ptr sched`#: helps with splitting the computation graph between multiple backends when needed, see [.codebit]#`struct ggml_backend_sched`#
|
|
* input tensors of type [.codebit]#`struct ggml_tensor*`#, see below
|
|
* [.codebit]#`struct llama_sbatch sbatch`#: helps with input handling
|
|
* [.codebit]#`size_t logits_size`#: size of [.codebit]#`logits`# buffer
|
|
* [.codebit]#`float * logits`#: 2-dimensional array of size [.codebit]#`[n_outputs][n_vocab]`# holding decode output
|
|
* [.codebit]#`size_t embd_size`#: size of [.codebit]#`embd`# buffer
|
|
* [.codebit]#`float * embd`#: 2-dimensional array of size [.codebit]#`[n_outputs][n_embd]`# holding embeddings output
|
|
* [.codebit]#`int32_t n_outputs`#: from comments, "number of actually-used outputs in the current ubatch or last logical batch"
|
|
|
|
Input tensors:
|
|
|
|
[source,C++]
|
|
----
|
|
struct ggml_tensor * inp_tokens; // I32 [n_batch]
|
|
struct ggml_tensor * inp_embd; // F32 [n_embd, n_batch]
|
|
struct ggml_tensor * inp_pos; // I32 [n_batch]
|
|
struct ggml_tensor * inp_out_ids; // I32 [n_outputs]
|
|
struct ggml_tensor * inp_KQ_mask; // F32 [kv_size, n_batch]
|
|
struct ggml_tensor * inp_KQ_mask_swa; // F32 [kv_size, n_batch]
|
|
struct ggml_tensor * inp_K_shift; // I32 [kv_size]
|
|
struct ggml_tensor * inp_mean; // F32 [n_batch, n_batch]
|
|
struct ggml_tensor * inp_cls; // I32 [n_batch]
|
|
struct ggml_tensor * inp_s_copy; // I32 [kv_size]
|
|
struct ggml_tensor * inp_s_mask; // F32 [1, n_kv]
|
|
struct ggml_tensor * inp_s_seq; // I32 [n_kv, n_batch]
|
|
struct ggml_tensor * inp_pos_bucket; // I32 [n_batch|n_kv, n_batch]
|
|
struct ggml_tensor * inp_embd_enc; // F32 [n_embd, n_outputs_enc]
|
|
struct ggml_tensor * inp_KQ_mask_cross; // F32 [n_outputs_enc, n_batch]
|
|
----
|
|
|
|
It has a single constructor that does minimal setup:
|
|
|
|
[source,C++]
|
|
----
|
|
llama_context(const llama_model & model)
|
|
: model(model)
|
|
, t_start_us(model.t_start_us)
|
|
, t_load_us(model.t_load_us) {}
|
|
----
|