add using gemma as a library notes to DEVELOPERS

2024-02-29 23:52:59 -05:00 · 2024-02-29 23:52:59 -05:00 · b841620d6b
parent ae7901c3f4
commit b841620d6b
1 changed files with 67 additions and 0 deletions
--- a/DEVELOPERS.md
+++ b/DEVELOPERS.md
@ -100,3 +100,70 @@ be exposed to the build system):
 In the medium term both of these will likely be deprecated in favor of handling
 options at runtime - allowing for multiple weight compression schemes in a single
 build and dynamically resizes the KV cache as needed.
 ## Using gemma.cpp as a Library (Advanced)
 Unless you are doing lower level implementations or research, from an
 application standpoint you can think of gemma.h and gemma.cc as the "core" of
 the library.
 You can regard `run.cc` as an example application that your own application is
 substituting for, so the invocations into gemma.h and gemma.cc you see in
 `run.cc` are probably the functions you'll be invoking. You can find examples
 of the invocations to tokenizer methods and `GenerateGemma` in `run.cc`.
 Keep in mind gemma.cpp is oriented at more experimental / prototype / research
 applications. If you're targeting production, there's more standard paths via
 jax / pytorch / keras for NN deployments.
 ### Gemma struct contains all the state of the inference engine - tokenizer, weights, and activations
 `Gemma(...)` - constructor, creates a gemma model object, which is a wrapper
 around 3 things - the tokenizer object, weights, activations, and KV Cache.
 In a standard LLM chat app, you'll probably use a Gemma object directly, in
 more exotic data processing or research applications, you might decompose
 working with weights, kv cache and activations (e.g. you might have multiple kv
 caches and activations for a single set of weights) more directly rather than
 only using a Gemma object.
 ## Use the tokenizer in the Gemma object (or interact with the Tokenizer object directly)
 You pretty much only do things with the tokenizer, call `Encode()` to go from
 string prompts to token id vectors, or `Decode()` to go from token id vector
 outputs from the model back to strings.
 ## The main entrypoint for generation is `GenerateGemma()`
 Calling into `GenerateGemma` with a tokenized prompt will 1) mutate the
 activation values in `model` and 2) invoke StreamFunc - a lambda callback for
 each generated token.
 Your application defines its own StreamFunc as a lambda callback to do
 something everytime a token string is streamed from the engine (eg print to the
 screen, write data to the disk, send the string to a server, etc.). You can see
 in `run.cc` the StreamFunc lambda takes care of printing each token to the
 screen as it arrives.
 Optionally you can define accept_token as another lambda - this is mostly for
 constrained decoding type of use cases where you want to force the generation
 to fit a grammar. If you're not doing this, you can send an empty lambda as a
 no-op which is what `run.cc` does.
 ## If you want to invoke the neural network forward function directly call the `Transformer()` function
 For high-level applications, you might only call `GenerateGemma()` and never
 interact directly with the neural network, but if you're doing something a bit
 more custom you can call transformer which performs a single inference
 operation on a single token and mutates the Activations and the KVCache through
 the neural network computation.
 ## For low level operations, defining new architectures, call `ops.h` functions directly
 You use `ops.h` if you're writing other NN architectures or modifying the
 inference path of the Gemma model.
 ## Discord
 We're also trying out a discord server for discussion here -
 https://discord.gg/H5jCBAWxAe