add using gemma as a library notes to DEVELOPERS

2024-02-29 23:52:59 -05:00 · 2024-02-29 23:52:59 -05:00 · b841620d6b
parent ae7901c3f4
commit b841620d6b
1 changed files with 67 additions and 0 deletions
--- a/DEVELOPERS.md
+++ b/DEVELOPERS.md
@ -100,3 +100,70 @@ be exposed to the build system):
 In the medium term both of these will likely be deprecated in favor of handling
 options at runtime - allowing for multiple weight compression schemes in a single
 build and dynamically resizes the KV cache as needed.
+
+## Using gemma.cpp as a Library (Advanced)
+
+Unless you are doing lower level implementations or research, from an
+application standpoint you can think of gemma.h and gemma.cc as the "core" of
+the library.
+
+You can regard `run.cc` as an example application that your own application is
+substituting for, so the invocations into gemma.h and gemma.cc you see in
+`run.cc` are probably the functions you'll be invoking. You can find examples
+of the invocations to tokenizer methods and `GenerateGemma` in `run.cc`.
+
+Keep in mind gemma.cpp is oriented at more experimental / prototype / research
+applications. If you're targeting production, there's more standard paths via
+jax / pytorch / keras for NN deployments.
+
+### Gemma struct contains all the state of the inference engine - tokenizer, weights, and activations
+
+`Gemma(...)` - constructor, creates a gemma model object, which is a wrapper
+around 3 things - the tokenizer object, weights, activations, and KV Cache.
+
+In a standard LLM chat app, you'll probably use a Gemma object directly, in
+more exotic data processing or research applications, you might decompose
+working with weights, kv cache and activations (e.g. you might have multiple kv
+caches and activations for a single set of weights) more directly rather than
+only using a Gemma object.
+
+## Use the tokenizer in the Gemma object (or interact with the Tokenizer object directly)
+
+You pretty much only do things with the tokenizer, call `Encode()` to go from
+string prompts to token id vectors, or `Decode()` to go from token id vector
+outputs from the model back to strings.
+
+## The main entrypoint for generation is `GenerateGemma()`
+
+Calling into `GenerateGemma` with a tokenized prompt will 1) mutate the
+activation values in `model` and 2) invoke StreamFunc - a lambda callback for
+each generated token.
+
+Your application defines its own StreamFunc as a lambda callback to do
+something everytime a token string is streamed from the engine (eg print to the
+screen, write data to the disk, send the string to a server, etc.). You can see
+in `run.cc` the StreamFunc lambda takes care of printing each token to the
+screen as it arrives.
+
+Optionally you can define accept_token as another lambda - this is mostly for
+constrained decoding type of use cases where you want to force the generation
+to fit a grammar. If you're not doing this, you can send an empty lambda as a
+no-op which is what `run.cc` does.
+
+## If you want to invoke the neural network forward function directly call the `Transformer()` function
+
+For high-level applications, you might only call `GenerateGemma()` and never
+interact directly with the neural network, but if you're doing something a bit
+more custom you can call transformer which performs a single inference
+operation on a single token and mutates the Activations and the KVCache through
+the neural network computation.
+
+## For low level operations, defining new architectures, call `ops.h` functions directly
+
+You use `ops.h` if you're writing other NN architectures or modifying the
+inference path of the Gemma model.
+
+## Discord
+
+We're also trying out a discord server for discussion here -
+https://discord.gg/H5jCBAWxAe