Skip to content

Instantly share code, notes, and snippets.

@MoserMichael
Last active November 14, 2025 01:14
Show Gist options
  • Select an option

  • Save MoserMichael/6e7c38fce92b34baa258587f1eab6dee to your computer and use it in GitHub Desktop.

Select an option

Save MoserMichael/6e7c38fce92b34baa258587f1eab6dee to your computer and use it in GitHub Desktop.
RAG-retrieval-augmened-generation.md

https://learn.deeplearning.ai/courses/retrieval-augmented-generation

by Zain Hassan

Module1

RAG (Retrieval Augmented Generation) gives the LLM direct access to customer supplied documents - specialized knowledge, that wer not art of the training data. Not everything is known during model training, the missing links are

  • private databases (stuff that people/enterprises do not make public)
  • hard to access information
  • knowledge of recent events

Process of answering a specific question is divided into stages

  • research / retrieval (here RAG comes in, this step is called Retrieval) This stage is adding data to the prompt (this becomes the augmented prompt) that will be considered during the next stage:
  • formulating a response based on the research results (this step is called Generation)
  • The Retriever analyses the meaning of the prompt, and identifies the queries, for the Knowledge base - which returns matching documents in order of relevance (the relevance score is the ranking), The Retriever takes the results with top score and adds the to the Augmented prompt .
  • how many documents to add to the prompt? tradeoff - you don't want to have a too large prompt. (context size limit and processing cost)

Applications of RAG

  • code generation: add relevant code to the augmented prompt from your project, so that the system gets up to speed with your code / coding conventions
  • chat bot: add information specific to your domain to the augmented prompt
  • health/legal application/personalized assistants : augmented prompt wants added information from a customers background / legal case history / personal schedule etc.
  • AI assisted web search :

Things you can do with RAG

  • you can update the data in the vector database (can't retrain your model, this allows for a good deal of flexibility).
  • the augmented prompt can include references (this allows the system to site it's sources!)
  • schema of augmented prompt f"""Respond to the following prompt {user_prompt} using the following information to help you answer {retrieved_documents_with_citations}"""

On LLM's.

  • they choose the next most probable word, based on the prompt + words generated up to a given moment

  • The choice of the next word heavily depends on already chosen tokens (this is called auto-regressive behavior - heavily influenced by he previous choice)

    • training of LLM was done on data that does not include company specific knowledge, so the prediction in this context is prone to error (hallucinations are more likely)
    • that's why augmented prompt becomes important to the result of the LLM!
    • the size of the augmented prompt matters: too much text added - you run into problems with the context window size, also processing of longer prompt takes more resources. (me: constructing data stored in / added by RAG is a subtle art, like most things with prompting...)
  • llm api's have two variants of input

    • linear text
    • a json log of the conversation with the following struct
      {
        { "role", "user", "who one the english cup in 1974?" },
        { "role", "assistant", "Liverpool didn't win the english cup in 1974." }
        ...
      }

Module2 - retrievers in more detail

Retriever architectures:

  • keyword base search (sensitive to exact wording of prompt)
  • semantic based search: search for articles that match the meaning of the prompt (more sensitive to the wider meaning)
  • hybrid search: do both keyboard based and semantic based search pipelines and combine the results

Search results from knowledge base have both ranking and metadata (which includes some classification of who this result is relevant to) The retriever filters out result not relevant to the client - according to this metadata

metadata filtering

Articles in the document DB usually have some kind of additional information: creation/modification date, title, author, access privileges, region where article appeared in, maybe some kind of additional classification

With metadata filtering you match the search query against this metadata info, not the document text

Says usually the metadata filtering is done by a set of external criteria (for example: if this is a 'free' vs 'paid' account, then exclude all DB entries marked as 'paid', or filter by region of the client) Says metadata filtering does not look at the document data, and that there is no overlap with the ranking score.

(q: date of article may be very relevant to a specific prompt?)

keyword search

Score documents based on the number of occurrences of a search term in a document

  • store count of words in document (excluding common stop-words)
  • also keep an inverse index: for each significant word, keep a list of documents where it appears in (to aid the lookup of candidate documents)
  • Possible ways to scored (each option improves on the previous one)
    • Frequency based: for each search work/token present in a document - add 1 to search score of that document (not very good, long document have higher scores, just because they are longer) TF(word) = num of occurrences of word in document
    • number of key word matches for given search hit (frequency score) / number of words in document (problem: all keywords are awarded equally, a longer document scores higher - it has a higher occurrence count)
    • Normalize by length of document: divide the score by number of words in document. Problem: all key words awarded equally - both stop words like 'the'.
    • Normalize by tf-idf -log-idf(w, doc) = log(IDF(w)) where IDF(w,doc)=number of words in document doc / number of occurrences of word w in document doc The log awards less frequent words over common words.

TF-IDF Term Frequency-Inverse Document Frequency - wiki

BM25 - BM for best matching - improvement over TF-IDF (is used more commonly than TF-IDF)

  • what can be improved over tf-idf?
    • term frequency: a document with 20 times hits for one of the search term is not always the better result, need something better (term frequency saturation - the term for the degree of counting a frequent word)
    • longer documents are still treated worse by tf-idf (length normalization is the word for the process)

bm25 has parameters for both frequency saturation and length normalization. So you can adjust to your data (doesn't go into the details of BM25)

Semantic search

keyword search problem: if the query does not include the exact keywords as in documents then no search results.

  • For vector similarity they use cosine similarity or vector dot product (not Euclidian distance)

  • Semantic search - you can map a keyword to a n dimensional vector. They say that semantically related words get a close enough vector value! (just like with Word2Vec or BERT neural networks) - they even compute the vector over a whole sentence - they say it still makes sense!!! There are several search strategies for semantic search

  • computer vector representation of search query, find document where the document produces a close enough vector (computing vector over the whole document)

  • for each query term, find close enough words in the vector model, then use tf-idf to search for documents

Intuition about vector models

They train a neural network based on 'positive pairs - two sentences that have a close meaning' and 'negative pairs' - to sentences that are far off. They use reinforcement learning to assign a close multidimensional vector to positive pairs, vs negative pairs.

Hybrid search

  • for example: do both keyword and semantic search, now you get a ranked list of results for each search strategy
  • reduce each result lists by applying metadata filter
  • merge both search result by applying rank fusion :
    • pick highest ranking results from each search (example: compute inverse of the score, or 1/(k+rank_in_result_list) where k is parameter, rank_in_result_list n - result occurred nth in list), if documentA occurs in both result lists, then it gets the sum of he scores from each list.
    • important: tune k separately for eah search strategy! Each strategy has its tradeoffs
    • merge by score

Later module: says most commonly used ratio is k=0.25 - 25% from semantic search 75% from keyword search.

Evaluation

Now you have lots of parameters to tune, need a way to evaluate the results (otherwise tuning will not make sense!)

  • Have a list of queries and expected/optimal search result lists

  • measure how actual results differ from the recorded list (precision & recall) /That's the problem here: you need a lot of ground truth - recorded results/

    Precision-measure:=num of documents in search result that occur in recorded list - relevant number of search hits /total retrieved documents

    precision score gets lower if irrelevant documents returned

! assess ranking effectiveness, checks ratio of relevant/irrelevant results !

Recall - number of relevant documents returned by search (relevant means they occur in recorded result list) / number of documents in recorded list

 recall gets lower if irrelevant documents are not in the actual results.

! Recall,Recall-K are the most sited metric ! Also: Precision and Recall are tied to size K of result list (Precision-K, Recall-K)


Average-Precision-K(prompt) - take Precision-K for k in [1..K] where position k has a relevant document returned for prompt. Sum the values and divide by number of relevant docs in [1..K]

Mean-average-Precision-K - compute average precision for large set of prompt, get the mean value. (gives a global view).


Reciprocal-rank(prompt) - 1 / rank-of-first-relevant-doc-in-search-results-for-prompt

Mean reciprocal rank - compute average of Reciprocal-rank(prompt) for many prompt

! how good is the model for top results of ranking / at top of ranking !

Module3 - production scale retrievers

With lots and lots of data you need a vector database (where the key is the multi-dimensional embedding vector)

K-nearest-neighbors(query) :

  • compute vector for query
  • find distance of query vector to all document vectors
  • return k closest vectors

Problem: lots of computations per query (for each document in DB!)

Approximate-nearest-neighbors(query)

  • compute a neighborhood graph over the vectors in the database.
    • compute the distance for each pair of vectors
    • In the graph: keep connection to all other nodes in the graph (?) (I understood they keep n closest nodes in the graph only)
  • Now when searching for the nearest neighbor to vector V:
    • start a walk over the graph over some random element.
    • for all linked nodes: compute the distance to V, pick the node with the smallest distance
    • repeat, until nearest node found (no more improvement)

Hierarchical-Navigable-Small-word(query)

  • here they have several neighborhood graphs:
    • layer1 - neighborhood graph over all documents (nodes) in the DB
    • layer2 - neighborhood over smaller subset (number-of-nodes/10)
    • layer3 - neighborhood over smaller subset (number-of-nodes/100)
  • search:
    • start over nodes of layer3, until no improvement found
    • start over nodes of layer2 (beginning with the same node that previous step ended)
    • start over nodes of layer1 (beginning with the same node that previous step ended)

Problems:

  • optimal result is not guaranteed,
  • precomputing the proximity graphs is heavy

Weaviate vector db - open source vector db used in course

/lots of explanation of the interface/

Chunking - limit the size of documents kept in vector db. Why?

  • window size of of LLM is limited, if text added by RAG step exceeds token limit then that is no good
  • keep the added text relevant to the query.
  • Chunking strategy:
    • preferred chunk size? Paragraph or page. says fixed size: like 250-500 characters, that's a good start. (but in the lab he is dividing into hundred words, not characters)
    • also says to allow overlaps between neighboring chunks (of some 10% of the chunk size) - this is good for search relevancy.
    • or you can split the chunk on a newline (split by paragraphs) ; or split according to domain (python code by function, html by paragraph tags, etc.)

Semantic chunking - problem with fixed sized chunks: you cut off sentences in th middle. Semantic chunking deals with th problem:

  • subdivide the text into sentences. Compute vector of each sentence. if adjacent sentence vector is close enough, then add them to the same chunk. Create a new chunk, once the difference between adjacent sentences crosses the threshold.

Context aware chunking - tell an LLM to subdivide the text into chunks (text portions with similar meaning). The LLM is also tasked with creating a summary of the chunk - this summary is appended to the chunk itself, to increase relevance of the text in the chunk. (this costs a lot, but is very effective)

Summary: start with a simple chunking strategy. Then experiment with more expensive advanced techniques on a subset of data - and check the resulting metrics for any improvement (to see if the added cost is worth it)


Query rewriting - make sense of a messy prompt / rewrite it before it gets to retriever.

llm prompt template """The following prompt was submitted by the user. Rewrite the prompt to optimize it for searching the medical database by doing the following:

  • clarify ambiguous phrases
  • use medical terminology where applicable
  • add synonyms that increase odds of finding matching documents
  • remove unnecessary or distracting information {user_prompt}"""

Named entity recognition - identify/categorize info from query, to enable 'targeted search + filtering'. Here a classification terms are added to the query, like (a LLM model called Gliner does that). Example

I read the Great Gatsby book by F. Scott Fitzgerald author (marked categories are added by model). Says retrieval works better this way!

/later: that's the most valuable trick here/

Hypothetical document embedding

  • an LLM is asked to create some answer for the query (a 'hypothetical document' to answer the query)
  • get vector for 'hypothetical document' and use it for vector lookup.

(??? what about wrong guesses by the LLM, never happened ? )

Cross encoder - for all documents in the knowledge base: prepend the prompt to the document, compute the vector over combined data. (if the prompt matches the document, then the result is a high score (?)) Problem: need to process a lot of documents, and don't scale.

CoIBERT - here you can use some precompute vectors for each word of the document. Compute score of max("{prompt[i]}{document[j]}") each pair of words from prompt vs document. Sum these max scores up (idea is to find words in document that pair well with word in the prompt). Problem: need to store lots of vectors per document!

Reranking - processes chunks returned by the lookup, rank them. The chunks are then send to the LLM according to the new order found for the chunks. Reranking use the cross encoder architecture! Now cross encoding makes more sense - it is applied over the limited set of chunks that are the result of the lookup. (says that's the first thing to try, when trying to improve relevance of search results (?))

Module4 - LLM, how they work and how to deal with them

  1. each token is assigned two vectors
    • an embedding vector embedding[], stands for the meaning (embedding via BERT. not Word2Vec)
    • second vector that encodes the position of the word in the context window
  2. token passes through first stage of transformer, consists of
  • Attention phase. Attention[i][j] 2-d vector is a score, how much does word[i] influence word[j] (n*n - for context window of size n)

  • feed forward phase - most parameters are here. embedding[] and Attention[][] are input to feed forward networks, the output is a new embedding2[] vector with the refined meaning of each embedding.

  1. embedding2[] vector passes through second stage of transformer, a different Attention phase followed by feed forward phase, that produces embedding3[]
  2. embedding3[] vector passes through third stage of transformer, a different Attention phase followed by feed forward phase, that produces embedding4[] ...
  3. take EmbeddingN[] to predict the next word, next word will be added to context window.
  • actually he model predicts a set of words, each suggestion has a probability called 'confidence'. The language model has to pick one of these suggestions as the next word that is to be added to the context window
    • 'temperature' is a parameter of the LLM. It determines the strategy of picking the next word from suggestions. Temperature=0 - always pick the candidate with the highest probability (also known as 'greedy decoding' - use that for generating python code!). temperature=1 - have a wider range of candidates for picking the next word at random (that is to be 'more creative'). Too high 'temperature' like temperature=5 will just pick gibberish.
  • Additional sampling tricks: ' ' - Top-K - limit choice of next token to K top likely suggestions,
    • Top-P - limit choice to tokens that have a probability higher than P percentages (like P=85%)
    • Repetition penalty during sampling of next token: to avoid repetitive phrases.
    • Logit bias: for some tokens, assign a score that is to be added to the real score for these tokens (preferential treatment of some tokens that is, like a disincentive for swear words)
    • good combination of settings:
      • temperature: 0.8 # slightly conservative in token choice
      • top-p: 0.9 # avoid choosing from tail of distribution
      • repetition-penalty: 1.2 # lightly penalize repetition
  • One token instance is called End-of-completion-token, it says when to break out of the loop and to present the result to the user.

Choosing an LLM

  • small models (1-10 billion params) vs large models (100-500 billion params) - these cst mre
  • context window size, very important
  • costs (per tokens) Sometimes input tokens are priced differently from output tokens.
  • knowledge cutoff date (newer and bigger models cost moe)

Metrics - lots of benchmarks on LLM's - use this to choose the right model for the task !

  • Automated benchmark, check LLM's on tasks that can be automatically checked (like multiple choice benchmarks, code with expected results, etc.)
    • MMLU leaderboard MMLU on wikipedia Wikipedia says they have been superseded, as the checks are too easy.
    • MMLU-Pro MMLU-ProX, etc. - successor benchmarks of the same type
  • human evaluated benchmarks - have N LLM's do a check, a Human is evaluating the results - they crowd-source the evaluation task - wikipedia says companies love it, but there are doubts (is vote rigging possible?) LLMArena arena des these
  • llm as a judge benchmarks, cheap and flexible way of evaluating (trick: models as judge will favor models from their own company/family, need to avoid this)
  • code generation benchmarks - lots of them are listed here
    • says: check with developer forums to verify if real world performance is good

What can be problems?

  • data contamination (good performance on benchmark, as the check of the benchmark and its result are part of the training data)
  • saturated benchmarks: all leading models perform near the top of the benchmark (at first they got low ranks, but then they got better) - happens a lot with automated benchmarks

Prompt engineering (the black art...)

  • format of prompt: most often a json with a dialog, that's the openAI prompt format - you pass this json via the REST api when asking it a question, where the last user question in the conversation is appended to the end of the messages array.
{
  "messages": [
    { "role": "system",
      "code": "lalala"
    },
    { "role": "user",
      "code": "lalala"
    }
    ...
  ]
}

This json is turned into a text with tags for the LLM, the LLM is trained to react to/understand these tags.

<|begin of text|><|start_header_id|>system<|end_header_id|>
...system prompt comes here<|eot_id|>

<|start_header_id|>user<|end_header_id|>
what is the capital of canada<|eot_id|>

...
  • system prompts are added to every prompt, here you needs to be set to high level instructions / set tone and style of the communication.

    • system prompts for LLM's usually contain the knowledge cutoff date + current date ; this helps them to judge if their information is not out of date for the task. (or if they need to seek external tools to look for additional clarifying info)
    • you can tell the LLM to answer in great detail - or answer succinctly and summarize
    • can tell the LLM how to use the retrieved documents
      • use documents to answer the question
      • or judge if a document is relevant - if that is what relevant for your task
      • tell it to cite documents in the response or not.
  • augmented prompt, in a RAG system this contains lots of information.

    • it looks like this
      • system prompt
      • conversation history
      • retrieved documents
      • most recent user prompt (query) that you want the LLM to reply to

Other prompt techniques:

  • In context learning : prompt that includes previous question-response pairs, so that the LLM 'gets' the structure of the response.
    • if this has more than one example this is called 'few shot learning'; with one example this is called 'one shot learning'
    • or better: you can pull in relevant pieces of conversation via RAG, that's a better kind of in context learning.

Encouraged reasoning

  • Tell the LLM to think 'step by step' (chain of thought) or tell it to 'think aloud' before giving an answer.
  • To tell it as part of the user prompt that reasoning between <scratchpad></scratchpad> are not part of the final answer - so it has a place to organize its thoughts! This also helps to understand the problem when bad answers are given!
    • now many LLM's have these instructions as part of their training, these are called reasoning models. They are already trained to use tags like 'scratchpad' to organize their reasoning. Reasoning models are slower and cost more to run, because of all of the intermediate thinking/processing.

Big thing: many reasoning models don't need all this prompting / in-context learning. (?) Here you need to set specific goals and instructions!

  • LLM providers of reasoning models may provide instructions on how o prompt them!

Context window management gets important, as all of this prompting may fill up the context window too quickly, all by themselves. You may have to limit the prompt length, like

  • dropping old messages from the prompt
  • summarizing older messages.
  • rag: only include rag response for the last prompt, remove rag retrieval responses from older questions/prompts

Experiment and measure! otherwise you start to guess too much.

Hallucinations.

LLM's guess the next word, with an element of probability, so they are bound to say wrong stuff - hallucinate. No general fix for that. RAG helps to reduce hallucinations, by grounding the LLM context window in correct data.

Another approach: consistency checks - have the LLM create output for the prompt, several times. Check all the output versions against each other. This assumes that hallucinations are produced inconsistently, so the odd answer out of several would be counted as wrong. (costs aa lot and is not reliable) ?how do they do the check, with another LLM?

Or rather check the claims of the LLM against a knowledge system:

  • prompt addition: "make factual claims only based on information retrieved from a knowledge base / RAG system "
  • prompt the LLM to site the sources of its claims. (risk: what if the citations are hallucinated?)
    • At least citations make it easier to verify the claims by Humans.
    • there is an open source system called ContextCite to check, how well a given text is backed by citations (?)

The ALCE benchmark checks, how accurately an LLM is citing sources.

Evaluating the LLM stage in a RAG pipeline. Need to check if

  • LLM finds relevant information from retrieved document set
  • the response based on this info is clear, with relevant information, sources cite, irrelevant info is ignored. (how to quantify these subjective criteria?) Somehow they use other LLM's to evaluate the response...
    • There is a library for automatic source citation: Ragas. For ResponseRelevancy metric: An LLM tries to guess the original prompt, given the output. Then they compute similarity between prompt vectors of original prompt and guessed original prompt - this way they get a score that can be averaged across responses. (and other such tricks, where an LLM is used as part of the process)

Agentic workflows with RAG - have different LLM's do a specific task of processing the workflow. Example Agentic workflow

  • router LLM decides if query needs RAG processing or not
  • in case of RAG processing: once documents have been retrieved from the knowledge base, have a special evaluator LLM decide if the retrieved documents are sufficient, if not then retrieve additional documents
  • once the LLM has processed the augmented prompt that includes the retrieved documents + user query: have a special citation LLM add citations to the result.

Several patterns with agentic workflows:

  • sequential workflow: a series of steps (one after the other), where each step is done by an LLM
  • conditional workflow, a router LLM decides, which alternative workflow should process this task
  • iterative workflow forms a loop, then you need an LLM to decide if to continue with the loop or if to break out of it
  • parallel workflows have an orchestrator LLM split up the task into subtask, send each task to its pipeline, then have a _ synthesizer LLM_ combine all of the subtasks into the final result

Fine Tuning - retrain an LLM with your own data, this is called supervised fine tuning. You do this, if you want an LLM to an expert in an additional domain, such as legal s tuff. (can work well, but can also fail spectacularly, like lowering performance). Says this is used to upgrade smaller models to particular tasks...

  • This needs a labeled training set (each training with a recorded best answer)
  • this data is used for an additional supervised training step (supervised, because the bes answer is known for each training example)

Is fine tuning a replacement for RAG? says RAG is good add injecting knowledge, while fine tuning is for domain adaption (so that the thing becomes expert in some particular area) ... sometimes you can find a ready made fine tuned model!

Module5 - RAG systems in production

problems during production:

  • scaling the system to more load
  • latency issues can also come up
  • suddenly processing cost also becomes important,
  • functionality: users can ask stuff that was not prepared for during production! Also real data is more messy/with more variants, input formats, etc
  • security issues are suddenly very important (like prompt injection)

Metrics is key during production

  • perf. counters (latency/throughput, resource usage: memory, compute usage)
  • quality metrics: (user feedback, recall)
  • logs: follow how a request is processed
  • Experimentation: can you evaluate how deployed changes during prod during prod affect output quality? Are eval stages part of the production process?

Distinguishes between the following as applied to per component scope vs system-wide scope

  • 'code based eval' - some prometheus counter incremented by the software
  • 'LLM as a judge'
  • 'human eval' - user feedback in the UI (thumbs up, thumbs down) or Human data annotation (costly, but useful for evaluating recall vs precision)

Observability platforms for LLM's

  • Phoenix by Arize, gathers traces for each component stage on each request, as well as input and output of each stage (aggregates/views opentelemetry spans/traces)
  • for system-wide counters (like throughput/latency, memory consumed/tokens used) better use Grafana dashboards.

Custom dataset - collect logs en masse, to create data sets for assessment (?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment