Skip to content

Instantly share code, notes, and snippets.

@celeroncoder
Last active October 23, 2025 15:53
Show Gist options
  • Select an option

  • Save celeroncoder/d166d8a3f5d4e7e77b6dd7ff0736df4e to your computer and use it in GitHub Desktop.

Select an option

Save celeroncoder/d166d8a3f5d4e7e77b6dd7ff0736df4e to your computer and use it in GitHub Desktop.

Offline RAG Assistant: A Comparative Benchmark Report

1. Comparative Benchmark Analysis

A critical weakness is the lack of direct comparative benchmarks against the three most relevant alternative frameworks.

1.1. llama.cpp

llama.cpp is the standard for local LLM inference and serves as a key performance baseline. When it comes to prompt processing, llama.cpp significantly outperforms the assistant, achieving 137-189 tokens/s in batch mode compared to the assistant's 8.10 tokens/s—a roughly 15x performance gap that's likely due to Python/FastAPI overhead compared to llama.cpp's native C++ implementation [1]. However, token generation performance is much more comparable, with the assistant reaching 9.19 tokens/s versus llama.cpp's 9-18 tokens/s range. One advantage of llama.cpp is its minimal deployment overhead through a single binary, making it straightforward to set up and run.

1.2. PrivateGPT

PrivateGPT is a direct competitor in the offline RAG market, and the comparison reveals some interesting trade-offs. The assistant's response times of 1.2-4.2 seconds fall right in line with PrivateGPT's 2-4 second range, and both systems handle roughly the same number of concurrent users at 3-5. Where the assistant really stands out is memory usage—it consumes just 4.8GB compared to PrivateGPT's 12-16GB, representing a 60% improvement in memory efficiency [2].

1.3. LocalAI

LocalAI uses a similar OpenAI-compatible API and has integrated RAG capabilities. Limited public benchmarks are available for direct comparison, but it represents an alternative architecture worth considering [3].

2. Performance Analysis

2.1. Retrieval Performance

The FAISS-based retrieval mechanism has room for improvement. Currently, the system achieves Precision@5 of 0.78 and Recall@5 of 0.65, which falls short of industry standards where production systems typically exceed 0.85 precision and 0.70 recall. There are more advanced embedding models available that could boost these numbers—for instance, Nomic Embed v1 delivers 86.2% top-5 accuracy (an 8.1% increase), while BGE-base-v1.5 reaches 84.7% top-5 accuracy (a 6.6% improvement) [4].

2.2. Embedding Model Performance

The specific sentence-transformer model being used isn't clearly specified, which is a missed opportunity since model choice can have a significant impact on performance. Here's how different options compare:

Model Speed (ms/1K tokens) Top-5 Accuracy Parameters
MiniLM-L6-v2 (Assumed) 14.7 78.1% 22M
E5-base-v2 20.2 83.5% 110M
BGE-base-v1.5 22.5 84.7% 110M
Nomic Embed v1 41.9 86.2% 137M

Source: [4]

3. Architectural and Scalability Analysis

3.1. Scalability Comparison: vLLM

vLLM represents a production-grade reference for optimized architecture. It can handle 64+ concurrent users compared to the assistant's 3-5, and it achieves 35x higher requests per second than llama.cpp at scale. However, there's a significant hardware trade-off: vLLM demands high-end GPUs like the H100 or A100, whereas the assistant is designed to work with consumer CPUs [1].

3.2. Quantization Validation

The use of 4-bit quantization is well-supported by industry research and offers compelling benefits. It allows a 7B parameter model to fit in just 3.5GB (INT4) compared to 28GB in full precision (FP32), and studies show this compression comes with minimal quality degradation [5].

3.3. Docker Performance Overhead

Choosing Docker for deployment is justified by the minimal performance overhead it introduces—typically just 1-3% for CPU and memory usage [6]. This makes containerization a practical choice without significant performance penalties.

4. Key Findings and Competitive Positioning

4.1. Identified Strengths

The assistant demonstrates several noteworthy strengths that make it competitive in this space. Its memory footprint of 4.8GB active memory is a significant advantage over PrivateGPT's 12-16GB usage. Deployment is straightforward—a single Docker command beats managing complex Python environments. The system also shows impressive results in hallucination reduction with a reported 65% improvement thanks to effective RAG integration. Perhaps most importantly, it's proven to work on commodity hardware like an Intel i5 processor with 16GB of RAM, making it accessible for resource-constrained environments [7].

4.2. Estimated Competitive Position

Aspect vs. llama.cpp vs. PrivateGPT vs. LocalAI
Inference speed ~85% slower Comparable Comparable
Memory efficiency Similar 60% better Unknown
RAG quality N/A Slightly lower Unknown
Deployment ease Similar Significantly easier Similar
Scalability Similar Similar Unknown

5. Conclusion

The Offline RAG Assistant's primary advantages are its exceptional memory efficiency and deployment simplicity compared to PrivateGPT. However, the lack of rigorous comparative benchmarking is a critical weakness. Implementing more comprehensive benchmarks would be necessary to empirically validate its strengths and clearly define its value proposition for resource-constrained, privacy-preserving deployments.

6. References

[1] https://developers.redhat.com/articles/2025/09/30/vllm-or-llamacpp-choosing-right-llm-inference-engine-your-use-case

[2] https://abstracta.us/blog/ai/privategpt-testing/

[3] https://www.libhunt.com/compare-privateGPT-vs-LocalAI

[4] https://supermemory.ai/blog/best-open-source-embedding-models-benchmarked-and-ranked/

[5] https://huggingface.co/blog/4bit-transformers-bitsandbytes

[6] https://stackoverflow.com/questions/21889053/what-is-the-runtime-performance-cost-of-a-docker-container

[7] https://drive.google.com/file/d/1MUs0M-jGRiUX9Kg6flrmHLlCGsNuYSQQ/view?usp=sharing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment