← ~/projects/

MultiEmbed-RAG-System

github.com/AbdallahAbou/MultiEmbed-RAG-System
Python FastAPI vLLM FAISS SentenceTransformers AraBERT
Production-ready Retrieval-Augmented Generation system with hierarchical multi-level embeddings, optimized for both Arabic and English text.
Architecture
                          ┌────────────────────────────────────┐
                          │           FastAPI Server           │
                          │         (Async, CORS, Auth)        │
                          └──────────────┬─────────────────────┘
                                         │
    ┌────────────────────────────────────┼────────────────────────────────────┐
    │                                    │                                    │
    ▼                                    ▼                                    ▼
┌───────────────────┐         ┌───────────────────┐         ┌───────────────────┐
│    Embeddings     │         │     Retrieval     │         │    Generation     │
├───────────────────┤         ├───────────────────┤         ├───────────────────┤
│ • SentenceTransf. │         │ • FAISS (IVF/HNSW)│         │ • vLLM Backend    │
│ • AraBERT v2      │         │ • CrossEncoder    │         │ • Streaming API   │
│ • Multi-level:    │         │ • HybridReranker  │         │ • Chat/Completion │
│   - Document      │         │   - Semantic      │         │ • Context Window  │
│   - Paragraph     │         │   - BM25 Lexical  │         │   Management      │
│   - Sentence      │         │   - Cross-Encoder │         │                   │
│   - Chunk         │         │ • Persistence     │         │                   │
└───────────────────┘         └───────────────────┘         └───────────────────┘
Key Features
Key Insight: Traditional RAG uses single-granularity chunks — too noisy (small) or misses content (large). Solution: embed at multiple semantic levels, weight matches by confidence, then fuse.

Score = α₁·sim(q,doc) + α₂·sim(q,para) + α₃·sim(q,sent) + α₄·sim(q,chunk)
Benchmarks
OperationLatency (p95)Throughput
Embedding (batch=32)45ms710 docs/s
FAISS Search (top-100)2.3ms435 QPS
Reranking (top-20)28ms35 QPS
E2E Query180ms5.5 QPS

Measured on RTX 3090, 500K documents indexed

Project Structure
src/
├── embeddings/
│   ├── models.py          # EmbeddingModel, SentenceTransformer, AraBERT
│   └── multi_level.py     # MultiLevelEmbedder, chunking strategies
├── retrieval/
│   ├── vector_store.py    # FAISSVectorStore (flat, ivf, hnsw)
│   └── reranker.py        # CrossEncoder, HybridReranker
├── generation/
│   └── llm_client.py      # VLLMClient, streaming support
└── api/
    └── main.py            # FastAPI app factory, endpoints