~/projects/MultiEmbed-RAG

MultiEmbed-RAG-System

github.com/AbdallahAbou/MultiEmbed-RAG-System

Python FastAPI vLLM FAISS SentenceTransformers AraBERT

Production-ready Retrieval-Augmented Generation system with hierarchical multi-level embeddings, optimized for both Arabic and English text.

Architecture

┌────────────────────────────────────┐ │ FastAPI Server │ │ (Async, CORS, Auth) │ └──────────────┬─────────────────────┘ │ ┌────────────────────────────────────┼────────────────────────────────────┐ │ │ │ ▼ ▼ ▼ ┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐ │ Embeddings │ │ Retrieval │ │ Generation │ ├───────────────────┤ ├───────────────────┤ ├───────────────────┤ │ • SentenceTransf. │ │ • FAISS (IVF/HNSW)│ │ • vLLM Backend │ │ • AraBERT v2 │ │ • CrossEncoder │ │ • Streaming API │ │ • Multi-level: │ │ • HybridReranker │ │ • Chat/Completion │ │ - Document │ │ - Semantic │ │ • Context Window │ │ - Paragraph │ │ - BM25 Lexical │ │ Management │ │ - Sentence │ │ - Cross-Encoder │ │ │ │ - Chunk │ │ • Persistence │ │ │ └───────────────────┘ └───────────────────┘ └───────────────────┘

Key Features

Hierarchical Embeddings — Generate embeddings at document, paragraph, sentence, and chunk levels for fine-grained retrieval

Multilingual Support — Native Arabic via AraBERT, multilingual via paraphrase-multilingual-MiniLM

Hybrid Search — Combines dense retrieval (FAISS) with sparse retrieval (BM25) and neural reranking

Production Ready — Docker deployment, health checks, structured logging, async I/O

Key Insight: Traditional RAG uses single-granularity chunks — too noisy (small) or misses content (large). Solution: embed at multiple semantic levels, weight matches by confidence, then fuse.

Score = α₁·sim(q,doc) + α₂·sim(q,para) + α₃·sim(q,sent) + α₄·sim(q,chunk)

Benchmarks

Operation	Latency (p95)	Throughput
Embedding (batch=32)	45ms	710 docs/s
FAISS Search (top-100)	2.3ms	435 QPS
Reranking (top-20)	28ms	35 QPS
E2E Query	180ms	5.5 QPS

Operation

Latency (p95)

Throughput

Embedding (batch=32)

45ms

710 docs/s

FAISS Search (top-100)

2.3ms

435 QPS

Reranking (top-20)

28ms

35 QPS

E2E Query

180ms

5.5 QPS

Measured on RTX 3090, 500K documents indexed

Project Structure

src/ ├── embeddings/ │ ├── models.py # EmbeddingModel, SentenceTransformer, AraBERT │ └── multi_level.py # MultiLevelEmbedder, chunking strategies ├── retrieval/ │ ├── vector_store.py # FAISSVectorStore (flat, ivf, hnsw) │ └── reranker.py # CrossEncoder, HybridReranker ├── generation/ │ └── llm_client.py # VLLMClient, streaming support └── api/ └── main.py # FastAPI app factory, endpoints