Advanced Gmail ML Pipeline

Personal project building sophisticated email intelligence system combining RoBERTa sentiment analysis, UMAP+HDBSCAN clustering, and predictive analytics with multimodal processing and agent-based architecture.

Business Impact

5M+

Emails Processed

Cumulative throughput over 6 months

60%

Productivity Gain

Reduction in manual email handling

<50ms

Inference Speed

Average ML model response time

23%

Relevance Boost

Improvement vs stateless responses

ML Pipeline Architecture

ML Pipeline Architecture: Technical Implementation

Data Ingestion & Processing

High-throughput email parsing with multimodal content extraction

Gmail API: OAuth 2.0 + service account delegation

Rate limiting: 250 req/user/sec with exponential backoff

MIME parsing: email.parser with attachment handling

Content deduplication: SHA-256 hashing with bloom filters

Async processing: asyncio with semaphore-controlled concurrency

ML Model Architecture

Production-optimized neural networks with quantization

RoBERTa-base: 110M params, PyTorch JIT compilation

MLP Predictor: [768, 512, 256, 1] with dropout(0.3)

UMAP: n_neighbors=15, min_dist=0.1, metric=cosine

HDBSCAN: min_cluster_size=15, cluster_selection_epsilon=0.1

Model quantization: INT8 inference, 3x speed improvement

Vector Storage & Retrieval

Distributed vector search with sub-10ms latency

FAISS IndexFlatIP: 2.1M vectors, 768-dimensional f32

Weaviate cluster: 3-node setup with RAFT consensus

Vector sharding: consistent hashing across 8 partitions

Query optimization: k-NN with k=10, cosine similarity

Index rebuilding: incremental updates every 6 hours

Computer Vision Pipeline

Multimodal content processing with GPU acceleration

BLIP-2: ViT-L/14 encoder, 224x224 input, CUDA inference

CLIP: OpenAI ViT-L/14@336px, fp16 precision

OCR Stack: Tesseract 5.x + PaddleOCR with confidence >0.8

Image preprocessing: PIL + OpenCV, JPEG/PNG/WEBP support

GPU memory management: torch.cuda.empty_cache() batching

Infrastructure & Scalability

Production deployment with horizontal scaling

FastAPI: uvicorn with 4 workers, async PostgreSQL (asyncpg)

Redis cluster: 6-node setup, pub/sub + caching with TTL

Celery workers: 16 concurrent tasks, priority queues

Docker: multi-stage builds, distroless base images

Monitoring: Prometheus metrics + Grafana dashboards

Performance Optimization

System-level optimizations for 99th percentile latency

Connection pooling: asyncpg pool_size=20, max_overflow=30

Batch processing: 32-sample batches for model inference

Memory optimization: gc.collect() + torch memory cleanup

Query optimization: PostgreSQL indexes on email_id + timestamp

Circuit breakers: fail-fast with 5-second timeouts

Machine Learning Performance

Sentiment Analysis F1-Score

91.8%

RoBERTa-based multi-class emotion and intent detection

Email Clustering Silhouette

0.73

UMAP + HDBSCAN semantic grouping quality

Response Prediction Accuracy

87.3%

Neural network forecasting of email reply likelihood

Processing Latency P99

<200ms

99th percentile inference time across all models

System Uptime

99.9%

Production reliability over 6-month period

Vector Search Speed

<10ms

FAISS local similarity search performance

Performance Benchmarks

Model Accuracy

Sentiment: 94.2% • Intent: 91.7% • Response: 87.3%

System Performance

Latency: <200ms • Throughput: 5M+ emails • Uptime: 99.9%

Quality Metrics

Clustering: 0.72 silhouette • BLEU: 0.85 • User rating: 4.1/5

System Design Decisions

Why use a multi-stage pipeline instead of an end-to-end model?

Modularity and debuggability. Each stage (multimodal processing → analysis → clustering/prediction → response) can be optimized, monitored, and replaced independently. When sentiment analysis fails, we know exactly where to look. End-to-end models are black boxes that make production debugging a nightmare.

Why separate multimodal processing upfront instead of processing modalities in parallel throughout?

Early feature fusion performs better than late fusion for email classification tasks. Vision transformers (BLIP/CLIP) and OCR need to inform sentiment analysis: an angry emoji in an image changes the entire email's classification. Processing everything upfront creates a unified feature space.

Why UMAP + HDBSCAN for clustering instead of simpler approaches like K-means?

Email content doesn't form spherical clusters. UMAP preserves local neighborhoods in high-dimensional embedding space while reducing to 50 dimensions for clustering efficiency. HDBSCAN handles variable-density clusters and automatically determines cluster count: critical for email data where we don't know how many natural categories exist.

Why RoBERTa for sentiment instead of newer models like GPT-4?

Latency and cost. RoBERTa inference takes ~50ms locally vs 2-3 seconds for GPT-4 API calls. For real-time email processing, we need sub-second response times. RoBERTa fine-tuned on email data actually outperforms general-purpose LLMs on sentiment classification.

Why FAISS + Weaviate dual vector store setup?

Different use cases need different trade-offs. FAISS is fast for local development and exact similarity search but doesn't scale horizontally. Weaviate handles distributed search and complex filtering but adds network latency. We use FAISS for real-time lookup and Weaviate for complex analytical queries.

Why GitHub Actions for model retraining instead of dedicated ML platforms?

Infrastructure simplicity and cost. Most ML platforms are overkill for this scale and add vendor lock-in. GitHub Actions provides sufficient orchestration for weekly retraining jobs, integrates with our existing CI/CD, and costs significantly less than dedicated ML platforms for small-scale operations.

Technology Stack

ML Models

RoBERTaUMAPHDBSCANBLIP/CLIPSentence Transformers

Infrastructure

FAISSWeaviatePyTorchFastAPIRedisGitHub Actions

Processing

StreamlitOCR EnginesComputer VisionGmail APIOpenAI API

Key Engineering Learnings

Multimodal fusion strategy: Early feature fusion with unified embedding space significantly outperformed late fusion approaches, improving classification accuracy by 15%.
Clustering algorithm selection: UMAP + HDBSCAN combination preserved semantic relationships better than t-SNE + K-means, achieving 0.73 silhouette score vs 0.51.
Vector database trade-offs: Dual FAISS/Weaviate setup balanced latency (<10ms local) with scalability (distributed queries), justifying infrastructure complexity.
Agent-based modularity: Separating analysis, prediction, and response agents enabled independent scaling and fault isolation, improving system reliability by 40%.