Agentic AI Assistants — Unified Showcase
Autonomous OSS-powered agents: From 1B edge-inference to 70B+ platform-scale
Role
AI Lead & Architect
Period
2025
Category
ai
Overview
A specialized showcase of production-ready agentic systems deployed across diverse domains. This platform highlights the bridge between raw LLM logic and intuitive human interaction. I deploy everything from 'Tiny' 1B parameter models (Phi-3, Qwen) for ultra-low latency edge inference, to 'Large' 70B+ reasoning models (Llama 3.1, DeepSeek) for institutional-grade workflows. Features include real-time voice agents with Simli, multimodal RAG pipelines with FAISS, and fully autonomous tool-calling chains.
Key Highlights
- Real-time Video Avatars: Integration with Simli for ultra-low latency interactive digital humans with frame-accurate lip-sync
- Real-time Multimodal: High-performance TTS/STT pipelines using Whisper, Piper, and Deepgram
- Scale-Ready Architecture: vLLM & TGI on GCP/AWS A100s for institutional performance
- Interactive Digital Humans: Using LiveKit and Simli for sub-second visual responses
- Deterministic Control: Using LangGraph for stateful multi-step agent behaviors
- Domain RAG: Custom semantic indexing (FAISS/pgvector) with multi-model tiering
Tech Stack
Summary
On-page overviewThis is a concise summary of the challenges, solution, and outcome for this project. Use the Case Study button above for the full deep dive.
The Problem
AI assistants are often stuck as 'wrappers' with high latency, expensive token costs, and zero domain context. I needed a system that could run anywhere—from a local portfolio bot to a real-time voice agent for a hospitality group.
The Solution
I built a tiered architecture that adapts to available compute: **1. The Real-Time Voice & Video Stack:** Integrated **Simli** for unified video avatar streaming and real-time lip-sync. It handles the handshake between Whisper (STT), the LLM (Llama 3.1 8B), and the TTS engine (Piper), ensuring sub-second 'voice-to-video' latency for interactive digital humans. **2. Deployment Tiers:** - **Cloud-Native**: GCP Model Garden or Replicate for immediate scalability. - **Custom Infra**: Spinning up VMs with **vLLM** or **TGI** on dedicated L4 GPUs for high-throughput production usage. - **Zero-Cost Edge**: Ollama running Phi-3-mini directly on the client/CPU for 24/7 availability with no API bills. **3. Knowledge & RAG:** Custom FAISS indexing pipelines that crawl domain data (menu PDFs, code repos, docs) and provide 'grounded' context to the model to prevent hallucinations.
The Outcome
Autonomous agents that act as a personal brand ambassador (ayushv.dev), a venue host (Cafe OTB), or a technical co-pilot—each optimized for cost, latency, and accuracy.
Team & Role
Architected the entire pipeline from the Dockerized vLLM endpoints to the responsive motion UIs and RAG vector indexing.
What I Learned
This project deepened my understanding of Python / FastAPI and LangGraph / LangChain and reinforced best practices in system design and scalability. I gained valuable insights into production-grade development and performance optimization.