Agentic AI Assistants — Unified Showcase

Autonomous OSS-powered agents: From 1B edge-inference to 70B+ platform-scale

Role

AI Lead & Architect

Period

2025

Overview

A specialized showcase of production-ready agentic systems deployed across diverse domains. This platform highlights the bridge between raw LLM logic and intuitive human interaction. I deploy everything from 'Tiny' 1B parameter models (Phi-3, Qwen) for ultra-low latency edge inference, to 'Large' 70B+ reasoning models (Llama 3.1, DeepSeek) for institutional-grade workflows. Features include real-time voice agents with Simli, multimodal RAG pipelines with FAISS, and fully autonomous tool-calling chains.

Key Highlights

Real-time Video Avatars: Integration with Simli for ultra-low latency interactive digital humans with frame-accurate lip-sync
Real-time Multimodal: High-performance TTS/STT pipelines using Whisper, Piper, and Deepgram
Scale-Ready Architecture: vLLM & TGI on GCP/AWS A100s for institutional performance
Interactive Digital Humans: Using LiveKit and Simli for sub-second visual responses
Deterministic Control: Using LangGraph for stateful multi-step agent behaviors
Domain RAG: Custom semantic indexing (FAISS/pgvector) with multi-model tiering

Tech Stack

Python / FastAPI

LangGraph / LangChain

OSS LLMs (Ollama, vLLM, TGI)

FAISS & pgvector

Whisper (STT) & Piper (TTS)

Simli (Real-time Assistant)

Replicate / GCP Model Garden

FAISS / RAG Indexing

Links

Read Case Study View Repository

Summary

On-page overview

This is a concise summary of the challenges, solution, and outcome for this project. Use the Case Study button above for the full deep dive.

The Problem

AI assistants are often stuck as 'wrappers' with high latency, expensive token costs, and zero domain context. I needed a system that could run anywhere—from a local portfolio bot to a real-time voice agent for a hospitality group.

The Solution

I built a tiered architecture that adapts to available compute: **1. The Real-Time Voice & Video Stack:** Integrated **Simli** for unified video avatar streaming and real-time lip-sync. It handles the handshake between Whisper (STT), the LLM (Llama 3.1 8B), and the TTS engine (Piper), ensuring sub-second 'voice-to-video' latency for interactive digital humans. **2. Deployment Tiers:** - **Cloud-Native**: GCP Model Garden or Replicate for immediate scalability. - **Custom Infra**: Spinning up VMs with **vLLM** or **TGI** on dedicated L4 GPUs for high-throughput production usage. - **Zero-Cost Edge**: Ollama running Phi-3-mini directly on the client/CPU for 24/7 availability with no API bills. **3. Knowledge & RAG:** Custom FAISS indexing pipelines that crawl domain data (menu PDFs, code repos, docs) and provide 'grounded' context to the model to prevent hallucinations.

The Outcome

Autonomous agents that act as a personal brand ambassador (ayushv.dev), a venue host (Cafe OTB), or a technical co-pilot—each optimized for cost, latency, and accuracy.

Team & Role

Architected the entire pipeline from the Dockerized vLLM endpoints to the responsive motion UIs and RAG vector indexing.

What I Learned

This project deepened my understanding of Python / FastAPI and LangGraph / LangChain and reinforced best practices in system design and scalability. I gained valuable insights into production-grade development and performance optimization.

Back to Projects Start a Project