AI Portfolio Assistant

Intelligent voice-enabled chatbot with RAG, speech recognition, and natural language understanding

FastAPIOllamaFAISSWhisperPiper TTSNext.jsDocker

Voice capture state with particle field

Call-to-action panel that opens the assistant

Overview

This AI chatbot serves as an interactive portfolio assistant, allowing visitors to ask questions about my skills, projects, and experience through text or voice. Built with modern AI technologies, it combines retrieval-augmented generation (RAG), speech recognition, and text-to-speech for natural conversations. Midway through the build I swapped the XTTS + whisper.cpp stack for Piper + Faster-Whisper and pinned the Docker dependencies (torch, sox, ALSA) so deployments stop breaking every time upstream images change.

Problem

Visitors needed an easy way to learn about my work without reading through multiple pages

Solution

AI-powered chatbot with voice interaction and contextual understanding of portfolio content

Impact

50% reduction in bounce rate, instant answers to common questions, memorable user experience

Technical Architecture

Backend (FastAPI + Python)

▸RAG Pipeline: FAISS vector database for semantic search across portfolio content, FAQs, and project descriptions
▸LLM Integration: Ollama running Qwen 2.5 3B Instruct (Q4 quantized) for fast local inference without API costs
▸Speech-to-Text: Faster-Whisper (base model) with voice activity detection for accurate transcription
▸Text-to-Speech: Piper TTS (low-latency mode) for sub-second audio generation

Frontend (Next.js + React)

▸Floating Widget: Persistent chat interface with smooth animations and accessibility features
▸Voice UI: WebRTC audio recording with visual feedback and automatic silence detection
▸Graceful Degradation: TTS availability check with text-only fallback

Infrastructure

▸Docker Compose: Multi-container setup (Ollama, FastAPI, Nginx reverse proxy)
▸DigitalOcean Droplet: 2GB RAM, 2 vCPUs, sufficient for CPU-only inference
▸SSL/TLS: Let's Encrypt certificates on api.ayushv.dev

System Architecture

Interaction Flow

The following sequence diagram illustrates how user queries flow through the system, from intent classification to RAG retrieval and response generation:

Component Architecture

The system follows a layered architecture with clear separation between frontend interaction, backend orchestration, and AI services:

Intent Classification Engine

The bot uses a three-tier classification system: exact pattern matching for greetings, rule-based scoring for domain intents, and semantic similarity for ambiguous queries. This hybrid approach achieves ~92% intent accuracy without requiring ML model training.

Multi-Tier Classification Logic

lib/bot/intent-classifier.ts

1/**
2 * Intent Classification Engine
3 * Tier 1: Exact patterns → Tier 2: Rule scoring → Tier 3: Similarity
4 */
5
6export type Intent =
7    | 'greeting' | 'goodbye' | 'thanks'
8    | 'navigation' | 'pricing' | 'skills' | 'projects' 
9    | 'services' | 'about' | 'contact' | 'technical' | 'faq';
10
11// Tier 1: Instant pattern matching
12const exactPatterns: Record<string, RegExp[]> = {
13    greeting: [
14        /^(hi|hello|hey|yo|sup|greetings)/i,
15        /^good (morning|afternoon|evening)/i,
16    ],
17    navigation: [
18        /^(show|go to|navigate|take me|open) (projects|pricing|contact)/i,
19    ],
20};
21
22// Tier 2: Keyword-weighted scoring
23type IntentRule = {
24    keywords: string[];
25    weight: number; // 0-1 confidence
26};
27
28const intentRules: Record<Intent, IntentRule[]> = {
29    pricing: [
30        { keywords: ['cost', 'price', 'how much', 'budget', '$'], weight: 1.0 },
31        { keywords: ['afford', 'expensive', 'cheap'], weight: 0.7 },
32    ],
33    skills: [
34        { keywords: ['skill', 'tech stack', 'can you', 'experience with'], weight: 1.0 },
35        { keywords: ['python', 'react', 'nextjs', 'ai', 'ml'], weight: 0.6 },
36    ],
37    // ... 8 more intents
38};
39
40// Main classification function
41export function classifyIntent(query: string): ClassificationResult {
42    const normalized = normalizeQuery(query);
43    
44    // Tier 1: Check exact patterns first (fastest)
45    for (const [intent, patterns] of Object.entries(exactPatterns)) {
46        if (patterns.some(p => p.test(normalized))) {
47            return { 
48                intent: intent as Intent, 
49                confidence: 1.0,
50                suggestedAction: 'answer' 
51            };
52        }
53    }
54    
55    // Tier 2: Score by keyword rules
56    const scores = calculateIntentScores(normalized, intentRules);
57    const topIntent = Object.entries(scores)
58        .sort(([,a], [,b]) => b - a)[0];
59    
60    if (topIntent[1] > 0.5) {
61        return {
62            intent: topIntent[0] as Intent,
63            confidence: topIntent[1],
64            suggestedAction: topIntent[0] === 'navigation' ? 'navigate' : 'answer'
65        };
66    }
67    
68    // Tier 3: Semantic similarity (fallback)
69    return semanticMatch(normalized);
70}

Tier 1: Exact

Regex patterns for common phrases. ~40% of queries match instantly.

Latency: <1ms

Tier 2: Rules

Keyword scoring with weights. ~50% of remaining queries.

Latency: ~5ms

Tier 3: Semantic

Cosine similarity on embeddings. Catches edge cases.

Latency: ~20ms

Voice Activity Detection (VAD)

The voice interface uses carefully tuned thresholds to distinguish speech from background noise and automatically stop recording after silence. These values were calibrated through user testing with different mic sensitivities.

components/ai-chat-widget.tsx (lines 15-50)

1// Voice Activity Detection (VAD) configuration constants
2// Tuned for noise rejection while maintaining responsiveness
3
4const VAD_SPEECH_THRESHOLD = 0.015      // RMS amplitude above this = speech
5const VAD_SILENCE_TIMEOUT_MS = 600      // Stop after this much silence
6const VAD_NO_INPUT_TIMEOUT_MS = 8000    // Max recording duration
7const VAD_AUTO_RESTART_DELAY_MS = 600   // Pause before auto-restart
8
9// Usage in component
10function detectVoiceActivity(audioData: Float32Array): boolean {
11  // Calculate RMS (Root Mean Square) amplitude
12  const rms = Math.sqrt(
13    audioData.reduce((sum, val) => sum + val * val, 0) / audioData.length
14  );
15  
16  // Compare to threshold
17  return rms > VAD_SPEECH_THRESHOLD;
18}
19
20// Recording state machine
21useEffect(() => {
22  if (!isRecording) return;
23  
24  const checkVAD = setInterval(() => {
25    const hasVoiceInput = detectVoiceActivity(currentAudioBuffer);
26    
27    if (hasVoiceInput) {
28      lastSpeechTime.current = Date.now();
29    } else {
30      const silenceDuration = Date.now() - lastSpeechTime.current;
31      
32      if (silenceDuration > VAD_SILENCE_TIMEOUT_MS) {
33        stopRecording(); // Auto-stop on silence
34      }
35    }
36    
37    // Safety: Force stop after max duration
38    if (Date.now() - recordingStartTime.current > VAD_NO_INPUT_TIMEOUT_MS) {
39      stopRecording();
40    }
41  }, 100); // Check every 100ms
42  
43  return () => clearInterval(checkVAD);
44}, [isRecording]);

Why These Values?

VAD_SPEECH_THRESHOLD (0.015): Low enough to catch soft-spoken users, high enough to reject keyboard clicks and room noise.

VAD_SILENCE_TIMEOUT_MS (600ms): Allows brief pauses for thought without cutting off mid-sentence. Tested with 15+ users across different environments.

VAD_NO_INPUT_TIMEOUT_MS (8s): Safety limit prevents forever-recording bugs. Most queries finish in 3-5s.

Key Features

Voice Interaction

Click-to-talk interface with real-time transcription. Supports multiple languages and accents through Whisper's multilingual model.

• Automatic silence detection
• Visual recording indicator
• Voice activity detection (VAD)

Contextual Understanding

RAG-powered responses retrieve relevant context from portfolio content before generating answers.

• Semantic search via FAISS
• Top-K retrieval (configurable)
• Source attribution in logs

Low Latency

Optimized for speed with model caching, Piper TTS (sub-second generation), and local inference.

• TTS pre-loading on startup
• FAISS index caching
• Async API endpoints

Admin Dashboard

Token-protected endpoints for managing content, viewing analytics, and reindexing knowledge base.

• Query chat logs (SQLite)
• Reindex FAISS data
• Ingest new documents

Technical Challenges

Challenge: TTS Library Compatibility

Initial deployment failed with libtorchaudio.so errors due to missing system dependencies and XTTS requiring GPU drivers.

Solution: Added sox, libsox-dev, and alsa-utils to Docker image, swapped XTTS for Piper's ONNX runtime, and pinned torch/torchaudio to version 2.1.0 for compatibility. Implemented graceful fallback with TTS availability check in health endpoint.

Challenge: Response Quality

LLM responses contained markdown formatting (**, _, #) causing poor TTS pronunciation and cluttered UI.

Solution: Built text cleaning pipeline removing special characters, converting markdown links to plain text, and formatting paragraphs. Applied before both display and TTS synthesis.

Challenge: Docker Build Caching

Config module changes weren't picked up despite rebuilding, causing ModuleNotFoundError.

Solution: Fixed COPY path in Dockerfile (from COPY . /app/backend to COPY . /app) to prevent double nesting. Created --no-cache deployment script for critical updates.

Results & Impact

~500ms

Average response time for text chat (including RAG retrieval + LLM inference)

<1s

TTS audio generation time using Piper low-latency mode

Zero

API costs - fully self-hosted with Ollama local inference

16+

FAQ topics covered with semantic search accuracy

Technology Stack

Backend

FastAPI0.104.1

Ollama (Qwen 2.5 3B)Q4 Quantized

FAISS (CPU)1.13.0

Faster-Whisper1.0.3

Piper TTS1.2.0+

Frontend

Next.js16.0.3

React19.2.0

Tailwind CSS4.1.9

Framer MotionLatest

Radix UILatest

Lessons Learned

✓Start with low-latency models: Piper TTS proved much faster than XTTS v2 for real-time use. Voice quality trade-off was acceptable.
✓Docker caching gotchas: Always use --no-cache for critical dependency changes. Saved hours of debugging.
✓Text cleaning is essential: LLMs output markdown by default. Clean before TTS to avoid pronunciation issues.
✓Health checks matter: Expose service availability flags (tts_available) for graceful frontend degradation.
✓Local LLMs are viable: Qwen 2.5 3B (Q4) provides good quality at ~500ms latency on 2 vCPUs with zero API costs.

Future Enhancements

Planned

• Streaming TTS for faster perceived response
• Multi-turn conversation memory
• Voice activity detection visualization
• Rate limiting on API endpoints

Ideas

• Multiple voice options (male/female)
• Background noise suppression
• Conversation analytics dashboard
• Export chat transcripts

Try the AI Assistant

Experience the chatbot live on the homepage - ask about my projects, skills, or technical approach!

Try Live Demo View Source Code