Evalzo — Video Assessment Agent

Architecture v2.1 (Recall.ai + Groq Whisper)  |  April 2026  |  CTO + Engineering Brief

What Changed

Recall.ai eliminates 3 of our original 6 layers. Their Meeting Bot API joins the interview, records video + audio + screen-share separately, provides speaker-attributed transcripts with word-level timestamps, and gives us structured participant_events that tell us exactly when screen-sharing started and stopped.

The triage LLM layer is completely eliminated. We don't need AI to figure out which parts are screen-shares — Recall.ai's screenshare_on / screenshare_off events tell us directly, and it captures screen-share video as a separate stream (type: "screenshare"). Transcription uses Groq Whisper V3 Turbo ($0.04/hr, 216x real-time) on Recall.ai's per-participant audio — no diarization needed since each audio stream IS a specific speaker. Total cost: ~$1.64–3.64/candidate.
Architecture
Before vs After
Cost Analysis
Dev Effort
Implementation Guide

New Architecture — 4 Layers

1 INPUTS BUILD
Meeting URL
Zoom / Meet / Teams link
Job Description
Role requirements & criteria
Interview Config
Type, rubric, weights
Soul File
AI persona + evaluation rules
2 RECALL.AI — RECORDING + EVENTS BUY — $0.60/hr

Bot joins the meeting via API. Records everything. Returns structured data post-call. No Recall.ai transcription — we use their per-participant audio with Groq Whisper instead.

Per-Participant Audio
Separate MP3 per speaker — natural diarization, no AI needed
Screen-Share Video
Separate MP4 per participant (type: "screenshare")
Participant Events
screenshare_on/off, speech_on/off, join/leave with timestamps
Mixed Video
720p MP4 gallery/speaker view
Meeting Metadata
Title, platform, participant names/emails
2b TRANSCRIPTION — GROQ WHISPER V3 TURBO $0.04/hr · 216x real-time
Per-Speaker Transcription
Each participant's audio MP3 from Recall.ai → Groq Whisper API → timestamped transcript per speaker
No diarization needed — each audio file IS a specific participant, speaker identity is structural
Merge chronologically → fully attributed transcript in ~17 seconds for a 60-min call
Why Not Recall.ai Transcription?
Groq: $0.04/hr vs Recall: $0.15/hr
73% cheaper, 216x real-time speed
We own the transcription layer — can swap to Deepgram/AssemblyAI if needed
3 ASSESSMENT LLM BUILD
Assessment + JD Scoring
Receives: full transcript (text) + screen-share clips (video) + JD + soul file
Evaluates: technical skill, problem-solving, communication, code quality
Outputs: per-criterion JD-aligned scores with evidence citations
Swappable: Claude Opus / Sonnet, GPT-4o, Gemini Pro
Why No Triage Needed
Recall.ai participant_events already tells us which segments are screen-shares (with timestamps). We extract only those clips — no LLM classification needed.
4 OUTPUT BUILD
Structured Summary
Narrative assessment (JSON + Markdown)
JD-Aligned Scores
Per-criterion ratings with evidence
Evidence Clips
Timestamped links to video moments
Overall Verdict
Score + hire recommendation

Data Flow (One Candidate Assessment)

1. Recruiter schedules interview → Evalzo calls POST /api/v1/bot/ with meeting URL + recording config (no transcription add-on)
2. Recall.ai bot joins meeting, records everything silently
3. Post-call webhook recording.done fires → Evalzo fetches per-participant audio MP3s + participant events + screen-share MP4s
4. Each participant audio MP3 → Groq Whisper V3 Turbo → timestamped transcript (parallel, ~17s total for 60 min)
5. Merge transcripts chronologically (speaker identity is structural — no diarization step)
6. Extract screen-share segments using screenshare_on/off timestamps (zero LLM cost)
7. Sends to Assessment LLM: merged transcript + screen-share clips + JD + soul file
8. LLM returns structured assessment → Evalzo formats and delivers to recruiter dashboard

Architecture Comparison

v1 — Build Everything (6 Layers)

BUILD Layer 1: Inputs

BUILD Layer 2: Orchestrator

ELIMINATED Layer 3: Triage LLM

ELIMINATED Layer 4: Processing (transcription + clipping)

BUILD Layer 5: Assessment LLM

BUILD Layer 6: Output

6 layers · $5.25–8.55/candidate · 6+ weeks dev

v2.1 — Recall.ai + Groq Whisper (5 Layers)

BUILD Layer 1: Inputs

BUY Layer 2: Recall.ai (recording + events — no transcription)

BUY Layer 2b: Groq Whisper V3 Turbo ($0.04/hr)

BUILD Layer 3: Assessment LLM

BUILD Layer 4: Output

5 layers · $1.64–3.64/candidate · ~12 dev-days

What Recall.ai Eliminated

Original LayerWhat It DidWhy It's Gone
Triage LLM (Layer 3)Light model watched video to classify segments as Q&A vs screen-shareRecall.ai participant_events provides screenshare_on/off timestamps — structured data, no AI needed
Transcription Service (Layer 4)Self-hosted Whisper/Deepgram + diarization pipelineRecall.ai gives per-participant audio (natural diarization). Groq Whisper V3 Turbo transcribes at $0.04/hr, 216x real-time. No diarization step needed.
Video Clipper (Layer 4)FFmpeg extracted screen-share clipsRecall.ai captures screen-share as separate MP4 (type: "screenshare") — no clipping needed
Orchestrator (Layer 2)Complex prompt routing to triage vs assessment modelsSimplified to a webhook handler — no multi-model routing needed

What We Still Build

ComponentEffortWhy
Recall.ai integration + Groq Whisper transcription3 daysCreate bot, handle webhooks, fetch audio/video, transcribe per-participant via Groq, merge chronologically
Assessment LLM pipeline3–4 daysPrompt engineering, structured output, model abstraction — this is Evalzo's IP
JD alignment + scoring2–3 daysMap assessment to JD criteria, weighted scoring, evidence linking
Output formatting + dashboard integration2 daysPresent results to recruiter
Total~12 dev-days

Three Approaches Compared (60-min interview)

Approach A: Naive — Full Video to Frontier Model

ComponentVolumeCost
Video frames (1fps × 60 min)3,600 frames × ~1,200 tok$13–22
Audio transcript tokens~30K tokens$0.45
Output (assessment)~5K tokens$0.38
Total per candidate$14–23

Approach B: Triage-First (v1 Architecture)

ComponentVolumeCost
Triage model (full video, 0.2 fps)720 frames, light model$0.40–0.80
Transcription (40 min verbal)40 min audio$0.40
Assessment model (20 min clips only)1,200 frames, frontier model$4.30–7.20
JD scoring pass (text only)~10K tokens$0.15
Total per candidate$5.25–8.55

Approach C: Recall.ai + Groq Whisper + Assessment LLM (v2.1 Architecture)

ComponentVolumeCost
Recall.ai recording (no transcription add-on)60 min$0.50
Recall.ai web_4_core variant (for separate audio/video)60 min$0.10
Groq Whisper V3 Turbo transcription60 min$0.04
Assessment LLM (transcript as text + screen-share clips only)~30K text tokens + ~600 frames (20 min @ 0.5fps)$1.00–3.00
Total per candidate$1.64–3.64
v2.1 saves 70–85% vs v1, and 85–93% vs naive. At 500 candidates/month: Naive = $11,500  |  v1 Triage-first = $3,750  |  v2.1 Recall+Groq = $1,320. For purely verbal interviews (no screen-share), v2.1 drops to $0.64/candidate.

Why Is v2.1 So Much Cheaper?

1. No triage LLM at all. v1 spent $0.40–0.80 per interview having a model watch the full video just to classify segments. Recall.ai gives us screenshare_on/off events for free as structured data.

2. No diarization needed. Recall.ai captures separate audio per participant (audio_separate_mp3). Each file IS a specific speaker — speaker identity is structural, not algorithmic. No pyannote, no GPU, no Celery workers.

3. Groq Whisper at $0.04/hr is 73% cheaper than Recall.ai's transcription. We skip Recall.ai's $0.15/hr transcription add-on and transcribe each participant's audio independently via Groq. 216x real-time speed means a 60-min call transcribes in ~17 seconds. And we own this layer — can swap to Deepgram or AssemblyAI if Groq pricing changes.

4. Screen-share video comes pre-separated. We don't need FFmpeg clipping. Recall.ai captures screen-share as a separate stream (type: "screenshare"). We send only that to the assessment LLM — no wasted tokens on webcam footage.

5. Lower frame rate is fine. For screen-share assessment (code on screen), we can sample at 0.5fps (one frame every 2 seconds). Code doesn't change that fast. This halves the token cost vs 1fps.

Dev Effort Breakdown — 1 Engineer, Full-Time

Assumes one backend engineer familiar with Python, REST APIs, and LLM integrations. All estimates include testing.

Phase 1: Recall.ai Integration 3 days

Day 1 — Bot lifecycle + webhook handler (8 hrs)

Create bot via API on interview schedule. Handle recording.done, transcript.done webhooks. Implement bot status polling as fallback. Store recording metadata in DB.

Day 2 — Data retrieval + storage (8 hrs)

Fetch transcript JSON, screen-share MP4s, participant events via media_shortcuts. Parse screenshare_on/off events to identify visual segments. Download and temporarily store screen-share clips. Handle retry/error cases.

Day 3 — Testing + edge cases (8 hrs)

Test across Zoom, Meet, Teams. Handle: no screen-share interviews, multiple screen-share segments, failed recordings, bot join failures. Wire up with Evalzo's interview scheduling system.

Phase 2: Assessment LLM Pipeline 4 days

Day 4 — Prompt engineering + soul file (8 hrs)

Design system prompt structure: soul file + JD context + evaluation rubric. Build prompt assembly from interview config + JD. Test with sample transcripts. Define structured output schema (JSON mode).

Day 5 — Multimodal assessment (8 hrs)

Implement video frame extraction from screen-share MP4 (frame sampling at configurable fps). Build the multimodal prompt: transcript sections + video frames interleaved chronologically. Handle context window limits (chunk if needed).

Day 6 — Model abstraction layer (8 hrs)

Abstract LLM calls behind a unified interface (assess(transcript, media, jd) → Assessment). Add config-driven model selection (Claude/GPT/Gemini). Implement retry, timeout, cost tracking per assessment.

Day 7 — JD alignment scoring (8 hrs)

Second-pass prompt that maps raw assessment to JD criteria. Weighted scoring per requirement. Evidence extraction (link scores to transcript timestamps + video moments). Structured output: scores array + overall verdict.

Phase 3: Output + Integration 3 days

Day 8 — Output formatting (8 hrs)

Transform LLM output into recruiter-facing format. Generate Markdown summary + JSON payload. Map evidence citations to Recall.ai video timestamps (deep-link to specific moments). Build overall score + hire recommendation logic.

Day 9 — Dashboard integration (8 hrs)

Connect output to Evalzo's existing recruiter dashboard. Display assessment alongside existing written test (QGen) results. Add video playback with evidence-linked timestamps. Handle assessment status states (processing, complete, failed).

Day 10 — End-to-end testing (8 hrs)

Full pipeline test: schedule → record → assess → display. Test with 3–5 real interview recordings across different types (technical, behavioral, mixed). Verify cost tracking matches projections. Performance test: ensure assessment completes within 2–3 min of recording completion.

Phase 4: Hardening + QA 2 days

Day 11 — Error handling + monitoring (8 hrs)

Add comprehensive error handling for each pipeline stage. Implement alerting for failed assessments. Add cost monitoring dashboard (per-assessment + monthly aggregate). Implement assessment retry logic.

Day 12 — Quality validation (8 hrs)

Compare AI assessments against human evaluator benchmarks. Tune prompts based on quality gaps. Document the system for team handoff. Deploy to staging.

PhaseDaysHoursDeliverable
Recall.ai Integration + Groq Whisper324Bot lifecycle, webhook handler, data retrieval, per-participant transcription working
Assessment LLM Pipeline432Multimodal assessment with model abstraction + JD scoring
Output + Integration324Dashboard-ready output, end-to-end tested
Hardening + QA216Production-ready, monitored, documented
Total12 days96 hrsFull pipeline, staging-ready

Implementation Guide — ClickUp-Ready

Detailed technical steps for developer handoff. Each section maps to a ClickUp task.

Task 1: Recall.ai Account + API Setup

Assignee: Backend engineer  |  Estimate: 2 hrs  |  Priority: Urgent

Task 2: Create Bot Service

Assignee: Backend engineer  |  Estimate: 8 hrs  |  Priority: High

Service that creates a Recall.ai bot when an interview is scheduled.

POST https://us-east-1.recall.ai/api/v1/bot/
{
  "meeting_url": "https://zoom.us/j/123456789?pwd=abc",
  "bot_name": "Evalzo Assessment",
  "join_at": "2026-04-25T14:00:00Z",   // scheduled >10min ahead = guaranteed
  "recording_config": {
    // NO transcript provider — we use Groq Whisper on per-participant audio
    "audio_separate_mp3": {},            // per-participant audio (natural diarization)
    "video_mixed_mp4": {},               // 720p gallery view recording
    "video_separate_mp4": {},            // separate webcam + screenshare per participant
    "participant_events": {},            // join/leave/screenshare_on/off events
    "meeting_metadata": {}
  },
  "variant": { "zoom": "web_4_core" },  // required for video_separate_mp4
  "metadata": {
    "evalzo_interview_id": "int_abc123",
    "evalzo_jd_id": "jd_xyz456"
  }
}

Task 3: Webhook Handler + Groq Transcription

Assignee: Backend engineer  |  Estimate: 8 hrs  |  Priority: High

Endpoint that receives Recall.ai lifecycle webhooks, transcribes via Groq, and triggers the assessment pipeline.

Task 4: Assessment LLM Service

Assignee: Backend engineer  |  Estimate: 16 hrs  |  Priority: High

Core intelligence layer. Consumes Recall.ai data + JD, outputs structured assessment.

Task 5: Output Formatter + Dashboard Integration

Assignee: Backend + Frontend engineer  |  Estimate: 12 hrs  |  Priority: Medium

Task 6: Testing + QA

Assignee: Backend engineer  |  Estimate: 12 hrs  |  Priority: High

Recall.ai API Quick Reference

ActionEndpointWhen
Create botPOST /api/v1/bot/Interview scheduled
Get bot statusGET /api/v1/bot/{id}/Polling / webhook
Get per-participant audioGET /api/v1/audio_separate?recording_id={id}Post-call → Groq Whisper
Get participant eventsGET media_shortcuts.participant_events.data.download_urlPost-call
Get screen-share videoGET /api/v1/video_separate?recording_id={id}Post-call
Get mixed videoGET media_shortcuts.video_mixed.data.download_urlPost-call
Delete recordingDELETE /api/v1/recording/{id}/After assessment (privacy)

API Docs: docs.recall.ai/docs/getting-started  |  Rate limit: 120 requests/min  |  Auth: Authorization: Token {API_KEY}