Evalzo — Video Assessment Agent

Architecture v2.1 (Recall.ai + Groq Whisper) | April 2026 | CTO + Engineering Brief

What Changed

Recall.ai eliminates 3 of our original 6 layers. Their Meeting Bot API joins the interview, records video + audio + screen-share separately, provides speaker-attributed transcripts with word-level timestamps, and gives us structured participant_events that tell us exactly when screen-sharing started and stopped.

The triage LLM layer is completely eliminated. We don't need AI to figure out which parts are screen-shares — Recall.ai's screenshare_on / screenshare_off events tell us directly, and it captures screen-share video as a separate stream (type: "screenshare"). Transcription uses Groq Whisper V3 Turbo ($0.04/hr, 216x real-time) on Recall.ai's per-participant audio — no diarization needed since each audio stream IS a specific speaker. Total cost: ~$1.64–3.64/candidate.

Architecture

Before vs After

Cost Analysis

Dev Effort

Implementation Guide

New Architecture — 4 Layers

1 INPUTS BUILD

Meeting URL

Zoom / Meet / Teams link

Job Description

Role requirements & criteria

Interview Config

Type, rubric, weights

Soul File

AI persona + evaluation rules

2 RECALL.AI — RECORDING + EVENTS BUY — $0.60/hr

Bot joins the meeting via API. Records everything. Returns structured data post-call. No Recall.ai transcription — we use their per-participant audio with Groq Whisper instead.

Per-Participant Audio

Separate MP3 per speaker — natural diarization, no AI needed

Screen-Share Video

Separate MP4 per participant (type: "screenshare")

Participant Events

screenshare_on/off, speech_on/off, join/leave with timestamps

Mixed Video

720p MP4 gallery/speaker view

Meeting Metadata

Title, platform, participant names/emails

2b TRANSCRIPTION — GROQ WHISPER V3 TURBO $0.04/hr · 216x real-time

Per-Speaker Transcription

Each participant's audio MP3 from Recall.ai → Groq Whisper API → timestamped transcript per speaker
No diarization needed — each audio file IS a specific participant, speaker identity is structural
Merge chronologically → fully attributed transcript in ~17 seconds for a 60-min call

Why Not Recall.ai Transcription?

Groq: $0.04/hr vs Recall: $0.15/hr
73% cheaper, 216x real-time speed
We own the transcription layer — can swap to Deepgram/AssemblyAI if needed

3 ASSESSMENT LLM BUILD

Assessment + JD Scoring

Receives: full transcript (text) + screen-share clips (video) + JD + soul file
Evaluates: technical skill, problem-solving, communication, code quality
Outputs: per-criterion JD-aligned scores with evidence citations
Swappable: Claude Opus / Sonnet, GPT-4o, Gemini Pro

Why No Triage Needed

Recall.ai participant_events already tells us which segments are screen-shares (with timestamps). We extract only those clips — no LLM classification needed.

4 OUTPUT BUILD

Structured Summary

Narrative assessment (JSON + Markdown)

JD-Aligned Scores

Per-criterion ratings with evidence

Evidence Clips

Timestamped links to video moments

Overall Verdict

Score + hire recommendation

Data Flow (One Candidate Assessment)

1. Recruiter schedules interview → Evalzo calls POST /api/v1/bot/ with meeting URL + recording config (no transcription add-on)
2. Recall.ai bot joins meeting, records everything silently
3. Post-call webhook recording.done fires → Evalzo fetches per-participant audio MP3s + participant events + screen-share MP4s
4. Each participant audio MP3 → Groq Whisper V3 Turbo → timestamped transcript (parallel, ~17s total for 60 min)
5. Merge transcripts chronologically (speaker identity is structural — no diarization step)
6. Extract screen-share segments using screenshare_on/off timestamps (zero LLM cost)
7. Sends to Assessment LLM: merged transcript + screen-share clips + JD + soul file
8. LLM returns structured assessment → Evalzo formats and delivers to recruiter dashboard

Architecture Comparison

v1 — Build Everything (6 Layers)

BUILD Layer 1: Inputs

BUILD Layer 2: Orchestrator

ELIMINATED Layer 3: Triage LLM

ELIMINATED Layer 4: Processing (transcription + clipping)

BUILD Layer 5: Assessment LLM

BUILD Layer 6: Output

6 layers · $5.25–8.55/candidate · 6+ weeks dev

v2.1 — Recall.ai + Groq Whisper (5 Layers)

BUILD Layer 1: Inputs

BUY Layer 2: Recall.ai (recording + events — no transcription)

BUY Layer 2b: Groq Whisper V3 Turbo ($0.04/hr)

BUILD Layer 3: Assessment LLM

BUILD Layer 4: Output

5 layers · $1.64–3.64/candidate · ~12 dev-days

What Recall.ai Eliminated

Original Layer	What It Did	Why It's Gone
Triage LLM (Layer 3)	Light model watched video to classify segments as Q&A vs screen-share	Recall.ai `participant_events` provides `screenshare_on/off` timestamps — structured data, no AI needed
Transcription Service (Layer 4)	Self-hosted Whisper/Deepgram + diarization pipeline	Recall.ai gives per-participant audio (natural diarization). Groq Whisper V3 Turbo transcribes at $0.04/hr, 216x real-time. No diarization step needed.
Video Clipper (Layer 4)	FFmpeg extracted screen-share clips	Recall.ai captures screen-share as separate MP4 (`type: "screenshare"`) — no clipping needed
Orchestrator (Layer 2)	Complex prompt routing to triage vs assessment models	Simplified to a webhook handler — no multi-model routing needed

What We Still Build

Component	Effort	Why
Recall.ai integration + Groq Whisper transcription	3 days	Create bot, handle webhooks, fetch audio/video, transcribe per-participant via Groq, merge chronologically
Assessment LLM pipeline	3–4 days	Prompt engineering, structured output, model abstraction — this is Evalzo's IP
JD alignment + scoring	2–3 days	Map assessment to JD criteria, weighted scoring, evidence linking
Output formatting + dashboard integration	2 days	Present results to recruiter
Total	~12 dev-days

Three Approaches Compared (60-min interview)

Approach A: Naive — Full Video to Frontier Model

Component	Volume	Cost
Video frames (1fps × 60 min)	3,600 frames × ~1,200 tok	$13–22
Audio transcript tokens	~30K tokens	$0.45
Output (assessment)	~5K tokens	$0.38
Total per candidate		$14–23

Approach B: Triage-First (v1 Architecture)

Component	Volume	Cost
Triage model (full video, 0.2 fps)	720 frames, light model	$0.40–0.80
Transcription (40 min verbal)	40 min audio	$0.40
Assessment model (20 min clips only)	1,200 frames, frontier model	$4.30–7.20
JD scoring pass (text only)	~10K tokens	$0.15
Total per candidate		$5.25–8.55

Approach C: Recall.ai + Groq Whisper + Assessment LLM (v2.1 Architecture)

Component	Volume	Cost
Recall.ai recording (no transcription add-on)	60 min	$0.50
Recall.ai `web_4_core` variant (for separate audio/video)	60 min	$0.10
Groq Whisper V3 Turbo transcription	60 min	$0.04
Assessment LLM (transcript as text + screen-share clips only)	~30K text tokens + ~600 frames (20 min @ 0.5fps)	$1.00–3.00
Total per candidate		$1.64–3.64

v2.1 saves 70–85% vs v1, and 85–93% vs naive. At 500 candidates/month: Naive = $11,500 | v1 Triage-first = $3,750 | v2.1 Recall+Groq = $1,320. For purely verbal interviews (no screen-share), v2.1 drops to $0.64/candidate.

Why Is v2.1 So Much Cheaper?

1. No triage LLM at all. v1 spent $0.40–0.80 per interview having a model watch the full video just to classify segments. Recall.ai gives us screenshare_on/off events for free as structured data.

2. No diarization needed. Recall.ai captures separate audio per participant (audio_separate_mp3). Each file IS a specific speaker — speaker identity is structural, not algorithmic. No pyannote, no GPU, no Celery workers.

3. Groq Whisper at $0.04/hr is 73% cheaper than Recall.ai's transcription. We skip Recall.ai's $0.15/hr transcription add-on and transcribe each participant's audio independently via Groq. 216x real-time speed means a 60-min call transcribes in ~17 seconds. And we own this layer — can swap to Deepgram or AssemblyAI if Groq pricing changes.

4. Screen-share video comes pre-separated. We don't need FFmpeg clipping. Recall.ai captures screen-share as a separate stream (type: "screenshare"). We send only that to the assessment LLM — no wasted tokens on webcam footage.

5. Lower frame rate is fine. For screen-share assessment (code on screen), we can sample at 0.5fps (one frame every 2 seconds). Code doesn't change that fast. This halves the token cost vs 1fps.

Dev Effort Breakdown — 1 Engineer, Full-Time

Assumes one backend engineer familiar with Python, REST APIs, and LLM integrations. All estimates include testing.

Phase 1: Recall.ai Integration 3 days

Day 1 — Bot lifecycle + webhook handler (8 hrs)

Create bot via API on interview schedule. Handle recording.done, transcript.done webhooks. Implement bot status polling as fallback. Store recording metadata in DB.

Day 2 — Data retrieval + storage (8 hrs)

Fetch transcript JSON, screen-share MP4s, participant events via media_shortcuts. Parse screenshare_on/off events to identify visual segments. Download and temporarily store screen-share clips. Handle retry/error cases.

Day 3 — Testing + edge cases (8 hrs)

Test across Zoom, Meet, Teams. Handle: no screen-share interviews, multiple screen-share segments, failed recordings, bot join failures. Wire up with Evalzo's interview scheduling system.

Phase 2: Assessment LLM Pipeline 4 days

Day 4 — Prompt engineering + soul file (8 hrs)

Design system prompt structure: soul file + JD context + evaluation rubric. Build prompt assembly from interview config + JD. Test with sample transcripts. Define structured output schema (JSON mode).

Day 5 — Multimodal assessment (8 hrs)

Implement video frame extraction from screen-share MP4 (frame sampling at configurable fps). Build the multimodal prompt: transcript sections + video frames interleaved chronologically. Handle context window limits (chunk if needed).

Day 6 — Model abstraction layer (8 hrs)

Abstract LLM calls behind a unified interface (assess(transcript, media, jd) → Assessment). Add config-driven model selection (Claude/GPT/Gemini). Implement retry, timeout, cost tracking per assessment.

Day 7 — JD alignment scoring (8 hrs)

Second-pass prompt that maps raw assessment to JD criteria. Weighted scoring per requirement. Evidence extraction (link scores to transcript timestamps + video moments). Structured output: scores array + overall verdict.

Phase 3: Output + Integration 3 days

Day 8 — Output formatting (8 hrs)

Transform LLM output into recruiter-facing format. Generate Markdown summary + JSON payload. Map evidence citations to Recall.ai video timestamps (deep-link to specific moments). Build overall score + hire recommendation logic.

Day 9 — Dashboard integration (8 hrs)

Connect output to Evalzo's existing recruiter dashboard. Display assessment alongside existing written test (QGen) results. Add video playback with evidence-linked timestamps. Handle assessment status states (processing, complete, failed).

Day 10 — End-to-end testing (8 hrs)

Full pipeline test: schedule → record → assess → display. Test with 3–5 real interview recordings across different types (technical, behavioral, mixed). Verify cost tracking matches projections. Performance test: ensure assessment completes within 2–3 min of recording completion.

Phase 4: Hardening + QA 2 days

Day 11 — Error handling + monitoring (8 hrs)

Add comprehensive error handling for each pipeline stage. Implement alerting for failed assessments. Add cost monitoring dashboard (per-assessment + monthly aggregate). Implement assessment retry logic.

Day 12 — Quality validation (8 hrs)

Compare AI assessments against human evaluator benchmarks. Tune prompts based on quality gaps. Document the system for team handoff. Deploy to staging.

Phase	Days	Hours	Deliverable
Recall.ai Integration + Groq Whisper	3	24	Bot lifecycle, webhook handler, data retrieval, per-participant transcription working
Assessment LLM Pipeline	4	32	Multimodal assessment with model abstraction + JD scoring
Output + Integration	3	24	Dashboard-ready output, end-to-end tested
Hardening + QA	2	16	Production-ready, monitored, documented
Total	12 days	96 hrs	Full pipeline, staging-ready

Implementation Guide — ClickUp-Ready

Detailed technical steps for developer handoff. Each section maps to a ClickUp task.

Task 1: Recall.ai Account + API Setup

Assignee: Backend engineer | Estimate: 2 hrs | Priority: Urgent

Sign up at recall.ai/signup — free credits included for testing
Get API key from dashboard. Region: us-east-1 (or eu-central-1 if EU data residency needed)
Configure webhook URL in Recall dashboard for lifecycle events: recording.done, transcript.done, video_separate.done
Store API key in Evalzo secrets manager (not in code)
Test: curl -H "Authorization: Token YOUR_KEY" https://us-east-1.recall.ai/api/v1/bot/

Task 2: Create Bot Service

Assignee: Backend engineer | Estimate: 8 hrs | Priority: High

Service that creates a Recall.ai bot when an interview is scheduled.

POST https://us-east-1.recall.ai/api/v1/bot/
{
  "meeting_url": "https://zoom.us/j/123456789?pwd=abc",
  "bot_name": "Evalzo Assessment",
  "join_at": "2026-04-25T14:00:00Z",   // scheduled >10min ahead = guaranteed
  "recording_config": {
    // NO transcript provider — we use Groq Whisper on per-participant audio
    "audio_separate_mp3": {},            // per-participant audio (natural diarization)
    "video_mixed_mp4": {},               // 720p gallery view recording
    "video_separate_mp4": {},            // separate webcam + screenshare per participant
    "participant_events": {},            // join/leave/screenshare_on/off events
    "meeting_metadata": {}
  },
  "variant": { "zoom": "web_4_core" },  // required for video_separate_mp4
  "metadata": {
    "evalzo_interview_id": "int_abc123",
    "evalzo_jd_id": "jd_xyz456"
  }
}

Store returned bot_id against the interview record in Evalzo DB
Handle 507 (no capacity) with retry after 5s
web_4_core variant adds $0.10/hr but is required for separate video capture

Task 3: Webhook Handler + Groq Transcription

Assignee: Backend engineer | Estimate: 8 hrs | Priority: High

Endpoint that receives Recall.ai lifecycle webhooks, transcribes via Groq, and triggers the assessment pipeline.

Listen for recording.done event — this fires when all artifacts are ready
On receive: fetch bot details via GET /api/v1/bot/{bot_id}/
Extract from response:
- Per-participant audio: GET /api/v1/audio_separate?recording_id={id} → download each MP3
- recordings[0].media_shortcuts.participant_events.data.download_url → events JSON
- Separate video: GET /api/v1/video_separate?recording_id={id} → filter for type: "screenshare"
Transcribe via Groq Whisper V3 Turbo:
- Send each participant's MP3 to POST https://api.groq.com/openai/v1/audio/transcriptions (model: whisper-large-v3-turbo)
- Run in parallel — all participants transcribe concurrently (~17s total)
- Each response includes timestamped words — tag with participant name from Recall.ai metadata
- Merge all transcripts chronologically → fully speaker-attributed transcript, no diarization step
Parse participant_events for screenshare_on / screenshare_off — extract timestamps
Download screen-share MP4(s) to temp storage
Enqueue assessment job with: merged transcript + screen-share video paths + JD ID

Task 4: Assessment LLM Service

Assignee: Backend engineer | Estimate: 16 hrs | Priority: High

Core intelligence layer. Consumes Recall.ai data + JD, outputs structured assessment.

Model abstraction: Define interface assess(transcript, media, jd, config) → Assessment. Implement for Claude (via Anthropic API) and one backup (OpenAI or Google). Config file selects which.
Frame extraction: Use FFmpeg to sample frames from screen-share MP4 at 0.5fps. Convert to base64 for multimodal API calls.
Prompt structure:
- System: Soul file + evaluation rules + output schema
- User: JD (structured) + interview config + chronological interleave of transcript segments and video frames
- Use JSON mode / structured output for consistent scoring format
Two-pass option: Pass 1 = raw assessment. Pass 2 = JD alignment scoring. Can be same model, two calls.
Context management: If transcript + frames exceed context window, chunk by interview section and run final synthesis pass.
Cost tracking: Log input/output tokens + model used per assessment. Store in DB for monitoring.

Task 5: Output Formatter + Dashboard Integration

Assignee: Backend + Frontend engineer | Estimate: 12 hrs | Priority: Medium

Transform LLM structured output into recruiter-facing format
Generate evidence links: map transcript timestamps to Recall.ai video playback URLs
Integrate with existing Evalzo candidate profile page
Display: overall score, per-criterion scores, narrative summary, clickable evidence moments
Show alongside QGen written test results for holistic candidate view

Task 6: Testing + QA

Assignee: Backend engineer | Estimate: 12 hrs | Priority: High

Record 3–5 test interviews (mix of technical + behavioral + screen-share)
Run through full pipeline, verify output quality
Test edge cases: no screen-share, 90+ min interview, multiple screen-shares, bad network quality
Benchmark: AI assessment vs human evaluator on same interviews
Verify cost per assessment matches $1.65–3.65 projection
Load test: 10 concurrent assessments

Recall.ai API Quick Reference

Action	Endpoint	When
Create bot	`POST /api/v1/bot/`	Interview scheduled
Get bot status	`GET /api/v1/bot/{id}/`	Polling / webhook
Get per-participant audio	`GET /api/v1/audio_separate?recording_id={id}`	Post-call → Groq Whisper
Get participant events	`GET media_shortcuts.participant_events.data.download_url`	Post-call
Get screen-share video	`GET /api/v1/video_separate?recording_id={id}`	Post-call
Get mixed video	`GET media_shortcuts.video_mixed.data.download_url`	Post-call
Delete recording	`DELETE /api/v1/recording/{id}/`	After assessment (privacy)

API Docs: docs.recall.ai/docs/getting-started | Rate limit: 120 requests/min | Auth: Authorization: Token {API_KEY}