Architecture v2.1 (Recall.ai + Groq Whisper) | April 2026 | CTO + Engineering Brief
Recall.ai eliminates 3 of our original 6 layers. Their Meeting Bot API joins the interview, records video + audio + screen-share separately, provides speaker-attributed transcripts with word-level timestamps, and gives us structured participant_events that tell us exactly when screen-sharing started and stopped.
screenshare_on / screenshare_off events tell us directly, and it captures screen-share video as a separate stream (type: "screenshare"). Transcription uses Groq Whisper V3 Turbo ($0.04/hr, 216x real-time) on Recall.ai's per-participant audio — no diarization needed since each audio stream IS a specific speaker. Total cost: ~$1.64–3.64/candidate.
Bot joins the meeting via API. Records everything. Returns structured data post-call. No Recall.ai transcription — we use their per-participant audio with Groq Whisper instead.
participant_events already tells us which segments are screen-shares (with timestamps). We extract only those clips — no LLM classification needed.
1. Recruiter schedules interview → Evalzo calls POST /api/v1/bot/ with meeting URL + recording config (no transcription add-on)
2. Recall.ai bot joins meeting, records everything silently
3. Post-call webhook recording.done fires → Evalzo fetches per-participant audio MP3s + participant events + screen-share MP4s
4. Each participant audio MP3 → Groq Whisper V3 Turbo → timestamped transcript (parallel, ~17s total for 60 min)
5. Merge transcripts chronologically (speaker identity is structural — no diarization step)
6. Extract screen-share segments using screenshare_on/off timestamps (zero LLM cost)
7. Sends to Assessment LLM: merged transcript + screen-share clips + JD + soul file
8. LLM returns structured assessment → Evalzo formats and delivers to recruiter dashboard
BUILD Layer 1: Inputs
BUILD Layer 2: Orchestrator
ELIMINATED Layer 3: Triage LLM
ELIMINATED Layer 4: Processing (transcription + clipping)
BUILD Layer 5: Assessment LLM
BUILD Layer 6: Output
6 layers · $5.25–8.55/candidate · 6+ weeks dev
BUILD Layer 1: Inputs
BUY Layer 2: Recall.ai (recording + events — no transcription)
BUY Layer 2b: Groq Whisper V3 Turbo ($0.04/hr)
BUILD Layer 3: Assessment LLM
BUILD Layer 4: Output
5 layers · $1.64–3.64/candidate · ~12 dev-days
| Original Layer | What It Did | Why It's Gone |
|---|---|---|
| Triage LLM (Layer 3) | Light model watched video to classify segments as Q&A vs screen-share | Recall.ai participant_events provides screenshare_on/off timestamps — structured data, no AI needed |
| Transcription Service (Layer 4) | Self-hosted Whisper/Deepgram + diarization pipeline | Recall.ai gives per-participant audio (natural diarization). Groq Whisper V3 Turbo transcribes at $0.04/hr, 216x real-time. No diarization step needed. |
| Video Clipper (Layer 4) | FFmpeg extracted screen-share clips | Recall.ai captures screen-share as separate MP4 (type: "screenshare") — no clipping needed |
| Orchestrator (Layer 2) | Complex prompt routing to triage vs assessment models | Simplified to a webhook handler — no multi-model routing needed |
| Component | Effort | Why |
|---|---|---|
| Recall.ai integration + Groq Whisper transcription | 3 days | Create bot, handle webhooks, fetch audio/video, transcribe per-participant via Groq, merge chronologically |
| Assessment LLM pipeline | 3–4 days | Prompt engineering, structured output, model abstraction — this is Evalzo's IP |
| JD alignment + scoring | 2–3 days | Map assessment to JD criteria, weighted scoring, evidence linking |
| Output formatting + dashboard integration | 2 days | Present results to recruiter |
| Total | ~12 dev-days |
| Component | Volume | Cost |
|---|---|---|
| Video frames (1fps × 60 min) | 3,600 frames × ~1,200 tok | $13–22 |
| Audio transcript tokens | ~30K tokens | $0.45 |
| Output (assessment) | ~5K tokens | $0.38 |
| Total per candidate | $14–23 | |
| Component | Volume | Cost |
|---|---|---|
| Triage model (full video, 0.2 fps) | 720 frames, light model | $0.40–0.80 |
| Transcription (40 min verbal) | 40 min audio | $0.40 |
| Assessment model (20 min clips only) | 1,200 frames, frontier model | $4.30–7.20 |
| JD scoring pass (text only) | ~10K tokens | $0.15 |
| Total per candidate | $5.25–8.55 | |
| Component | Volume | Cost |
|---|---|---|
| Recall.ai recording (no transcription add-on) | 60 min | $0.50 |
Recall.ai web_4_core variant (for separate audio/video) | 60 min | $0.10 |
| Groq Whisper V3 Turbo transcription | 60 min | $0.04 |
| Assessment LLM (transcript as text + screen-share clips only) | ~30K text tokens + ~600 frames (20 min @ 0.5fps) | $1.00–3.00 |
| Total per candidate | $1.64–3.64 | |
1. No triage LLM at all. v1 spent $0.40–0.80 per interview having a model watch the full video just to classify segments. Recall.ai gives us screenshare_on/off events for free as structured data.
2. No diarization needed. Recall.ai captures separate audio per participant (audio_separate_mp3). Each file IS a specific speaker — speaker identity is structural, not algorithmic. No pyannote, no GPU, no Celery workers.
3. Groq Whisper at $0.04/hr is 73% cheaper than Recall.ai's transcription. We skip Recall.ai's $0.15/hr transcription add-on and transcribe each participant's audio independently via Groq. 216x real-time speed means a 60-min call transcribes in ~17 seconds. And we own this layer — can swap to Deepgram or AssemblyAI if Groq pricing changes.
4. Screen-share video comes pre-separated. We don't need FFmpeg clipping. Recall.ai captures screen-share as a separate stream (type: "screenshare"). We send only that to the assessment LLM — no wasted tokens on webcam footage.
5. Lower frame rate is fine. For screen-share assessment (code on screen), we can sample at 0.5fps (one frame every 2 seconds). Code doesn't change that fast. This halves the token cost vs 1fps.
Assumes one backend engineer familiar with Python, REST APIs, and LLM integrations. All estimates include testing.
Day 1 — Bot lifecycle + webhook handler (8 hrs)
Create bot via API on interview schedule. Handle recording.done, transcript.done webhooks. Implement bot status polling as fallback. Store recording metadata in DB.
Day 2 — Data retrieval + storage (8 hrs)
Fetch transcript JSON, screen-share MP4s, participant events via media_shortcuts. Parse screenshare_on/off events to identify visual segments. Download and temporarily store screen-share clips. Handle retry/error cases.
Day 3 — Testing + edge cases (8 hrs)
Test across Zoom, Meet, Teams. Handle: no screen-share interviews, multiple screen-share segments, failed recordings, bot join failures. Wire up with Evalzo's interview scheduling system.
Day 4 — Prompt engineering + soul file (8 hrs)
Design system prompt structure: soul file + JD context + evaluation rubric. Build prompt assembly from interview config + JD. Test with sample transcripts. Define structured output schema (JSON mode).
Day 5 — Multimodal assessment (8 hrs)
Implement video frame extraction from screen-share MP4 (frame sampling at configurable fps). Build the multimodal prompt: transcript sections + video frames interleaved chronologically. Handle context window limits (chunk if needed).
Day 6 — Model abstraction layer (8 hrs)
Abstract LLM calls behind a unified interface (assess(transcript, media, jd) → Assessment). Add config-driven model selection (Claude/GPT/Gemini). Implement retry, timeout, cost tracking per assessment.
Day 7 — JD alignment scoring (8 hrs)
Second-pass prompt that maps raw assessment to JD criteria. Weighted scoring per requirement. Evidence extraction (link scores to transcript timestamps + video moments). Structured output: scores array + overall verdict.
Day 8 — Output formatting (8 hrs)
Transform LLM output into recruiter-facing format. Generate Markdown summary + JSON payload. Map evidence citations to Recall.ai video timestamps (deep-link to specific moments). Build overall score + hire recommendation logic.
Day 9 — Dashboard integration (8 hrs)
Connect output to Evalzo's existing recruiter dashboard. Display assessment alongside existing written test (QGen) results. Add video playback with evidence-linked timestamps. Handle assessment status states (processing, complete, failed).
Day 10 — End-to-end testing (8 hrs)
Full pipeline test: schedule → record → assess → display. Test with 3–5 real interview recordings across different types (technical, behavioral, mixed). Verify cost tracking matches projections. Performance test: ensure assessment completes within 2–3 min of recording completion.
Day 11 — Error handling + monitoring (8 hrs)
Add comprehensive error handling for each pipeline stage. Implement alerting for failed assessments. Add cost monitoring dashboard (per-assessment + monthly aggregate). Implement assessment retry logic.
Day 12 — Quality validation (8 hrs)
Compare AI assessments against human evaluator benchmarks. Tune prompts based on quality gaps. Document the system for team handoff. Deploy to staging.
| Phase | Days | Hours | Deliverable |
|---|---|---|---|
| Recall.ai Integration + Groq Whisper | 3 | 24 | Bot lifecycle, webhook handler, data retrieval, per-participant transcription working |
| Assessment LLM Pipeline | 4 | 32 | Multimodal assessment with model abstraction + JD scoring |
| Output + Integration | 3 | 24 | Dashboard-ready output, end-to-end tested |
| Hardening + QA | 2 | 16 | Production-ready, monitored, documented |
| Total | 12 days | 96 hrs | Full pipeline, staging-ready |
Detailed technical steps for developer handoff. Each section maps to a ClickUp task.
Assignee: Backend engineer | Estimate: 2 hrs | Priority: Urgent
us-east-1 (or eu-central-1 if EU data residency needed)recording.done, transcript.done, video_separate.donecurl -H "Authorization: Token YOUR_KEY" https://us-east-1.recall.ai/api/v1/bot/Assignee: Backend engineer | Estimate: 8 hrs | Priority: High
Service that creates a Recall.ai bot when an interview is scheduled.
POST https://us-east-1.recall.ai/api/v1/bot/
{
"meeting_url": "https://zoom.us/j/123456789?pwd=abc",
"bot_name": "Evalzo Assessment",
"join_at": "2026-04-25T14:00:00Z", // scheduled >10min ahead = guaranteed
"recording_config": {
// NO transcript provider — we use Groq Whisper on per-participant audio
"audio_separate_mp3": {}, // per-participant audio (natural diarization)
"video_mixed_mp4": {}, // 720p gallery view recording
"video_separate_mp4": {}, // separate webcam + screenshare per participant
"participant_events": {}, // join/leave/screenshare_on/off events
"meeting_metadata": {}
},
"variant": { "zoom": "web_4_core" }, // required for video_separate_mp4
"metadata": {
"evalzo_interview_id": "int_abc123",
"evalzo_jd_id": "jd_xyz456"
}
}
bot_id against the interview record in Evalzo DBweb_4_core variant adds $0.10/hr but is required for separate video captureAssignee: Backend engineer | Estimate: 8 hrs | Priority: High
Endpoint that receives Recall.ai lifecycle webhooks, transcribes via Groq, and triggers the assessment pipeline.
recording.done event — this fires when all artifacts are readyGET /api/v1/bot/{bot_id}/GET /api/v1/audio_separate?recording_id={id} → download each MP3recordings[0].media_shortcuts.participant_events.data.download_url → events JSONGET /api/v1/video_separate?recording_id={id} → filter for type: "screenshare"POST https://api.groq.com/openai/v1/audio/transcriptions (model: whisper-large-v3-turbo)participant_events for screenshare_on / screenshare_off — extract timestampsAssignee: Backend engineer | Estimate: 16 hrs | Priority: High
Core intelligence layer. Consumes Recall.ai data + JD, outputs structured assessment.
assess(transcript, media, jd, config) → Assessment. Implement for Claude (via Anthropic API) and one backup (OpenAI or Google). Config file selects which.Assignee: Backend + Frontend engineer | Estimate: 12 hrs | Priority: Medium
Assignee: Backend engineer | Estimate: 12 hrs | Priority: High
| Action | Endpoint | When |
|---|---|---|
| Create bot | POST /api/v1/bot/ | Interview scheduled |
| Get bot status | GET /api/v1/bot/{id}/ | Polling / webhook |
| Get per-participant audio | GET /api/v1/audio_separate?recording_id={id} | Post-call → Groq Whisper |
| Get participant events | GET media_shortcuts.participant_events.data.download_url | Post-call |
| Get screen-share video | GET /api/v1/video_separate?recording_id={id} | Post-call |
| Get mixed video | GET media_shortcuts.video_mixed.data.download_url | Post-call |
| Delete recording | DELETE /api/v1/recording/{id}/ | After assessment (privacy) |
API Docs: docs.recall.ai/docs/getting-started | Rate limit: 120 requests/min | Auth: Authorization: Token {API_KEY}