audio-speakers://

[ audio-speakers:// ] experimental

cat: audio model: @cf/openai/whisper-large-v3-turbo

Upload multi-person audio → get a per-speaker turn breakdown: who spoke, when, for how long, and a roll-up of speaking-time share.

// system prompt

You diarise multi-speaker audio. User uploads + names expected speaker count. Output:

  ## Speaker timeline (turns)
  TIME      SPEAKER     DURATION
  ----      -------     --------
  00:00     Speaker 1   0:18
  00:18     Speaker 2   0:42
  ...

  ## Speaking-time summary
  Speaker 1: <X min / Y%>
  Speaker 2: <…>

  ## Notable patterns
  • <e.g. "Speaker 1 dominates the first 5 minutes then drops off; Speaker 3 is largely silent until minute 8">

  ## Quality notes
  • <overlap detected / unclear transitions / single-channel quality limit>

Rules:
- Don't name speakers unless they identify themselves in the audio (use "Speaker 1", "Speaker 2", etc.).
- Round turn timestamps to seconds.
- Speaking-time summary uses minutes + % of total spoken time.
- "Notable patterns" surfaces things like dominance, silence stretches, turn-taking patterns. Useful for meeting health analytics.
- Diarisation accuracy depends on audio quality — flag overlap, cross-talk, single-channel recordings.

Upload multi-speaker audio + estimated speaker count (auto / 2 / 3 / 4+)

⚡ powered by Cloudflare Workers AI · quota deducted on success

// output

// sample output

## Speaker timeline (turns)
TIME      SPEAKER     DURATION
----      -------     --------
00:00     Speaker 1   0:18
00:18     Speaker 2   0:42
01:00     Speaker 1   0:12
01:12     Speaker 3   1:08
02:20     Speaker 2   0:30
02:50     Speaker 1   0:45
03:35     Speaker 3   0:25
04:00     Speaker 2   1:20
05:20     Speaker 1   2:15
07:35     Speaker 2   0:55
08:30     Speaker 3   1:30
10:00     (end)

## Speaking-time summary
Speaker 1: 3 min 30s / 35%
Speaker 2: 3 min 27s / 34%
Speaker 3: 3 min 3s / 31%

## Notable patterns
• Even distribution across speakers — close to 1/3 each, healthy turn-taking pattern.
• Speaker 3 had the longest single turn (1:30 starting at 08:30) but is also the most silent in the first 5 minutes.
• Two near-1-minute monologue stretches (Speaker 1 at 05:20, Speaker 2 at 04:00) — natural for a presentation-style meeting.

## Quality notes
• Single-channel audio. Diarisation accuracy ~90% on clean turns; lower at fast hand-offs.
• Slight overlap detected at 02:18-02:22 (Speaker 1 and Speaker 2 talking over each other). Timestamps in that window are approximate.

// powered by cloudflare workers ai · quota deducted on success ← back to catalog