Custom voiceover

Uploading your own narration, the alignment indicator, ElevenLabs voice-clone import, and how the pipeline runs differently.

4 min read

If you'd rather narrate your own video, the pipeline accepts a custom voiceover upload. The synthesis step is skipped, the clips align against your audio instead of ElevenLabs' output, and the rest of the pipeline runs normally. This page covers when to use it, how to upload, and what happens differently.

When to use it

Custom voiceover is the right call when:

You have a signature voice. Established channel, recognisable narrator, the audience comes for that voice.
The video is for a multilingual audience. ElevenLabs covers English well at launch; if you're publishing in Spanish, French, German, or other languages, your own voice (or a hired narrator) is the path.
You need pronunciation precision. Proper nouns the synthesis stumbles on, technical terminology, names of historical figures with unusual pronunciation. Recording yourself catches these.
You're producing sponsored content. Sponsors sometimes require named-narrator delivery for brand consistency.
The script is in your voice. If you wrote it (see Custom script) and reading it back as someone else's voice would feel wrong.

It's NOT the right call when:

You're publishing on cadence and the recording step is what would slow you down. The ElevenLabs voices are good; pick one and stay consistent across videos.
You're new to the platform and still figuring out what you want. Use the voice picker until you've shipped a few.

How to upload

In the voice picker on the brief form, scroll past the recommended voices to the Use your own voiceover affordance. Three options:

MP3 (recommended for most). Mono or stereo, any bit rate from 64 kbps up. The pipeline normalises.
WAV. Lossless, useful if you've done significant post-production already.
M4A. Apple's preferred format, fine.

File size cap is 100 MB. A 30-minute MP3 at 192 kbps is about 43 MB, comfortably inside.

Drag-and-drop or click to file-picker. The upload runs to Cloudflare R2 in the background. You can keep filling out the brief while it goes.

The alignment indicator

Once the file lands, the voice picker shows an alignment indicator: how your recording's duration compares to the script length you picked.

The indicator shows three states:

On target. Your recording is within tolerance for the chosen length. The pipeline scales the timeline to your audio exactly. No action needed.
Light. Your recording is shorter than the target length. The pipeline pads with longer holds on footage so the timeline still runs the target length, but the script density drops and the video feels slower. You'll see a "Pad with extra footage" prompt.
Over. Your recording is longer than the target length. The pipeline either compresses footage holds or extends the timeline to your audio length. The indicator suggests trimming the script when the over-target exceeds the tolerance band.

The tolerance is 8 percent of the target length, with a minimum of 15 seconds. For a 12-minute target, that works out at about 58 seconds either side. Within tolerance, the timeline matches your audio exactly. Outside, you'll see actionable suggestions with a rough word-count delta (the pipeline assumes 2.5 words per second of narration).

The waveform preview

After upload, the voice picker shows a small SVG waveform of your recording. Useful for:

Catching silence at the start or end (likely tail-padding from your recording setup, the pipeline doesn't trim this automatically; the editor has a clip-start handle).
Spotting flat sections (where the audio level dropped, which the pipeline can detect but the indicator doesn't always surface).
Confirming the file uploaded fully (a truncated upload shows a half-waveform).

ElevenLabs voice cloning

If you'd rather clone your voice to ElevenLabs once and reuse it across many videos, the voice picker accepts a custom voice ID you've created in your ElevenLabs account. Paste the voice ID, the pipeline uses your cloned voice for the synthesis step.

The first time you import a custom voice, you'll be asked to tick a consent checkbox confirming you have the rights to the voice and have read ElevenLabs' terms. The consent is recorded against your account and carries through every import you make from that point. You can revoke consent at any time in Settings → Account → Revoke ElevenLabs consent (see Account and sign-in). Revoking blocks new imports but doesn't detach voices you've already imported.

What runs differently

When custom voiceover is supplied, the pipeline skips:

ElevenLabs synthesis. No call to the voice API, no per-second synthesis cost.
Voice recommendation. The voice picker's recommended-voices tier is hidden once you've uploaded a custom file.
VO anomaly detection. The vision-evaluator's audio-anomaly check is bypassed; you're trusted to have produced clean audio.

What still runs:

Footage orchestration still picks clips. The orchestrator reads the script and aligns visually against your audio's pacing.
Caption generation still runs. The captions come from the script, the timings come from forced-alignment against your audio (so a caption appears exactly when you say it, not when ElevenLabs would have said it).
Music selection still runs, biased by niche and tone.
Render and publish are identical.

Cost implications

A custom-voiceover job costs less than a synthesised-voice job because the synthesis step is skipped. The brief form's live cost preview shows the saving before you submit.

For Pro long-form, the synthesis cost is bundled into the base credit cost; using a custom VO removes the synthesis portion. Approximately 20 to 30 percent of the credit cost of a standard job, depending on length.

What's next

Brief writing strategy covers writing the brief itself, which still drives outline + footage + theme even when you supply the voiceover.

Custom script is often paired with custom voiceover, you wrote the script, you record it.

Editor overview covers the editor's clip-by-clip view, including the alignment indicator and waveform preview that surface on the editor's voice panel.

Cheers,
Carl