Audio Node

An Audio Node holds a single audio clip — a voiceover, a music track, or a sound effect. You can upload an audio file or generate one using an AI audio model. Audio Nodes connect downstream to Video Nodes, letting you attach narration, music, or sound to your video output.

Adding content

Uploading audio

When an Audio Node is empty, click the Upload button to select a file from your computer.

Maximum file size: 50 MB

Generating with a model

Select a generation mode in the generation panel, write a prompt or configure your settings, then click Generate. The audio clip appears in the display area when complete.

The Generate button is disabled when:

There is no prompt input and no upstream Text Node connected
Audio is already being generated on this node
An upstream node is currently generating ("Upstream task in progress.")

Generation modes

The Audio Node has four generation modes. Select your mode from the tabs in the generation panel.

Text to Speech

Converts a text prompt into spoken audio using a selected voice. Use this to generate character narration, voiceovers, or dialogue.

Supported models:

FishAudio — supports official voices and custom voices you create (see Custom voices below)
ElevenLabs V3 — uses ElevenLabs official voices only; custom voices not supported

How it works: Type or paste your script in the prompt field (up to 5,000 characters), select a voice, and click Generate.

Tip: If an upstream Text Node is connected, its content is passed as the prompt automatically — useful for turning a script node directly into voiceover.

Text to Music

Generates a music track from a text description of the style, mood, and feel you want.

Supported model: Music V3 (Lyria 3 Pro)

Duration options: 5s to 300s, in 30-second increments

How it works: Describe the music you want in the prompt field (e.g. "An energetic and catchy pop song with a driving beat, bright synths"), select a duration, and click Generate.

⚠️
Note: Music V3 charges based on duration — longer clips cost more credits. Check the credit cost on the Generate button before confirming.

Text to Sound Effect

Generates a sound effect from a text description.

Supported model: ElevenLabs (sound effects API)

Duration options: 0.5s to 30s

How it works: Describe the sound you want (e.g. "Ambient sound of a busy street on a sunny morning"), select a duration, and click Generate.

⚠️
Note: This model performs best with English prompts. If you write your prompt in another language, it is automatically translated to English before being sent to the model. Your original text remains visible in the input field.

Voice Design

Generates voice samples based on a text description of a character's vocal qualities. Use this to audition and create custom voices before committing to a full voiceover.

Supported model: ElevenLabs (Voice Design API)

How it works: Describe the voice you want (e.g. "An older woman with a thick Southern accent. She is sweet and sarcastic."). The model generates 3 samples of approximately 8 seconds each. The first sample is shown by default.

Once you have a voice sample you like, you can save it as a custom voice for use in Text to Speech via FishAudio (see Custom voices below).

Generation panel

Element	Description
Mode tabs	Text to Speech / Text to Music / Text to Sound Effect / Voice Design
Upstream text icon	Inline icon block if a Text Node is connected upstream
Prompt input	Your text instructions; up to 5,000 characters
Voice selector	(Text to Speech only) Choose an official or custom voice
Model selector	Choose the AI model for the current mode
Duration	(Music and Sound Effect modes) Select output length
Generate button	Shows credit cost; click to confirm and start generation

Custom voices

Custom voices let you create and reuse a specific voice across your canvas projects. This feature is only available when FishAudio is selected as the model in Text to Speech mode.

Adding a custom voice

In the Audio Node generation panel (Text to Speech mode, FishAudio selected), click Add Voice.
The canvas enters voice selection mode — all non-audio nodes and audio nodes shorter than 5 seconds are greyed out.
Click any eligible Audio Node (must contain audio longer than 5 seconds) to select it as your voice reference.
A confirmation panel appears below the selected node. Confirm to create the voice.
FishAudio processes the reference audio into a speaker file. The new voice appears in your voice selector.

⚠️
Note: If the current canvas has no eligible Audio Nodes, you'll be prompted to generate one first. Click the cancel button at the bottom of the canvas to exit voice selection mode at any time.

Managing custom voices

Custom voices appear in the voice selector alongside official voices. You can:

Preview any voice (official or custom) by clicking the play button
Rename a custom voice directly in the selector
Delete a custom voice permanently — a confirmation prompt appears before deletion. Deleting a voice does not affect audio already generated with it.

Custom voices are available across all your canvases, not just the one where they were created.

Connecting to other nodes

Accepted upstream connections

Upstream node	How its content is used
Text Node	Text content is passed as the prompt in the generation panel

Audio Nodes only accept Text Nodes upstream. Image Nodes and Video Nodes cannot connect upstream to an Audio Node.

Supported downstream connections

An Audio Node can connect downstream to a Video Node only. When connected:

The Video Node's generation mode switches to Omni-Reference automatically
The audio option in the Video Node is enabled by default
The Audio Node appears in the Video Node's reference strip
You can use @Sound in the Video Node prompt to reference the connected audio

⚠️
Note: Video models that do not support audio input are greyed out in the Video Node's model selector when an Audio Node is connected upstream.

Audio Nodes cannot connect to other Audio Nodes.

Display area toolbar

State	Toolbar options
Empty / generating / failed	Upload button only
Audio present	Trim, Download, History

Trim

Click Trim to open the trim tool directly on the audio node's waveform. Drag the trim handles to select the portion you want to keep. You can preview your selection using the node's playback controls.

When you confirm the trim, the selected clip becomes a new Audio Node on the canvas — the original node is unchanged. The two nodes are not connected.

History

Click History to view all audio clips previously generated on this node. Each entry shows the generation date, model used, duration, and relevant parameters (voice name for TTS, etc.).

From the history panel you can:

Play any past clip
Trim or Download a past clip
Set as Main — replaces the current display area content with the selected history clip. Any downstream Video Nodes connected to this node will show an "input updated" notification.

Content and deletion

When you delete an Audio Node, its audio content is saved to your account assets — it is not lost.

Audio Node

Adding content

Uploading audio

Generating with a model

Generation modes

Text to Speech

Text to Music

Note: Music V3 charges based on duration — longer clips cost more credits. Check the credit cost on the Generate button before confirming.

Text to Sound Effect

Note: This model performs best with English prompts. If you write your prompt in another language, it is automatically translated to English before being sent to the model. Your original text remains visible in the input field.

Voice Design

Generation panel

Custom voices

Adding a custom voice

Note: If the current canvas has no eligible Audio Nodes, you'll be prompted to generate one first. Click the cancel button at the bottom of the canvas to exit voice selection mode at any time.

Managing custom voices

Connecting to other nodes

Accepted upstream connections

Supported downstream connections

Note: Video models that do not support audio input are greyed out in the Video Node's model selector when an Audio Node is connected upstream.

Display area toolbar

Trim

History

Content and deletion