# Kling Video 3.0 Omni Audio — notes

Status: external reference summary (blog-derived, not contract-verified)
Source type: Kling blog copy provided by user
Captured: 2026-03-29

## Core message
This article positions Kling Video 3.0 Omni audio as a native audiovisual workflow rather than a silent-video-plus-dubbing workflow. The central claims are voice-bound character assets, native lip sync, multilingual dialogue, and prompt-driven ambient soundscapes.

## Main claims
- Native audio is generated together with video inside a unified multimodal framework
- Voice can be bound to character identity through Elements 3.0
- Two primary voice-binding paths are described:
  - Video Extraction: 3–8 second video clip of a single speaking person
  - Multi-Image + Audio Binding: 1–4 images plus separate 5–30 second audio clip
- Voice tags using the `<<<voice_1>>>` style are described for assigning speakers
- `voice_list` is described for ordering or disambiguating multi-speaker dialogue
- Native lip sync is positioned as multilingual across five major languages
- Ambient sound and background music are described as prompt-driven and semantically matched

## Claimed practical workflow
1. Create character/subject in Element Library
2. Bind visual identity and voice using either video extraction or multi-image + audio
3. Use prompt script with voice tags such as `<<<voice_1>>>`
4. For multiple speakers, use `voice_list`
5. Describe ambient sound / atmosphere in the prompt

## Cost and quality claims in article
- Supports up to 15 seconds of continuous video with native audio
- Article claims pricing examples such as:
  - 1080p with native audio: 12 credits per second
  - 720p with native audio: 9 credits per second
  - video with voice control only: 2 credits per second
- Professional mode is framed as the high-quality option for commercial work

## Operational interpretation
- If this article is directionally correct, the strongest audio workflow likely depends on creating reusable character assets first, not simply adding dialogue text to a raw generation request.
- The article strongly suggests that audiovisual consistency is tied to Elements 3.0 / character asset workflows.
- Voice tags and `voice_list` may become important payload-level details when we later field-map audio generation.

## Caution / status
- These are blog-derived claims, not yet live-confirmed contracts in our current API work.
- We have not yet verified exact request-body shapes for:
  - audio asset binding
  - `voice_list`
  - voice-tag syntax at API level
  - native-audio generation endpoints/flags
- Treat this note as workflow guidance and a search target for deeper SOT/API extraction, not as final runtime truth.
