# Kling API field-level Tier 1 addendum — audio & elements hints

Status: working addendum
Captured: 2026-03-29
Basis:
- Qingque-derived capture
- blog-note corpus already stored in `docs/current/`

## Purpose
This addendum captures field-level and workflow-level hints that strongly affect next-stage implementation, especially around:
- `video_list`
- `element_list`
- audio / voice
- General Element APIs

---

# 1. General / Element API surface confirmed in Qingque capture
The capture confirms the presence of the following sections:
- General - Create Element
- Create Multi-Image Elements
- Create Video Character Elements
- General - Query Custom Element (Single)
- General - Query Custom Element (List)
- General - Query Presets Element (List)
- General - Delete Custom Element

This is important because the capture does not merely mention elements conceptually; it exposes a full General element-management API surface.

---

# 2. `element_list[]` structure observed directly
Doc-derived structure observed in the Qingque capture:
```json
"element_list": [
  {
    "element_id": long
  },
  {
    "element_id": long
  }
]
```

## Current interpretation
- `element_list[]` is likely the reusable attachment point for assets created through General / Create Element
- this is likely the stronger production path for identity stability than plain inline image references alone

## Status
- field name: doc-derived
- nested field: `element_id` doc-derived
- live generation attachment: confirmed in current workspace testing

## Live-confirmed upgrade
Current workspace testing has now confirmed all of the following:
- reusable elements can be created through the General element surface
- created `element_id` values can be attached through Omni `element_list`
- Omni generation with `element_list` can succeed
- Omni generation with `element_list + sound="on"` can also succeed

Current interpretation should therefore be updated from
"likely reusable attachment point"
to
"live-confirmed reusable attachment point for Omni generation in the current tested scope."

---

# 3. `video_list[]` structure observed directly
Doc-derived structure observed in the Qingque capture:
```json
"video_list": [
  {
    "video_url": "video_url",
    "refer_type": "base",
    "keep_original_sound": "yes"
  }
]
```

## Additional doc-derived note
- only `.mp4` / `.mov` are supported

## Live alignment so far
- `video_url=<base64 video>` -> rejected (`Video URL is invalid`)
- `video_url=<remote url>` -> create/query succeeded

## Interpretation
- `video_url` likely expects an actual reachable remote URL
- `refer_type='base'` appears to define the reference role/class for the incoming video
- `keep_original_sound='yes'` suggests source-audio preservation can be part of the workflow

---

# 4. Audio / voice field triage from preserved artifacts
The preserved corpus now supports a cleaner split between **exact standalone audio endpoints** and **still-unresolved native audiovisual binding hints**.

## 4.1 Exact standalone audio surfaces now recoverable
These are doc-derived from preserved extractions, not guesses.

### Text to Speech
Endpoint:
- `POST /v1/audio/tts`

Recovered request fields:
- `text` — required, max 1000 chars
- `voice_id` — required
- `voice_language` — required, default `zh`, enum observed: `zh`, `en`
- `voice_speed` — optional, default `1.0`, range `[0.8, 2.0]`

Recovered result shape:
- `task_result.audios[].id`
- `task_result.audios[].url`
- `task_result.audios[].duration`

### Text to Audio
Endpoint:
- `POST /v1/audio/text-to-audio`

Recovered request fields:
- `prompt` — required, max 200 chars
- `duration` — required, range `3.0s - 10.0s`, one decimal place
- `external_task_id` — optional
- `callback_url` — optional

Recovered result shape:
- `task_result.audios[].id`
- `task_result.audios[].url_mp3`
- `task_result.audios[].url_wav`
- `task_result.audios[].duration_mp3`
- `task_result.audios[].duration_wav`

### Video to Audio
Endpoint:
- `POST /v1/audio/video-to-audio`

Recovered request fields:
- `video_id` — optional
- `video_url` — optional
- rule: exactly one of `video_id` or `video_url` should be supplied; docs say they cannot both be empty or both have values
- `video_id` limits: generated by Kling AI, within 30 days, duration `3.0s - 20.0s`
- `video_url` limits: `.mp4/.mov`, max 100MB, duration `3.0s - 20.0s`
- `sound_effect_prompt` — optional, max 200 chars
- `bgm_prompt` — optional, max 200 chars
- `asmr_mode` — optional, default `false`
- `external_task_id` — optional
- `callback_url` — optional

Recovered result shape:
- `task_info.parent_video.id`
- `task_info.parent_video.url`
- `task_info.parent_video.duration`
- `task_result.videos[].id`
- `task_result.videos[].url`
- `task_result.videos[].duration`
- `task_result.audios[].id`
- `task_result.audios[].url_mp3`
- `task_result.audios[].url_wav`
- `task_result.audios[].duration_mp3`
- `task_result.audios[].duration_wav`

### Custom Voice / preset voice surfaces
Endpoints observed:
- `POST /v1/general/custom-voices`
- `GET /v1/general/custom-voices/{id}`
- `GET /v1/general/custom-voices`
- `GET /v1/general/presets-voices`
- `POST /v1/general/delete-voices`

Recovered creation fields:
- `voice_name` — required, max 20 chars
- `voice_url` — optional; supports `.mp3/.wav/.mp4/.mov`
- `video_id` — optional
- `callback_url` — optional
- `external_task_id` — optional

Recovered source constraints:
- source voice must be clean, one human voice only, 5s-30s
- `video_id` customization path only supports qualifying videos; preserved docs explicitly mention V2.6 + `sound=on`, Avatar API outputs, or Lip-Sync API outputs

Recovered voice result/list fields:
- `voices[].voice_id`
- `voices[].voice_name`
- `voices[].trial_url`
- `voices[].owned_by`

## 4.2 Still blog-derived / not yet field-level grounded in preserved API tables
From stored audio and Omni notes, the following concepts recur, but their grounding level is now split more precisely:
- exact preserved API-table support now exists for `voice_list` on Image-to-Video
- exact preserved prompt-level guidance now exists for voice tags such as `<<<voice_1>>>` on both Text-to-Video and Image-to-Video
- Elements 3.0 voice binding is doc-derived through element surfaces, including `element_voice_id`
- video extraction for character + voice remains workflow-described, not fully reduced into one closed request-body reference here
- multi-image + separate audio binding remains workflow-described, not yet field-level closed here
- native lip sync remains capability-described, not yet reduced to a dedicated request-field row in the preserved video-generation tables used here
- multilingual dialogue remains capability-described
- ambient sound / background music prompt control inside native video-generation payloads remains incomplete at Tier 1 for Series 3 generation

## Exact preserved evidence now available
### Image-to-Video
Preserved `apiReference_model__imageToVideo.txt` shows an exact request-body row:
- `voice_list` — `array`, optional
- documented meaning: list of voices referenced when generating videos
- limit: up to 2 voices
- constraint: when `voice_list` is not empty and the prompt references a voice ID, billing is “with specified voice”
- source restriction: `voice_id` must come from Custom Voices or Presets Voices, and explicitly not from Lip-Sync voice IDs
- exclusivity: `element_list` and `voice_list` are mutually exclusive and cannot coexist
- example shape:
```json
"voice_list": [
  { "voice_id": "voice_id_1" },
  { "voice_id": "voice_id_2" }
]
```

The same preserved Image-to-Video source also contains an invocation example placing both constructs directly in the create payload:
- `prompt: "<<<voice_1>>>Ask the people in the picture to say ..."`
- sibling request-body field `voice_list: [{"voice_id": "..."}]`
- `sound: "on"`

This is the strongest currently preserved request-body evidence for native speaker binding.

## Current field-level interpretation of Image-to-Video native speaker binding
- the strongest currently preserved native speaker-binding contract is:
  - speaker tag inside `prompt`
  - sibling `voice_list`
  - `sound: "on"`
- `voice_list` is not a standalone substitute for prompt speaker placement; the preserved example and notes tie the binding semantics to prompt tags such as `<<<voice_1>>>`
- `voice_id` sources are constrained to Custom Voices / Presets Voices and explicitly exclude Lip-Sync voice IDs
- `element_list` and `voice_list` are preserved as mutually exclusive on Image-to-Video create
- therefore the current safest doc-derived interpretation is:
  - speaker selection = `voice_list`
  - speaker placement/binding inside the scene = `prompt` tag such as `<<<voice_1>>>`
  - audio enablement = `sound: "on"`

### Text-to-Video
Preserved `apiReference_model__textToVideo.txt` confirms prompt-level semantics inside the `prompt` field notes:
- `Use <<<voice_1>>> to specify voice, same sequence as voice_list`
- up to 2 voices
- when specifying voice, `sound` must be `on`
- simpler grammar is preferred
- example: `The man <<<voice_1>>> said, "Hello.".`

However, in the currently preserved extraction used here, Text-to-Video has not yet been separately reduced to an exact standalone `voice_list` request-body row with the same confidence level as Image-to-Video.

## Current field-level interpretation of Text-to-Video native speaker binding
- Text-to-Video is now stronger than a pure capability hint because preserved prompt-field notes explicitly mention:
  - `<<<voice_1>>>`
  - up to 2 voices
  - `sound` must be `on` when specifying voice
- however, Text-to-Video still does not yet have a separately transcribed `voice_list` request-body row at the same confidence level as Image-to-Video
- current safest interpretation:
  - prompt-level speaker-tag semantics are source-grounded
  - full request-body parity with Image-to-Video remains unclosed until an exact `voice_list` row is recovered or live-confirmed

## Current unresolved audio questions that still require live verification
These are no longer documentation-discovery questions only; they are now runtime-contract questions.

1. Whether Text-to-Video accepts `voice_list` with the same exact request-body semantics as Image-to-Video
2. Whether `element_voice_id` can coexist with generation-time `voice_list`
3. Whether native speech, ambient sound, and BGM are controlled by separate generation-time fields in Series 3 create payloads, or only through narrower dedicated audio endpoints / prompts
4. Whether native audiovisual binding behaves identically across Text-to-Video, Image-to-Video, and Omni generation paths when speaker-selection is done by explicit `voice_list` rather than by prompt-only dialogue
5. Whether any additional native lip-sync-specific request-body row exists in preserved Series 3 generation tables beyond the currently recovered prompt-tag + `voice_list` + `sound` pattern

## Newly closed audio/runtime points from current live testing
These points were previously softer and are now materially stronger:

- Omni native audio generation is live-confirmed to succeed with `sound="on"` in current workspace testing
- prompt-level dialogue without explicit `voice_list` is live-confirmed to produce native audio in Omni current testing
- `element_list + sound="on" + prompt-level dialogue` is live-confirmed to succeed in Omni current testing
- Korean prompt-level dialogue is live-confirmed to produce audible output in Omni current testing

This does **not** prove full parity with Image-to-Video `voice_list` semantics.
It does prove that current Omni native-audio generation is stronger than a merely hypothetical path.

## Why this matters for API planning
This means the audio state is no longer simply “unknown.”
Instead:
- standalone audio generation is field-level visible
- custom voice management is field-level visible
- Image-to-Video native speaker binding is field-level visible for `voice_list`
- prompt-level `<<<voice_1>>>` placement is preserved inside the `prompt` field notes/examples
- Omni native audiovisual generation is now also live-confirmed at the practical level for prompt-level dialogue and for `element_list + sound="on"`
- native audiovisual binding into Series 3 generation is still only partially closed overall because Text-to-Video parity, explicit `voice_list` parity outside Image-to-Video, element-voice coexistence behavior, and deeper lip-sync/native-audio contract rows are still incomplete

## Current closeout boundary for audio docs
- documentation is now strong enough to distinguish:
  - exact standalone audio endpoints
  - exact voice-management endpoints
  - strongest currently preserved native speaker-binding pattern
  - exact unresolved questions that require live verification
- live testing is now strong enough to distinguish:
  - Image-to-Video-style explicit `voice_list` binding as the strongest preserved field-level contract
  - Omni prompt-level native dialogue generation as a practical live-confirmed path
  - Omni `element_list + sound="on"` as a practical live-confirmed path
- therefore the remaining audio work should now be read as:
  - **documentation parity cleanup** for any still-missing exact rows
  - **runtime semantics verification** for behavior that preserved SOT alone cannot close

---

# 5. Quality-related hint from Qingque capture
The capture contains a quality/pro mode hint equivalent to:
- Professional Mode -> generating 1080P videos -> higher quality video output

## Current interpretation
- `mode='pro'` likely maps to better visual quality
- this should be preferred for premium one-scene outputs once live-confirmed for the relevant endpoint/task combinations

---

# 6. Implementation priority implications
## Highest-value near-term targets
1. Finish exact field extraction for audio/voice payload structures if present in Qingque body tables
2. Clarify Text-to-Video `voice_list` parity vs Image-to-Video using preserved evidence or explicit negative classification
3. Promote `video_list` remote-URL workflow into a documented tested recipe
4. Clarify `video_list + element_list` as complementary continuity + recurring-identity path

## Practical production hypothesis
The strongest future production stack likely becomes:
- one scene -> one 15s clip
- `mode='pro'`
- `image_list(first_frame/end_frame)` for frame anchoring when needed
- `video_list(remote url)` for scene-to-scene continuation when needed
- `element_list(element_id)` for strong character/product consistency
- optional explicit `voice_list` binding where field-level confirmed
- prompt-level native dialogue on Omni where practical live-confirmation is sufficient
- stronger native lip-sync/voice-control interpretation only where exact contract is confirmed or explicitly live-verified

---

# 7. What still must be extracted from raw SOT
- exact General - Create Element request-body fields
- exact Create Multi-Image Elements example body
- exact Create Video Character Elements example body
- exact audio-specific request fields (if shown in raw tables)
- exact Text-to-Video `voice_list` row, if separately present in raw capture beyond prompt notes