# Kling pipeline approved payload patterns

Status: active implementation guide
Captured: 2026-03-29

## Purpose
Turn the current field-level documentation and live findings into a practical implementation guide for what the pipeline should actively prefer, what it may use carefully, and what it should avoid.

---

## 1. Approved patterns (current best choices)

## Production defaults
- production model pair:
  - non-Omni: `kling-v3`
  - Omni: `kling-v3-omni`
- production asset rule for image inputs:
  - keep source-of-truth assets on our side
  - send the actual image attachment payload we control (raw base64) as the default
  - do not make remote image URLs the default production dependency for image inputs
- preserve endpoint-specific field shapes exactly:
  - `image2video` -> top-level `image` / optional `image_tail`
  - Omni -> `image_list[].image_url`
- critical Omni production limits from SOT:
  - with no reference video: `reference images + reference elements <= 7`
  - with reference video: `reference images + reference elements <= 4`
  - `end_frame` is not supported when there are more than 2 images
- current operational reading for `mode`:
  - `std` remains the default economical baseline
  - `pro` is currently the better choice when higher-resolution output matters
  - `pro` may also be the safer choice for some reference-heavy / appearance-sensitive Omni runs
- current operating policy:
  - keep `std` as the default mode in the pipeline
  - use `pro` selectively when quality/resolution is the goal or when a `std` run shows meaningful weakness
- production-facing route scope in this guide is intentionally narrower than the full docs corpus:
  - keep `text2video`, `image2video`, and Omni in the current production layer
  - verified production-facing routes are now recorded explicitly:
    - `text2video` with `kling-v3` = verified
    - `image2video` with `kling-v3` = verified
    - `omni` with `kling-v3-omni` = verified
  - do not treat `reference2video` as a current production-default route in this guide

### A. One-scene image-anchored Omni generation
Use when:
- a single scene fits inside 15 seconds
- one clip should contain the whole scene
- reference image should anchor the start state

Preferred pattern:
```json
{
  "model_name": "kling-v3-omni",
  "prompt": "...",
  "duration": "5|15",
  "mode": "std|pro",
  "aspect_ratio": "16:9",
  "image_list": [
    {
      "image_url": "<raw-base64-image>",
      "type": "first_frame"
    }
  ]
}
```

### A2. Two-person transformed-scene Omni generation (current best live recipe for 2-person audiovisual probing)
Use when:
- two distinct reference people must appear in one coherent scene
- the target scene should transform away from the source portrait context
- wardrobe should adapt to the target setting rather than preserve the portrait clothing
- dialogue / audio may be added in the same shot

Preferred pattern:
```json
{
  "model_name": "kling-v3-omni",
  "prompt": "... both women visible/distinct ... wardrobe adapted to target setting ... no extra people ... <<<voice_1>>>... <<<voice_2>>>...",
  "duration": "5",
  "mode": "std|pro",
  "aspect_ratio": "16:9",
  "sound": "on",
  "image_list": [
    {"image_url": "<raw-base64-person-1>"},
    {"image_url": "<raw-base64-person-2>"}
  ]
}
```

Operational reading:
- for portrait-like subject references, avoid `type='first_frame'` when the real goal is transformed-scene generation rather than strict first-frame preservation
- current live evidence suggests `first_frame` can over-anchor the opening frames and make them look like lightly animated reference portraits
- explicit wardrobe-adaptation wording is important when the scene clothing should differ from the source portrait clothing
- current strongest use case for this pattern is 2-person scene grounding first, then dialogue/audio on top

### B. Start/end frame guided Omni generation
Use when:
- the scene should begin and end at controlled visual states
- a one-scene clip still needs stronger beginning/end anchors

Preferred pattern:
```json
{
  "model_name": "kling-v3-omni",
  "prompt": "...",
  "duration": "5|15",
  "mode": "std|pro",
  "aspect_ratio": "16:9",
  "image_list": [
    {"image_url": "<raw-base64-start-frame>", "type": "first_frame"},
    {"image_url": "<raw-base64-end-frame>", "type": "end_frame"}
  ]
}
```

### C. One-scene image2video generation
Use when:
- the task is a non-Omni image-to-video path
- production wants the current non-Omni base model rather than Omni
- the image input should come from repo-controlled source-of-truth assets
- optional `image_tail` is being used with caution, understanding that current live behavior may look more like start/end transition guidance than true multi-reference scene binding

Preferred pattern:
```json
{
  "model_name": "kling-v3",
  "prompt": "...",
  "duration": "5|10",
  "mode": "std|pro",
  "aspect_ratio": "16:9",
  "image": "<raw-base64-image>",
  "image_tail": "<optional-raw-base64-end-image>"
}
```

### D. Multi-shot Omni generation
Use when:
- one clip contains multiple internal shots
- the scene is still one scene, but needs controlled shot progression

Preferred pattern:
```json
{
  "model_name": "kling-v3-omni",
  "prompt": "...",
  "duration": "15",
  "mode": "std|pro",
  "aspect_ratio": "16:9",
  "multi_shot": true,
  "shot_type": "customize",
  "multi_prompt": [
    {"index": 1, "prompt": "...", "duration": "5"},
    {"index": 2, "prompt": "...", "duration": "5"},
    {"index": 3, "prompt": "...", "duration": "5"}
  ]
}
```

### E. Scene-to-scene continuation via remote video reference
Use when:
- clip B should inherit continuity from clip A
- clips are treated as separate scenes, not fake pieces of one hidden scene
- one scene must be split across multiple clips and then reassembled in edit, while preserving as much continuity as is realistically available

Preferred pattern:
```json
{
  "model_name": "kling-v3-omni",
  "prompt": "...",
  "duration": "5|15",
  "mode": "std|pro",
  "aspect_ratio": "16:9",
  "sound": "off",
  "video_list": [
    {
      "video_url": "<remote-url>",
      "refer_type": "feature",
      "keep_original_sound": "yes"
    }
  ]
}
```

Operational reading:
- `video_list` is the current scene-continuity surface, not the reusable-character surface
- current SOT-visible contract is still URL-shaped (`video_url`); do not assume a URL-free direct attachment path here without separate documented support
- the most realistic current single-pipeline implementation is short-window chaining: when clip A completes, immediately feed Kling's returned result URL into clip B as `video_list[].video_url`
- this is a practical fallback for split-scene continuity, not a promise of exact same-clip extension

Code-facing note:
- the current scaffold should default `sound='off'` for this helper path because the recovered Omni docs explicitly say that when a reference video is present, `sound` can only be `off`
- this note is narrow to the `video_list[]` Omni path and should not be over-read as a resolved statement about broader audio semantics

### F. Reusable identity / asset binding (future-leaning approved direction)
Use when:
- stronger character/product consistency is needed
- reusable assets exist
- the intended workflow looks like serialized/drama-style recurring characters rather than one-off scene references

Preferred pattern direction:
```json
{
  "model_name": "kling-v3-omni",
  "prompt": "...",
  "duration": "5|15",
  "mode": "std|pro",
  "aspect_ratio": "16:9",
  "element_list": [
    {"element_id": 12345}
  ]
}
```

Status note:
- structure is doc-derived
- exact live generation flow still needs fuller element-create contract extraction

### F2. Two-person conversational Omni generation with reusable elements (live-confirmed)
Use when:
- two recurring people should stay identity-stable across runs
- the scene is a spoken dialogue scene, not just a beauty shot
- reusable `element_id` assets already exist
- audio should be generated in the same clip

Preferred pattern:
```json
{
  "model_name": "kling-v3-omni",
  "prompt": "... medium close shot or upper-body dialogue framing ... both faces and both mouths clearly visible ... wardrobe adapted to target scene ... natural Korean conversation timing ... <<<voice_1>>>... <<<voice_2>>>...",
  "duration": "5|10|15",
  "mode": "std|pro",
  "aspect_ratio": "16:9",
  "sound": "on",
  "element_list": [
    {"element_id": 12345},
    {"element_id": 67890}
  ]
}
```

Operational reading:
- live testing now confirms that `element_list + sound='on' + dialogue prompt` can create and succeed on Omni
- the main failure mode is not necessarily audio generation failure; it is often framing mismatch, where subjects are too small in frame for lip readability
- for spoken dialogue, prompt the camera distance explicitly (`medium close shot`, `upper-body framing`, `both mouths visible`) instead of letting the model drift into a wide establishing shot
- treat shot-scale instructions as first-class control, not stylistic garnish
- the same rule applies across languages: explicit spoken language + explicit speaker lines + explicit dialogue framing

### F3. Conversational audiovisual prompting rules (generalized from current live tests)
These are not 5-second-only tips. They are reusable prompt rules that should scale across durations, languages, and recurring-character scenes.

#### Character grounding
- number of important on-screen speakers should usually match the number of grounded assets supplied
- for 2-person scenes, prefer either two reference images or two reusable elements; do not strongly ground one person and leave the second person mostly to prompt inference
- use Omni as the default route for multi-person audiovisual scene composition

#### Scene transformation
- if the source assets are portrait-like but the target is a different scene, avoid `first_frame` unless strict opening-frame preservation is actually desired
- explicitly tell the model to adapt wardrobe to the target setting when portrait clothing should not carry through
- if unwanted clothing persistence is likely, say so directly (`do not preserve portrait clothing`)

#### Dialogue framing
- if speech readability matters, explicitly request a dialogue-capable framing: `medium shot`, `medium close shot`, `upper-body framing`, `close enough to clearly see both faces and both mouths`
- avoid relying on a wide environmental shot when the actual goal is turn-taking readability or lip-readability
- if a wide beauty shot is desired, treat it as a different shot objective from a dialogue-coverage shot

#### Dialogue scripting
- provide explicit speaker turns in the prompt rather than leaving the conversation structure implicit
- use stable speaker markers such as `<<<voice_1>>>` and `<<<voice_2>>>`
- keep speech density proportional to clip duration; shorter clips need fewer/simpler lines, while longer clips can support more turn-taking

#### Audio / language instruction
- use `sound: \"on\"` when speech generation is desired
- explicitly name the intended spoken language if language matters
- if natural timing is important, say so directly (`natural Korean conversation timing`, `subtle facial expressions`, `gentle head turns`)

#### Evaluation rule
- if sync seems weak in a result where the people are visually small, first diagnose shot scale before concluding that the audio/visual alignment itself failed
- distinguish between: (a) actual sync failure, and (b) unreadable sync caused by distant framing

---

## 2. Allowed-with-caution patterns

### A. Top-level `image=<base64>` on Omni
Status:
- live create/succeed possible
- but reference adherence was weak in observed tests

Rule:
- may be used only as a fallback or exploratory path
- should not be treated as the preferred strong reference route
- production default for image-anchored Omni remains `image_list[].image_url` populated with repo-controlled raw base64 image data

### B. Remote image URL values where the upstream contract allows them
Status:
- doc-derived value form exists for some image fields, especially Omni `image_list[].image_url`
- but production-default dependency on remote image hosting is not preferred for this repo

Rule:
- allowed when there is a specific operational reason
- not the default production asset strategy for image inputs

### B. `std` mode for premium outputs
Status:
- valid and cheaper
- but Qingque hints that Professional Mode is the better-quality path

Rule:
- `std` is acceptable for cheap verification
- `pro` should be preferred for high-quality one-scene outputs once appropriate

---

## 3. Disallowed / deprecated patterns

### A. `image_list[].image`
Reason:
- rejected live for Omni

### B. `image_list[].image + index`
Reason:
- rejected live for Omni

### C. Top-level guessed `video_url=<base64 video>`
Reason:
- not SOT-grounded as a validated contract path
- earlier success should not be treated as proof of proper video-reference semantics

### D. Generated-last-frame chaining as the default answer for one scene
Reason:
- quality degradation accumulates
- conflicts with current single-scene <=15s pipeline policy

---

## 4. Current pipeline defaults

### Single-scene default
- one scene -> one clip
- target <= 15 seconds
- prefer first-frame or start/end-frame anchoring
- prefer `pro` later for premium quality

### Cross-scene default
- use separate clips
- continuity between scenes may use `video_list(remote url)` or later stronger asset workflows

### Identity-consistency default
- future best path likely depends on element workflows, not prompt-only tactics

---

## 5. Implementation rule of thumb
If the goal is:
- **one scene** -> start with `image_list(first_frame/end_frame)` or multi-shot inside one clip
- **scene transition** -> use `video_list(remote url)`
- **strong reusable identity** -> move toward `element_list(element_id)` once Tier 2 create schemas are fully transcribed

---

## 6. What this guide prevents
This guide exists to prevent:
- reusing rejected field names
- confusing create success with correct conditioning semantics
- using cheap but weak fallback paths as if they were the best production route
- drifting back into legacy assumptions after documentation has improved


## 2026-03-29 clarified video-reference rules
For `video_list[]` usage:
- `refer_type='base'` means the incoming video is treated as the video to be edited
- with reference video present, `sound` must be `off` according to the doc-derived Omni rules
- `keep_original_sound='yes'|'no'` should be treated as the source-audio retention control inside the video-list entry
