# Kling pipeline approved payload patterns Status: active implementation guide Captured: 2026-03-29 ## Purpose Turn the current field-level documentation and live findings into a practical implementation guide for what the pipeline should actively prefer, what it may use carefully, and what it should avoid. --- ## 1. Approved patterns (current best choices) ## Production defaults - production model pair: - non-Omni: `kling-v3` - Omni: `kling-v3-omni` - production asset rule for image inputs: - keep source-of-truth assets on our side - send the actual image attachment payload we control (raw base64) as the default - do not make remote image URLs the default production dependency for image inputs - preserve endpoint-specific field shapes exactly: - `image2video` -> top-level `image` / optional `image_tail` - Omni -> `image_list[].image_url` - critical Omni production limits from SOT: - with no reference video: `reference images + reference elements <= 7` - with reference video: `reference images + reference elements <= 4` - `end_frame` is not supported when there are more than 2 images - current operational reading for `mode`: - `std` remains the default economical baseline - `pro` is currently the better choice when higher-resolution output matters - `pro` may also be the safer choice for some reference-heavy / appearance-sensitive Omni runs - current operating policy: - keep `std` as the default mode in the pipeline - use `pro` selectively when quality/resolution is the goal or when a `std` run shows meaningful weakness - production-facing route scope in this guide is intentionally narrower than the full docs corpus: - keep `text2video`, `image2video`, and Omni in the current production layer - verified production-facing routes are now recorded explicitly: - `text2video` with `kling-v3` = verified - `image2video` with `kling-v3` = verified - `omni` with `kling-v3-omni` = verified - do not treat `reference2video` as a current production-default route in this guide ### A. One-scene image-anchored Omni generation Use when: - a single scene fits inside 15 seconds - one clip should contain the whole scene - reference image should anchor the start state Preferred pattern: ```json { "model_name": "kling-v3-omni", "prompt": "...", "duration": "5|15", "mode": "std|pro", "aspect_ratio": "16:9", "image_list": [ { "image_url": "", "type": "first_frame" } ] } ``` ### A2. Two-person transformed-scene Omni generation (current best live recipe for 2-person audiovisual probing) Use when: - two distinct reference people must appear in one coherent scene - the target scene should transform away from the source portrait context - wardrobe should adapt to the target setting rather than preserve the portrait clothing - dialogue / audio may be added in the same shot Preferred pattern: ```json { "model_name": "kling-v3-omni", "prompt": "... both women visible/distinct ... wardrobe adapted to target setting ... no extra people ... <<>>... <<>>...", "duration": "5", "mode": "std|pro", "aspect_ratio": "16:9", "sound": "on", "image_list": [ {"image_url": ""}, {"image_url": ""} ] } ``` Operational reading: - for portrait-like subject references, avoid `type='first_frame'` when the real goal is transformed-scene generation rather than strict first-frame preservation - current live evidence suggests `first_frame` can over-anchor the opening frames and make them look like lightly animated reference portraits - explicit wardrobe-adaptation wording is important when the scene clothing should differ from the source portrait clothing - current strongest use case for this pattern is 2-person scene grounding first, then dialogue/audio on top ### B. Start/end frame guided Omni generation Use when: - the scene should begin and end at controlled visual states - a one-scene clip still needs stronger beginning/end anchors Preferred pattern: ```json { "model_name": "kling-v3-omni", "prompt": "...", "duration": "5|15", "mode": "std|pro", "aspect_ratio": "16:9", "image_list": [ {"image_url": "", "type": "first_frame"}, {"image_url": "", "type": "end_frame"} ] } ``` ### C. One-scene image2video generation Use when: - the task is a non-Omni image-to-video path - production wants the current non-Omni base model rather than Omni - the image input should come from repo-controlled source-of-truth assets - optional `image_tail` is being used with caution, understanding that current live behavior may look more like start/end transition guidance than true multi-reference scene binding Preferred pattern: ```json { "model_name": "kling-v3", "prompt": "...", "duration": "5|10", "mode": "std|pro", "aspect_ratio": "16:9", "image": "", "image_tail": "" } ``` ### D. Multi-shot Omni generation Use when: - one clip contains multiple internal shots - the scene is still one scene, but needs controlled shot progression Preferred pattern: ```json { "model_name": "kling-v3-omni", "prompt": "...", "duration": "15", "mode": "std|pro", "aspect_ratio": "16:9", "multi_shot": true, "shot_type": "customize", "multi_prompt": [ {"index": 1, "prompt": "...", "duration": "5"}, {"index": 2, "prompt": "...", "duration": "5"}, {"index": 3, "prompt": "...", "duration": "5"} ] } ``` ### E. Scene-to-scene continuation via remote video reference Use when: - clip B should inherit continuity from clip A - clips are treated as separate scenes, not fake pieces of one hidden scene - one scene must be split across multiple clips and then reassembled in edit, while preserving as much continuity as is realistically available Preferred pattern: ```json { "model_name": "kling-v3-omni", "prompt": "...", "duration": "5|15", "mode": "std|pro", "aspect_ratio": "16:9", "sound": "off", "video_list": [ { "video_url": "", "refer_type": "feature", "keep_original_sound": "yes" } ] } ``` Operational reading: - `video_list` is the current scene-continuity surface, not the reusable-character surface - current SOT-visible contract is still URL-shaped (`video_url`); do not assume a URL-free direct attachment path here without separate documented support - the most realistic current single-pipeline implementation is short-window chaining: when clip A completes, immediately feed Kling's returned result URL into clip B as `video_list[].video_url` - this is a practical fallback for split-scene continuity, not a promise of exact same-clip extension Code-facing note: - the current scaffold should default `sound='off'` for this helper path because the recovered Omni docs explicitly say that when a reference video is present, `sound` can only be `off` - this note is narrow to the `video_list[]` Omni path and should not be over-read as a resolved statement about broader audio semantics ### F. Reusable identity / asset binding (future-leaning approved direction) Use when: - stronger character/product consistency is needed - reusable assets exist - the intended workflow looks like serialized/drama-style recurring characters rather than one-off scene references Preferred pattern direction: ```json { "model_name": "kling-v3-omni", "prompt": "...", "duration": "5|15", "mode": "std|pro", "aspect_ratio": "16:9", "element_list": [ {"element_id": 12345} ] } ``` Status note: - structure is doc-derived - exact live generation flow still needs fuller element-create contract extraction ### F2. Two-person conversational Omni generation with reusable elements (live-confirmed) Use when: - two recurring people should stay identity-stable across runs - the scene is a spoken dialogue scene, not just a beauty shot - reusable `element_id` assets already exist - audio should be generated in the same clip Preferred pattern: ```json { "model_name": "kling-v3-omni", "prompt": "... medium close shot or upper-body dialogue framing ... both faces and both mouths clearly visible ... wardrobe adapted to target scene ... natural Korean conversation timing ... <<>>... <<>>...", "duration": "5|10|15", "mode": "std|pro", "aspect_ratio": "16:9", "sound": "on", "element_list": [ {"element_id": 12345}, {"element_id": 67890} ] } ``` Operational reading: - live testing now confirms that `element_list + sound='on' + dialogue prompt` can create and succeed on Omni - the main failure mode is not necessarily audio generation failure; it is often framing mismatch, where subjects are too small in frame for lip readability - for spoken dialogue, prompt the camera distance explicitly (`medium close shot`, `upper-body framing`, `both mouths visible`) instead of letting the model drift into a wide establishing shot - treat shot-scale instructions as first-class control, not stylistic garnish - the same rule applies across languages: explicit spoken language + explicit speaker lines + explicit dialogue framing ### F3. Conversational audiovisual prompting rules (generalized from current live tests) These are not 5-second-only tips. They are reusable prompt rules that should scale across durations, languages, and recurring-character scenes. #### Character grounding - number of important on-screen speakers should usually match the number of grounded assets supplied - for 2-person scenes, prefer either two reference images or two reusable elements; do not strongly ground one person and leave the second person mostly to prompt inference - use Omni as the default route for multi-person audiovisual scene composition #### Scene transformation - if the source assets are portrait-like but the target is a different scene, avoid `first_frame` unless strict opening-frame preservation is actually desired - explicitly tell the model to adapt wardrobe to the target setting when portrait clothing should not carry through - if unwanted clothing persistence is likely, say so directly (`do not preserve portrait clothing`) #### Dialogue framing - if speech readability matters, explicitly request a dialogue-capable framing: `medium shot`, `medium close shot`, `upper-body framing`, `close enough to clearly see both faces and both mouths` - avoid relying on a wide environmental shot when the actual goal is turn-taking readability or lip-readability - if a wide beauty shot is desired, treat it as a different shot objective from a dialogue-coverage shot #### Dialogue scripting - provide explicit speaker turns in the prompt rather than leaving the conversation structure implicit - use stable speaker markers such as `<<>>` and `<<>>` - keep speech density proportional to clip duration; shorter clips need fewer/simpler lines, while longer clips can support more turn-taking #### Audio / language instruction - use `sound: \"on\"` when speech generation is desired - explicitly name the intended spoken language if language matters - if natural timing is important, say so directly (`natural Korean conversation timing`, `subtle facial expressions`, `gentle head turns`) #### Evaluation rule - if sync seems weak in a result where the people are visually small, first diagnose shot scale before concluding that the audio/visual alignment itself failed - distinguish between: (a) actual sync failure, and (b) unreadable sync caused by distant framing --- ## 2. Allowed-with-caution patterns ### A. Top-level `image=` on Omni Status: - live create/succeed possible - but reference adherence was weak in observed tests Rule: - may be used only as a fallback or exploratory path - should not be treated as the preferred strong reference route - production default for image-anchored Omni remains `image_list[].image_url` populated with repo-controlled raw base64 image data ### B. Remote image URL values where the upstream contract allows them Status: - doc-derived value form exists for some image fields, especially Omni `image_list[].image_url` - but production-default dependency on remote image hosting is not preferred for this repo Rule: - allowed when there is a specific operational reason - not the default production asset strategy for image inputs ### B. `std` mode for premium outputs Status: - valid and cheaper - but Qingque hints that Professional Mode is the better-quality path Rule: - `std` is acceptable for cheap verification - `pro` should be preferred for high-quality one-scene outputs once appropriate --- ## 3. Disallowed / deprecated patterns ### A. `image_list[].image` Reason: - rejected live for Omni ### B. `image_list[].image + index` Reason: - rejected live for Omni ### C. Top-level guessed `video_url=` Reason: - not SOT-grounded as a validated contract path - earlier success should not be treated as proof of proper video-reference semantics ### D. Generated-last-frame chaining as the default answer for one scene Reason: - quality degradation accumulates - conflicts with current single-scene <=15s pipeline policy --- ## 4. Current pipeline defaults ### Single-scene default - one scene -> one clip - target <= 15 seconds - prefer first-frame or start/end-frame anchoring - prefer `pro` later for premium quality ### Cross-scene default - use separate clips - continuity between scenes may use `video_list(remote url)` or later stronger asset workflows ### Identity-consistency default - future best path likely depends on element workflows, not prompt-only tactics --- ## 5. Implementation rule of thumb If the goal is: - **one scene** -> start with `image_list(first_frame/end_frame)` or multi-shot inside one clip - **scene transition** -> use `video_list(remote url)` - **strong reusable identity** -> move toward `element_list(element_id)` once Tier 2 create schemas are fully transcribed --- ## 6. What this guide prevents This guide exists to prevent: - reusing rejected field names - confusing create success with correct conditioning semantics - using cheap but weak fallback paths as if they were the best production route - drifting back into legacy assumptions after documentation has improved ## 2026-03-29 clarified video-reference rules For `video_list[]` usage: - `refer_type='base'` means the incoming video is treated as the video to be edited - with reference video present, `sound` must be `off` according to the doc-derived Omni rules - `keep_original_sound='yes'|'no'` should be treated as the source-audio retention control inside the video-list entry