# Kling current state summary

Status: active quick summary
Captured: 2026-03-29
Updated: 2026-03-31

## Purpose
Provide the shortest useful summary of the current Kling pipeline/documentation state without replacing the detailed references.

---

# 1. Pipeline policy
- one scene = one clip
- single-scene ceiling = 15s
- multi-clip chaining is no longer the default answer for one scene
- multi-clip continuity is still allowed for scene-to-scene transitions

# 1.5 Phase 2 model policy
- production model pair is now treated as:
  - `kling-v3` for non-Omni image/video generation paths
  - `kling-v3-omni` for Omni paths
- `kling-video-o1` should no longer be treated as the current 3.0 base production model
- non-Omni endpoint model policy is still narrower than full platform certainty across all preserved docs, but the repo’s production-facing default pair is no longer framed as `kling-video-o1 + kling-v3-omni`
- therefore builders/validators/docs should distinguish two different truths:
  - production default pair for this repo: `kling-v3` + `kling-v3-omni`
  - broader endpoint/model support surface may still evolve with later verification
- stable reading for now:
  - current production-facing video families are `text2video`, `image2video`, and Omni
  - historical passes are useful evidence, but not all documented families should stay in the current production-facing layer
  - `reference2video` remains a documented family in the broader docs corpus, but current evidence now suggests it is more likely a legacy/non-current route than a production-default 3.0 path
  - use current production paths with the documented production pair and endpoint-specific payload shapes, while keeping broader endpoint/model support questions clearly separated from production-default truth

# 1.6 Production input policy
- source-of-truth assets should stay on our side
- production requests should prefer sending the actual attachment payloads we control (raw base64 image data) instead of depending on remote image URLs as the default
- endpoint shape differences must stay explicit:
  - `image2video` uses top-level `image` / optional `image_tail`
  - Omni uses `image_list[].image_url`
- remote URLs remain a supported/documented value form where the upstream contract allows them, but they are not the default production asset strategy for image inputs in this repo
- verified production-facing image input methods now recorded explicitly:
  - `image2video` with `kling-v3` -> top-level `image=<raw base64>` works
  - `omni` with `kling-v3-omni` -> `image_list[].image_url=<raw base64>` with `type='first_frame'` works

# 1.7 Currently verified production-facing generation routes
- `text2video` with `kling-v3` = verified
- `image2video` with `kling-v3` = verified
- `omni` with `kling-v3-omni` = verified
- `reference2video` should no longer be listed alongside these as a current production-facing route; it is now treated as a legacy/non-current endpoint candidate pending explicit relocation to historical/legacy docs

# 1.8 Preserved Omni risk-blocked scenario
A recent 10s Omni stress test should be preserved as a future-verification scenario, not treated as a schema failure.

Preserved facts:
- endpoint/model used: Omni with `kling-v3-omni`
- duration: `10`
- input method: three user-provided reference images attached as raw base64 in `image_list[].image_url`
- create phase was accepted and returned a task id
- final query result later returned `task_status='failed'` with `task_status_msg='Failure to pass the risk control system'`
- `final_unit_deduction='0'` in the preserved failure result
- current interpretation: this should be read as a risk-control/moderation block on that scenario, not as evidence that the Omni image-list shape itself was invalid
- important correction: an earlier working reference-set mapping was later found to be wrong, so early inferences that "two host images themselves are safe" or that only the lingerie object caused the block should not be treated as reliable evidence

Exact preserved prompt from the create log:
> Two female hosts stand in a private bedroom and notice a lingerie piece hanging on a rack. They react with subtle surprise and admiration. One host lifts the lingerie from the hanger and holds it against her body, then tries it on in place with visible, natural dressing motion. She adjusts smooths the fit while the second host watches with an impressed meaningful smile. The final scene concludes with a woman trying on the lingerie posing with a meaningful smile, followed by a close-up of the garment. smooth and seamless movement, vivid product images, and realistic body and clothing movements blend together.

# 1.9 Current interpretation of multi-reference behavior from live tests
Use the following as the current working interpretation, not a final upstream contract claim:

## Practical tool-selection rule
- **one-scene reference-guided clip** -> Omni `image_list[].image_url`
- **one-clip storyboard / drama-style shot sequencing** -> Omni `multi_shot` + `multi_prompt[]`
- **reusable recurring character / object identity** -> `element_list[].element_id` via Elements workflows
- **cross-scene continuity between separate clips** -> `video_list` continuation / remote video reference
- **same-clip duration extension** -> no current-production route confirmed; historical `video-extend` exists in the broader docs corpus but is legacy/non-current for this repo

## 2026-03-31 multi-shot prompt experiments — current reading
Recent 4-person Omni multi-shot tests add these operational findings:
- `<<<element_n>>>` prompt templating plus explicit cast-lock wording materially improves cast control compared with a generic group prompt
- `pro` improved visual polish, but did not automatically solve duplicate-person or missing-person failures
- ending anti-freeze wording helped avoid hard freeze-like endings, but final-shot novelty still requires deliberate end-state design

## Continuity tool interpretation
- `video_list` and `element_list` are not the same tool and should be treated as potentially complementary:
  - `video_list` = previous clip / scene reference
  - `element_list` = recurring identity / asset persistence
- `video_list` is the current doc-derived scene-continuity surface, especially for next/previous-shot continuity and style/camera-motion carryover when `refer_type='feature'`
- `element_list` / Video Character Elements should not be treated as a substitute for scene continuity; they are better read as subject/identity asset workflows, not scene-level continuity workflows
- `element_list` is now live-confirmed at the practical level for creation, query, Omni attachment, and Omni `sound='on'` generation; the remaining open question is not whether elements work at all, but how far they complement `video_list` for recurring identity across continued scenes
- `video_list` is useful for scene-to-scene continuity reference, but it should not be over-read as exact extension of the same clip
- exact clip extension is not currently available in the repo's active production route set; historical `video-extend` exists in docs but is legacy/non-current and not supported for the current production pair
- current SOT-visible `video_list` contract is URL-shaped (`video_url` / "Reference Video, obtained via URL"); do not assume a URL-free direct attachment path for scene continuity unless a separate upload->URL provisioning contract is explicitly documented later
- practical current-production fallback for split scenes: if one scene must be broken into multiple clips, the most realistic implementation is short-window chaining using the freshly returned Kling result URL from the previous clip as the next clip's `video_list[].video_url`
- this chaining strategy should be read as a pragmatic continuity fallback, not a guarantee of frame-exact extension; target edit-usable continuity, not seamless same-clip continuation
- strict-closeout residual for continuity is now narrower: the open runtime question is no longer whether `video_list` exists or whether URL-shaped chaining is plausible, but how strong the practical continuity carryover is under controlled live testing and how it interacts with reusable `element_list` identity persistence
- Omni with `image_list` appears to treat the `type='first_frame'` image as a strong start-frame anchor, while additional images behave more like contextual visual references than explicit role-bound slots
- `kling-v3` `image2video` with top-level `image` + `image_tail` should not be described as `reference2video`; it is still an `image2video` test using tail input
- in current live behavior, `image2video` + `image_tail` can plausibly act more like a start/end or transition signal than a true multi-reference scene-binding contract
- therefore successful `image` + `image_tail` outputs should not be over-read as proof that non-Omni `image2video` understands multiple characters as simultaneous scene references
- the separate `reference2video` / `multi-image2video` family should not be used as current production guidance; current multi-reference-like production behavior is better interpreted through Omni `image_list`
- Omni `image_list` may function as the effective current multi-reference route, but it should still be described conservatively as start-frame anchoring plus additional visual references, not explicit role-slot binding
- current live refinement from the 2-woman poolside tests:
  - when the goal is immediate transformed-scene generation rather than strict first-frame preservation, `type='first_frame'` can be too strong for portrait-like references
  - in that situation, the first frames may look like lightly animated reference portraits rather than a newly composed scene
  - removing `first_frame` materially improved early-frame scene composition, two-person grounding stability, and wardrobe adaptation in the latest Omni tests
  - explicit wardrobe-adaptation wording is important when the target scene clothing should differ from the source portrait clothing
- current strongest live-tested 2-person audiovisual recipe is now:
  - endpoint: Omni
  - model: `kling-v3-omni`
  - two raw-base64 reference images in `image_list[].image_url`
  - no `type='first_frame'` for portrait-like subject references when transformed-scene generation is the goal
  - prompt explicitly requiring both women visible/distinct, no extra people, stable scene coherence, and wardrobe adaptation to the target setting
  - `sound='on'`
  - direct prompt-level dialogue lines for each speaker
  - current result state: visually successful 2-person scene grounding plus audio stream + dialogue-like output; remaining weakness is mostly motion richness / conversational acting quality, not baseline feasibility
- current strongest live-confirmed reusable-identity audiovisual recipe is now:
  - endpoint: Omni
  - model: `kling-v3-omni`
  - `element_list` with reusable element ids created through the General element surface
  - `sound='on'`
  - direct prompt-level dialogue lines
  - practical lesson from current runs: dialogue readability still depends heavily on framing; wide shots can make sync quality unreadable even when native audio generation succeeded
  - additional practical lesson from current reproducibility tests: stronger lip-readability did **not** come from aggressively meta-describing mouth motion; the best recent results came from preserving a natural dialogue-scene structure (speaker tags + readable framing + natural timing) while avoiding over-explicit instructions like `both women should visibly move their lips`
- current comparison reading between dual-image grounding and element grounding:
  - no clearly meaningful quality winner has been established yet
  - the more decisive practical lever in current tests was shot objective / framing scale rather than a strong baseline-vs-element superiority result
  - therefore `element_list` is now better treated as a reusable identity control surface whose complementarity with `video_list` remains the next major runtime question
- current lip-motion reproducibility reading:
  - native audio + mouth motion can succeed in Omni current testing
  - the open question is now reproducibility/strength, not bare feasibility
  - current evidence suggests that natural conversational-scene prompting is more reliable than over-explicit mouth-motion meta-prompting
- additional practical deployment note:
  - for recurring **3D / stylized / cartoon** characters whose costume is part of their signature identity, the simpler `element_list + sound='on'` independent-clip strategy may be more viable than it is for realistic humans
  - this does not eliminate drift risk completely, but stylized character design and signature costume blocks may make soft continuity more acceptable and more stable in practice
- critical SOT limit for production use:
  - with no reference video: `number(reference images) + number(reference elements) <= 7`
  - with reference video: `number(reference images) + number(reference elements) <= 4`
  - `end_frame` is not supported when there are more than 2 images
- current live interpretation of `mode`:
  - `pro` appears modestly better than `std` overall
  - `pro` has a practical advantage for higher-resolution output
  - in some reference-heavy cases, `pro` may also be more likely to pass where `std` fails
  - current wording should stay moderate: this is a strong operational tendency, not yet a universal upstream guarantee
- current operating policy for this repo:
  - default mode = `std`
  - escalate to `pro` only when higher resolution, higher polish, or observed `std` instability makes it worth the extra cost

---

# 2. Cost policy
- every create is billable by default
- no “free probe” assumption
- first successful create ends a probe round
- multiple creates for the same immediate purpose require explicit approval
- SOT must be checked before any new payload shape is tried live
- the **first billable Phase 3 create is approval-gated** even if it is doc-derived

---

# 3. Strongest current Omni reference paths
## Image-anchored
Preferred:
- `image_list[].image_url + type='first_frame'`
- optional `end_frame`

Rejected legacy:
- `image_list[].image`
- `image_list[].image + index`

## Video-anchored
Preferred current direction:
- `video_list[].video_url=<remote url>`
- `refer_type='base'`
- `keep_original_sound='yes'`

Rejected / weak paths:
- base64 video in `video_url`
- old guessed top-level `video_url`

## Asset-anchored
Doc-derived strong future direction:
- `element_list[].element_id`
- General / Create Element workflows

---

# 4. Quality interpretation
- `std` works
- Qingque-derived hints strongly suggest Professional Mode / 1080p / higher-quality output
- generated-last-frame chaining is a rejected historical approach for production continuity work and should not remain in the current-production path
- better reference assets and reusable elements likely matter as much as prompt quality

---

# 5. Documentation state
## Strong enough already
- documentation architecture
- policy / guardrails
- main reference layer
- Tier 1 video create/query field-level reference
- Tier 2 element create/query/list/delete top-level field reference
- Tier 3 image-generation create/query field-level reference
- field status register
- approved payload patterns

## Current honest boundary
- top-level Phase 1 field-level documentation closeout has been achieved
- current Omni-first narrow Phase 2 sync closeout has been achieved
- the main remaining open surfaces under the stricter 100%-closeout target are now:
  - deeper nested child-row cleanup
  - capability-map support-range classification
  - audio/voice parity cleanup where preserved evidence is weaker than current live practice
  - `video_list` continuity strength and `video_list + element_list` complementarity runtime verification
  - broader helper retirement / provenance cleanup outside the narrow closeout scope

---

# 6. Expanded Phase 3 meaning
Phase 3 now means an **expanded verification phase**, not just extra Omni poking.

It includes:
1. deeper Omni verification
2. explicit non-Omni verification
3. sharper documentation of what stays provisional vs reusable after those checks

Planned verification families:
- Omni: `pro`, scene-reference continuity, elements, audio/voice
- non-Omni: `text2video`, `image2video`
- legacy/non-current families are tracked separately and should not be read as part of the current-production verification core

Interpretation rule:
- historical PASS is evidence
- current scaffold default policy is a separate question
- Phase 3 exists partly to keep those two truths separate

---

# 7. Approval boundary before Phase 3 live work
Explicit user approval is required before:
- the first new billable create in Phase 3
- any hypothesis-only payload on a billable endpoint
- any 2+ create comparison batch for the same immediate purpose

If approval has not been given yet, documentation/code updates may continue, but billable verification must not start.

---

# 8. Current top priority
Documentation first.
Code sync second.
Then start only the smallest explicitly approved slice of expanded Phase 3.

# 9. Watcher timing
- A minimal non-LLM watcher is desirable, but it is **not** the current top priority.
- Current decision: implement watcher v1 **after Phase 3 closeout**, once the remaining verification boundary is cleaner.
- Why: facts should be watcher-managed later, but the repo should finish the current verification/refinement pass before freezing even a minimal watcher contract.