# Kling / Omni conversational prompting guide

Status: active working guide
Captured: 2026-04-01
Basis: field-level docs + current live tests

## Purpose
This guide is the generalized operating guide for multi-person conversational audiovisual generation in Kling, especially on Omni.

It is not a narrow recipe sheet and it is not a pile of one-off prompt templates.
Its job is to explain:
- what problem type you are solving,
- which control levers matter,
- how to choose between image grounding, reusable elements, and other references,
- how to diagnose failures,
- and which behaviors are already live-confirmed versus still uncertain.

---

# 1. Scope

This guide is for cases where at least one of the following is true:
- spoken dialogue matters
- multiple on-screen people matter
- recurring identity matters
- scene transformation from portrait-like assets matters
- audio + visual coherence matters
- lip-readability or turn-taking readability matters

This guide is not primarily about:
- abstract text-only generation strategy
- single-person silent beauty shots
- scene-to-scene chaining via `video_list`
- one-off endpoint enumeration

---

# 2. Core operating view

## 2.1 The main question is not “what template should I paste?”
The main question is:

> what kind of control problem am I solving?

Examples:
- do I need 2 distinct people to remain stable in one scene?
- do I need recurring characters across multiple scenes?
- do I need a dialogue shot or just an establishing shot?
- do I need transformation away from portrait context?
- do I need readable speech, or only the existence of audio?

Prompt design should follow the control problem, not the other way around.

## 2.2 The most important shift
In current live testing, output quality was often driven less by superficial prompt wording and more by a small set of structural control variables:
- number of grounded people
- grounding type
- framing scale
- first-frame anchoring strength
- wardrobe carryover control
- dialogue density versus clip duration
- explicit speaker markup

These are the real levers.

---

# 3. Problem-type classification

Before writing or revising a prompt, classify the task.

## 3.1 Identity grounding problem
You need the model to keep specific people or products recognizable.

Typical signs:
- “this exact woman should appear again”
- “both people must remain distinct”
- “recurring cast / recurring product”

Primary levers:
- `image_list`
- `element_list`
- subject count matching reference count

## 3.2 Scene transformation problem
You have portrait-like source assets but want a different target scene.

Typical signs:
- portrait -> poolside
- headshot -> office conversation
- studio portrait -> street scene

Primary levers:
- avoid over-anchoring with `first_frame` unless strictly needed
- explicitly request target-scene wardrobe adaptation
- explicitly reject unwanted portrait-clothing carryover when relevant

## 3.3 Dialogue readability problem
You do not merely need audio to exist. You need the scene to read as a conversation.

Typical signs:
- lip-readability matters
- turn-taking matters
- facial reactions matter
- viewer should know who is speaking

Primary levers:
- shot scale
- upper-body / medium-close framing
- visible faces and mouths
- explicit speaker turns

## 3.4 Recurring asset reuse problem
You need the same character identity to be reused across runs or scenes.

Primary levers:
- create reusable elements first
- use `element_list` instead of rebuilding identity from scratch every run

## 3.5 Audio-language problem
You need a specific spoken language and dialogue behavior.

Primary levers:
- `sound: "on"`
- explicit language naming in the prompt
- explicit spoken lines
- duration-aware speech density

---

# 4. Control levers that matter most

## 4.1 Subject count should usually match grounding count
For important on-screen people, do not rely on prompt inference when stable multi-person identity matters.

Operational rule:
- 1 important person -> 1 strong grounded asset
- 2 important people -> usually 2 grounded assets
- if only one person is grounded and the other is implied by prompt, multi-person stability becomes less reliable

Current reading:
- Omni is the best current route for multi-person grounding among the tested paths

## 4.2 Choose the right grounding type

### `image_list`
Use when:
- you have strong per-run references
- recurring reuse is not the main problem
- you want direct image-based anchoring

### `element_list`
Use when:
- you need recurring identities
- you want reusable asset IDs
- you plan to reuse the same people across runs/scenes

Current live evidence:
- `element_list` generation works
- `element_list + sound="on"` works
- Korean dialogue with `element_list` also worked in current testing

### Additional practical note for non-human recurring characters
For recurring **3D characters / stylized animated characters / cartoon-like characters** whose costume is part of the character identity, the practical tradeoff can be more favorable than for realistic humans.

Why:
- character design tends to act as a stronger identity package
- costume/color blocks are more likely to be interpreted as part of the signature character look
- soft continuity across independently generated clips may still feel acceptable to viewers even without perfect scene-to-scene continuity

Operational reading:
- structure A (independent clip generation with `element_list + sound="on"`, without relying on `video_list`) may be substantially more viable for stylized recurring characters than for realistic-human drama continuity
- do **not** assume perfect automatic outfit persistence
- but if the character's signature outfit should remain unchanged, a light reminder such as `keep the same character design`, `preserve the same outfit`, or `maintain the same signature costume/colors` is often the safer posture than saying nothing at all

### `video_list`
Use when:
- you need scene-to-scene continuity from a prior clip
- you are solving continuity, not reusable identity

Do not confuse `video_list` with `element_list`.
They solve different problems.

## 4.3 Treat shot scale as a first-class control
This is one of the most important findings from current testing.

If dialogue readability matters, do not leave framing implicit.

Use explicit language such as:
- medium shot
- medium close shot
- upper-body framing
- close enough to clearly see both faces and both mouths
- not a distant wide establishing shot

Why this matters:
- a scene can have valid generated audio and still feel like “bad sync” simply because mouths are too small to read
- distant framing can create a false diagnosis of audio/visual failure

## 4.4 Use `first_frame` only when you really want strong opening-frame preservation
If the real goal is scene transformation away from the source portrait, `first_frame` can over-anchor the opening frames.

Observed risk:
- the result may look like a lightly animated source portrait rather than a real transformed scene

Operational rule:
- if transformation is the goal, default away from `first_frame`
- use it only when preserving the literal opening visual state is actually desired

## 4.5 Control wardrobe carryover explicitly
Portrait assets often carry strong clothing priors.
If the target scene needs different clothing, say so directly.

Useful instruction types:
- adapt wardrobe to the target setting
- suitable for the target activity
- do not preserve portrait clothing

Without this, the model may over-preserve source clothing context.

## 4.6 Script the conversation explicitly
Do not rely on vague “they talk” wording when dialogue structure matters.

Use explicit speaker turns, for example:
- `<<<voice_1>>> ...`
- `<<<voice_2>>> ...`

Why this helps:
- clearer turn structure
- easier language control
- easier debugging of who should be speaking when

### Important practical caution from current live testing
Explicit speaker tags are useful.
Over-explaining lip-sync behavior is not always useful.

Current live testing suggests a meaningful distinction between:
- **good structural dialogue guidance**, and
- **over-explicit meta-instructions about mouth motion**

Good structural dialogue guidance includes:
- explicit speaker turns via `<<<voice_1>>>`, `<<<voice_2>>>`
- natural conversation timing
- subtle facial expressions
- gentle head turns
- dialogue-capable framing with visible faces/mouths

Potentially harmful over-explicit meta-instructions include patterns like:
- `both women should visibly move their lips while speaking`
- `this is a speaking scene, not a silent beauty shot`
- `visible mouth articulation during each spoken line`
- wrapping every line in narration such as `Woman 1 says ... Woman 2 replies ...`

Why this matters:
- these stronger meta-instructions may seem like they should improve lip-sync
- in current testing, they could instead make the scene feel more rigid, over-constrained, or less naturally conversational
- the better results came from restoring natural dialogue-scene priors, not from aggressively describing mouth motion itself

Operational rule:
- keep the speaker structure explicit
- keep the framing readable
- keep the acting instructions natural
- do **not** over-specify lip movement unless future evidence proves a narrower prompt pattern that reliably helps

## 4.7 Speech density must fit duration
This is not a 5-second trick. It is a general rule.

Operational rule:
- shorter clips require lower dialogue density
- longer clips can support more turns and longer lines

If lines are too dense for duration, common symptoms are:
- truncated final phrases
- compressed delivery
- reduced readability
- weaker facial timing coherence

---

# 5. Recommended operating patterns

## 5.1 Multi-person audiovisual scene, non-recurring identities
Prefer:
- Omni
- `image_list` with one strong asset per important person
- `sound: "on"` when dialogue matters
- explicit speaker markup
- explicit shot scale when the scene is conversational

## 5.2 Multi-person audiovisual scene, recurring identities
Prefer:
- Omni
- `element_list`
- `sound: "on"`
- explicit speaker markup
- explicit dialogue-capable framing

## 5.3 Portrait-to-scene transformation
Prefer:
- Omni
- multiple grounded assets if multiple important people appear
- avoid `first_frame` by default
- explicit wardrobe adaptation
- explicit scene objective

## 5.4 Wide beauty shot vs dialogue shot
Treat these as separate objectives.

### Beauty / establishing shot
Optimize for:
- environmental composition
- atmosphere
- spatial context

### Dialogue / coverage shot
Optimize for:
- readable faces
- readable mouths
- who-is-speaking clarity
- upper-body or closer framing

Do not assume one prompt will optimize both equally.

---

# 6. Symptom -> probable cause -> next fix

## 6.1 Second person is weak, merged, or drifts
Probable cause:
- insufficient grounding for the number of important people

Next fix:
- add a second strong grounded asset
- prefer Omni for the multi-person scene
- avoid leaving one speaker mostly to prompt inference

## 6.2 Opening frames look too much like the portrait reference
Probable cause:
- `first_frame` over-anchoring

Next fix:
- remove `first_frame`
- describe transformed target scene more explicitly
- add wardrobe adaptation wording

## 6.3 Clothing stays too close to source portrait clothing
Probable cause:
- weak wardrobe transformation instruction

Next fix:
- explicitly request target-scene wardrobe
- explicitly reject portrait clothing carryover

## 6.4 Audio exists but sync feels weak
Probable cause:
- people are too small in frame to read the mouths
- or actual sync weakness, but shot scale must be checked first

Next fix:
- move to medium / medium-close / upper-body framing
- explicitly ask for both faces and both mouths to be visible
- only after that should you judge deeper sync quality

Additional caution:
- if framing is already good, do not immediately escalate into heavy-handed mouth-motion meta-prompting
- first try restoring a more natural conversational prompt structure
- current live evidence suggests that over-explaining lip movement can underperform simpler conversational-scene wording

## 6.5 Final dialogue is cut off or compressed
Probable cause:
- dialogue density too high for clip duration

Next fix:
- shorten lines
- reduce number of turns
- or increase duration if the scene objective allows it

## 6.6 Scene looks beautiful but not conversational
Probable cause:
- model drifted into beauty-shot logic rather than dialogue-shot logic

Next fix:
- rewrite prompt to prioritize coverage language
- explicitly request facial readability and conversational timing
- remove excessive environmental emphasis if it is taking over the shot

---

# 7. What is live-confirmed now

These points are supported by direct current testing in this workspace context.

## Confirmed
- Omni is currently the strongest tested path for 2-person scene grounding
- two-person transformed-scene generation can work with two image references
- portrait-like references plus `first_frame` can over-anchor the opening frames
- explicit wardrobe adaptation wording materially matters
- `element_list` generation works on Omni
- `element_list + sound="on"` works on Omni
- Korean dialogue generation with `element_list` worked in current testing
- wide framing can make sync quality unreadable even when audio generation itself succeeded
- medium / upper-body dialogue framing improves conversational readability

## Not fully closed yet
- longer conversational clips with denser multi-turn dialogue
- stronger guarantees around per-speaker voice consistency across longer runs
- 3+ important people in one dialogue scene
- exact upper bound of controllable multilingual dialogue complexity in a single clip
- stronger parity judgments between image-grounding and element-grounding across many scene types

---

# 8. Decision flow

## Step 1. How many important people must remain stable?
- if 1 -> one strong grounded asset may be enough
- if 2 -> usually use two grounded assets
- if recurring cast -> favor `element_list`

## Step 2. Is the scene conversational or merely scenic?
- if conversational -> explicitly control shot scale
- if scenic -> wide framing may be acceptable

## Step 3. Are you transforming away from portrait context?
- if yes -> avoid `first_frame` by default
- explicitly control wardrobe carryover

## Step 4. Does spoken language matter?
- if yes -> `sound: "on"`
- explicitly name the language
- explicitly write the lines

## Step 5. Is dialogue density realistic for the chosen duration?
- if no -> shorten lines or increase duration

## Step 6. If the result feels wrong, what failed first?
- identity?
- framing?
- wardrobe transformation?
- dialogue density?
- actual audio/sync?

Diagnose in that order rather than guessing randomly.

---

# 9. Minimal doctrine summary

If you only remember a few things, remember these:

1. For multi-person dialogue, ground each important person.
2. Use Omni as the default route for complex multi-person audiovisual scenes.
3. For recurring characters, reusable elements are meaningful and live-usable.
4. Treat framing as a control variable, not decoration.
5. Avoid `first_frame` when transforming portraits into a new scene.
6. Explicitly control wardrobe adaptation.
7. Script speaker turns directly.
8. Fit dialogue density to duration.
9. If sync seems weak, check shot scale before blaming the whole generation.

---

# 10. Relationship to templates

Templates may still be useful later, but only as examples derived from this guide.

The guide is the primary asset.
Templates are secondary artifacts.

That distinction matters, because the model behavior changes across:
- scene type
- number of people
- language
- duration
- grounding type
- framing objective

The reusable knowledge is the control logic, not the literal prompt text.