Plot'n Polish

Plot'n Polish: Zero-shot Story Visualization and Disentangled Editing with Text-to-Image Diffusion Models

¹Virginia Tech
²Adobe Research

Abstract

Text-to-image diffusion models have demonstrated significant capabilities in generating detailed and diverse visuals across various domains, with story visualization emerging as a particularly promising application. However, as their use in real-world creative domains increases, the need for providing enhanced control, refinement, and the ability to modify images post-generation in a consistent manner become an important challenge. Existing methods often lack the flexibility to apply fine or coarse edits while maintaining visual and narrative consistency across multiple frames, preventing the creators from seamlessly crafting and refining their visual stories. To address these challenges, we introduce Plot'n Polish, a zero-shot framework that enables consistent story generation and provides fine-grained control over story visualizations at various levels of detail.

Method

An overview of Plot'n Polish. Users can provide story plots for each frame and image prompts, or these can be generated by the LLM based on the story idea. The image prompts are used to create template images for visualizing the story. The editing framework takes editing prompts in the form of text or images, along with initial images to edit and extracted depth conditions.

Qualitative Results

Qualitative results for Plot'n Polish. Our results demonstrate that Plot'n Polish excels in producing consistent visual narratives and allows for a wide range of successful edits including localized edits, character or object replacements, and personalization.

Qualitative Comparison

Qualitative comparison of our method with state-of-the-art story visualization methods, including StoryDiffusion, ConsiStory, AutoStudio, and Intelligent Grim. Our method outperforms competitors by maintaining consistent visual elements, such as attire and character features, across all panels, ensuring narrative coherence. In contrast, existing methods struggle with inconsistencies, blending errors, often breaking narrative flow and reducing clarity.