How I Made an AI Music Video, Part 2: Video Generation
In this second part of my two-part series, I dive deep into the technical workflow I used to create the animated visuals for "Through the Mystic Green." The biggest challenge with current video generation models is achieving consistency across shots, since each clip is treated as a self-contained video with no knowledge of what came before or after. I initially tried using ChatGPT to create character and style sheets for consistency, but the results were too inconsistent for professional use. Instead, I developed a workflow using Photoshop to composite consistent character poses into different scenes, then used those as starting frames for video generation. I ultimately chose Kling 2.1 over other models because its 10-second generation length gave me enough usable footage after cutting the initial seconds needed for consistency.
What made this project particularly interesting was embracing what I call "happy accidents" – a Bob Ross-inspired approach where I let the AI guide the creative process rather than forcing predetermined outcomes. Current video models don't follow prompts perfectly, so instead of fighting against their limitations, I worked with them collaboratively. The AI would sometimes create unexpected but compelling results that actually enhanced the story, like when I simply prompted "the girl is becoming confident" and Kling responded by opening up her hair in a beautiful visual metaphor. This approach required about 400 iterations using the cost-effective Kling 2.1 fast model, and while it meant abandoning a rigid story structure, it resulted in a music video that effectively demonstrates both the capabilities and current limitations of AI video generation.