LAVE: LLM-Powered Agent Assistance and Language Augmentation for Video Editing

Bryan Wang^1,2, Yuliang Li², Zhaoyang Lv², Haijun Xia^2,3, Yan Xu², Raj Sodhi²

¹University of Toronto, ²Meta Reality Labs – Research, ³UCSD
ACM IUI 2024

The LAVE system is a video editing tool that offers LLM-powered agent assistance and language-augmented features. LAVE represents a step toward human-AI co-creation, suggesting a vast potential for future work in the realm of agent-assisted content editing.

Abstract

Video creation has become increasingly popular, yet the expertise and effort required for editing often pose barriers to beginners. In this paper, we explore the integration of large language models (LLMs) into the video editing workflow to reduce these barriers. Our design vision is embodied in LAVE, a novel system that provides LLM-powered agent assistance and language-augmented editing features. LAVE automatically generates language descriptions for the user's footage, serving as the foundation for enabling the LLM to process videos and assist in editing tasks. When the user provides editing objectives, the agent plans and executes relevant actions to fulfill them. Moreover, LAVE allows users to edit videos through either the agent or direct UI manipulation, providing flexibility and enabling manual refinement of agent actions. Our user study, which included eight participants ranging from novices to proficient editors, demonstrated LAVE's effectiveness. The results also shed light on user perceptions of the proposed LLM-assisted editing paradigm and its impact on users' creativity and sense of co-creation. Based on these findings, we propose design implications to inform the future development of agent-assisted content editing.

Editing Agent

LAVE features a plan-and-execute agent that can form an action plan consisting of editing functions given a command from the user, such as "make a video for my trip to Paris." Upon receiving an input containing the user’s editing command, a planning prompt is constructed. This prompt includes the planning instruction, past conversations, and the new user command. It is then sent to the LLM to produce an action plan, which reflects the user’s editing goal and outlines actions to assist the user in achieving this goal. Each action is accompanied by a context, which provides additional information relevant to the action, such as a language query for video retrieval. The user reviews and approves the actions one by one. After an action is approved, it is translated into actual Python function calls and executed. This process continues for all the actions in the plan, unless the user decides to provide new instructions to revise or cancel the plan

LLM-based Video Editing Functions

Footage Overviewing

The agent can generate an overview text that summarizes the videos the user provided in the gallery, categorizing them based on themes or topics. For instance, clips from a road trip to the Grand Canyon might be categorized under themes like "Hiking and Outdoor Adventures" or "Driving on Highways." This feature is particularly helpful when users are not familiar with the footage, such as when editing videos from older collections or dealing with extensive video sets.

Idea Brainstorming

The agent can assist in brainstorming video editing ideas based on the gallery videos. This allows the agent to suggest various concepts, helping to ignite the users’ creative sparks. For example, the agent might suggest using several clips of the user’s pet to create a video on the topic, "A Day in the Life of Pets—from Day to Night." Additionally, users can provide the agent with optional creative guidance or constraints to guide the agent’s ideation process.

Video Retrieval

Searching for relevant footage is a fundamental yet often tedious aspect of video editing. Instead of the user manually searching the gallery, the agent can assist by retrieving videos based on language queries, such as "Strolling around the Eiffel Tower." After completing the retrieval, the agent will present the most relevant videos in the language-augmented video gallery, sorted by relevance.

Storyboarding

The agent can assist users in ordering these clips based on a narrative or storyline provided by the users. The narrative can be as concise as "from indoor to outdoor", or more detailed, for example, "starting with city landscapes, transitioning to food and drinks, then moving to the night social gathering." If users do not provide a storyline, the agent will automatically generate one based on the videos already added to the timeline. Once the agent generates a storyboard, the videos in the timeline will be re-ordered accordingly. The agent will also provide a scene-by-scene description of the storyboard in the chatroom.

Clip Trimming

LAVE allows users to input trimming commands to extract video segments based on their specifications. These commands can be free-form. For instance, they might refer to the video’s semantic content, such as "keep only the segment focusing on the baseball game", or specify precise trimming details like "Give me the last 5 seconds." Commands can also combine both elements, like "get 3 seconds where the dog sits on the chair". For transparency, the LLM also explains its rationale for the trimmings, detailing how they align with user instructions

Future Directions

LAVE represents an exciting research area that opens up more questions than it answers. The field of LM-based content editing and agent assistance is just starting to showcase its immense potential, promising many interesting opportunities. Below, we highlight several future directions in addition to those outlined in the paper.

Towards Delightful Mixed-Initiative Video Editing

Mixed-Initiative Interaction, a concept made prominent in the 90s and 00s by HCI+AI pioneers like Eric Horvitz and Marti Hearst, advocates for systems that enable both humans and AI to collaborate efficiently. With the advent of LLM-based agents, revisiting and expanding upon these ideas is timely. We build LAVE on the foundation of mixed-initiative interaction principles, yet it still has potential for enhancement. For example, while LAVE's agent contributes to the editing process, it only responds to user prompts and lacks the capability to autonomously initiate edits or monitor editing behaviors for proactive support. Furthermore, the agent can be extended to consider users' GUI manipulation history to provide personalized assistance. However, caution is advised as agents that proactively engage without solicitation can be perceived as intrusive. Thus, future studies are required to balance proactive engagement with user preferences, ensuring a delightful human-AI co-creation experience without overwhelming users.

Transitioning to Multi-Agent Architecture for In-Context Interaction within Functions

LAVE currently supports a single plan-and-execute agent equipped with functions and tools such as brainstorming and storyboarding, providing a unified interface for language interactions. However, this agent design limits its ability to facilitate in-depth, interactive discussions for specific actions. For example, if users are dissatisfied with ideas generated from the "brainstorming" function, in our current design, it is not straightforward to negotiate with the planning agent to refine the generated ideas iteratively. Instead, their feedback may lead the agent to plan for a new brainstorming execution rather than refining the ideas based on the current context. To overcome these limitations, evolving LAVE into a multi-agent system would be beneficial. This system could include a dedicated planning agent focused solely on planning and specialized agents for each specific function like brainstorming or storyboarding. Users could then engage with particular agents for focused discussions within the context of each function.

Integrating Video Editing Knowledge to LLM for Grounded Assistance

LAVE currently utilizes the standard GPT-4 model to facilitate a range of video editing tasks. This method is effective because the pre-trained LLM already possesses a decent understanding of storytelling and video editing techniques, as well as exceptional skills in information extraction and summarization. However, there is room for improvement in integrating domain-specific knowledge. This could be achieved through fine-tuning or few-shot prompting to align suggestions more closely with desired editing styles. For instance, brainstorming functions can be designed to generate video ideas that resemble the styles of certain creators.

Weaving AI Video Generation into the LAVE Editing Workflow

LAVE is fundamentally a video editing tool designed for users who already have a collection of videos ready for editing. However, the video creation workflow can vary widely. Sometimes, users may not start with an existing collection of videos, or they might find their current videos insufficient for their desired project. This is where recent advances in video generation models, such as Sora and Pika, become valuable. These models can generate video footage that can serve as B-roll to augment a user's collection or even provide entirely generated footage for editing. While LAVE focuses on editing and these models on generation, they naturally complement each other. Future research could explore the interplay between LAVE and such models to enhance the video creation workflow.

BibTeX

@article{wang2024lave,
        title={LAVE: LLM-Powered Agent Assistance and Language Augmentation for Video Editing},
        author={Wang, Bryan and Li, Yuliang and Lv, Zhaoyang and Xia, Haijun and Xu, Yan and Sodhi, Raj},
        journal={arXiv preprint arXiv:2402.10294},
        year={2024} }