Imagine This! Scripts to Compositions to Videos
Imagining a scene described in natural language with realistic layout and appearance of entities is the ultimate test of spatial, visual, and semantic world knowledge. Towards this goal, we present the Composition, Retrieval and Fusion Network (Craft), a model capable of learning this knowledge from video-caption data and applying it while generating videos from novel captions. Craft explicitly predicts a temporal-layout of mentioned entities (characters and objects), retrieves spatio-temporal entity segments from a video database and fuses them to generate scene videos. Our contributions include sequential training of components of Craft while jointly modeling layout and appearances, and losses that encourage learning compositional representations for retrieval. Craft outperforms direct pixel generation approaches and generalizes well to unseen captions and to unseen video databases with no text annotations. We demonstrate Craft on Flintstones, a new richly annotated video-caption dataset with over 25000 videos.
104 views
291
72
18 hours ago 00:05:35 2
Original “Pope Trump“ video - Trump Predicted It?! The First American Pope is REAL Now 🍔⛪
6 days ago 00:27:18 0
The Vanishing of Ethan Carter | RAILWAY MURDER | (Part 1) Gameplay Walkthrough w/ facecam
3 weeks ago 00:02:50 1
Best Kids Attraction in Malta? Playmobil FunPark!
1 month ago 00:00:00 3
- Follow Me (Dream Piano 2025 Extended Mix). NEW Italo Disco, Euro Disco, Music 80s-90s
1 month ago 01:22:00 0
The Montauk Project: Stranger Things Exposed
2 months ago 00:11:37 0
Bokida | Indie Physics Game | BUILD, DESTROY, ENJOY! |
2 months ago 00:25:31 0
Master Reboot, Gameplay Playthrough w/ Facecam Part 13 (Final: Both endings!) - LIVE OR DIE?!
2 months ago 00:08:53 0
Master Reboot, Gameplay Playthrough w/ Facecam Part 11 (Circus Memory) - FUN FUN NOT!