Project Omni: A vision for Multimodal Data Exploration

Data today is inherently multimodal: continuous video, drone imagery, LiDAR sweeps, satellite rasters, sensor time‑series, and traditional relational tables - all mashed together in the same decision loop. Insight hides at the seams between these modalities, yet our tooling still forces us to peel them apart. Omni tackles that mismatch head‑on, giving analysts one coherent canvas for truly multimodal data exploration.

Project Omni re-imagines how humans ask questions of rich visual data and consume the answers. Our end-goal is a fluid, sub-second loop from a user’s first keystroke to an insight delivered in the same medium—video for video, spatial maps for spatial data, etc. To get there we attack four tightly-coupled problems:


Guiding Multimodal Query Specification

How do we let users formulate rich, expressive queries across video and other complex data modalities?

Guided Querying over Videos using Autocompletion Suggestions – HILDA ’24

Introduces an LLM + VLM pipeline that turns raw video collections into query suggestions sorted by relevance and diversity. A user starts typing; the system streams suggestions that grow or shrink with every keystroke, cutting typing effort by ≈35 %. [pdf] [page]

Emojis in Autocompletion: Enhancing Video Search with Visual Cues – HILDA ’25

Adds a second modality: each suggestion carries a representative emoji selected by importance‑weighted semantic alignment. A user study shows a 14 % drop in query‑completion time and lower cognitive load, validating that tiny visual cues matter at scale. [pdf] [demo]


Delivering Rich, Multimodal Results

How can we return full‑fidelity multimodal answers fast enough to watch in real time, rather than force users to scroll lists or wait for downloads?

Accelerating Video Segment Access via Quality‑Aware Multi‑Source Selection – MMSys ’25

Interactive tasks often need just a slice of many adaptive‑bit‑rate encodings. We formalize the segment‑access problem as a cost/quality trade‑off and build a selector that is 3–6× faster than picking a single encoding. The result: bounded‑latency segment retrieval that keeps up with user thought. [pdf]

V2V: Efficiently Synthesizing Video Results for Video Queries – ICDE ’24

Instead of giving the user a paginated list of clips, V2V assembles the answer set into a single edited video—respecting temporal order, overlap constraints, and bitrate budgets. A cost‑based optimizer rewrites declarative specs into operator pipelines, yielding 3× speed‑ups and watch‑ready answers within seconds. [pdf] [page]


Making Multimodal Data Discoverable

How do we locate and query petabyte‑scale multimodal datasets spread across dozens of silos - without moving the data?

OmniMesh: Addressing Findability Challenges in Distributed Nature Data Repositories – SSDBM ’25

Visual data isn’t confined to corporate clouds; biodiversity researchers, for example, publish images, LiDAR, and drone video in dozens of silos. Using the SMAC specification, OmniMesh layers a federated index + lightweight adapters over these sites, enabling global search, schema discovery, and provenance tracking without moving data. [pdf]


Project Pillars


Papers and Code


Acknowledgement

This material is based upon work supported by the National Science Foundation under Grant #1910356, the Imageomics Institute, and the Honda Research Institute.

Disclaimer: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.