Author : Shima M. Al Mehmadi 1
Date of Publication :30th August 2023
Abstract: In today's digital era, multimedia content such as images, videos, text, and audio are commonplace. With the increase in the number of Unmanned Aerial Vehicles (UAVs) in the sky, UAV videos have emerged as a new form of communication. To efficiently and effectively search for a specific video from a large dataset, text-to-video retrieval is recommended. In this paper, we present a text-to-event retrieval model for UAV videos. The model comprises two parts: the first part extracts frame-level features from the video using Vision Transformer (ViT), and the second part extracts textual representations from the query using Bidirectional Encoder Representations from Transformers (BERT). Both parts are jointly trained on text-video pairs using bidirectional contrastive loss. The effectiveness of the proposed method was evaluated on the CapERA dataset, an extended version of the event recognition in aerial video (ERA) dataset, and the results demonstrate its efficacy.
Reference :