The life of an audio/video asset is often limited to a particular project or story in the absence of any systematic capturing and tracking of metadata within the broadcast delivery chain. Hence, there exists a latent need to automate the capture of metadata as well as subsequent search and retrieval processes in audio visual archives.
Real-time audio indexing
Such systems use advanced speech and text technologies that extract metadata from an audio stream in real time. They take live or recorded audio feeds from diverse sources and produce indexed XML files as output. Within the XML output, each word is time-tagged to mark the exact point of its occurrence within the text.
The results are then uploaded to a server for subsequent search and retrieval through a user-friendly GUI. The initial processing phase is that of speaker change detection (SCD). SCD is done using a phone-level decoder, which employs a set of broad phonetic classes of speech sounds as well as non-speech sounds. Using the information produced by the decoder, SCD separates speech from non-speech components and divides the audio stream into segments, with each segment belonging to a different speaker.
The output of SCD is used both for automatic speech recognition (ASR) and speaker identification (SID). ASR uses statistical models to determine what was said, and produces an indexed text output. SID is language-independent and identifies speakers on the basis of statistical models mirroring the voice characteristics of specific speakers. In the absence of a voice print match, the gender is detected.
The integrated output of SID and ASR serves as input for the text technologies such as named entity detection (NED) and topic detection (TD). NED highlights named locations, persons, organizations, dates, times, monetary amounts and percentages in specific colors within the text output.
In TD, an episode is partitioned into sections of homogeneous stories. Finally, these stories are classified into individual topics after scoring against statistical models. The user obtains the final text output with identified speakers, topics and highlighted named entities, just by clicking on a particular broadcast.
Funded by the European Commission, the Combined Image and Word Spotting (CIMWOS) project integrates audio indexing technologies with video and text technologies to automatically locate and retrieve text, images, video and audio from a multilingual audiovisual database by performing content-based searches. The project is being undertaken by a consortium of six organizations from different parts of Europe, with a focus on English, Greek and French. It is based on an open architecture for subsequent addition of new languages.
The technologies involved are speech recognition, face detection and identification, object recognition, and text detection and recognition. Given an arbitrary image, the goal of face detection is to determine whether or not there are any faces in the image, and if present, return the image location and extent of the face. This serves as input for the face recognition module, which matches the face with a given test set.
The object recognition module handles 3D objects of general shape and tries to identify the same object from a wide range of previously unseen viewpoints.
Text detection aims at finding a block of images that may contain a single line of text. The speech recognition module uses speech recognition algorithms to create a transcript of the video soundtracks. Because speech tracks are fully characterized by their text transcription, ASR can be done offline, and the audio content of the multimedia database can be referred to through the text transcription.
The indexing mechanism stores information to associate text fragments with audio fragments, keeping them aligned, and dramatically increases the performance of the word-spotting process, because only text-based retrieval algorithms are involved online.
The CIMWOS system supports both independent and combined searches of images and speech. Users can search for particular images using a text-based interface. It is also possible to use a graphical interface, asking the system to find images with similar characteristics to the reference graphic. For search of speech-based material, the user can search for keywords that have been associated with the material. Thus, a user can search for a particular keyword or speaker.
Samita Mishra is the marketing manager for SAIL LABS Technology.