Interactive video browsing

Video-summarization software reduces the amount of time analysts spend scanning surveillance data for key observations

Video surveillance provides one of the best sources of data for critical infrastructure protection, border surveillance, and urban law enforcement. However, reviewing these data is a time-consuming process, often requiring multiple analysts to spend hours scanning raw video to locate observations related to persons, vehicles, or events of interest. The capability to view video content at a small fraction of the raw video duration—hours compressed to minutes or even seconds—could allow analysts to comb through additional surveillance footage and more quickly identify observations that warrant closer looks. A team of Lincoln Laboratory researchers has developed such a capability: the Video Content Summarization Tool.

"Our tool creates video summary views with time-compression rates up to the 1000s and processes each hour of video data within minutes," says Lincoln Laboratory technical staff member Jason Thornton, who developed the software along with colleagues Marianne DeAngelus, Ronald Duarte, Zach Elko, Kim Lasko, and Timothy Schreiner and former Laboratory staff members Nicholas Gallo, Christine Russ, and Aaron Yahr. "Analysts simply have to select camera feeds or video files relevant to their investigation, specify a time interval in the video over which they want to review content, scan the resulting summarized videos for activity of interest, and then select the activity to examine the original video segment from which it was pulled," he continues.

The Video Content Summarization ToolVideo captured from an office lobby was used to create this side-by-side comparison of a long original video (left) to its condensed summary view. Note that the condensed view depicts activity from the entire one-hour video sequence within a 17-second clip (a time-compression factor of 200).

The software is based on a novel activity-mapping technique that temporally maps all pixel-level motion detected in long surveillance videos onto a condensed, composite visual clip. Unlike other video-synopsis techniques proposed in the computer-vision literature, this new method eliminates the need to perform object tracking, which can be an error-prone process in scenes that contain partial occlusions or intersecting motion patterns. "We can apply our technique to a broad set of scene types with varying crowdedness levels, object types or sizes, occlusion frequency, and other characteristics, without having to design tracking or association algorithms that work for all situations," explains Duarte.

To begin the video-summarization process, the tool automatically scans an original video segment for motion detected at the pixel level. "Motion could include people walking or running, cars being driven through a parking lot, and objects being added to or removed from locations," says Duarte. The software assigns a motion score to every pixel by applying a standard motion-detection technique based on an adaptive background subtraction model. This model first learns a representation of a scene’s static background through inspecting multiple video frames separated in time and estimating the normal background intensity at each pixel. When applied to a new frame, the model assigns a motion score to each pixel by measuring some normalized distance between the current pixel value and the background one. When there is no motion, the score approaches zero; when there is motion, the pixel value will deviate from the background value and thus the score will increase.

"Active" pixels from frames in the original video are then rearranged into a temporally compressed timeline through a nonlinear cyclical transformation of the spatial-temporal volume that contains the video. The novel transformation scheme does not require object tracking (because the transformation is performed independently for each active pixel) and has two advantages: (1) continuity of motion is preserved between the original and summary videos, creating an effective visualization of activity; (2) objects that appear together in the raw video sequence also appear together in the summary sequence (along with other activity that has been mapped into the condensed timeline); this co-occurrence preserves interactions that may be relevant to an investigation, such as conversations between individuals or object handoffs.

The Video Content Summarization ToolThe Video Content Summarization Tool (VCST) forms a short composite view depicting activity sampled from throughout the original video sequence. To form each frame of the summary video, VCST blends the "active" pixels detected within a subset of original video frames (top row, highlighted in color) into a composite view by overlaying these pixels onto a shared background frame (bottom row). Each pixel value in the final summary frames is computed as a weighted sum of corresponding active pixel values from the original frames; this sum is determined according to the deviation of each active pixel from a background model of static scene components. The time-compression ratio that the analyst selects determines the temporal spacing between original frames and the number of original frames from which active content is blended to create the summary frame. The analyst can adjust this ratio to improve review efficiency by factors of 100s or even 1000s, depending on the density of activity in the original video.

A graphical user interface displays the resulting summary clip, which contains many instances of foreground motion blended onto a static background. Because the pixel remapping process that generates the summary view can be executed in real time, slider bars can provide users with instantaneous control of two key parameters of the video summary formation: the time-compression ratio (i.e., length of summary video) and motion-sensitivity threshold. By adjusting the time-compression ratio, analysts can move seamlessly from sparse to dense activity representations, even single-frame representations of scenes. The ability to adjust motion sensitivity on the fly allows users to create video summarizations that capture meaningful activity but avoid oversensitivity to motion clutter, such as that resulting from illumination changes. By clicking on a specific piece of activity in the summary clip, analysts can jump to the original video frames containing that activity to examine it more closely. "We designed the software with these customizable settings to support interactive video browsing," says Thornton. 


The Video Content Summarization Tool plays a short summary view of a longer video stream captured at a street intersection. Underneath the video are two slider bars for analysts to easily customize how content is displayed. The time-compression bar allows analysts to instantaneously range from long synopsis videos that provide clear views of individual activity components to highly condensed versions that show complete activity patterns. With the other bar, analysts can adjust motion sensitivity, trading off between a comprehensive detection capability and clutter suppression (e.g., the removal of shadow effects). When analysts place the mouse over objects in the summary view, the corresponding activity is highlighted in blue. By left-clicking this activity, analysts can see the original video segment from which that activity was pulled.


Currently, the software is designed to connect to either video archives or exported video files to generate requested summary views. The software has been deployed for use in multiple operational settings, and user testing has demonstrated that it can significantly speed up the video review process. Future development will focus on broadening the application of the technique beyond video collected from fixed cameras to video collected from panning cameras. "This capability would allow users to apply the tool to summarize footage from panning cameras covering wide outdoor areas, for example," says Thornton.

Posted December 2015

top of page