Predicting subjective video quality
Feb 1, 2008 12:00 PM, By Kevin Ferguson And Winfried Shultz
Difference mean opinion score: A comprehensive approach to emulation the human vision system to obtain accurate and repeatable results when predicting subjective video quality rating.
Adaptive (dynamically nonlinear) response class
Figure 3. Mach band effect. The two horizontal gray bars are actually identical, although in the combined picture on the right in the regions of large contrast the gray bar seems to have brighter and darker areas.
One additional element when dealing with perception of the human being, regardless of whether the vision system is observed (or another sense), is the perceptual contrast that humans can identify. In a comparative situation, humans have to make decisions and this is done by contrasting one incident from another.
A good example of this behavior was identified by Sherif, Taub and Hovland in 1958. They asked people to lift different weights. When the test subjects initially had to lift a heavy weight they subsequently underestimated the weight of lighter weights lifted afterwards. Similar effects happen in the vision system as well. Some ITU standards refer to this effect when introducing training sequences before the actual testing to set a common baseline for the participants of human viewer trials.
Predicting subjective quality ratings must take into account that perceptual contrast will happen in the process of detecting the threshold of noticeable differences. It is essential to calibrate the system with input of human vision science data to support an adaptive filtering system.
For the general case, human vision response to video stimuli is adaptive (dynamically nonlinear). Response sensitivities can change by more than an extra order of magnitude beyond the spatiotemporal dynamic range. Spatiotemporal dynamic range represents the human vision system's capability to identify differences in light stimuli over the area (resolution) and time (how many different light stimuli occur in a given area). In a very simplified way, we can say that the adaptive nature of the human vision acts like a magnifying glass for spatial as well as temporal aspects. However, the adaptation itself has time constants for it to work properly. Any measurement system for predicting subjective picture quality must take these effects into account.
The approach of the vision science community is to measure stimulus response pairs in a controlled environment. To determine the contrast sensitivity of the human vision system, for example, a whole series of tests need to be conducted to take into account the adaptive capabilities of the human vision.
First, the spatial content is varied (detail increased for example) for a given ambient luminance (lowlight for example to emulate cinema conditions) and multiple curves are traced for different levels of temporal changes in the video. The human vision will then adapt to the lowlight levels and for each frequency of temporal changes. Then spatial content is varied for a fixed combination of the other two parameters, and results are recorded. Only one parameter is changed and the effects on spatial contrast sensitivity are recorded again for this different set of parameters.
This data then can be used to model the human vision system and serve as a parameter set to account for adaptation parameters dynamically determined in the measurement process. Why is this so important? Contrast sensitivities can change almost up to 100 times depending on the values of the other parameters due to the adaptive nature of the vision system. As a consequence, meaningful results can only be obtained from a system that dynamically adapts its filter settings according to the surrounding conditions and the video stimuli observed.
The change in sensitivity with average luminance, such as light and dark adaptations, involves the nonlinearity that is consistent with many visual perception phenomena. One very obvious adaptation is the ability of the human vision system to adapt to ambient luminance. As a consequence, a movie watched in a cinema compared with watching it at home or on a mobile device in bright sunlight is not only a matter of screen size but also of perceived quality due to the adaptive nature of the human vision system. Brightness with flicker, changes in dynamic responses to step increases, after images, visual illusions and extreme sensitivity (i.e. photosensitive epilepsy) are consistent with the types of nonlinearity that account for most of the adaptation.
Key human vision stimulus-response data sets
Any analysis of perceived video quality has to identify the threshold at which differences compared to a reference will become noticeable to the viewer. This is comparable to differentiating a trend from noise. The underlying measurements for predicting subjective quality ratings have to be calibrated with human vision science data supporting the ability to detect the differences at the smallest increment. This is called detecting supra-threshold responses of the human vision system to stimuli applied. One effective way to do this is to calibrate the measurement system with stimulus-response data sets based on findings of the human vision science community.
Predicting subjective video quality ratings
The human vision system only responds to light stimuli. Any measurement system must establish a transfer from (electrical) video data to light stimuli emitted from a display. Figure 4 shows the signal flow through the analysis engine. In the display model, a conversion from electrical signals to light stimuli is performed. The viewing model takes into account the viewing distance, ambient light and so forth.
The vision model provides adaptive filtering to effectively simulate the human vision system as described. The difference is obtained from the results of each predicted human vision response. Within the objective maps node, visible impairments are classified and measured objectively, with the ability to then sum each impairment with corresponding relative annoyance or relative preference. The summary node extracts single summary measures per frame and/or video sequence. Also, the ITU BT.500 training equivalent, which maps the response summary measures to difference mean opinion score (DMOS) is included in the summary node.
An adaptive integrator (see Figure 5) is used to filter in four spatial directions (right, left, up, down) and temporally. The result is a spatiotemporal filter that is tunable in each dimension. Consistent with previous models taking into account center and surround interaction, two spatiotemporal filters are used: one for the center and one for the surround. The surround spatiotemporal response is used to both subtract from the center and tune the center spatiotemporal response. In addition, the surround spatiotemporal response also alters its own response via feedback to the frequency controls, but much more slowly than for the center, consistent with longer term adaptation such as long term light and dark adaptations, after-images, and other long-term effects.
Calibration
Extensive controls allow for calibration for direct threshold spatiotemporal response, horizontal and vertical dimensions, and center and surround. A control for baseline frequency cut-off (corresponding to integration time or area) is required. Other items requiring calibration are frequency response adaptation sensitivity for control of the transition between threshold and supra-threshold response (one for spatial and one temporal). In addition to the adaptive spatiotemporal filter, other model components are used to take into account Weber's law, perceptual differences between correlated versus uncorrelated images and other behavior including types of masking.
Conclusion
Vision science has developed comprehensive sets of stimuli-response pairs to predict human vision system response and perception. Functional blocks can be implemented in a dynamically nonlinear adaptive system that successfully models the human vision system far beyond currently existing technical implementations according to ITU J.144. This implementation is widely agnostic to video systems in terms of resolution, frame rate or compression algorithms and is capable of simulating viewing conditions, display types and viewer skills. It is eventually calculating DMOS scores based on an adaptive filtering system emulating the human vision system. The objective is to help accelerate the optimization of encoding algorithms, improve bandwidth utilization in distribution systems and to foster a better viewing experience for the TV consumers by measuring and tracking optimum perceived video quality.
Kevin Ferguson is a principal engineer at Tektronix, responsible for mathematical modelling and algorithm development for automated video measurement and picture quality analysis. Winfried Schultz is marketing manager video EMEA of Tektronix' video product line.
References
- Ferguson, Kevin, “An Adaptable Human Vision Model for Subjective Video Quality Rating Prediction Among CIF, SD, HD and E-Cinema,” Tektronix Inc., Whitepaper, 1. June 2007, Lit.no. 25W-21014-0.
- Ferguson, Kevin, “Predicting Subjective Quality Ratings of Video,” US Patent No. 6829005, Issued Dec. 7, 2004.
- W. H. Swanson, T. Ueno, V. C. Smith, J. Pokorny, “Temporal modulation sensitivity and pulse-detection thresholds for chromatic and luminance perturbations,” J. Opt. Soc. Am., Oct. 1987, Vol. 4, No. 10, pp. 1992-2005.
- D. Hubel, “Eye, Brian, and Vision,” Scientific American Library, NY, NY, 1995, pp. 33-136.
- Enroth-Cugell, “The World of Retinal Ganglion Cells,” from Shapley, R., Man-Kit Lam, D., ed., Contrast Sensitivity, MIT Press, 1993, pp. 155,159.
- B. Levitan and G. Buchsbaum, “Signal sampling and propagation through multiple cell layers in the retina: modeling and analysis with multirate filtering,” J. Opt. Soc. Am., July 1993, Vol. 10, No. 7, pp. 1463-1480.
| Want to use this article? Click here for options! |





















