Authors: Pengfei Li*, Qing Fang**, Youzhao Wu**, Honglei Ji**
* Yongjiang Laboratory (Y-LAB), Ningbo 315000, China
** China R&D Center, TCL Electronics Co., Ltd., Shenzhen 518000, China.
Abstract
Vision is the most direct and effective way for humans to obtain information from the outside world. More than 80% of the information is transmitted to the brain through eyes. With the development of display technology, the picture received by the eyes is brighter, more vivid and clear. For such high-quality display images, the objective evaluation system is slightly insufficient. Merely describing the image quality with parameters such as brightness, color gamut, and contrast is not completely consistent with consumers’ real feelings. Therefore, subjective evaluation based on human vision is necessary to add to the image quality evaluation system. This article focuses on the domestic and international standards of subjective assessment methods for image quality, and reviews it from the perspectives of viewing conditions, observers, test preparation, test methods, and result analysis. Based on the above research, a subjective evaluation method suitable for high-quality display devices is summarized.
Author Keywords
Image quality, Subjective evaluation, DSCQS, DSCS, SSCQE
Table of Contents
Toggle1. Introduction
Eyes are the most important sensory organ for human beings. More than 80% of the information we receive from the outside world is obtained through eyes [1] . Our brain processes visuals 60000 times faster than text [2] . About 40 %of people respond better to visuals than other materials. Of all of the sensory receptors in our body, 70% are in our eyes. To some extent, we perceive the world primarily through our eyes.
With the development of display technology, parameters such as brightness, color gamut, and resolution of display devices such as TVs have become higher and higher, and the detailed visual information received by the human eyes has also sharply increased. For display technology, the most important performance is picture quality. According to data released by NPD Display Search, picture quality is a major consideration for consumers around the world when they decide to buy a new TV. Therefore, how to evaluate the image quality is particularly important.
2. Image quality evaluation system
The evaluation of display technology image quality can be divided into objective parameter evaluation and subjective evaluation. Objective parameter evaluation involves brightness, color gamut, color volume, viewing angle, uniformity, contrast and other dimensions. These physical quantities are quantified by precise optical instruments. Subjective evaluation is mainly for some unquantifiable items, such as halo, overall display quality, etc. The objective evaluation is based on the CIE 1931 standard. Brightness is determined according to the visual function of the human eye in the CIE 1931 standard. Similarly, the color point is determined according to the XYZ response curve in the CIE 1931 standard. The other objective parameters such as color gamut, color volume, viewing angle, uniformity, contrast, etc. are based on brightness and color point. Nowadays, in addition to the CIE 1931 standard,the evaluation of color and brightness will also be evaluated from color space systems such as CIE 1976, Lab, and LCH. These color spaces can be obtained by formula conversion from CIE 1931.
At present, the well-known third-party evaluation agencies in the display industry will take into account both objective evaluation and subjective evaluation, such as HDTVtest, RTINGS, DisplayMate and DxoMark. The HDTVtest uses precision optical instruments to measure objective parameters such as contrast, color, and viewing angle of TVs in a dark room. At the same time, it also conducts subjective evaluation by playing test videos to test functions such as local dimming and HDR picture quality. But its evaluation results do not have a detailed result report. The RTINGS is an evaluation agency in Canada, which mainly purchases TV products in the North American market for testing in the form of self-purchasing. Its test items are more comprehensive than the HDTVtest, taking into account different usage scenarios, and the results are more open and transparent. The RTINGS assessment consists of both objective quantitative tests and subjective scoring tests. DisplayMate and DxoMark are evaluation agencies mainly for mobile phone screens, covering color, video, motion and touch, giving consideration to both objective and subjective tests.
In addition to the above evaluation systems based on objective evaluation, there are also some programs based on subjective evaluation, such as 3M’s display quality score (DQS). It can let product developers forecast how these design decisions affect perceptions of quality. It is mainly based on display size, resolution, luminance and color gamut. It comprehensively considers objective quantitative tests and subjective scoring tests. But the evaluation criteria and calculation methods are not open, or have not been continuously updated.
With the development of display technology, parameters such as brightness and color gamut become higher and higher. However, the increase of these parameter values is not completely linear with the increase of the actual display effect. Stevens conducted a large number of studies of suprathreshold sensation, using quantitative estimation method to study the relationship between stimulus intensity and sensory size. He found that the mental strength of perception is an exponential function of physical strength, and proposed Stevens’s power law [3] , as shown in Equation 1.
Where, S refers to the mental strength, that is, the perceived size or the sensed size; I refers to the physical strength, that is, the physical quantity of the stimulus; k and n are the constant characteristics of a certain type of experience to be rated (k is a constant, n represents the The power exponent determined by the stimulus intensity of the channel).
According to some perception studies in recent years, the display effect will not improve with the increase of display parameters after reaching a certain level [4-7] . Taking brightness as an example, the brightness recognition ability of the human eye in the dark room is 0.32 nit (σ=0.09 nit)-1716.3 nit (σ=427.0 nit) [4] . Beyond this range, the human eye cannot accurately recognize changes in brightness. In fact, the brightness of high-end TVs has basically reached the upper line of this range, such as Samsung’s The Terrace TV, Samsung’s QN90A, Sony’s Z9F, and so on.
According to the data from the RTINGS website, the brightness of Samsung’s the Terrace TV is close to 2000nit, as shown in Table 1. When the peak brightness exceeds 6000-7000 nit, the subjective score of image quality decreases [5] . The same is true of the color gamut, where colors outside the natural gamut are perceived as distorted. The Pointer’s Gamut basically contains all the colors in nature, and it is an irregularly shaped color gamut [8] , as shown in Figure 1. High-gamut technologies such as quantum dot technology will show out-of-a-gamut colors that don’t look normal. To sum up, due to the existence of perceptual limits, objective parameters cannot accurately measure the performance of high-color gamut and high-brightness display devices. In addition, the final display effect is largely affected by factors such as algorithms. Therefore, subjective evaluation is particularly important to supplement the evaluation system.
3. Subjective evaluation criteria
We mainly studied the following criteria. International Standard: ITU-R BT.500 Methodology for the subjective assessment of the quality of television pictures [9] . Chinese national standards: GB/T 7401—1987 Method of subjective assessment of quality of colour TV pictures [10] , GY/T 134—1998 The method for the subjective assessment of the quality of digital television picture [11] , SJ/T 11590—2016 A subjective method for the evaluation of displaying quality of LED displays [12] , T/CSMPTE 3—2018 Subjective assessm method for image quality of ultra high-definition television [13] , GY/T 340—2020 Subjective assessment methods for image quality of ultra high-definition television——Doublestimulus continuous quality-scale [14] .
ITU-R BT.500 is a subjective evaluation method published by International Tecommunication Union-Radiocommunication Secto for the quality of television images. It mainly consists of common test methods, rating scales and viewing conditions. The standard was first published in 1974, and has been supplemented and modified many times to form the existing standard, which issuitable for all mainstream display technologies. ITU-R BT.500 is the most substantial and comprehensive standard, involving a variety of test methods and data processing methods. GB/T 7401 is a subjective evaluation method for color television image quality issued by the National Bureau of Standards of the People’s Republic of China in 1987, which is the earliest standard for subjective evaluation in China. The standard is mainly for color TV, involving a variety of evaluation methods and data processing methods. GY/T 134 is a subjective evaluation method for digital TV image quality issued by the State Administration of Radio, Film and Television in 1998. The standard not only prescribes typical evaluation methods and data processing methods, but also describes the testing process in detail. SJ/T 11590 is a subjective evaluation method for LED display image quality released by The Ministry of Industry and Information Technology, PRC in 2016. This standard focuses on test items and describes performance analysis criteria and scoring criteria for LED displays in detail. T/CSMPTE 3 is a subjective evaluation method for the image quality of UHD TV or display released by China Society of Film and Television Technology in 2018. This standard is mainly for the double stimulus continuous mass scale method. GY/T 340 is a method for subjective evaluation of ultra-high definition TV image quality by dual stimulus continuous quality scale in laboratory environment issued by the National Radio and Television Administration.
The above standards have different emphasis on the description content, but they are also related to each other, as shown in Figure 2. ITU-R BT.500 is the most comprehensive standard, and GB/T 7401 is the earliest subjective evaluation standard in China. GY/T 134 refers to ITU-R BT.500 and GB/T 7401. SJ/T 11590 refers to GB/T 7401. T/CSMPTE 3 references GY/T 134 and ITU-R BT.500. GY/T 340-2020 refers to ITU-R BT.500.
Through in-depth research on the above criteria, we found that there are five core elements of subjective evaluation: viewing conditions, observers, pre-test training, subjective test and result analysis. The viewing conditions mainly include two aspects: the technical parameter requirements for the display and the viewing conditions. All current standards limit the basic parameters of the display, including brightness, color temperature, viewing angle, and so on. The ambient lighting conditions are generally divided into dark room and indoor lighting, and the color temperature generally requires a D65 light source. The viewing distance is generally related to the resolution. For example, ITU-R BT.500 has formulated different standards according to different resolutions, as shown in Figure 3. Observers generally require at least 15 people, and can generally be divided into professional observers and nonprofessional observers. The preparation before the test includes the selection of test pictures and videos and the training of subjective observers. There are two common subjective test methods: the double-stimulus continuous quality scale method and the stimulus comparison method. Each method has different testing procedures and subjective scales. The result analysis is usually analyzed from the perspective of mean value and confidence interval, and generally should include a screening mechanism.
4. Evaluation methods suitable for high performance TVs
With the development of display technology, high resolution (2K & 4K), large size (55-inch, 65-inch, 75-inch and above) and wide color gamut have gradually become the mainstream of TV development. Based on the criteria mentioned in the previous section, we summarize a set of subjective image quality evaluation methods applicable to mainstream TVs. We will start the narrative from the following five aspects: TV settings, viewing conditions, observer, pre-test training, test method and result analysis.
TV settings: In order to ensure the objectivity and accuracy of the experimental results, the correct settings for the TV must first be made before subjective evaluation. Different viewing modes in the TV can seriously affect the brightness, color and color temperature of the displayed image. Warm color temperature is recommended for testing here, and default values are recommended for other settings. The default value is the best display state adjusted by the TV manufacturer, and this setting will display the best performance of the TV.
Viewing conditions: Two viewing scenarios are recommended here: laboratory and indoor lighting, with indoor lighting conditions preferred. The laboratory environment should be a dark room environment with no ambient light interference. It is recommended to use D65 light source for indoor lighting, and the ambient illuminance on the screen (the incident light formed on the screen by the surrounding environment, should be measured in the vertical direction of the screen) is less than or equal to 200 lux. The recommended viewing distance is 3 times the height of the TV (here for mainstream 2k TVs).
Observer: Observers should preferably be non-professional observers rather than professional observers. When precise judgment is required, evaluation and analysis can be performed by a professionally trained observer. In general, experimental results using non-professional observers are more objective. They should be representative, that is, should include audiences of different genders, ages, and cultural levels. They should have normal visual acuity (including corrected visual acuity) and color vision, and can pass the screening test of the visual chart and color blindness chart. They should have a certain ability of analysis and judgment, and can quickly accept and master the evaluation methods and requirements. The experimental data of at least 15 volunteers are statistically significant.
Pre-test training: There are two things we need to do before the test: picking test pictures and pre-test training. Test images: Materials that are challenging but not overly extreme should be preferred. Ideally, 5-7 test sequences should be used. Pre-test training: At the beginning of each evaluation cycle, the evaluation method should be introduced to the viewers in detail and correctly, and the evaluation demonstration should be displayed. Demonstration displays should use images or sequences that differ from formal tests. A test phase should last less than half an hour. At the beginning of the first phase, about 5 simulation demonstrations are played in a pseudo-random manner to stabilize the observer’s opinion.
Test method: Three test methods are recommended here: DSCQS (double stimulus continuous quality scale), DSCS (double stimulus comparison quality scale), and SSCQE(single stimulus continuous quality scale). The two-stimulus continuous quality scaling method is an alternating method. In this method the rater is asked to view a pair of images, each from the same source, except that one goes through the process to be examined and the other is direct signal source. The evaluators were asked to rate the quality of both. Evaluation experiments will use one monitor or two well-matched monitors, and will generally be performed as in the single-stimulus case. If using a monitor, the attempt will include an additional excitation field of the same duration as the first. In this case, it is good practice to ensure that on each attempt, the two components of a pair appear with the same frequency in the first and second positions. If two monitors are used, the excitation fields are displayed simultaneously. Judgment is the comparison of all possible pairs of conditions, while the incentive comparison method provides a more comprehensive evaluation of the relationship between conditions. But if doing so requires too many observations, it is possible to distribute the observations among the raters, or to use some sample drawn from all possible pairs. Usually, DSCQS requires the double stimulus appear in turn and random. The volunteer rate after watching at least two times. in contrast to this, DSCS requires the double stimulus appear at the same time. The volunteer can score based on difference after one view.
For different test methods, there are different subjective scales, as shown in Figure 4. The grading scale is for the dual stimulus continuous mass scale method. Observers were asked to mark the vertical scale to rate the overall image quality of each presentation. The evaluation pairs (benchmark and test) for each test condition were converted from the metric length on the scoring sheet to a normalized score in the range of 0 to 100. The evaluation difference that exists between the reference condition and the test condition is then calculated. The comparison scale is prepared for the stimulus comparison method. Observers are required to rate the experimental group against the standard group, using words from Figure 4 that indicate the presence and direction of perceptible differences.
Human memory properties also affect experimental results. Our significant memory effect no more than 9s,so the duration of each sequence should less than 9s. Humans can detect quality declines quickly and improvements slowly [15] . Hide sample labels and order can increase experimental accuracy. It seems common for volunteers to exchange points in DSCQS,so double confirmation is necessary SSCQE can replaces DSCQS when removing the hidden reference, and make sure each volunteer sees a unique sequence [16] . In this way, the experiment time can be reduced, thereby making less visual fatigue. Result analysis: The result analysis mainly analyzes two situations: no screening, and screening. The main difference between the two is whether the experimental data is selectively eliminated. Typically, viewers can be screened if the number of viewers in the test is small and those viewers are non-experts. Otherwise, it is not required. The screening mechanism can refer to the method in ITU-R BT.500.
5. Summarize
Display technology has been closely related to all aspects of human society. With the continuous development of technology, the objective evaluation system cannot accurately describe the true feelings of human beings. The mainstream evaluation methods at this stage are both objective tests and subjective tests, such as the HDTVtest and the RTINGS. This paper summarizes a set of subjective evaluation methods suitable for high-quality TV from five aspects TV settings, viewing conditions, observer, pre-test training, test method and result analysis through in-depth research on relevant standards.
6. Acknowledgements
The authors would like to gratefully acknowledge the support of the Guangdong Provincial Key R&D program (2020B0101030008: Development and industrial application of low environmental pollution quantum dot luminescent materials and devices).
7. References
- Sanders M S, McCormick E J. Human factors in engineering and design. Industrial Robot: An International Journal, 1998.
- Sherman K. How social media changes our thinking and learning. Language Teacher, 2013, 37(4): 9.
- Stevens S S. On the psychophysical law. Psychological review, 1957, 64(3): 153-181.
- Daly S, Kunkel T, Sun X, et al. Viewer preferences for shadow, diffuse, specular, and emissive luminance limits of high dynamic range displays. SID Symposium Digest of Technical Papers, 2013, 44 (1): 563–566.
- Kunkel T, Reinhard E. A reassessment of the simultaneous dynamic range of the human visual system. Proceedings of the 7th Symposium on Applied Perception in Graphics and Visualization. 2010: 17-24.
- Seetzen H, Li H, Ye L, et al. Observations of luminance, contrast and amplitude resolution of displays. SID Symposium Digest of Technical Papers. Oxford, UK: Blackwell Publishing Ltd, 2006, 37(1): 1229-1233.
- WANG LL ,TU Y,MOU TS, et al. Research Progress on Visual Perception and Vision Health for New Display Technologies. Optoelectronic Technology. 2021, 41(4):246- 253.(in Chinese)
- Pointer M R. The gamut of real surface colours. Color Research & Application, 1980, 5(3): 145-155.
- Recommendation ITU-R BT.500-14. Methodology for the subjective assessment of the quality of television pictures. 2019.
- Recommendation GB/T 7401. Method of subjective assessment of quality of colour TV pictures. 1987. (in Chinese)
- Recommendation GY/T 134. The method for the subjective assessment of the quality of digital television picture. 1998. (in Chinese)
- Recommendation SJ/T 11590. A subjective method for the evaluation of displaying quality of LED displays. 2016. (in Chinese)
- Recommendation T/CSMPTE 3. Subjective assessm method for image quality of ultra high-definition television. 2018. (in Chinese)
- Recommendation GY/T 340. Subjective assessment methods for image quality of ultra high-definition television——Double-stimulus continuous quality-scale. 2020. (in Chinese)
- Hamberg R, de Ridder H. Time-varying image quality: Modeling the relation between instantaneous and overall quality. SMPTE journal, 1999, 108(11): 802-811.
- Pinson M H, Wolf S. Comparing subjective video quality testing methodologies. Visual Communications and Image Processing 2003. SPIE, 2003, 5150: 573-582.
Click to view the original article:
Symp Digest of Tech Papers – 2022 – Li – 13 2 Review of Subjective evaluation method of high‐quality TV images
