MPEG-2 Video Compression: Intra and Inter Frames, Study Guides, Projects, Research of Electrical and Electronics Engineering

The basics of mpeg-2 video compression, focusing on intra and inter frames. Intra-frames (i) are coded independently, while inter-frames (p and b) rely on predictions from other frames. The role of reference frames, motion vectors, and quantization tables in the compression process.

Typology: Study Guides, Projects, Research

Pre 2010

Uploaded on 07/29/2009

koofers-user-pin-1
koofers-user-pin-1 🇺🇸

3

(1)

10 documents

1 / 30

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
ECE 434 – Multimedia Networks
Spring 2009
David Swiston
MPEG2 Computer Project
1. Introduction
The following report and accompanying code detail an investigation into the MPEG-2 compression
algorithm. The steps and theory involved in encoding and decoding the MPEG-2 bit stream will be
discussed. A MATLAB based simulation of the encoding and decoding process is provided along with
experimental results using the MATLAB simulation.
2. MPEG-2 Algorithm
1. Caveat
This report essentially ignores the construction and framing of the MPEG-2 bitstream. This project
need not concern itself with this due to the assumption that the encoding and decoding is done over a
direct communication line. There is no need to packetize and frame the information. Ignoring this
aspect of MPEG-2 still allows the important concepts of MPEG-2 coding to be analyzed and discussed.
2. Introduction and History
The MPEG-2 algorithm is a lossy compression algorithm intended to remove spacial and temporal
redundancy in a video in order to ease transmission and storage requirements for videos. Using a
simple example, a 1280 horizontal by 780 vertical pixel, 24 bit-per-pixel video at 24 frames per second
would require 530,841,600 bits-per-second of bandwidth to transmit with a two hour movie requiring
477,757,440,000 bytes of storage.
1280 pixels720 pixels24 bits /pixel24 frames/s=530,841,600 bits /s
530,841,600 bits /s60 s/min60 min /hr
8bits /byte =477,757,440,000 bytes
Given the very large requirements to transmit a digital video as described above, it is necessary to
compress the video to make storage and transmission practical.
MPEG-2 is widely used in many applications, allowing video transmissions such as the example
provided above feasible. Its development began in 1990, was approved in November of 1994, and the
final standard published in 1995. The initial application was as a standard for digital broadcast TV,
offering higher target bit rates than the MPEG-1 standard and also supporting interlaced video. It has
since been widely used in many applications such as in DVDs, ATSC, and DVB.
3. Colorspace
The MPEG-2 standard uses the YCbCr colorspace in order to take advantage of the Human Visual
System (HVS). The Y component in YCbCr represents the luminance information and Cb and Cr
represent chrominance differences. The eye is less sensitive to chrominance than it is to luminance.
MPEG-2 exploits this fact by allowing for chrominance subsampling. Subsampling schemes 4:4:4,
4:2:2, and 4:2:0 are used. The format 4:4:4 keeps all of the chrominance information, 4:2:2 keeps half
of the chrominance information by horizontally subsampling by a factor of two, and 4:2:0 keeps a
quarter of the chrominance information by subsampling both the horizontal and vertical dimensions by
two. By removing chrominance information, either of the later two formats will automatically result in
a data reduction with little to no perceptible difference to the eye. The 4:2:0 format is the most
pervasive subsampling format. The MATLAB simulation and subsequent discussions assume the 4:2:0
format.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e

Partial preview of the text

Download MPEG-2 Video Compression: Intra and Inter Frames and more Study Guides, Projects, Research Electrical and Electronics Engineering in PDF only on Docsity!

ECE 434 – Multimedia Networks Spring 2009 David Swiston MPEG2 Computer Project

1. Introduction

The following report and accompanying code detail an investigation into the MPEG-2 compression algorithm. The steps and theory involved in encoding and decoding the MPEG-2 bit stream will be discussed. A MATLAB based simulation of the encoding and decoding process is provided along with experimental results using the MATLAB simulation.

2. MPEG-2 Algorithm

  1. Caveat This report essentially ignores the construction and framing of the MPEG-2 bitstream. This project need not concern itself with this due to the assumption that the encoding and decoding is done over a direct communication line. There is no need to packetize and frame the information. Ignoring this aspect of MPEG-2 still allows the important concepts of MPEG-2 coding to be analyzed and discussed.
  2. Introduction and History The MPEG-2 algorithm is a lossy compression algorithm intended to remove spacial and temporal redundancy in a video in order to ease transmission and storage requirements for videos. Using a simple example, a 1280 horizontal by 780 vertical pixel, 24 bit-per-pixel video at 24 frames per second would require 530,841,600 bits-per-second of bandwidth to transmit with a two hour movie requiring 477,757,440,000 bytes of storage. 1280 pixels ∗ 720 pixels ∗ 24 bits / pixel ∗ 24 frames / s = 530,841,600 bits / s 530,841,600 bits / s ∗ 60 s / min ∗ 60 min / hr 8 bits / byte =477,757,440,000 bytes Given the very large requirements to transmit a digital video as described above, it is necessary to compress the video to make storage and transmission practical. MPEG-2 is widely used in many applications, allowing video transmissions such as the example provided above feasible. Its development began in 1990, was approved in November of 1994, and the final standard published in 1995. The initial application was as a standard for digital broadcast TV, offering higher target bit rates than the MPEG-1 standard and also supporting interlaced video. It has since been widely used in many applications such as in DVDs, ATSC, and DVB.
  3. Colorspace The MPEG-2 standard uses the YCbCr colorspace in order to take advantage of the Human Visual System (HVS). The Y component in YCbCr represents the luminance information and Cb and Cr represent chrominance differences. The eye is less sensitive to chrominance than it is to luminance. MPEG-2 exploits this fact by allowing for chrominance subsampling. Subsampling schemes 4:4:4, 4:2:2, and 4:2:0 are used. The format 4:4:4 keeps all of the chrominance information, 4:2:2 keeps half of the chrominance information by horizontally subsampling by a factor of two, and 4:2:0 keeps a quarter of the chrominance information by subsampling both the horizontal and vertical dimensions by two. By removing chrominance information, either of the later two formats will automatically result in a data reduction with little to no perceptible difference to the eye. The 4:2:0 format is the most pervasive subsampling format. The MATLAB simulation and subsequent discussions assume the 4:2: format.
  1. Inter/Intra Frames At its most basic level, the MPEG-2 bitstream is a sequence of compressed frames sent one after another. MPEG-2 defines three types of basic frames. Intra-frames (I frames) and inter-frames (P and B frames). I frames are independent images and are compressed without exploiting temporal redundancies. P and B-frames attempt to exploit temporal redundancies and hence are not considered as independent images. They are coded using predictions from other frames. A P-frame is created by forward prediction using a previously coded I or P-frame to predict the current frame's pixel values. A B-frame is created by bidirectional prediction, using a previously coded I of P-frame and a future coded I or P-frame to predict the current frame's pixel values. For reference purposes, the frame being predicted is known as the target frame and the frame or frames used in the prediction are known as reference frames. P and B frames record the difference between the source frame or frames and the predicted frame. Since the frame rate of video is relatively high and generally little movement in scenes, videos contain quite a bit of temporal redundancy. As a result, difference images in P and B frames consist mainly of small values and low entropy, which is good for compression. The pattern of I, P, and B frames in an MPEG-2 video sequence is constructed at encoding time and is embedded into the video's header. Due to the fact that B-frames require previous and future frames to be decoded, the actual coding and transmission order of frames can differ from the display order of a video. See the table below for an example. Frames Display Order I B B P B B P B B I Coding & Transmission Order I P B B P B B I B B Note that the MATLAB simulation generates only I and P frames as this was all that was required in the project.
  2. Block Based Coding MPEG-2 utilized block based coding. Each frame is not coded as a whole but broken into multiple macroblocks sized NxN. The default macroblock size for luminance images is N=16. When 4:2: chroma subsampling is used, N=8. Macroblocks are further divided into 8x8 sized blocks, giving 4 luminance blocks and 2 chrominance blocks per each macroblock. The block size provides a compromise between accuracy at low frequencies and computational complexity of the Discrete Cosine Transform (to be discussed later).
  3. Motion Compensation By using P and B frames, MPEG-2 attempts to exploit temporal redundancy. If frames do not contain quickly moving objects and the camera does not change rapidly between frames, the temporal redundancy is high, giving frames low entropy. However, this simple approach has difficulty achieving high compression ratios. Most of the differences between frames is due to camera and object movement. MPEG-2 attempts to track motion between frames by searching for motion within frames. If a good estimate of motion can be captured and used to predict the movement of pixels, calculating a difference image can use the motion predictions to lower entropy even further. To allow for motion prediction, MPEG-2 segments each frame into a number of macroblocks sized N by N pixels. The default macroblock size for luminance images is N=16. When 4:2:0 chroma subsampling is used, N=8. Motion prediction is performed at this macroblock level. Reference frames are searched within a window centered around the position of the target frame macroblock for the macroblock that most closely matches the macroblock in the target frame. A the closest match is decided by minimizing the difference between macroblocks. The difference can be measured in many
  1. Zig-Zag Ordering After blocks have been transformed using the DCT and quantized, they are arranged into a vector by zig-zag ordering, shown below. Two types of ordering exist, one for progressive video and another for interlaced video. This report assumes progressive input. Interlaced ordering will be ignored. Progressive Zig-Zag Ordering 1 2 6 7 15 16 28 29 3 5 8 14 17 27 30 43 4 9 13 18 26 31 42 44 10 12 19 25 32 41 45 54 11 20 24 33 40 46 53 55 21 23 34 39 47 52 56 61 22 35 38 48 51 57 60 62 36 37 49 50 58 59 63 64 This zig-zag ordering attempts to order the DCT coefficients from lowest to highest frequency. Natural scenes tend to contain more low frequency content than high frequency content. By zig-zag ordering, most of the low frequency components which tend to be non-zero are at the beginning of the vector. Likewise, higher frequency terms which tend to be zero appear at the end of the vector. This result is exploited by run-length coding to further compress the bitstream.
  2. Lossless Compression After vectors have been formed by zig-zag ordering, there are two more compression steps left. For intra frames, the DC coefficients of each block in intra frames are differentially coded with respect to the previous blocks DC term and the AC coefficients are run-length encoded (RLE) due to long runs of zeros resulting from the zig-zag ordering. This in turn is also coded using a variable length code. For inter frames, both the DC and AC coefficients are RLE and then coded using a variable length code. Finally, the motion vectors computed for inter frames are differentially coded in the horizontal and vertical directions. This in turn is coded using a variable length code.

3. Source

The source video frames for this experiment come from the movie Wall-E. They were captured using VLC Media Player 0.9.4 using the image video output option. This allowed for each decoded frame to be output as a lossless PNG file instead of being displayed on the computer screen. The output frames were decimated from 1280 horizontal by 528 vertical pixels to 512 horizontal by 211 (cropped to 208) vertical pixels. The frame rate was also decimated by three from 23.972615 frames per second to 7.99087 frames per second. The decimation in the spacial and temporal dimensions was done to ease computational and memory requirements while working in MATLAB as well as to allow VLC to perform the frame capture without dropping any frames due to my computer hardware limitations.

4. Matlab Simulation

The MPEG2 based encoder and decoder are implemented in MATLAB. It does not produce an MPEG exact bitstream but it does encompass all of the important features including colorspace conversion, chroma subsampling, inter/intra framing,block based coding, motion compensation, 2D DCT, DCT coefficient quantization, zig-zag reordering, and run-length and variable length coding. The following table details the default settings used. They were chosen and not varied as they are widely used defaults in MPEG Chroma Subsampling 4:2: Macroblock Size 16 Block Size 8 Chroma Upsampling Nearest Neighbor Interpolation Code execution is structured as follows:

  1. Retrieve a frame (PNG file)
  2. Convert from RGB to YIQ
  3. Chroma Subsample I and Q data using 4:2:
  4. If working on a P frame, perform motion vector search over all the macroblocks
  5. Encode the frame a. Loop over all macroblocks 1. Encode six blocks using the 8x8 2D DCT 2. If frame is P frame, code difference between current frame and previously decoded frame 3. Quantize each of the six blocks 4. Zig-Zag order each of the six blocks 5. Perform run-length encoding b. Give a bit size estimate of the frame based on the entropy of the coded frame
  6. Decode the frame a. Loop over all macroblocks 1. Perform run-length decoding 2. Reverse Zig-Zag ordering 3. Invert the quantization step 4. Decode six blocks using the 8x8 2D IDCT 5. If frame is P frame, perform motion compensation
  7. Upsample the chroma I and Q components
  8. Convert from YIQ to RGB
  9. Record decoded frame The simulation allows for many parameters of the encoding process to be configured, including the frame sequence pattern (using [I P P P I P P P.. .] but it is easily changeable), motion compensation search (sequential, logarithmic, and hierarchical) with a configurable search window, video source, SAD limit (used to determine if a motion vector should be used or discarded), quantization tables (separate for I and Q frames), and the scaling factor used in quantization.

Frame 2 Original Decoded Difference Frame 3 Original

Decoded Difference Frame 4 Original Decoded

Original Decoded Difference Frame 7 Original

Decoded Difference Frame 8 Original Decoded

Original Decoded Difference

6. Experiments

  1. Motion Vector Search Method A motion vector search can be computationally intensive. The formulas below give pixel comparisons per motion vector required for each method, where p is the maximum horizontal and vertical displacement and N is the macroblock size. A sequential search will always find the motion vector that yields the lowest possible error, however it is very slow. The logarithmic and hierarchical methods are quicker but do not guarantee the motion vector with the lowest resulting prediction error will be found. Motion Vector Search Pixel Operations Per Second Sequential 2p 1  2 ∗ N 2 Logarithmic 9 ∗ ceil log p  1 ∗ N 2

Hierarchical [ 2 ∗ ceilp 4

2 

N

2  9 

N

2 9N 2 ] Using the MATLAB model, all three search methods were profiled for runtime. Note that the hierarchical search method can use any type of search method to find the initial estimated motion vector. In this simulation, this initial search was done using a sequential search. The following tables provide the average runtime for the motion vector search for the first 10 frames and the MSE of the resulting frames for each search method. A scale factor of 18 was used. Hierarchical Search Frame Runtime (seconds)

MSE

Red Green Blue 1 N/A (I Frame) 19.322050 16.198571 20. 2 0.5841 9.084407 18.942899 12. 3 0.5465 10.910673 23.888249 16. 4 0.5401 14.708402 24.684232 18. 5 N/A (I Frame) 15.947172 11.643376 16. 6 0.5460 11.343562 20.374408 14. 7 0.5449 14.893874 25.406973 18. 8 0.5430 17.093994 28.643874 22. 9 N/A (I Frame) 14.310519 10.771259 16. 10 0.5421 8.011503 17.069636 12. Logarithmic Frame Runtime (seconds)

MSE

Red Green Blue 1 N/A (I Frame) 19.322050 16.198571 20. 2 0.4092 8.938308 18.562040 12. 3 0.3917 10.980027 23.842989 16. 4 0.3925 15.429133 25.150879 19. 5 N/A (I Frame) 15.947172 11.643376 16. 6 0.3942 10.992563 19.638616 14. 7 0.4111 14.297307 23.897611 18. 8 0.3855 16.308415 27.130587 20. 9 N/A (I Frame) 14.310519 10.771259 16. 10 0.3923 7.531550 16.487849 12. Sequential Frame Runtime MSE

From To Motion Vector Prediction Quiver Plot – Window size = 64 MSE – Red: 7.75 Green: 17.15 Blue: 11. Bits: 0.8630e+ Decoded Source Frame (Frame #2) Decoded Predicted Frame (Frame #3) From To Motion Vector Prediction Quiver Plot – Window size = 64 MSE: Red: 8.90 Green: 21.09 Blue: 13.

Bits: 1.1999e+

  1. Motion Vector Error Limits The video used in this project contains quickly moving objects. This is in part due to the fact that the frame rate was decimated by three which increases a moving object's frame to frame motion vector magnitude. As a result, the motion vector prediction does not always follow or represent an object's true motion because the macroblock being searched for in a future frame is outside of the prediction window. This forces the search mechanism to find a macroblock within the search window that may not truly be a good match. For the chrominance information, this is especially problematic. The predictions are based only on the luminance image and when not correctly tracking motion, the chrominance components suffer. To see the effects of this, the option to include a SAD limit was added to the simulation. When the option is enabled, if no motion vector for a macroblock can be found that results in a SAD less than the limit, the particular macroblock will be coded in the same manor as an I frame macroblock would be, using no motion vector or predictions and utilizing the I frame quantization matrix. Below are three quiver plots created from a P frame's motion vector estimates. In the first quiver plot one can notice the failure of the motion vector search to properly track the object's actual motion in the frame. In the second and third plots, the motion estimations with a SAD above the limit are thrown out, which appear as [0,0] vectors in the quiver plot. Note the MSE values and entropy (estimated bits required) for each configuration is provided. By looking at the results, it can be deduced that limiting the allowable SAD when performing motion vector search can increase image quality at the expense of compression. Initial I Frame P Frame Being Predicted Initial Prediction Without SAD Limits

Original

  • MSE – R: 0.00 G: 0.00 B: 0.00 Bits: 2.5559e+
    • Scale =
      • MSE – R: 0.84 G: 0.56 B: 1.26 Bits: 2.9023e+
  • Scale =
    • MSE – R: 9.91 G: 7.81 B: 10.99 Bits: 7.8123e+
      • Scale =
        • MSE – R: 17.56 G: 14.88 B: 18.47 Bits: 4.6190e+
  • Scale =
    • MSE – R: 28.21 G: 23.32 B: 27.29 Bits: 2.6768e+
      • Scale =
        • MSE – R: 37.07 G: 33.83 B: 38.52 Bits: 1.5557e+
  • Scale =
    • MSE – R: 50.09 G: 40.82 B: 52.87 Bits: 1.0163e+
  1. Transform Type As an investigation into the benefits of the DCT versus other transforms, the coding step was changed to use the Discrete Fourier Transform (functionally implemented using the FFT). The DCT is used in place of other transforms like the FFT for its efficient energy compaction properties. This behavior was verified during the experiment. At equivalent MSEs, using an 8x8 FFT required more bits than the 8x DCT when encoding an I frame. The table below details the results. Worth noting is that subjectively, the frame coded using the FFT is of worse quality than that using the DCT, even though the MSE values are comparable. FFT MSE - Red: 19.36 Green: 15.45 Blue: 18. Bits: 8.7026e+ DCT MSE – Red: 19.32 Green: 16.19 Blue: 20. Bits: 4.2134e+
  2. Difference Image A further experiment was performed by throwing out coded difference images. This was done to show the importance of the difference images. In this configuration, the only source of information for P and B frames is the previous frame and the motion vector. The results in the following table show that in the presence of very quick motion where the motion vectors do not represent the true motion of an object, the difference frame is absolutely necessary for satisfactory results. With predictable motion, the difference image is much less important, but the output frames without the difference can still look objectionably blocky in places. Since motion estimation is done in 16x16 blocks and only in the two- dimensions available in the frame, simply disregarding the difference image is not viable except in the most predictable circumstances. Original Previous Frame Predicted No Difference Image