Nyquist Sampling Theorem - Midterm Exam | CS 598, Exams of Computer Science

Material Type: Exam; Class: Human-in-the-loop Data Mgnt; Subject: Computer Science; University: University of Illinois - Urbana-Champaign; Term: Unknown 1989;

Typology: Exams

Pre 2010

Uploaded on 03/11/2009

koofers-user-nbz
koofers-user-nbz 🇺🇸

10 documents

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS598KN Midterm Solutions
Audio:
1: The Nyquist Sampling Theorem states that “If a signal is sampled at a rate
higher than twice the highest significant signal frequency, then the samples
contain all the information of the original signal. “ In order to fully reconstruct
signals that are audible by human ear (frequencies of up to 20khz), 2x20khz
needs to be used. CD sound is sampled at slightly more than twice the frequency
to make up for possible imprecisions.
2: Hummed songs are converted into a string of characters U, D, and S (Up,
Down, and Same), that represents the sequence of relative differences in pitch.
Songs in the audio database are pre-computed in the same way, thus converting
the problem of audio matching into string matching. The string-matching
algorithm allows k mismatches.
3: The psycho-acoustic model attempts to account for how humans actually
perceive sound. As such, it translates the physical properties of sound
(frequency, level, and duration) into measures perceived by humans (critical
band rate, loudness, and subjective duration). The critical-band-rate scale follows
a linear frequency scale up to 500 Hz, and then a logarithmic frequency scale
above 500khz. Masking effect, which can happen both in frequency and time
domains, is related to the model because it explains how we hear – certain
sound masks others, and therefore, the masked sounds become acoustically
irrelevant.
4: MPEG audio compression makes heavy use of psycho-acoustic model to
remove acoustically irrelevant part of the audio signal, in order to achieve high
compression rate. In MPEG audio compression, the audio signal is first divides
into 32 frequency sub-bands. Then the psycho-acoustic model analyzes the
amount of masking for each sub-band: If the energy in a band is below the
masking threshold, then it is not encoded. Otherwise, it is allocated a number of
bits to represent the coefficient. Bit allocation is determined by the signal-to-mask
ratio (SMR – ratio of the signal energy to the minimum masking threshold) in
such a way that sub-bands with small SMRs are allocated more bits and those
with large SMRs are allocated less bits.
5: HAS is heavily used in MPEG audio compression and immersive audio
systems.
In MPEG audio compression, the psycho-acoustic model is used to determine
inaudible sound signal due to masking effects, and allocate bits judiciously based
on the signal-to-mask ratio. The use of the psycho-acoustic model is to achieve
high compression rate while preserving the audible quality as much as possible.
pf3
pf4
pf5

Partial preview of the text

Download Nyquist Sampling Theorem - Midterm Exam | CS 598 and more Exams Computer Science in PDF only on Docsity!

CS598KN Midterm Solutions

Audio:

1: The Nyquist Sampling Theorem states that “If a signal is sampled at a rate higher than twice the highest significant signal frequency, then the samples contain all the information of the original signal. “ In order to fully reconstruct signals that are audible by human ear (frequencies of up to 20khz), 2x20khz needs to be used. CD sound is sampled at slightly more than twice the frequency to make up for possible imprecisions.

2: Hummed songs are converted into a string of characters U, D, and S (Up, Down, and Same), that represents the sequence of relative differences in pitch. Songs in the audio database are pre-computed in the same way, thus converting the problem of audio matching into string matching. The string-matching algorithm allows k mismatches.

3: The psycho-acoustic model attempts to account for how humans actually perceive sound. As such, it translates the physical properties of sound (frequency, level, and duration) into measures perceived by humans (critical band rate, loudness, and subjective duration). The critical-band-rate scale follows a linear frequency scale up to 500 Hz, and then a logarithmic frequency scale above 500khz. Masking effect, which can happen both in frequency and time domains, is related to the model because it explains how we hear – certain sound masks others, and therefore, the masked sounds become acoustically irrelevant.

4: MPEG audio compression makes heavy use of psycho-acoustic model to remove acoustically irrelevant part of the audio signal, in order to achieve high compression rate. In MPEG audio compression, the audio signal is first divides into 32 frequency sub-bands. Then the psycho-acoustic model analyzes the amount of masking for each sub-band: If the energy in a band is below the masking threshold, then it is not encoded. Otherwise, it is allocated a number of bits to represent the coefficient. Bit allocation is determined by the signal-to-mask ratio (SMR – ratio of the signal energy to the minimum masking threshold) in such a way that sub-bands with small SMRs are allocated more bits and those with large SMRs are allocated less bits.

5: HAS is heavily used in MPEG audio compression and immersive audio systems. In MPEG audio compression, the psycho-acoustic model is used to determine inaudible sound signal due to masking effects, and allocate bits judiciously based on the signal-to-mask ratio. The use of the psycho-acoustic model is to achieve high compression rate while preserving the audible quality as much as possible.

In immersive audio systems, it is noticed that humans perceive sound based on multiple cues, including level and time differences and direction-dependent frequency-response effects caused by sound reflection in the outer ear, head etc (cumulatively referred as the head-related transfer function – HRTF). An immersive audio system tries to include these cues seamlessly (by placing loud- speakers properly, for example) to best achieve real 3D sound effects.

6: Autocorrelation isolates and tracks the peak energy levels of the signal, and thus is a good measure of the pitch. The drawbacks are that it is subject to aliasing (picking an integer multiple of the actual pitch) and is computationally complex. Two alternatives exist: Maximum Likelihood and Cepstrum Analysis. Unfortunately, the computational complexity of Maximum Likelihood is much higher than autocorrelation, and Cepstrum Analysis does not give very accurate results for humming.

7: Training in audio database is done by computing the N-vector (represented as “a”) for each sound entered into the database. When the user supplies a set of example sounds for training, the mean vector and the covariance matrix for the “a” vectors in each class are calculated and stored as the system’s model of the perceptual property being trained by the user. In audio retrieval, sounds (in the database) that are closest to the sample sound, in terms of their “a” vector distances, are retrieved. For small audio databases, brute-force searching mechanisms can be adopted, but for large databases, sounds in the database need to be indexed by their acoustic features in order to allow for faster retrieval.

8: Simple algorithm:

  1. The speaker goes through a training phase and a structured collection of audio training data and information bearing audio messages is created.
  2. Create a model, using a hidden Markov model – statistical representation of a speech event like a word, from the training data. This model is created offline.
  3. All training data are verified and transcribed at the word level (non-speech events and dis-fluencies such as partially spoken words, pauses, hesitations can be transcribed in accordance with the Wall Street Journal data collection procedure) (phonetic transcriptions can be automatically generated from a machine readable version of the Oxford Learners Dictionary.
  4. We get a lattice (HMM) including sound and transcribed text elements. So, if a speaker says a word, it will be searched through the lattice and search for corresponding transcribed text.

and mono surround) into two channels. In the 1990’s, Dolby Stereo Digital (SR- D) was invented, which eliminated matrix-based encoding and decoding and provided five discrete channels (left, center, right, and independent left and right surround) in a configuration known as stereo surround. A sixth, LFE channel was introduced to add more head room and prevent the main speakers from overloading at low frequencies. (Describing either one gets credit.)

Challenges: Human hearing is based on time and level differences and HRTF. Seamlessly incorporating these cues in immersive audio systems (e.g., through proper placement of speakers to count for time and level differences perceived by human ear) is a challenging issue.

Video:

1: Two common features: (1) use DCT coding; (2) use motion vector. Two differences: (1) MPEG-2 uses block-based coding while MPEG-4 uses object-based coding; (2) MPEG-4 supports combination of different objects (graphics, animation, and natural objects), while MPEG-2 does not.

2: MPEG-audio compression is based on two mechanisms: (1) psycho-acoustic model (lossy compression – however, the loss is undetectable by human ears because only the acoustically irrelevant part is eliminated); (2) entropy encoding (lossless encoding – using Huffman coding, to reduce number of bits). MPEG-video compression considers 3 types of frames: I (self-contained frames), P (encoded relative to the past reference frame), and B (encoded relative to the past/future reference frames). Encoding of I frames are similar to that of JPEG: the image is divided into 8x8 macro-blocks; each block is transformed from spatial domain into frequency domain using DCT; after this, the data is quantized (lossy); and then run-length-encoded in a zigzag form to optimize compression. Encoding of P and B frames further use motion vectors that try to relate current image to the past ones through motion compensation.

3: Since perceptive quality should not be deteriorated, only the least significant bits should be used for storing watermark bits. Suppose we only use the LSB bit (one bit) to store the watermark, it could be computed by certain function based on the values of other bits in the current pixel. (For covertness, certain pseudo- randomness can be also employed.)

4: R*-trees provide the notion of bounding boxes for clustering data with small distances. The insertion and deletion operations use the bounding boxes from the nodes to ensure that nearby elements are placed in the same leaf node, and the searching operation uses the bounding boxes to decide whether or not to search inside a child node. Such a structure makes retrieval fast because only overlapped bounding boxes (instead of the whole space) need to be searched. In

image retrieval, if a parameter value is provided, only those images within the overlapped bounding boxes are searched.

5: Histogram Backprojection answers the question “Where in the image are the colors that belong to the object being looked for (the target)?” It is important because using this – the colors that appear in other objects besides the target are de-emphasized (so that they are less likely to distract the search mechanism, thus help to locate a target in a crowded image space.)

6: The color histogram of an image represents the frequency of different colors in the image. Two images of similar color histograms are likely to describe the same object. Since histograms are invariant to translation and rotation, and changes slowly under change of angle of view, change in scale, and occlusion, they can be used for identifying objects. The Histogram Intersection algorithm is used to identifying matches. Given a pair of histograms, the intersection of the two histograms is computed, and then normalized. A high normalized intersection value indicates high likelihood of the two objects being the same.

7: Advantages: (1) avoid having to uncompress and recompress the video for manipulation; (2) flexibility to accommodate dynamic resources and heterogeneous QoS requirements. Disadvantages: (1) degrade the video quality; (2) incurs variable-rate output (unpredictable throughput).

8: As the number of dimensions grows, the area/volume of the overlap of the bounding boxes will increase very quickly. This makes the searching space (which is the area/volume of the overlaps) also large. For high dimensions (e.g., 10+) the searching space could be as high as 90+% the total space, which makes the searching complexity almost linear (or sequential).

There are two ways to deal with high-dimensionality feature spaces: (1) the high dimensional feature space is replaced by a lower dimensional feature space by means of, for example, the K-L, or principal component, transform; (2) to narrow the search space for each query. An algorithm based on hierarchical Self- Organization Maps (SOM) has been described in paper “A Scheme for Visual Feature based Image Indexing” by Zhang&Zhong.

9: Algorithm to detect a scene change in video:

  1. Detect change between a pair of images, i.e., do pair-wise comparison.
  2. Compute Dpi(k,l) = 1 if |Pi(k, l) – Pi+1(k, l)| > t; 0 otherwise;
  3. Count the number of pixels changed from one frame to the next;
  4. Declare scene boundary. Compare: if the count is larger than the scene boundary, then scene change occurs.

Challenges in scene change detection: