Problem

In speech analysis, synthesis, and coding, the speech signal is commonly modeled over a...

In speech analysis, synthesis, and coding, the speech signal is commonly modeled over a short time interval as the response of an LTI system excited by an excitation that switches between a train of equally spaced pulses for voiced sounds and a wideband random noise source for unvoiced sounds. To use homomorphic deconvolution to separate the components of the speech model, the speech signal s[n] = v[n] ∗ p[n] is multiplied by a window sequence w[n] to obtain x[n] = s[n]w[n]. To simplify the analysis, x[n] is approximated by

x[n] = (v[n] ∗ p[n]) · w[n] _ v[n] ∗ (p[n] · w[n]) = v[n] ∗ pw[n]

where pw[n] = p[n]w[n] as in Eq. (13.123).

(a) Give an example of p[n], v[n], and w[n] for which the above assumption may be a poor approximation

(b) One approach to estimating the excitation parameters (voiced/unvoiced decision and pulse spacing for voiced speech) is to compute the real cepstrum cx [n] of the windowed segment of speech x[n] as depicted in Figure P13.29-1. For the model of Section 13.10.1, express cx [n] in terms of the complex cepstrumxˆ [n].How would you use cx [n] to estimate the excitation parameters?

(c) Suppose that we replace the log operation in Figure P13.29-1 with the “squaring” operation so that the resulting system is as depicted in Figure P13.29-2. Can the new “cepstrum” qx [n] be used to estimate the excitation parameters? Explain.