0

0

2548 days ago,
893 views

PowerPoint PPT Presentation
Voice DSP. Section 1 Speech science and what we can gain from itPart 2 Speech DSP (AGC, VAD, components, reverberation cancellation)Part 3 Speech pressure techiquesPart 4 Speech Recognition. Voice DSP - Part 2. Easiest handling GainAGCVADMore complex preparing pitch trackingU/V decisioncomputing LPC different elements.

Voice DSP Processing II Yaakov J. Stein Chief Scientist RAD Data Communications

Voice DSP Part 1 Speech science and what we can gain from it Part 2 Speech DSP (AGC, VAD, highlights, reverberate cancelation) Part 3 Speech pressure techiques Part 4 Speech Recognition

Voice DSP - Part 2 Simplest handling Gain AGC VAD More intricate preparing pitch following U/V choice figuring LPC different elements Echo Cancelation Sources of resound Echo concealment Echo cancelation Adaptive commotion cancelation The LMS calculation Other versatile calculations The standard LEC

Voice DSP - Part 2a Simplest voice DSP

Gain (volume) Control In simple handling (hardware) pick up requires a speaker Great care must be taken to guarantee linearity! In computerized handling (DSP) pick up requires just increase y = G x Need enough bits!

Automatic Gain Control (AGC) Can we set the pick up consequently? Yes, in light of the flag's Energy! E = x 2 (t) dt = S x n 2 All we need to do is apply pick up until achieve fancied vitality Assume we need the vitality to be Y Then y = Y/E x = G x has precisely this vitality

AGC - cont. Consider the possibility that the info isn't stationary (gets more grounded and weaker after some time) ?. The vitality is characterized for all circumstances - < t < so it can't help! So we characterize "vitality in window" E(t) and persistently differ pick up G(t) This is A daptive G ain C ontrol We don't need pick up to bounce from window to window so we smooth the immediate pick up G(t) a G(t) + (1-a ) Y/E(t) IIR channel 8

AGC - cont. The a coefficient decides how quick G(t) can change In more mind boggling executions we may independently control mix time, assault time, discharge time What is included in the calculation of G(t) ? Squaring of information esteem Accumulation Square root (or Pythagorean total) Inversion (division) Square root and reversal are hard for a DSP processor yet algorithmic upgrades are conceivable (and frequently required)

Simple VAD Sometimes it is helpful to know whether somebody is talking (or not) Save data transfer capacity Suppress reverberate Segment expressions We may have the capacity to escape with "vitality VOX" Normally require N oise R iding T hreshold/S ignal R iding T hreshold However, there are issues vitality VOX since it doesn't separate amongst discourse and commotion What we truly need is a discourse particular movement locator V oice A ctivity D etector

Simple VAD - cont. VADs work by perceiving that discourse is not quite the same as clamor Speech is low-pass while commotion is white Speech is generally voiced thus has contribute a given range Average clamor plentifulness is moderately steady A basic VAD may utilize: zero intersections zero intersection "subsidiary" otherworldly tilt channel vitality forms mixes of the above

Other "straightforward" procedures Simple = not fundamentally subject to points of interest of discourse flag Speed change of recorded flag Speed change with pitch pay Pitch change with speed remuneration Sample rate transformation Tone era Tone discovery Dual tone era Dual tone identification (require high dependability)

Voice DSP - Part 2b Complex voice DSP

Correlation One noteworthy distinction amongst basic and complex preparing is the calculation of connections (identified with LPC show) Correlation is a measure of similitude Shouldn't we utilize squared contrast to gauge likeness? D 2 = < ( x(t) - y(t) ) 2 > No, since squared contrast is touchy to pick up time shifts

Correlation - cont. D 2 = < ( x(t) - y(t) ) 2 > = < x 2 > + < y 2 > - 2 < x(t) y(t) > So when D 2 is insignificant C (0) = < x(t) y(t) > is maximal and discretionary additions don't change this To require some investment shifts into record C( t ) = < x(t) y(t+ t ) > and search for maximal t ! We can even discover how much a flag looks like itself

Autocorrelation Crosscorrelation C x y ( t ) = < x(t) y(t+ t ) > Autocorrelation C x ( t ) = < x(t) x(t+ t ) > C x ( 0 ) is the vitality! Autocorrelation finds shrouded periodicities! Substantially more grounded than looking in the time portrayal Wiener Khintchine Autocorrelation C(t) and Power Spectrum S(f) are FT match So autocorrelation contains an indistinguishable data from the power range … and can itself be registered by FFT

Pitch following How would we be able to quantify (and track) the pitch? We can search for it in the range however it might be exceptionally powerless may not be there (sifted through) need high determination unearthly estimation Correlation based strategies The contribute periodicity ought to be seen the auto connection! Once in a while computationally less difficult is the A bsolute M agnitude D ifference F unction < | x(t) - x(t+ t ) | >

Pitch following - cont. Sondhi's calculation for autocorrelation-based pitch following : get window of discourse figure out whether the portion is voiced (see U/V choice beneath) low-pass channel and focus clasp to lessen formant instigated connections register autocorrelation slacks relating to legitimate pitch interims discover slack with greatest relationship OR discover slack with maximal amassed relationship in all products Post handling P tingle trackers once in a while make little mistakes (generally twofold pitch) So right anomalies in view of neighboring qualities

Other Pitch Trackers Miller's information decrease & Gold and Rabiner's parallel preparing strategies Zero-intersections, vitality, extrema of waveform Noll's cepstrum based pitch tracker Since the pitch and formant commitments are isolated in cepstral area Most precise for clean discourse, however not vigorous in commotion Methods in view of LPC blunder flag LPC strategy separates at pitch beat onset Find periodicity of mistake via autocorrelation Inverse sifting technique Remove formant separating by low-arrange LPC examination Find periodicity of excitation via autocorrelation Sondhi-like techniques are the best for loud discourse

U/V choice Between VAD and pitch following Simplest U/V choice depends on vitality and zero intersections More mind boggling strategies are consolidated with pitch following Methods in light of example acknowledgment Is voicing very much characterized? Level of voicing (buzz) Voicing per recurrence band (impedance) Degree of voicing per recurrence band

LPC Coefficients How would we locate the vocal tract channel coefficients? Framework ID issue All-shaft (AR) channel Connection to expectation S n = G e n + S m a m s n-m Can discover G from vitality (so how about we disregard it) Unknown channel known info known yield

LPC Coefficients For straightforwardness how about we accept three a coefficients S n = e n + a 1 s n-1 + a 2 s n-2 + a 3 s n-3 Need three conditions! S n = e n + a 1 s n-1 + a 2 s n-2 + a 3 s n-3 S n+1 = e n+1 + a 1 s n + a 2 s n-1 + a 3 s n-2 S n +2 = e n+2 + a 1 s n + 1 + a 2 s n + a 3 s n-1 In grid shape S n e n s n-1 s n-2 s n-3 a 1 S n+1 = e n+1 + s n s n-1 s n-2 a 2 S n +2 e n+2 s n + 1 s n s n-1 a 3 s = e + S a

LPC Coefficients - cont. S = e + S a so by straightforward polynomial math a = S - 1 ( s - e ) and we have diminished the issue to grid reversal Toeplitz framework so the reversal is simple (Levinson-Durbin calculation) Unfortunately commotion makes this endeavor separate! Move to next time and the appropriate response will be distinctive. Need to by one means or another normal the appropriate responses The correct averaging is before the condition fathoming relationship versus autocovariance

LPC Coefficients - cont. Can't simply normal after some time - all conditions would be the same! How about we take the contribution to be zero S n = S m a m s n-m increase by S n-q and entirety over n S n S n S n-q = S m a m S n s n-m s n-q we perceive the autocorrelations C s (q) = S m C s (|m-q|) a m Yule-Walker conditions autocorrelation technique: s n outside window are zero (Toeplitz) autocovariance strategy: utilize all required s n (no window) Also - pre-accentuation!

Alternative components The a coefficients aren't the main arrangement of elements Reflection coefficients (barrel display) log-zone coefficients (chamber demonstrate) shaft areas LPC cepstrum coefficients L ine S pectral P air frequencies All hypothetically contain a similar data (arithmetical changes) Euclidean separation in LPC cepstrum space ~ Itakura Saito measure so these are prevalent in discourse acknowledgment LPC ( a ) coefficients don't quantize or introduce well so these aren't useful for discourse pressure LSP frequencies are best for pressure

LSP coefficients a coefficients are not factually similarly weighted post positions are better (geometric) however range is delicate close unit hover Is there an all-point portrayal? Hypothesis 1: Every genuine polynomial with all roots on the unit circle is palindromic (e.g. 1 + 2t + t 2 ) or antipalindromic (e.g. t + t 2 - t 3 ) Theorem 2: Every polynomial can be composed as the entirety of palindromic and antipalindromic polynomials Consequence: Every polynomial can be spoken to by roots on the unit circle, that is, by edges

Voice DSP - Part 2c Echo Cancelation

Acoustic Echo

Line reverberate mixture cross breed Telephone 1 Telephone 2

4w switch comp inv 4w switch Echo smother or by and by need more: VOX, supersede, reset, and so forth

- close end far end Why not resound suppresion? Reverberate concealment makes discussion half duplex Waste of full-duplex framework Conversation unnatural Hard to soften up Dead sounding line It would be ideal to cross out the resound subtract the reverberate flag permitting coveted flag through however that requires DSP .

clean Echo can

SPONSORS

No comments found.

SPONSORS

SPONSORS