Discourse Segregation

1443 days ago, 486 views
PowerPoint PPT Presentation
DeLiang Wang Discernment and Neurodynamics Lab Ohio State College http://www.cse.ohio-state.edu/pnl/. Discourse Isolation. Blueprint of presentation. Presentation: Discourse isolation issue Sound-related scene investigation (ASA) Discourse upgrade

Presentation Transcript

Slide 1

DeLiang Wang Perception & Neurodynamics Lab Ohio State University http://www.cse.ohio-state.edu/pnl/Speech Segregation

Slide 2

Outline of Introduction: Speech isolation issue Auditory scene examination (ASA) Speech upgrade Speech isolation by computational sound-related scene investigation (CASA) Segregation as paired grouping Concluding comments

Slide 3

Real-world tryout What? Discourse message speaker age, sexual orientation, etymological beginning, inclination, … Music Car going by Where? Left, appropriate, up, down How close? Channel attributes Environment qualities Room resonation Ambient clamor

Slide 4

added substance commotion from other sound sources channel twisting resonation from surface reflections Sources of interruption and mutilation

Slide 5

Cocktail party issue Term authored by Cherry "One of our most essential resources is our capacity to tune in to, and tail, one speaker within the sight of others. This is such a typical affair, to the point that we may underestimate it; we may call it 'the mixed drink party issue'… " (Cherry, 1957) "For 'mixed drink party'- like circumstances… when all voices are similarly noisy, discourse stays clear for ordinary hearing audience members notwithstanding when there are upwards of six meddling talkers" (Bronkhorst & Plomp, 1992) Ball-room issue by Helmholtz "Confounded past origination" (Helmholtz, 1863) Speech isolation issue

Slide 6

Listener execution Speech gathering edge (SRT) The discourse to-clamor proportion required for half understandability Each 1 dB pick up in SRT compares to 5-10% expansion in comprehensibility (Miller et al., 1951) subordinate upon materials Source: Steeneken (1992)

Slide 7

Source: Wang and Brown (2006) Effects of contending source SRT Difference (23 dB!)

Slide 8

Some utilizations of discourse isolation Robust programmed discourse and speaker acknowledgment Processor for hearing prosthesis Hearing guides Cochlear inserts Audio data recovery

Slide 9

Approaches to discourse isolation Monaural methodologies Speech improvement CASA Focus of this instructional exercise Microphone-cluster approaches Spatial sifting (beamforming) Extract target sound from a particular spatial heading with a sensor exhibit Limitation: Configuration stationarity. Consider the possibility that the objective switches or changes area. Autonomous segment investigation Find a demixing framework from blends of sound sources Limitation: Strong presumptions. Boss among them is stationarity of blending framework

Slide 10

Part II: Auditory scene examination Human sound-related framework How does the human sound-related framework sort out sound? Sound-related scene examination account

Slide 11

Auditory outskirts A complex component for transducing weight varieties noticeable all around to neural driving forces in sound-related nerve strands

Slide 12

Beyond the fringe The sound-related framework is perplexing with four transfer stations amongst outskirts and cortex instead of one in the visual framework In contrast with the sound-related fringe, focal parts of the sound-related framework are less comprehended Number of neurons in the essential sound-related cortex is equivalent to that in the essential visual cortex regardless of the way that the quantity of filaments in the sound-related nerve is far less than that of the optic nerve (thousands versus millions) The sound-related framework (Source: Arbib, 1989) The sound-related nerve

Slide 13

Auditory scene examination Listeners are fit for parsing an acoustic scene (a sound blend) to shape a mental portrayal of each solid source – stream – in the perceptual procedure of sound-related scene investigation (Bregman, 1990) From acoustic occasions to perceptual streams Two theoretical procedures of ASA: Segmentation . Decay the acoustic blend into tangible components (portions) Grouping . Join sections into streams, so that fragments in a similar stream start from a similar source

Slide 14

Simultaneous association Simultaneous association bunches sound segments that cover in time. ASA signs for concurrent association: Proximity in recurrence (otherworldly vicinity) Common periodicity Harmonicity Temporal fine structure Common spatial area Common onset (and to a lesser degree, basic balance) Common worldly tweak Amplitude balance (AM) Frequency balance (FM) Demo:

Slide 15

Sequential association Sequential association bunches sound parts crosswise over time. ASA signals for successive association: Proximity in time and recurrence Temporal and unearthly congruity Common spatial area; all the more for the most part, spatial coherence Smooth pitch shape Smooth organization move? Cadenced structure Demo: gushing in African xylophone music Note in pentatonic scale

Slide 16

Primitive versus mapping based association Primitive gathering. Inborn information driven systems, reliable with those depicted by Gestalt analysts for visual observation – highlight based or base up It is area general, and adventures inherent structure of ecological sound Grouping prompts portrayed before are primitive in nature Schema-driven gathering. Learned information about discourse, music and other ecological sounds – demonstrate based or best down It is space particular, e.g. association of discourse sounds into syllables

Slide 17

Organization in discourse: Spectrogram " … unadulterated joy … " congruity onset synchrony counterbalance synchrony harmonicity

Slide 18

Interim outline of ASA Auditory fringe preparing sums to a disintegration of the acoustic flag ASA signs basically reflect basic rationality of regular sound sources A subset of prompts accepted to be firmly required in ASA Simultaneous association: Periodicity, worldly tweak, onset Sequential association: Location, pitch shape and other source attributes (e.g. vocal tract)

Slide 19

Part III. Discourse upgrade Speech improvement plans to evacuate or lessen foundation clamor Improve flag to-commotion proportion (SNR) Assumes stationary commotion or possibly that commotion is more stationary than discourse A tradeoff between discourse mutilation and clamor bending (leftover commotion) Types of discourse improvement calculations Spectral subtraction Wiener separating Minimum mean square mistake (MMSE) estimation Subspace calculations Material in this part is principally in light of Loizou (2007)

Slide 20

Spectral subtraction It depends on a basic rule: Assuming added substance clamor, one can get a gauge of the perfect flag range by subtracting a gauge of the clamor range from the loud discourse range The commotion range can be evaluated (and refreshed) amid periods when the discourse flag is truant or when just commotion is available It requires voice movement recognition or discourse stop location

Slide 21

Basic standard In the flag space y ( n ) = x ( n ) + d ( n ) x : discourse flag; d : clamor; y : boisterous discourse In the DFT area Y ( ω ) = X ( ω ) + D ( ω ) Hence we have the assessed flag size range To guarantee nonnegative sizes, which can occur because of commotion estimation blunders, half-wave amendment is connected

Slide 22

Basic guideline (cont.) Assuming that discourse and commotion are uncorrelated, we have the assessed flag control range all in all Again, half-wave correction should be connected

Slide 23

Flow outline Noise estimation/refresh Noisy Speech FFT + Phase data Enhanced Speech IFFT

Slide 24

Effects of half-wave correction

Slide 25

Musical commotion Isolated pinnacles cause melodic commotion

Slide 26

Over-subtraction to diminish melodic clamor By over-subtracting the commotion range, we can decrease the plentifulness of segregated pinnacles and now and again dispense with them inside and out. This independent from anyone else, in any case, is not adequate on the grounds that the profound valleys encompassing the pinnacles still stay in the range For that reason, phantom deck is utilized to "fill in" the unearthly valleys α is over-subtraction figure ( α > 1), and β is ghastly floor parameter ( β < 1)

Slide 27

Effects of parameters: Sound demo Half-wave amendment : α =1, β = 0 α =3, β = 0 α =8, β = 0 α =8, β = 0.1 α =8, β = 1 α =15, β = 0 Noisy sentence (+5 dB SNR) Original (clean) sentence

Slide 28

Wiener channel Aim: To locate the ideal channel that limits the mean square blunder between the coveted flag (clean flag) and the evaluated yield Input to this channel: Noisy discourse Output of this channel: Enhanced discourse

Slide 29

Wiener channel in recurrence area Wiener channel for clamor lessening H ( ω ) indicates the channel Minimizing mean square mistake between separated boisterous discourse and clean discourse prompts to for recurrence ω k P xx ( ω k ): control range of x ( n ) P dd ( ω k ): control range of d ( n )

Slide 30

Wiener channel as far as from the earlier SNR Define from the earlier SNR at recurrence ω k : Wiener channel turns out to be More weakening at lower SNR and less constriction at higher SNR

Slide 31

Iterative Wiener sifting Optimal Wiener channel relies on upon info flag control range, which is not accessible. Practically speaking, we can evaluate the Wiener channel iteratively We can consider the accompanying method at emphasis i to gauge H ( w ): Step 1 : Obtain a gauge of the Wiener channel in light of the improved flag got at cycle i Initialize with boisterous discourse flag Step 2 : Filter the uproarious flag through the recently acquired Wiener channel as indicated by: to get the new upgraded flag, . Rehash the above procedure

Slide 32

MMSE estimator The Wiener channel is the ideal (in the mean square mistake sense) complex range estimator, not the ideal size range estimator Ephraim and Malah (1984) proposed a MMSE estimator which is the ideal greatness range estimator Unlike the Wiener estimator, the MMSE estimator does not require a straight model between the watched information and the estimator, yet accept the likelihood appropriations of discourse and nois