An Iterative Strategy for Sectioning Discourse and Content Arrangement

´╗┐An Iterative Technique for Segmenting Speech and Text Alignment Arthur R. Toth Speech Seminar - 4/18/2003

Basic Problem Have Large Audio File, Associated Text Want to Align Text With Audio Useful for Synthesis Useful for Acoustic Modeling Doing this physically is dull What on the off chance that it should be possible consequently? then again regardless of the possibility that part should be possible consequently?

Related Problem Splitting sound document can help Phrases can be great applicant Can't just be so long (need to inhale) sufficiently short where constrained arrangement plausible Existing work on anticipating break areas But then you have to part related content

Constraints Different Data is accessible Acoustic information, i.e. waveform Supra-segmental data For our first endeavors, we are attempting to perceive how far we can get utilizing just waveform Differs from methodologies which utilize word information cf. Wang & Hirschberg, Wightman et al.

Data Set BostonUniversity Radio Corpus Single speaker monolog No discourse turn data Female commentator Some characteristics Loud breathing Broad f0 territory, in some cases vast plunges

Slide 6

Segmenting Strategy Want to concentrate on Phrase Break Levels>2 Tool for first guess: vad end-pointer accessible from MS State University open area utilizes power and zero-intersections records beginnings and closures of discovered sections

Splitting Text - First Pass Use Festival to foresee lengths of words Linearly scale add up to anticipated length to genuine length Look at places of fragment endpoints from vad and utilize scaled length expectations to anticipate word

Iterations Refine gauges iteratively as takes after: In every cycle, work left-to-right Use sphinx-adjust to score constrained arrangements for words through starting last word forecast likewise attempt last words up to 2 preceding and 2 after take best scoring rundown of words as new gauge Note: constrained arrangement can fizzle

Experiment and Results 5 emphasess were run Estimated word areas were contrasted with real ones Had with change over from times to words Criterion - break connected with last past word finishing time Most significant change gave off an impression of being in first emphasis

Discussion Points near right enhanced rapidly Points assist away didn't enhance as much Window estimate presumably too little Need to extend window sizes, however remember different limitations Heuristic like Itakura manage may be convenient Many misses just 1 off, and one-sided May come about because of estimation or marking

Further Work More complex expression break discovery Using a universally useful apparatus Want the choice of utilizing supra-segmental information, if accessible Would a Switching State-Space Model offer assistance? (Ghahramani & Hinton) Is left-to-right cycle approach best? Non-iterative model for part message?