An Iterative Technique for Segmenting Speech and Text Alignment Arthur R. Toth Speech Seminar - 4/18/2003
Slide 2Basic Problem Have Large Audio File, Associated Text Want to Align Text With Audio Useful for Synthesis Useful for Acoustic Modeling Doing this physically is dull What on the off chance that it should be possible consequently? then again regardless of the possibility that part should be possible consequently?
Slide 3Related Problem Splitting sound document can help Phrases can be great applicant Can't just be so long (need to inhale) sufficiently short where constrained arrangement plausible Existing work on anticipating break areas But then you have to part related content
Slide 4Constraints Different Data is accessible Acoustic information, i.e. waveform Supra-segmental data For our first endeavors, we are attempting to perceive how far we can get utilizing just waveform Differs from methodologies which utilize word information cf. Wang & Hirschberg, Wightman et al.
Slide 5Data Set BostonUniversity Radio Corpus Single speaker monolog No discourse turn data Female commentator Some characteristics Loud breathing Broad f0 territory, in some cases vast plunges
Slide 6Segmenting Strategy Want to concentrate on Phrase Break Levels>2 Tool for first guess: vad end-pointer accessible from MS State University open area utilizes power and zero-intersections records beginnings and closures of discovered sections http://www.isip.msstate.edu/ventures/discourse/programming/legacy/signal_detector/index.html
Slide 7Splitting Text - First Pass Use Festival to foresee lengths of words Linearly scale add up to anticipated length to genuine length Look at places of fragment endpoints from vad and utilize scaled length expectations to anticipate word
Slide 9Iterations Refine gauges iteratively as takes after: In every cycle, work left-to-right Use sphinx-adjust to score constrained arrangements for words through starting last word forecast likewise attempt last words up to 2 preceding and 2 after take best scoring rundown of words as new gauge Note: constrained arrangement can fizzle
Slide 10Experiment and Results 5 emphasess were run Estimated word areas were contrasted with real ones Had with change over from times to words Criterion - break connected with last past word finishing time Most significant change gave off an impression of being in first emphasis
Slide 13Discussion Points near right enhanced rapidly Points assist away didn't enhance as much Window estimate presumably too little Need to extend window sizes, however remember different limitations Heuristic like Itakura manage may be convenient Many misses just 1 off, and one-sided May come about because of estimation or marking
Slide 14Further Work More complex expression break discovery Using a universally useful apparatus Want the choice of utilizing supra-segmental information, if accessible Would a Switching State-Space Model offer assistance? (Ghahramani & Hinton) Is left-to-right cycle approach best? Non-iterative model for part message?
SPONSORS
SPONSORS
SPONSORS