An Empirical Study on Language Model Adaptation

1506 days ago, 513 views
PowerPoint PPT Presentation
2. Plot. IntroductionThe Language Model and the Task of IMERelated WorkLM Adaptation MethodsExperimental ResultsDiscussionConclusion and Future Work. 3. Presentation. Dialect model adjustment endeavors to alter the parameters of a LM so it will perform well on a specific area of data.In specific, we concentrate on the purported cross-space LM adjustment worldview, that is, to adjust a LM t

Presentation Transcript

Slide 1

An Empirical Study on Language Model Adaptation Jianfeng Gao , Hisami Suzuki, Microsoft Research Wei Yuan Shanghai Jiao Tong University Presented by Patty Liu

Slide 2

Outline Introduction The Language Model and the Task of IME Related Work LM Adaptation Methods Experimental Results Discussion Conclusion and Future Work

Slide 3

Introduction Language show adjustment endeavors to modify the parameters of a LM with the goal that it will perform well on a specific area of information. Specifically, we concentrate on the purported cross-area LM adjustment worldview, that is, to adjust a LM prepared on one space ( foundation area) to an alternate area ( adjustment space), for which just a little measure of preparing information is accessible. The LM adjustment strategies examined here can be gathered into two classes: (1) Maximum a posteriori (MAP) : Linear interjection (2) Discriminative preparing : boosting 、 perceptron 、 least example chance

Slide 4

The Language Model and the Task of IME (Input Method Editor) : The clients first information phonetic strings, which are then changed over into fitting word strings by programming. Dissimilar to discourse acknowledgment, there is no acoustic equivocalness in IME, since the phonetic string is given straightforwardly by clients. In addition, we can expect a novel mapping from W to An in IME, that is, . From the point of view of LM adjustment, IME faces a similar issue that discourse acknowledgment does: the nature of the model depends vigorously on the closeness between the preparation information and the test information.

Slide 5

Related Work (1/3) I. Measuring Domain Similarity: : a dialect : genuine fundamental likelihood conveyance of : another appropriation (e.g., a SLM) which endeavors to display : the cross entropy of as for : a word string in

Slide 6

Related Work (2/3) However, in all actuality, the hidden is never known and the corpus size is never endless. We hence make the presumption that is an ergodic and stationary process, and surmised the cross entropy by figuring it for an adequately vast n as opposed to computing it for the farthest point. The cross entropy considers both the closeness between two appropriations (given by KL dissimilarity) and the entropy of the corpus being referred to.

Slide 7

Related Work (3/3) II. LM Adaptation Methods MAP : change the parameters of the foundation display → expand the probability of the adjustment information Discriminative preparing strategies : utilizing adjustment information → straightforwardly limit the mistakes in it made by the foundation demonstrate These systems have been connected effectively to dialect demonstrating in non-adjustment and also adjustment situations for discourse acknowledgment.

Slide 8

LM Adaptation Methods ─LI I. The Linear Interpolation Method : the likelihood of the foundation demonstrate : the likelihood of the adjustment display : the history, relates to the two going before words : For straightforwardness, we picked a solitary for all histories and blocked it on held-out information

Slide 9

LM Adaptation Methods - Problem Definition Of Discriminative Training Methods (1/3) II. Discriminative Training Methods ◎ Problem Definition

Slide 10

LM Adaptation Methods - Problem Definition Of Discriminative Training Methods (2/3) which sees IME as a positioning issue, where the model gives the positioning score, not probabilities. We in this way don't assess the LM acquired utilizing discriminative preparing by means of perplexity.

Slide 11

LM Adaptation Methods - Problem Definition Of Discriminative Training Methods (3/3) : reference transcript : a mistake capacity which is an alter separate capacity for this situation : test hazard , the entirety of blunder checks over the preparation tests Discriminative preparing strategies endeavor to limit the by streamlining the model parameters. Be that as it may, can't be streamlined effectively, since is a piecewise consistent (or step) capacity of and its angle is indistinct. In this manner, discriminative techniques apply distinctive methodologies that advance it roughly. The boosting and perceptron calculations rough by misfortune works that are reasonable for advancement, while MSR utilizes a basic heuristic preparing method to limit straightforwardly.

Slide 12

LM Adaptation Methods─ The Boosting Algorithm (1/2) (i) The Boosting Algorithm edge : a positioning blunder : a mistaken hopeful change gets a higher score than the right transformation , where if , and 0 generally Optimizing the RLoss : NP-finish   → advances its upper bound, ExpLoss : raised

Slide 13

LM Adaptation Methods─ The Boosting Algorithm (2/2) : an esteem expanding exponentially with the aggregate of the edges of sets over the set where is seen in however not in : the esteem identified with the total of edges over the set where is seen in yet not in : a smoothing element (whose esteem is improved on held-out information) :a standardization consistent.

Slide 14

LM Adaptation Methods─ The Perceptron Algorithm (1/2) (ii) The Perceptron Algorithm delta govern: stochastic guess:

Slide 15

LM Adaptation Methods ─ The Perceptron Algorithm (2/2) arrived at the midpoint of perceptron calculation

Slide 16

LM Adaptation Methods─ MSR(1/7) (iii) The Minimum Sample Risk Method Conceptually, MSR works like any multidimensional capacity streamlining approach: - The main heading (i.e., highlight) is chosen and SR is limited along that course utilizing a line look , that is, modifying the parameter of the chose include while keeping every single other parameter settled. - Then, from that point, along the second course to its base, et cetera - Cycling through the entire arrangement of headings the same number of times as important, until SR quits diminishing.

Slide 17

LM Adaptation Methods ─ MSR(2/7) This straightforward strategy can work appropriately under two presumptions. - First, there exists a usage of line inquiry that productively advances the capacity along one course. - Second, the quantity of applicant components is not very huge, and they are not profoundly corresponded. Be that as it may, neither of the suppositions holds for our situation. - First of all, Er(.) in is a stage capacity of λ , and in this way can't be upgraded specifically by customary slope based techniques –-a network seek must be utilized. In any case, there are issues with straightforward matrix look: utilizing an extensive network could miss the ideal arrangement, while utilizing a fine-grained lattice would prompt to a moderate calculation. - Second, on account of LM, there are a huge number of hopeful components, some of which are exceedingly related with each other.

Slide 18

LM Adaptation Methods ─ MSR(3/7) ◎ dynamic applicant of a gathering : competitor word string, Since for our situation takes number qualities and ( is the check of a specific n - gram in ), we can assemble the hopefuls utilizing so that hopefuls in each gathering have a similar estimation of . In each gathering, we characterize the applicant with the most noteworthy estimation of as the dynamic hopeful of the gathering in light of the fact that regardless of what esteem takes, just this competitor could be chosen by :

Slide 19

LM Adaptation Methods ─ MSR(4/7) ◎ Grid Line Search By finding the dynamic applicants, we can lessen to a considerably littler rundown of dynamic applicants. We can locate an arrangement of interims for , inside each of which a specific dynamic applicant will be chosen as . Thus, for each preparation test, we get a grouping of interims and their comparing values. The ideal esteem can then be found by navigating the grouping and taking the midpoint of the interim with the most minimal esteem. By combining the arrangement of interims of each preparation test in the preparation set, we acquire a worldwide grouping of interims and their comparing test hazard. We can then locate the ideal incentive and also the negligible example chance by crossing the worldwide interim grouping.

Slide 20

LM Adaptation Methods ─ MSR(5/7) ◎ Feature Subset Selection Reducing the quantity of components is basic for two reasons: to lessen computational many-sided quality and to guarantee the speculation property of the straight model. Viability of : The cross-connection coefficient between two components and

Slide 21

LM Adaptation Methods ─ MSR(6/7)

Slide 22

LM Adaptation Methods ─ MSR(7/7) : the quantity of all applicant highlights : the quantity of elements in the subsequent model, According to the element determination technique: - step1: for each of the competitor highlights - step4: assessments of are required Therefore, we just gauge the estimation of between each of the chose highlights and each of the top outstanding elements with the most noteworthy estimation of . This diminishes the quantity of assessments of to .

Slide 23

Experimental Results (1/3) I. Information The information utilized as a part of our tests comes from five unmistakable wellsprings of content. Diverse sizes of every adjustment preparing information were likewise used to show how distinctive sizes of adjustment preparing information influenced the exhibitions of different adjustment techniques.

Slide 24

Experimental Results (2/3) II. Processing Domain Characteristics (i) The likeness between two spaces: cross entropy - not symmetric - self entropy (the differing qualities of the corpus) increments in the accompanying request : N→Y→E→T→S

Slide 25

Experimental Results (3/3) III. Aftereffects of LM Adaptation We prepared our gauge trigram display on our experience (Nikkei) corpus .

Slide 26

Discussion (1/6) I. Area Similarity and CER The more comparative the adjustment space is to the foundation area, the better the CER comes about.

Slide 27

Discussion (2/6) II. Space Similarity and the Robustness of Adaptation Methods The discriminat