Alla Rozovskaya, Dan Roth; Sentence structure Error Modification into the Morphologically Rich Languages: The truth from Russian. Purchases of the Association for Computational Linguistics 2019; 7 1–17. doi:
Abstract
Up to now, all the search for the sentence structure error correction concerned about English, and problem enjoys hardly started browsed some other dialects. I target work away from correcting creating mistakes within the morphologically rich dialects, with a look closely at Russian. We establish a reversed and you will mistake-tagged corpus away from Russian student writing and produce patterns that make entry to current condition-of-the-art procedures that happen to be well studied for English. Although unbelievable abilities enjoys been already achieved having grammar mistake correction out-of low-local English creating, this type of email address details are restricted to domains in which plentiful knowledge investigation was readily available. Due to the fact annotation is extremely high priced, these types of techniques are not suitable for the majority of domains and you will dialects. I ergo run strategies that use “minimal oversight”; that’s, those people that do not believe in huge amounts regarding annotated degree analysis, and show how present limited-supervision ways increase so you’re able to an incredibly inflectional code particularly Russian. The results demonstrate that these methods are extremely used for repairing errors into the grammatical phenomena one involve rich morphology.
step one Addition
Which report address work from correcting errors in the text. Most of the research in neuro-scientific sentence structure error modification (GEC) focused on correcting problems from English words students. One simple method to writing about these types of mistakes, and this turned out very winning within the text message modification tournaments (Dale and you can Kilgarriff, 2011; Dale ainsi que al., 2012; Ng ainsi que al., 2013, 2014; Rozovskaya ainsi que al., 2017), utilizes a servers- discovering classifier paradigm which is according to research by the strategy having correcting context-delicate spelling mistakes (Golding and Roth, 1996, 1999; Banko and you will Brill, 2001). Contained in this strategy, classifiers was educated to have a certain mistake type of: like, preposition, blog post, otherwise noun amount (Tetreault et al., 2010; Gamon, 2010; Rozovskaya and Roth, 2010c, b; Dahlmeier and you will Ng, 2012). In the first place, classifiers was basically coached with the native English analysis. As the multiple annotated student datasets became offered, designs had been also instructed to the annotated learner research.
Recently, the mathematical machine translation (MT) procedures, as well as neural MT, possess gathered considerable dominance because of the method of getting high annotated corpora from student creating (elizabeth.grams., Yuan and Briscoe, 2016; patt and Ng, 2018). Group tips work well into better-laid out variety of errors, whereas MT excellent during the repairing interacting and you can advanced particular mistakes, which makes these types of tactics subservient in certain areas (Rozovskaya and you will Roth, 2016).
Because of the availability of higher (in-domain) datasets, reasonable progress during the overall performance have been made inside English sentence structure correction. Unfortuitously, lookup towards most other languages has been scarce. Past works includes jobs which will make annotated learner corpora getting Arabic (Zaghouani et al., 2014), Japanese (Mizumoto ainsi que al., 2011), and you will Chinese (Yu et al., 2014), and you will shared airg opportunities for the Arabic (Mohit ainsi que al., 2014; Rozovskaya ainsi que al., 2015) and you can Chinese mistake recognition (Lee mais aussi al., 2016; Rao ainsi que al., 2017). not, building robust models various other languages could have been an issue, once the a strategy you to definitely hinges on big supervision is not feasible across languages, styles, and student backgrounds. Furthermore, getting languages that will be cutting-edge morphologically, we may you prefer a lot more study to handle the fresh lexical sparsity.
Which work concentrates on Russian, an extremely inflectional vocabulary throughout the Slavic group. Russian has more than 260M audio system, having 47% out of who Russian is not their indigenous language. step one I corrected and you may error-marked over 200K words out of low-native Russian messages. We make use of this dataset to build several sentence structure modification systems one to draw towards and you will expand the methods you to showed condition-of-the-ways efficiency to your English grammar correction. Due to the fact measurements of our very own annotation is bound, weighed against what exactly is used in English, one of the wants of our work is in order to quantify brand new effectation of with minimal annotation towards existing ways. I look at the MT paradigm, and that demands large volumes from annotated student analysis, and group techniques which can run one number of oversight.