Several neural encoder–decoder approaches relying on recurrent neural networks were proposed (Chollampatt et al., 2016 Yuan and Briscoe, 2016 Ji et al., 2017). However, phrase-based MT systems do not generalize well beyond the error patterns observed in the training data. The MT approach has shown state-of-the-art results on the benchmark CoNLL-14 test set in English (Susanto et al., 2014 Junczys-Dowmunt and Grundkiewicz, 2016 Chollampatt and Ng, 2017) it is particularly good at correcting complex error patterns, which is a challenge for the classification methods (Rozovskaya and Roth, 2016). MT systems for grammar correction are trained using 20M–50M words of learner texts to achieve competitive performance. In the MT approach, the error correction problem is cast as a translation task: namely, translating ungrammatical learner text into well-formed grammatical text, and original learner texts and the corresponding corrected texts act as parallel data. Error-specific classifiers are typically trained for common learner errors-for example, article, preposition, or noun number in English (Izumi et al., 2003 Han et al., 2006 Gamon et al., 2008 De Felice and Pulman, 2008 Tetreault et al., 2010 Gamon, 2010 Rozovskaya and Roth, 2011 Dahlmeier and Ng, 2012). Given a text to correct, for each confusable word, the task is to select the most likely candidate from the relevant confusion set. Classifiers can be trained either on learner or on native data, where each target word occurrence (e.g., the) is treated as a positive training example for the corresponding word. Given a confusion set, for example for articles, each occurrence of a confusable word is represented as a vector of features derived from a context windowaround it. In the classification approach, error-specific classifiers are built. There are currently two well-studied paradigms that achieve competitive results on the task in English–MT and machine learning classification. (5) We present an error analysis that provides further insight into the behavior of the models on a morphologically rich language. (4) We demonstrate that the classification framework with minimal supervision is particularly useful for morphologically rich languages they can benefit from large amounts of native data, due to a large variability of word forms, and small amounts of annotation provide good estimates of typical learner errors. (3) We extend state- of-the-art grammar correction methods to a morphologically rich language and, in particular, identify classifiers needed to address mistakes that are specific to these languages. (2) We present an analysis of the annotated data, in terms of error rates, error distributions by learner type (foreign and heritage), as well as comparison to learner corpora in other languages. The dataset is available for research 3 and can serve as a benchmark dataset for Russian, which should facilitate progress on grammar correction research, especially for languages other than English. This paper makes the following contributions: (1) We describe an error classification schema for Russian learner errors, and present an error-tagged Russian learner corpus. The results demonstrate that these methods are particularly useful for correcting mistakes in grammatical phenomena that involve rich morphology. We thus focus on methods that use “minimal supervision” that is, those that do not rely on large amounts of annotated training data, and show how existing minimal-supervision approaches extend to a highly inflectional language such as Russian. Because annotation is extremely costly, these approaches are not suitable for the majority of domains and languages. Although impressive results have recently been achieved for grammar error correction of non-native English writing, these results are limited to domains where plentiful training data are available. We present a corrected and error-tagged corpus of Russian learner writing and develop models that make use of existing state-of-the-art methods that have been well studied for English. We address the task of correcting writing mistakes in morphologically rich languages, with a focus on Russian. Until now, most of the research in grammar error correction focused on English, and the problem has hardly been explored for other languages.
0 Comments
Leave a Reply. |