Catalogue of Artificial Intelligence Techniques
Categories: Natural Language
Author(s): Karen Sp\"arck Jones
An essential requirement for serious natural language processing programs, though not strictly an AI technique. A simple strategy, based on a letter-by-letter tree-structured dictionary, assumes that errors fall into the four types: (i) extra letter (ii) substituted letter (iii) omitted letter, and (iv) reversed letter pair. Then at any point where a mismatch between input string and dictionary string occurs a new match can be tried by, respectively, (i) advancing the input string one letter (ii) advancing both strings together (iii) advancing the dictionary string, and (iv) advancing first one string and then the other. This strategy is looking for letter position correspondence between the two strings; weaker strategies look merely for ordinal correspondence, and yet weaker for `material' correspondence, i.e., just having the same letters. The simple strategy described does not take account of the number of errors per word. Generalising for this requires a string similarity measure. Such measures allow the use of non-literal word representations, e.g., hash coding, and of n-gram rather than single-letter based matching. Spelling correction may also use heuristics exploiting, e.g., distinctive properties of the language (`u' after `q' in English), those of the input device (optical character reader, human typist), and the choice of strategy may be influenced by the task, e.g., whether a large lexicon is involved, whether a user should be consulted for a proposed correction, etc. In the limit, error detection and correction requires full language understanding.
- Pollock, J.J., Spelling error detection and correction by computer: some notes and a bibliography
Journal of Documentation 38 (1982), 282--291.