Catalogue of Artificial Intelligence Techniques
Keywords: bi-gram, markov, natural language processing, tri-gram, word prediction
Author(s): Hannah Stewart
The n-gram is a model of word prediction used in natural language processing. The model works by looking at the previous N-1 words in order to predict the next word. Predictions are made by comparing the combination of words with those of a copora (singular: corpus). A copora being an on-line collection of text and speech and is used as a basis for statistical processing natural languages.
The probability of a combination of words is calculated by using relative frequencies:
where p = number of times used word appears in corpus and q = number of words in corpus.
This creates a probability distribuition across the possible words. The final predicted word is chosen by comparing the conditional probabilities of the possible words.
Markov assupmtion is used within n-grams. This is when it is assumed that the probability of a word depends only on the previous word. For example, a bi-gram is a first-order Markov model which approximates the probability of the next word by looking at one word into the past; a tri-gram is a second-order Markov model and looks two words into the past; whilst the N-gram is a N-1 Markov model and looks N-1 words into the past.
As n-grams must be trained from a/some corpus/corpora they are not perfect, because corpora are finite and therefore there will be words and combinations of words missing. In order to help this problem a technique called smoothing is used.
- D. Jurafsky and J.H. Martin, Speech and Language Processing, Prentice Hall, 2000.