Question

Please explain/demonstrate how to use NLTK to test unigram, bigram, and trigram character models on guessing...

Please explain/demonstrate how to use NLTK to test unigram, bigram, and trigram character models on guessing the language of new, unseen words.

0 0
Add a comment Improve this question Transcribed image text
Answer #1

Unigrams
• Each individual word (instance of punctuation, etc.) is a token
• There are 16 tokens in this sentence, including the period
– A fact about the unicorn is the same as an alternative fact about the unicorn.
• The counts of these words in the Brown Corpus using NLTK
– a 23195 fact 447 about 1815 the 69971 unicorn 0 is 10109 the 69971 same 686 as 7253
– an 3740 alternative 34 fact 447 about 1815 the 69971 unicorn 0 . 49346
• Probability of each token chosen randomly (and independently of other tokens)
– This is called the unigram probability.
– a 0.02 fact 0.000385 about 0.00156 the 0.0603 unicorn 0.0 is 0.00871 the 0.0603
– Same 0.000591 as 0.00625 an 0.00322 alternative 2.93e-05 fact 0.000385 about 0.00156
– the 0.0603 unicorn 0.0 . 0.0425
• Converting counts to unigram probabilities
– count/total_words ≈ probability
– Assumes that (Brown) corpus is representative of future occurrences

A Unigram Model of a Sentence
• Unigram probability of sentence = product of probabilities of individual words.
• If 1 word has probability of 0, than the probability of the sentence is 0, unless we model Out-of-Vocabulary (OOV) items.
• One OOV model: assume words occurring once are OOV and recalculate tcounts, e.g., unicorn now has a non-zero probability
• New Unigram Probabilities:
– a 0.02 fact 0.000385 about 0.00156 the 0.0603
– unicorn 0.0135 is 0.00871 the 0.0603 same 0.000591
– as 0.00625 an 0.00322 alternative 2.93e-05
– fact 0.000385 about 0.00156 the 0.0603   

Bigrams
• Bigram = probability of wordN, given wordN-1
– bigram(the,same) = count(the,same)/count(the)
– count(the,same) = 628
– count(the) = 69,971
– bigram_probability = 628/69971 = 0.00898
• Additional steps
– Include probability that a word occurs a the beginning of a sentence, i.e., bigram(the,START)
– Include probability that a token occurs at the end of a sentence, e.g.,bigram(END,.)
– Include non-zero probability for case when an unknown word follows a known one.
Backoff Model
– If a bigram has a zero count, “backoff” (use) the unigram of the word
• replace bigram(current_word,previous_word) with unigram(current_word)

NLTK bigram probability of sample sentence
• *start_end* a 0.0182 a fact 0.000388 fact about 0.00447
• about the 0.182 the *oov* 0.0293 *oov* is 0.00485
• is the 0.0786 the same 0.00898 same as 0.035 as an 0.029
• an alternative 0.00241
• alternative fact 0.000385 (Backing off to unigram probability for fact)
• fact about 0.00447 about the 0.182 the *oov* 0.0293
• *oov* . 0.0865 . *start_end* 1.0
• Total = product of the above probabilities = 1.12e-30

Trigrams, 4-grams, N-grams
• Trigram Probability
– Prob(3 token sequence | first 2 tokens)
– count(w−2,w−1,w)/count (w−2,w−1)
– count(the, same, as)/count(the, same)

In this way,we use NLTK to test unigram, bigram, and trigram character models using these probability
functions.However,Markov assumptions also helps to find the probability of differrent model.


Add a comment
Know the answer?
Add Answer to:
Please explain/demonstrate how to use NLTK to test unigram, bigram, and trigram character models on guessing...
Your Answer:

Post as a guest

Your Name:

What's your source?

Earn Coins

Coins can be redeemed for fabulous gifts.

Not the answer you're looking for? Ask your own homework help question. Our experts will answer your question WITHIN MINUTES for Free.
Similar Homework Help Questions
ADVERTISEMENT
Free Homework Help App
Download From Google Play
Scan Your Homework
to Get Instant Free Answers
Need Online Homework Help?
Ask a Question
Get Answers For Free
Most questions answered within 3 hours.
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT