Word Prediction using concepts of N – grams and CDF
Last Updated :
24 Oct, 2021
Have some basic understanding about – CDF and N – grams
Problem Statement – Given any input word and text file, predict the next n words that can occur after the input word in the text file.
Examples:
Input : is
Output : is it simply makes sure that there are never
Input : is
Output : is split, all the maximum amount of objects, it
Input : the
Output : the exact same position. There will be some.
Note – For illustrating the example, I have assigned the variable corpus to some text. If you want to test data on real world text data, you can find the data here.
Solution – We can approach this problem using the concepts of probability. Firstly we must calculate the frequency of all the words occurring just after the input in the text file(n-grams, here it is 1-gram, because we always find the next 1 word in the whole data file). Then using those frequencies, calculate the CDF of all these words and just choose a random word from it. To choose this random word, we take a random number and find the smallest CDF greater than or equal the random number. We do so because we want the most probable answer for each case. So that can be achieved by cdf as it gives the cumulative probability for each word in the list.
After finding the CDF, we can easily find the corresponding word and append that word to the output string. Now, if you wish, you can also append the word to the input string and send the whole string to repeat the process to find the next word, or you can just send the word that you found out using cdf. I have done that using the former approach.
Note – You will get a different output if you enter the same word multiple times. That depends on the size of your data file. Larger the file, more probability of a different output.
Code for above algorithm
Python3
import random
from collections import Counter
def next_word_freq(array, sentence):
sen_len, word_list = len (sentence.split()), []
for i in range ( len (array)):
if ' ' .join(array[i : i + sen_len]).lower() = = sentence.lower():
if i + sen_len < len (array) - 1 :
word_list.append(array[i + sen_len])
return dict (Counter(word_list))
def CDF(d):
prob_sum, sum_vals = 0 , sum (d.values())
for k, v in d.items():
pmf = v / sum_vals
prob_sum + = pmf
d[k] = prob_sum
return d
def main(sent, x, n):
corpus =
l = corpus.split()
temp_out = ''
out = sent + ' '
for i in range (n - x):
func_out = next_word_freq(l, sent)
cdf_dict = CDF(func_out)
rand = random.uniform( 0 , 1 )
try : key, val = zip ( * cdf_dict.items())
except : break
for j in range ( len (val)):
if rand < = val[j]:
pos = j
break
temp_out = key[pos]
out = out + temp_out + ' '
sent = temp_out
print (out, end = '\n\n' )
if __name__ = = '__main__' :
inp_sent = 'is'
main(inp_sent, len (inp_sent), 10 )
|
The concept shown above is used in fields like Natural Language Processing. This is a naive approach just to illustrate the concept. Actually, there are much more algorithms out there for word prediction. You can find one of them here
Please Login to comment...