Open In App

Word Prediction using concepts of N – grams and CDF

Last Updated : 24 Oct, 2021
Improve
Improve
Like Article
Like
Save
Share
Report

Have some basic understanding aboutCDF and N – grams
Problem Statement – Given any input word and text file, predict the next n words that can occur after the input word in the text file. 

Examples: 

Input  :  is 
Output :  is it simply makes sure that there are never

Input  :  is
Output :  is split, all the maximum amount of objects, it

Input  :  the
Output : the exact same position. There will be some.

Note – For illustrating the example, I have assigned the variable corpus to some text. If you want to test data on real world text data, you can find the data here.

Solution – We can approach this problem using the concepts of probability. Firstly we must calculate the frequency of all the words occurring just after the input in the text file(n-grams, here it is 1-gram, because we always find the next 1 word in the whole data file). Then using those frequencies, calculate the CDF of all these words and just choose a random word from it. To choose this random word, we take a random number and find the smallest CDF greater than or equal the random number. We do so because we want the most probable answer for each case. So that can be achieved by cdf as it gives the cumulative probability for each word in the list. 
After finding the CDF, we can easily find the corresponding word and append that word to the output string. Now, if you wish, you can also append the word to the input string and send the whole string to repeat the process to find the next word, or you can just send the word that you found out using cdf. I have done that using the former approach. 

Note – You will get a different output if you enter the same word multiple times. That depends on the size of your data file. Larger the file, more probability of a different output. 

Code for above algorithm  

Python3




import random
from collections import Counter
 
# This function calculates the freq of the (i+1)th
# word in the whole corpus, where i is the index of
# the sentence or the word.
 
def next_word_freq(array, sentence):
     
    sen_len, word_list = len(sentence.split()), []
     
    for i in range(len(array)):
 
        # If the sentence matches the sentence in the range (i, i+x)
        # and the length is less than the length of the corpus, append
        # the word to word_list.
         
        if ' '.join(array[i : i + sen_len]).lower() == sentence.lower():
 
            if i + sen_len < len(array) - 1:
 
                word_list.append(array[i + sen_len])
 
    # Return the count of each word in word_list
     
    return dict(Counter(word_list))
 
# Calculate the CDF of each word in the
# Counter dictionary.
 
def CDF(d):
     
    prob_sum, sum_vals = 0, sum(d.values())
     
    for k, v in d.items():
 
        # Calculate the PMF of each word by dividing
        # the freq. by total of all frequencies then add
        # all the PMFs till ith word which is the CDF of
        # the ith word.
         
        pmf = v / sum_vals
        prob_sum += pmf
        d[k] = prob_sum
 
    # Return cdf dictionary
     
    return d
 
# The main function reads the sentence/word as input
# from user and reads the corpus file. For faster processing,
# we have taken only the first 1000 words.
 
 
def main(sent, x, n):
 
    # I am using this sample text here to illustrate the output.
    # If anyone wants to use a text file, he can use the same. The code
    # to read corpus from file has been commented below.
 
    # corpus = open('a.txt','r').read()
 
    corpus = '''text The chance is unlikely if not done programmatically.
    However, imagine the game spawning multiple players at a spawn point,
    this would be the exact same location. I'm not quite sure what you
    mean with spin,     what does the integer reflect? Why is it a
    mismatch between data and structure? The structure does not
    assume a set amount of objects, it can be anything, that's why new
    nodes are created. It simply makes sure that there are not more than
    X leafs inside 1 node. The random is no option of course.
    My splitting algorithm always created the maximum amount of nodes
    already, split over the current node. But I guess I have to change
    this behaviour? Actually, all the books have different authors. And
    most have a different location too. There will be some with the same
    location, but different authors, though. I think my library should be
    able to store books with the same position. There are never
    equally-attractive leaf nodes. If a node is split, all childs will
    reflect a different part of the parent node.'''
     
    l = corpus.split()
 
    # "temp_out" will be used to store each partial sentence
    # which will later be stored into "sent". "out" is used to store
    # the final output.
     
    temp_out = ''
    out = sent + ' '
     
    for i in range(n - x):
 
        # calling the next_word_freq method that returns
        # the frequency of each word next to sent in the
        # whole word corpus.
         
        func_out = next_word_freq(l, sent)
 
        # cdf_dict stores the cdf of each word in the above map
        # that is calculated using method CDF.
         
        cdf_dict = CDF(func_out)
         
        # We use a random number to predict the next word.
        # The word having its CDF greater than or equal to rand
        # and less than or equal to 1.
         
        rand = random.uniform(0, 1)
 
        # If cdf_dict is empty, it means the word.sentence entered by you
        # does not exist in the corpus. Hence, break the loop and just print
        # the word entered by you. To implement this we use try-except block.
        # If an error occurs it implies there aren't enough values to unpack
        # and this can happen only when your input is absent from the corpus.
         
        try: key, val = zip(*cdf_dict.items())
        except: break
 
        # Iterate through the cdf values and find the smallest value
        # greater than or equal to the random number. That value is the
        # cdf of your predicted word. Add the key of the value to the output
        # string and update the "sent" variable as "temp_out".
         
        for j in range(len(val)):
             
            if rand <= val[j]:
                pos = j
                break
                     
        temp_out = key[pos]
        out = out + temp_out + ' '
        sent = temp_out
         
    print(out, end = '\n\n')
 
if __name__ == '__main__':
 
    inp_sent = 'is'
    # The output will have 10 words, including the input sentence/word.
    main(inp_sent, len(inp_sent), 10)
 
# Code contributed by Gagan Talreja.


The concept shown above is used in fields like Natural Language Processing. This is a naive approach just to illustrate the concept. Actually, there are much more algorithms out there for word prediction. You can find one of them here
 



Similar Reads

Python | Scipy stats.halfgennorm.cdf() method
With the help of stats.halfgennorm.cdf() method, we can get the value of cumulative distribution function by using stats.halfgennorm.cdf() method. Syntax : stats.halfgennorm.cdf(x, beta) Return : Return the value of cumulative distribution function. Example #1 : In this example we can see that by using stats.halfgennorm.cdf() method, we are able to
1 min read
Python | Scipy stats.hypsecant.cdf() method
With the help of stats.hypsecant.cdf() method, we can get the value of cumulative distribution function by using stats.hypsecant.cdf() method. Syntax : stats.hypsecant.cdf(x, beta) Return : Return the value of cumulative distribution function. Example #1 : In this example we can see that by using stats.hypsecant.cdf() method, we are able to get the
1 min read
Next Word Prediction with Deep Learning in NLP
Identifying the most likely word to follow a given string of words is the basic goal of the Natural Language Processing (NLP) task of "next word prediction." This predictive skill is essential in various applications, including text auto-completion, speech recognition, and machine translation. Deep learning approaches have transformed NLP by attain
9 min read
Python Program To Find Longest Common Prefix Using Word By Word Matching
Given a set of strings, find the longest common prefix. Examples: Input : {“geeksforgeeks”, “geeks”, “geek”, “geezer”} Output : "gee" Input : {"apple", "ape", "april"} Output : "ap"Recommended: Please solve it on “PRACTICE ” first, before moving on to the solution. We start with an example. Suppose there are two strings- “geeksforgeeks” and “geeks”
4 min read
Scraping Weather prediction Data using Python and BS4
This article revolves around scraping weather prediction d data using python and bs4 library. Let's checkout components used in the script - BeautifulSoup- It is a powerful Python library for pulling out data from HTML/XML files. It creates a parse tree for parsed pages that can be used to extract data from HTML/XML files. Requests - It is a Python
3 min read
Python program to read file word by word
Python is a great language for file handling, and it provides built-in functions to make reading files easy with which we can read file word by word. Read file word by wordIn this article, we will look at how to read a text file and split it into single words using Python. Here are a few examples of reading a file word by word in Python for a bette
2 min read
Link Prediction - Predict edges in a network using Networkx
Link Prediction is used to predict future possible links in a network. Link Prediction is the algorithm based on which Facebook recommends People you May Know, Amazon predicts items you're likely going to be interested in and Zomato recommends food you're likely going to order. For this article, we would consider a Graph as constructed below: impor
7 min read
Placement Prediction App using Flask
Machine learning is a widely employed method for making predictions. Numerous algorithms are accessible in different libraries for predictive tasks. In this article, we'll construct a placement prediction model using Random Forest Classifier with historical data and later we will store that model to .pkl file to integrate it with our Flask app usin
9 min read
Diabetes Prediction Machine Learning Project Using Python Streamlit
In this article, we will demonstrate how to create a Diabetes Prediction Machine Learning Project using Python and Streamlit. Our primary objective is to build a user-friendly graphical interface using Streamlit, allowing users to input data for diabetes prediction. To achieve this, we will leverage a dataset as our backend, along with a generated
3 min read
House Price Prediction using Linear Regression | Django
In this article, we'll explore how to use a Python machine-learning algorithm called linear regression to estimate house prices. We'll do this by taking input data from users who want to predict the price of their home. To make things more accessible and interactive, we'll transform this house price prediction code into a web-based system using the
5 min read
Article Tags :
Practice Tags :