Open In App

Real time optimized KMP Algorithm for Pattern Searching

Last Updated : 20 Feb, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

In the article, we have already discussed the KMP algorithm for pattern searching. In this article, a real-time optimized KMP algorithm is discussed. From the previous article, it is known that KMP(a.k.a. Knuth-Morris-Pratt) algorithm preprocesses the pattern P and constructs a failure function F(also called as lps[]) to store the length of the longest suffix of the sub-pattern P[1..l], which is also a prefix of P, for l = 0 to m-1. Note that the sub-pattern starts at index 1 because a suffix can be the string itself. After a mismatched occurred at index P[j], we update j to F[j-1]. The original KMP Algorithm has the runtime complexity of O(M + N) and auxiliary space O(M), where N is the size of the input text and M is the size of the pattern. Preprocessing step costs O(M) time. It is hard to achieve runtime complexity better than that but we are still able to eliminate some inefficient shifts. Inefficiencies of the original KMP algorithm: Consider the following case by using the original KMP algorithm:

Input: T = “cabababcababaca”, P = “ababaca” Output: Found at index 8

The longest proper prefix or lps[] for the above test case is {0, 0, 1, 2, 3, 0, 1}. Lets assume that the red color represents a mismatch occurs, green color represents the checking we skipped. Therefore, the searching process according to the original KMP algorithm occurs as follows: One thing which can be noticed is that in the third, fourth, and fifth matching, the mismatch occurs at the same location, T[7]. If we can skip the fourth and fifth matching, then the original KMP algorithm can further be optimised to answer the real-time queries. Real-time Optimization: The term real-time in this case can be interpreted as checking each character in the text T at most once. Our goal in this case is to shift the pattern properly (just like KMP algorithm does), but no need to check the mismatched character again. That is, for the same above example, the optimized KMP algorithm should work in the following way:

Approach: One way to achieve the goal is to modify the preprocessing process.

  • Let K be the size of the letters of the pattern P. We will construct a failure table to contain K failure functions(i.e. lps[]).
  • Each failure function in the failure table is mapped to a character(key in the failure table) in the alphabet of the pattern P.
  • Recall that the original failure function F[l] (or lps[]) stores the length of the longest suffix of P[1..l], which is also a prefix of P, for l = 0 to m-1, where m is the size of the pattern.
  • If a mismatched occurs at T[i] and P[j], the new value of j would be updated to F[j-1] and the counter ‘i’ would be unchanged.
  • In our new failure table FT[][], if a failure function F’ is mapped with a character c, F'[l] should store the length of the longest suffix of P[1..l] + c (‘+’ represents appending), which is also a prefix of P, for l = 0 to m-1.
  • The intuition is to make proper shifts but also depending on the mismatched character. Here the character c, which is also a key in the failure table, is our “guess” about the mismatched character in the text T.
  • That is, if the mismatched character is c, how should we shift the pattern properly? Since we are constructing the failure table in the preprocessing step, we have to make enough guesses about the mismatched character.
  • Hence, the number of lps[]’s in the failure table equals to the size of the alphabet of the pattern, and each value, the failure function, should be different with respect to the key, a character in P.
  • Assume we have already constructed the desired failure table. Let FT[][] be the failure table, T be the text, P be the pattern.
  • Then, in the matching process, if a mismatch occurs at T[i] and P[j] (i.e. T[i] != P[j]):
    1. If T[i] is a character in P, j will be updated to FT[T[i]][j-1], ‘i‘ will be updated to ‘i + 1‘. We are doing this since we are guaranteed that T[i] is matched or skipped.
    2. If T[i] is not a character, ‘j’ will be updated to 0, ‘i’ will be updated to ‘i + 1’.
  • Note that if a mismatching does not occur, the behaviour is exactly the same as the original KMP algorithm.

Constructing Failure Table:

  • To construct the failure table FT[][], we will need the failure function F(or lps[]) from the original KMP algorithm.
  • Since F[l] tells us the length of the longest suffix of the sub-pattern P[1..l], which is also a prefix of P, the values stored in the failure table is one step beyond it.
  • That is, for any key t in the failure table FT[][], the values stored in FT[t] is a failure function that satisfies for the character ‘t’ and FT[t][l] stores the length of the longest suffix of a sub-pattern P[1..l] + t(‘+’ means append), which is also a prefix of P, for l from 0 to m-1.
  • F[l] has already guaranteed that P[0..F[l]-1] is the longest suffix of the sub-pattern P[1..l], so we will need to check if P[F[l]] is t.
  • If true, then we can assign FT[t][l] to be F[l] + 1, as we are guaranteed that P[0..F[l]] is the longest suffix of the sub-pattern P[1..l] + t.
  • If false, that indicates P[F[l]] is not t. That is, we fail the matching at character P[F[l]] with the character t, but P[0..F[l]-1] matches a suffix of P[1..l].
  • By borrowing the idea from KMP algorithm, just like how we compute the failure function in original KMP algorithm, if the mismatch occurs at P[F[l]] with mismatched character t, we would like to update the next matching starting at FT[t][F[l]-1].
  • That is, we use the idea of the KMP algorithm to compute the failure table. Notice that F[l] – 1 is always less than l, so when we are computing FT[t][l], FT[t][F[l] – 1] is ready for us already.
  • One special case is that if F[l] is 0 and P[F[l]] is not t, F[l] – 1 has a value of -1, in this case, we will update FT[t][l] to 0. (i.e. there is no suffix of P[1..l] + t exists such that it is a prefix of P.)
  • As a conclusion of failure table construction, when we are computing FT[t][l], for any key t and l from 0 to m-1, we will check:
If P[F[l]] is t,
  if yes:
    FT[t][l] <- F[l] + 1;
  if no: 
    check if F[l] is 0,
      if yes:
        FT[t][l] <- 0;
      if no:
        FT[t][l] <- FT[t][F[t] - 1];
  • The new preprocessing step has a running time complexity of O(|\Sigma_P| \cdot M) , where \Sigma_P is the alphabet set of pattern P, M is the size of P.
  • The whole modified KMP algorithm has a running time complexity of O(|\Sigma_P| \cdot M + N ). The auxiliary space usage of O(|\Sigma_P| \cdot M ).
  • The running time and space usage look like “worse” than the original KMP algorithm. However, if we are searching for the same pattern in multiple texts or the alphabet set of the pattern is small, as the preprocessing step only needs to be done once and each character in the text will be compared at most once (real-time). So, it is more efficient than the original KMP algorithm and good in practice.

The space complexity is also O(n) because the failure table takes up O(n) space, and the failure function takes up O(n) space as well.



Similar Reads

KMP Algorithm for Pattern Searching
Given a text txt[0 . . . N-1] and a pattern pat[0 . . . M-1], write a function search(char pat[], char txt[]) that prints all occurrences of pat[] in txt[]. You may assume that N &gt; M. Examples: Input: txt[] = "THIS IS A TEST TEXT", pat[] = "TEST"Output: Pattern found at index 10 Input: txt[] = "AABAACAADAABAABA" pat[] = "AABA"Output: Pattern fou
15+ min read
Optimized Algorithm for Pattern Searching
Question: We have discussed the Naive String matching algorithm here. Consider a situation where all characters of a pattern are different. Can we modify the original Naive String Matching algorithm so that it works better for these types of patterns? If we can, then what are the changes to the original algorithm? Solution: In the original Naive St
7 min read
Z algorithm (Linear time pattern searching Algorithm)
This algorithm finds all occurrences of a pattern in a text in linear time. Let length of text be n and of pattern be m, then total time taken is O(m + n) with linear space complexity. Now we can see that both time and space complexity is same as KMP algorithm but this algorithm is Simpler to understand.In this algorithm, we construct a Z array. Wh
13 min read
Count of occurrences of each prefix in a string using modified KMP algorithm
Given a string S of size N, the task is to count the occurrences of all the prefixes of the given string S. Examples: Input: S = "AAAA" Output: A occurs 4 times AA occurs 3 times. AAA occurs 2 times. AAAA occurs 1 times. Explanation: Below is the illustration of all the prefix: Input: S = "ABACABA" Output: A occurs 4 times AB occurs 2 times ABA occ
15 min read
Remove all occurrences of string t in string s using KMP Algorithm
Given two strings s and t, the task is to remove all occurrences of t in s and return the modified string s, and you have to use the KMP algorithm to solve this. Examples: Input: s = "abcdefgabcabcabdefghabc", t = "abc"Output: "defgdefgh" Input: s = "aaabbbccc", t = "bbb"Output: "aaaccc" Approach: To solve the problem follow the below idea: We will
8 min read
Prefix Function and KMP Algorithm for Competitive Programming
The prefix function is a string matching technique used in computer science and string algorithms. It efficiently computes an array that represents the length of the longest proper prefix which is also a suffix for each prefix of a given string. The Knuth-Morris-Pratt (KMP) algorithm utilizes the prefix function to perform pattern matching in linea
15+ min read
Rabin-Karp algorithm for Pattern Searching in Matrix
Given matrices txt[][] of dimensions m1 x m2 and pattern pat[][] of dimensions n1 x n2, the task is to check whether a pattern exists in the matrix or not, and if yes then print the top most indices of the pat[][] in txt[][]. It is assumed that m1, m2 ? n1, n2 Examples: Input: txt[][] = {{G, H, I, P} {J, K, L, Q} {R, G, H, I} {S, J, K, L} } pat[][]
15+ min read
Rabin-Karp Algorithm for Pattern Searching
Given a text T[0. . .n-1] and a pattern P[0. . .m-1], write a function search(char P[], char T[]) that prints all occurrences of P[] present in T[] using Rabin Karp algorithm. You may assume that n &gt; m. Examples: Input: T[] = "THIS IS A TEST TEXT", P[] = "TEST"Output: Pattern found at index 10 Input: T[] = "AABAACAADAABAABA", P[] = "AABA"Output:
15 min read
Finite Automata algorithm for Pattern Searching
Given a text txt[0..n-1] and a pattern pat[0..m-1], write a function search(char pat[], char txt[]) that prints all occurrences of pat[] in txt[]. You may assume that n &gt; m.Examples: Input: txt[] = "THIS IS A TEST TEXT" pat[] = "TEST" Output: Pattern found at index 10 Input: txt[] = "AABAACAADAABAABA" pat[] = "AABA" Output: Pattern found at inde
13 min read
Aho-Corasick Algorithm for Pattern Searching
Given an input text and an array of k words, arr[], find all occurrences of all words in the input text. Let n be the length of text and m be the total number of characters in all words, i.e. m = length(arr[0]) + length(arr[1]) + ... + length(arr[k-1]). Here k is total numbers of input words. Example: Input: text = "ahishers" arr[] = {"he", "she",
15+ min read