Open In App

Python – Substituting patterns in text using regex

Last Updated : 29 Dec, 2020
Improve
Improve
Like Article
Like
Save
Share
Report

Regular Expression (regex) is meant for pulling out the required information from any text which is based on patterns. They are also widely used for manipulating the pattern-based texts which leads to text preprocessing and are very helpful in implementing digital skills like Natural Language Processing(NLP).

This article demonstrates how to use regex to substitute patterns by providing multiple examples where each example is a unique scenario in its own. It is very necessary to understand the re.sub() method of re (regular expression) module to understand the given solutions.

The re.sub() method performs global search and global replace on the given string. It is used for substituting a specific pattern in the string. There are in total 5 arguments of this function.

Syntax: re.sub(pattern, repl, string, count=0, flags=0)

Parameters:
pattern – the pattern which is to be searched and substituted
repl – the string with which the pattern is to be replaced
string – the name of the variable in which the pattern is stored
count – number of characters up to which substitution will be performed
flags – it is used to modify the meaning of the regex pattern

count and flags are optional arguments.

Example 1: Substitution of a specific text pattern
In this example, a given text pattern will be searched and substituted in a string. The idea is to use the very normal form of the re.sub() method with only the first 3 arguments.

Below is the implementation.




# Python implementation of substituting a 
# specific text pattern in a string using regex
  
# importing regex module
import re
  
# Function to perform
# operations on the strings
def substitutor():
      
    # a string variable
    sentence1 = "It is raining outside."
      
    # replacing text 'raining' in the string 
    # variable sentence1 with 'sunny' thus
    # passing first parameter as raining
    # second as sunny, third as the 
    # variable name in which string is stored
    # and printing the modified string
    print(re.sub(r"raining", "sunny", sentence1))
      
    # a string variable
    sentence2 = "Thank you very very much."
      
    # replacing text 'very' in the string 
    # variable sentence2 with 'so' thus 
    # passing parameters at their 
    # appropriate positions and printing 
    # the modified string
    print(re.sub(r"very", "so", sentence2))
  
# Driver Code: 
substitutor()


Output:

It is sunny outside.
Thank you so so much.

No matter how many time the required pattern is present in the string, the re.sub() function replaces all of them with the given pattern. That’s why both the ‘very’ are replaced by ‘so’ in the above example.

 

Example 2: Substituting a character set with a specific character
The task is to replace a character set with a given character. A character set means a range of characters. In the re.sub() method a character set is written inside [ ](square brackets).

In this example, the lower case character set i.e., [a-z] will be replaced by the digit 0. Below is the implementation.




# Python implementation of substituting 
# a character set with a specific character
  
# importing regex module
import re
  
# Function to perform
# operations on the strings
def substitutor():
      
    # a string variable
    sentence = "22 April is celebrated as Earth Day."
  
    # replacing every lower case characters  
    # in the variable sentence with 0 and 
    # printing the modified string
    print(re.sub(r"[a-z]", "0", sentence))
      
# Driver Code: 
substitutor()


Output:

22 A0000 00 0000000000 00 E0000 D00.

If there is a need to substitute both lowercase and uppercase character set then we have to introduce the uppercase character set in this way: [a-zA-Z] or the effective way to do is by using flags.

 

Example 3: Case-insensitive substitution of a character set with a specific character
In this example, both lowercase and uppercase characters will be replaced by the given character. With the use of flags, this task can be carried out very easily.

The re.I flag stands for re.IGNORECASE. By introducing this flag in the re.sub() method and mentioning any one character set i.e., lowercase or uppercase the task can be completed.

Below is the implementation.




# Python implementation of case-insensitive substitution
# of a character set with a specific character
  
# importing regex module
import re
  
# Function to perform
# operations on the strings
def substitutor():
      
    # a string variable
    sentence = "22 April is celebrated as Earth Day."
  
    # replacing both lowercase and
    # uppercase characters with 0 in  
    # the variable sentence by using 
    # flag and printing the modified string 
    print(re.sub(r"[a-z]", "0", sentence, flags = re.I))
      
# Driver Code: 
substitutor()


Output:

22 00000 00 0000000000 00 00000 000.

 

Example 4: Perform substitution up to a certain number of character
In this example, substitution will be up to a specific number of characters and not on the whole string. To perform this type of substitution the re.sub() method has an argument count.

By providing a numeric value to this argument, the number of characters on which substitution will occur can be controlled. Below is the implementation.




# Python implementation to perform substitution
# up to a certain number of characters
  
# importing regex module
import re
  
# Function to perform
# operations on the strings
def substitutor():
      
    # a string variable
    sentence = "Follow your Passion."
  
    # case-insensitive substitution
    # on variable sentence upto  
    # eight characters and printing
    # the modified string
    print(re.sub(r"[a-z]", "0", sentence, 8, flags = re.I))
      
# Driver Code: 
substitutor()


Output:

000000 00ur Passion.

 

Example 5: Substitution using shorthand character class and preprocessing of text
Regex module provides many shorthand character class for those character sets which are very common during preprocessing of text. Usage of shorthand character class results in writing efficient code and lessen the need to remember the range of every character set.

To get a detail explanation of shorthand character class and how to write regular expression in python for preprocessing of text click here. Following are some of the commonly used shorthand character classes:

\w: matches alpha numeric characters
\W: matches non-alpha numeric characters like @, #, ‘, +, %, –
\d: matches digit characters
\s: matches white space characters

Meaning of some syntax:
adding a plus(+) symbol after a character class or set: repetition of preceding character class or set for at least 1 or more times.

adding an asterisk(*) symbol after a character class or set: repetition of preceding character class or set for at least 0 or more times.

adding a caret(^) symbol before a character class or set: matching position is determined for that character class or set at the beginning of the string.

adding a dollar($) symbol after a character class or set: matching position is determined for that character class or set at the end of the string.

This example demonstrates the use of mentioned shorthand character classes for the substitution and preprocessing of text to get clean and error-free strings. Below is the implementation.




# Python implementation of Substitution using 
# shorthand character class and preprocessing of text
  
# importing regex module
import re
  
# Function to perform
# operations on the strings
def substitutor():
      
    # list of strings
    S = ["2020 Olympic games have @# been cancelled",
     "Dr Vikram Sarabhai was +%--the ISRO’s first chairman",
     "Dr Abdul            Kalam, the father      of India's missile programme"]
      
    # loop to iterate every element of list
    for i in range(len(S)):
          
        # replacing every non-word character with a white space
        S[i] = re.sub(r"\W", " ", S[i])
          
        # replacing every digit character with a white space
        S[i] = re.sub(r"\d", " ", S[i])
          
        # replacing one or more white space with a single white space
        S[i] = re.sub(r"\s+", " ", S[i])
          
        # replacing alphabetic characters which have one or more 
        # white space before and after them with a white space
        S[i] = re.sub(r"\s+[a-z]\s+", " ", S[i], flags = re.I)
          
        # substituting one or more white space which is at 
        # beginning of the string with an empty string
        S[i] = re.sub(r"^\s+", "", S[i])
          
        # substituting one or more white space which is at
        # end of the string with an empty string
        S[i] = re.sub(r"\s+$", "", S[i])
      
    # loop to iterate every element of list
    for i in range(len(S)):
          
        # printing each modified string
        print(S[i])
      
# Driver Code: 
substitutor()    


Output:

Olympic games have been cancelled
Dr Vikram Sarabhai was the ISRO first chairman
Dr Abdul Kalam the father of India missile programme


Previous Article
Next Article

Similar Reads

Find all the patterns of “1(0+)1” in a given string using Python Regex
A string contains patterns of the form 1(0+)1 where (0+) represents any non-empty consecutive sequence of 0’s. Count all such patterns. The patterns are allowed to overlap. Note : It contains digits and lowercase characters only. The string is not necessarily a binary. 100201 is not a valid pattern. Examples: Input : 1101001 Output : 2 Input : 1000
2 min read
Validate an IP address using Python without using RegEx
Given an IP address as input, the task is to write a Python program to check whether the given IP Address is Valid or not without using RegEx. What is an IP (Internet Protocol) Address? Every computer connected to the Internet is identified by a unique four-part string, known as its Internet Protocol (IP) address. An IP address (version 4) consists
2 min read
Python | Program that matches a word containing 'g' followed by one or more e's using regex
Prerequisites : Regular Expressions | Set 1, Set 2 Given a string, the task is to check if that string contains any g followed by one or more e's in it, otherwise, print No match. Examples : Input : geeks for geeks Output : geeks geeks Input : graphic era Output : No match Approach : Firstly, make a regular expression (regex) object that matches a
2 min read
The most occurring number in a string using Regex in python
Given a string str, the task is to extract all the numbers from a string and find out the most occurring element of them using Regex Python. It is guaranteed that no two element have the same frequency Examples: Input :geek55of55geeks4abc3dr2 Output :55Input :abcd1def2high2bnasvd3vjhd44Output :2Approach:Extract all the numbers from a string str usi
2 min read
Name validation using IGNORECASE in Python Regex
In this article, we will learn about how to use Python Regex to validate name using IGNORECASE. re.IGNORECASE : This flag allows for case-insensitive matching of the Regular Expression with the given string i.e. expressions like [A-Z] will match lowercase letters, too. Generally, It's passed as an optional argument to re.compile(). Let's consider a
2 min read
Python | Swap Name and Date using Group Capturing in Regex
In this article, we will learn how to swap Name and Date for each item in a list using Group Capturing and Numeric Back-referencing feature in Regex . Capturing Group : Parentheses groups the regex between them and captures the text matched by the regex inside them into a numbered group i.e ([\w ]+) which can be reused with a numbered back-referenc
3 min read
Categorize Password as Strong or Weak using Regex in Python
Given a password, we have to categorize it as a strong or weak one. There are some checks that need to be met to be a strong password. For a weak password, we need to return the reason for it to be weak. Conditions to be fulfilled are: Minimum 9 characters and maximum 20 characters.Cannot be a newline or a spaceThere should not be three or more rep
2 min read
Python program to Count Uppercase, Lowercase, special character and numeric values using Regex
Prerequisites: Regular Expression in Python Given a string. The task is to count the number of Uppercase, Lowercase, special character and numeric values present in the string using Regular expression in Python. Examples: Input : "ThisIsGeeksforGeeks!, 123" Output :No. of uppercase characters = 4No. of lowercase characters = 15No. of numerical char
2 min read
How to check if a string starts with a substring using regex in Python?
Prerequisite: Regular Expression in Python Given a string str, the task is to check if a string starts with a given substring or not using regular expression in Python. Examples: Input: String: "geeks for geeks makes learning fun" Substring: "geeks" Output: True Input: String: "geeks for geeks makes learning fun" Substring: "makes" Output: FalseChe
3 min read
Python program to find files having a particular extension using RegEx
Prerequisite: Regular Expression in Python Many of the times we need to search for a particular type of file from a list of different types of files. And we can do so with only a few lines of code using python. And the cool part is we don't need to install any external package, python has a built-in package called re, with which we can easily write
2 min read
three90RightbarBannerImg