Open In App

How to Remove repetitive characters from words of the given Pandas DataFrame using Regex?

Last Updated : 06 Feb, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

Prerequisite: Regular Expression in Python

In this article, we will see how to remove continuously repeating characters from the words of the given column of the given Pandas Dataframe using Regex.

Here, we are actually looking for continuously occurring repetitively coming characters for that we have created a pattern that contains this regular expression (\w)\1+ here \w is for character, 1+ is for the characters that come more than once. 

We are passing our pattern in the re.sub() function of re library. 

Syntax: re.sub(pattern, repl, string, count=0, flags=0)

The ‘sub’ in the function stands for SubString, a certain regular expression pattern is searched in the given string(3rd parameter), and upon finding the substring pattern is replaced by repl(2nd parameter), count checks and maintains the number of times this occurs. 

Now, Let’s create a Dataframe:

Python3




# importing required libraries
import pandas as pd
import re
 
# creating Dataframe with column
# as name and common_comments
df = pd.DataFrame(
  {
    'name' : ['Akash', 'Ayush', 'Diksha',
              'Priyanka', 'Radhika'],
     
    'common_comments' : ['hey buddy meet me today ',
                         'sorry bro i cant meet',
                         'hey akash i love geeksforgeeks',
                         'twiiter is the best way to comment',
                         'geeksforgeeks is good for learners']
    },
   
    columns = ['name', 'common_comments']
)
# printing Dataframe
df


 
 

Output:

 

 

Now, Remove continuously repetitive characters from words of the Dataframe common_comments column. 

 

Python3




# define a function to remove
# continuously repeating character
# from the word
def conti_rep_char(str1):
    tchr = str1.group(0)
    if len(tchr) > 1:
      return tchr[0:1]
     
# define a function to check
# whether unique character
# is present or not
def check_unique_char(rep, sent_text):
   
    # regular expression for
    # repetition of characters
    convert = re.sub(r'(\w)\1+',
                     rep,
                     sent_text)
     
    # returning the converted word
    return convert
 
df['modified_common_comments'] = df['common_comments'].apply(
                                   lambda x : check_unique_char(conti_rep_char,
                                                              x))
# show Dataframe
df


 
 

Output:

 

Time Complexity : O(n) where n is the number of elements in the dataframe.

Space complexity : O(m * n) where m is the number of columns and n is the number of elements in the dataframe



Previous Article
Next Article

Similar Reads

Extract date from a specified column of a given Pandas DataFrame using Regex
In this article, we will discuss how to extract only valid date from a specified column of a given Data Frame. The extracted date from the specified column should be in the form of 'mm-dd-yyyy'. Approach: In this article, we have used a regular expression to extract valid date from the specified column of the data frame. Here we used \b(1[0-2]|0[1-
2 min read
Split a String into columns using regex in pandas DataFrame
Given some mixed data containing multiple values as a string, let's see how can we divide the strings using regex and make multiple columns in Pandas DataFrame. Method #1: In this method we will use re.search(pattern, string, flags=0). Here pattern refers to the pattern that we want to search. It takes in a string with the following values: \w matc
3 min read
Replace values in Pandas dataframe using regex
While working with large sets of data, it often contains text data and in many cases, those texts are not pretty at all. The text is often in very messier form and we need to clean those data before we can do anything meaningful with that text data. Mostly the text corpus is so large that we cannot manually list out all the texts that we want to re
4 min read
Convert given Pandas series into a dataframe with its index as another column on the dataframe
First of all, let we understand that what are pandas series. Pandas Series are the type of array data structure. It is one dimensional data structure. It is capable of holding data of any type such as string, integer, float etc. A Series can be created using Series constructor. Syntax: pandas.Series(data, index, dtype, copy) Return: Series object.
1 min read
Extract punctuation from the specified column of Dataframe using Regex
Prerequisite: Regular Expression in Python In this article, we will see how to extract punctuation used in the specified column of the Dataframe using Regex. Firstly, we are making regular expression that contains all the punctuation: [!"\$%&\'()*+,\-.\/:;=#@?\[\\\]^_`{|}~]* Then we are passing each row of specific column to re.findall() functi
2 min read
Extract all capital words from Dataframe in Pandas
In this article, we are to discuss various methods to extract capital words from a dataframe in the pandas module. Below is the dataframe which is going to be used to depict various approaches: C/C++ Code # Import pandas library import pandas # Create dataset data = [['tom', 'DATAFRAME', '200.00'], ['PANDAS', 15, 3.14], ['r2j', 14, 'PYTHON']] # Cre
3 min read
Python Program to print strings with repetitive occurrence of an element in a list
Given a strings List, write a Python program that extracts all the strings with more than one occurrence of a specific value(here described using K) in elements of a list. Examples: Input : test_list = ["geeksforgeeks", "best", "for", "geeks"], K = 'e' Output : ['geeksforgeeks', 'geeks'] Explanation : geeks and geeksforgeeks have 2 and 4 occurrence
5 min read
Regex in Python to put spaces between words starting with capital letters
Given an array of characters, which is basically a sentence. However, there is no space between different words and the first letter of every word is in uppercase. You need to print this sentence after the following amendments: Put a single space between these words. Convert the uppercase letters to lowercase Examples: Input : BruceWayneIsBatmanOut
2 min read
Remove infinite values from a given Pandas DataFrame
Let's discuss how to Remove the infinite values from the Pandas dataframe. First let's make a dataframe: Example: C/C++ Code # Import Required Libraries import pandas as pd import numpy as np # Create a dictionary for the dataframe dict = {'Name': ['Sumit Tyagi', 'Sukritin', 'Akriti Goel', 'Sanskriti', 'Abhishek Jain'], 'Age': [22, 20, np.inf, -np.
2 min read
Difference Between Spark DataFrame and Pandas DataFrame
Dataframe represents a table of data with rows and columns, Dataframe concepts never change in any Programming language, however, Spark Dataframe and Pandas Dataframe are quite different. In this article, we are going to see the difference between Spark dataframe and Pandas Dataframe. Pandas DataFrame Pandas is an open-source Python library based o
3 min read