Open In App

Python program to extract Strings between HTML Tags

Last Updated : 17 May, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

Given a String and HTML tag, extract all the strings between the specified tag.

Input :  ‘<b>Gfg</b> is <b>Best</b>. I love <b>Reading CS</b> from it.’ , tag = “br” 
Output : [‘Gfg’, ‘Best’, ‘Reading CS’]
Explanation : All strings between “br” tag are extracted.

Input : ‘<h1>Gfg</h1> is <h1>Best</h1> I love <h1>Reading CS</h1>’  , tag = “h1” 
Output : [‘Gfg’, ‘Best’, ‘Reading CS’] 
Explanation : All strings between “h1” tag are extracted. 

Using re module this task can be performed. In this we employ, findall() function to extract all the strings by matching appropriate regex built using tag and symbols.

Python3




# importing re module
import re
 
# initializing string
test_str = '<b>Gfg</b> is <b>Best</b>. I love <b>Reading CS</b> from it.'
 
# printing original string
print("The original string is : " + str(test_str))
 
# initializing tag
tag = "b"
 
# regex to extract required strings
reg_str = "<" + tag + ">(.*?)</" + tag + ">"
res = re.findall(reg_str, test_str)
 
# printing result
print("The Strings extracted : " + str(res))


Output:

The original string is : <b>Gfg</b> is <b>Best</b>. I love <b>Reading CS</b> from it. The Strings extracted : [‘Gfg’, ‘Best’, ‘Reading CS’]

Time Complexity: O(N), where N is the length of the input string.

Auxiliary Space: O(N)

Method 2: Using string manipulation

  1. Initialize a string named “test_str” with some HTML content.
  2. Initialize a string named “tag” with the name of the tag whose content needs to be extracted.
  3. Find the index of the first occurrence of the opening tag in the “test_str” using the “find()” method and store it in a variable named “start_idx”.
  4. Initialize an empty list named “res” to store the extracted strings.
  5. Use a while loop to extract the strings between the tags. The loop will run until there are no more occurrences of the opening tag.
  6. Inside the loop, find the index of the closing tag using the “find()” method and store it in a variable named “end_idx”. If the closing tag is not found, exit the loop.
  7. Extract the string between the tags using string slicing, and append it to the “res” list.
  8. Find the index of the next occurrence of the opening tag using the “find()” method and update the “start_idx” variable.
  9. Repeat steps 6-8 until there are no more occurrences of the opening tag.
  10. Print the extracted strings using the “print()” function. The strings are converted to a string using the “str()” function before being printed.

Python3




# initializing string
test_str = '<b>Gfg</b> is <b>Best</b>. I love <b>Reading CS</b> from it.'
 
# initializing tag
tag = "b"
 
# finding the index of the first occurrence of the opening tag
start_idx = test_str.find("<" + tag + ">")
 
# initializing an empty list to store the extracted strings
res = []
 
# extracting the strings between the tags
while start_idx != -1:
    end_idx = test_str.find("</" + tag + ">", start_idx)
    if end_idx == -1:
        break
    res.append(test_str[start_idx+len(tag)+2:end_idx])
    start_idx = test_str.find("<" + tag + ">", end_idx)
 
# printing the extracted strings
print("The Strings extracted : " + str(res))


Output

The Strings extracted : ['Gfg', 'Best', 'Reading CS']

Time complexity: O(n), where n is the length of the input string.
Auxiliary space: O(m), where m is the number of occurrences of the tag in the input string.

Method 3: Using  recursion method:

Algorithm:

  1. Find the index of the first occurrence of the opening tag.
  2. If no opening tag is found, return an empty list.
  3. Extract the string between the opening and closing tags using the start index of the opening tag and the end index of the closing tag.
  4. Recursively call the function with the remaining string after the current tag.
  5. Return the list of extracted strings.

Python3




def extract_strings_recursive(test_str, tag):
    # finding the index of the first occurrence of the opening tag
    start_idx = test_str.find("<" + tag + ">")
 
    # base case
    if start_idx == -1:
        return []
 
    # extracting the string between the opening and closing tags
    end_idx = test_str.find("</" + tag + ">", start_idx)
    res = [test_str[start_idx+len(tag)+2:end_idx]]
 
    # recursive call to extract strings after the current tag
    res += extract_strings_recursive(test_str[end_idx+len(tag)+3:], tag)
 
    return res
 
# example usage
test_str = '<b>Gfg</b> is <b>Best</b>. I love <b>Reading CS</b> from it.'
tag = "b"
# printing original string
print("The original string is : " + str(test_str))
  
res = extract_strings_recursive(test_str, tag)
print("The Strings extracted : " + str(res))
#This code is contributed by Jyothi Pinjala.


Output

The original string is : <b>Gfg</b> is <b>Best</b>. I love <b>Reading CS</b> from it.
The Strings extracted : ['Gfg', 'Best', 'Reading CS']

Time Complexity:
The time complexity of this algorithm is O(n), where n is the length of the input string. This is because we iterate through the string only once, and the operations within the loop are constant time.

Auxiliary Space:
The space complexity of this algorithm is also O(n), where n is the length of the input string. This is because we create a new list for each recursive call, and the maximum depth of the recursion is n/2 (when the input string consists entirely of opening and closing tags). However, in practice, the depth of the recursion will be much smaller than n/2.



Previous Article
Next Article

Similar Reads

Insert tags or strings immediately before and after specified tags using BeautifulSoup
BeautifulSoup is a Python library that is used for extracting data out of markup languages like HTML, XML...etc. For example let us say we have some web pages that needed to display relevant data related to some research like processing information such as date or address but that do not have any way to download it, in such cases BeautifulSoup come
2 min read
Extract all the URLs that are nested within &lt;li&gt; tags using BeautifulSoup
Beautiful Soup is a python library used for extracting html and xml files. In this article we will understand how we can extract all the URLSs from a web page that are nested within &lt;li&gt; tags. Module needed and installation:BeautifulSoup: Our primary module contains a method to access a webpage over HTTP.pip install bs4Requests: used to perfo
4 min read
Python program to Extract Mesh matching Strings
Given a character mesh, containing missing characters, match the string which matches the mesh. Example: Input : test_list = ["geeks", "best", "peeks"], mesh = "_ee_s" Output : ['geeks', 'peeks'] Explanation : Elements according to mesh are geeks and peeks. Input : test_list = ["geeks", "best", "test"], mesh = "_e_t" Output : ['best', 'test'] Expla
5 min read
Python Program to Extract Strings with at least given number of characters from other list
Given a list containing only string elements, the task is to write a Python program to extract all the strings which have characters from another list given a number of times. Examples: Input : test_list = ["Geeksforgeeks", "is", "best", "for", "geeks"], char_list = ['e', 't', 's', 'm', 'n'], K = 2 Output : ['Geeksforgeeks', 'best', 'geeks'] Explan
7 min read
Find the title tags from a given html document using BeautifulSoup in Python
Let's see how to Find the title tags from a given html document using BeautifulSoup in python. so we can find the title tag from html document using BeautifulSoup find() method. The find function takes the name of the tag as string input and returns the first found match of the particular tag from the webpage. Example 1: Python Code # import Beauti
1 min read
Python | Extract K sized strings
Sometimes, while working with huge amount of data, we can have a problem in which we need to extract just specific sized strings. This kind of problem can occur during validation cases across many domains. Let's discuss certain ways to handle this in Python strings list. Method #1 : Using list comprehension + len() The combination of above function
5 min read
Python | Extract numbers from list of strings
Sometimes, we can data in many forms and we desire to perform both conversions and extractions of certain specific parts of a whole. One such issue can be extracting a number from a string and extending this, sometimes it can be more than just an element string but a list of it. Let's discuss certain ways in which this can be solved. Method #1 : Us
8 min read
Python - Extract range sized strings
Sometimes, while working with huge amount of data, we can have a problem in which we need to extract just specific range sized strings. This kind of problem can occur during validation cases across many domains. Let’s discuss certain ways to handle this in Python strings list. Method #1 : Using list comprehension + len() The combination of above fu
4 min read
Python | Extract Strings with only Alphabets
Sometimes, while working with Python lists, we can have a problem in which we need to extract only those strings which contain only alphabets and discard those which include digits. This has applications in day-day programming and web development domain. Lets discuss certain ways in which this task can be performed. Method #1 : Using isalpha() + li
7 min read
Python - Extract Strings with Successive Alphabets in Alphabetical Order
Given a string list, extract list which has any succession of characters as they occur in alphabetical order. Input : test_list = ['gfg', 'ij', 'best', 'for', 'geeks'] Output : ['ij', 'gfg', 'best'] Explanation : i-j, f-g, s-t are consecutive pairs. Input : test_list = ['gf1g', 'in', 'besht', 'for', 'geeks'] Output : [] Explanation : No consecutive
5 min read