Python program to extract Strings between HTML Tags
Last Updated :
17 May, 2023
Given a String and HTML tag, extract all the strings between the specified tag.
Input : ‘<b>Gfg</b> is <b>Best</b>. I love <b>Reading CS</b> from it.’ , tag = “br”
Output : [‘Gfg’, ‘Best’, ‘Reading CS’]
Explanation : All strings between “br” tag are extracted.
Input : ‘<h1>Gfg</h1> is <h1>Best</h1> I love <h1>Reading CS</h1>’ , tag = “h1”
Output : [‘Gfg’, ‘Best’, ‘Reading CS’]
Explanation : All strings between “h1” tag are extracted.
Using re module this task can be performed. In this we employ, findall() function to extract all the strings by matching appropriate regex built using tag and symbols.
Python3
import re
test_str = '<b>Gfg</b> is <b>Best</b>. I love <b>Reading CS</b> from it.'
print ( "The original string is : " + str (test_str))
tag = "b"
reg_str = "<" + tag + ">(.*?)</" + tag + ">"
res = re.findall(reg_str, test_str)
print ( "The Strings extracted : " + str (res))
|
Output:
The original string is : <b>Gfg</b> is <b>Best</b>. I love <b>Reading CS</b> from it. The Strings extracted : [‘Gfg’, ‘Best’, ‘Reading CS’]
Time Complexity: O(N), where N is the length of the input string.
Auxiliary Space: O(N)
Method 2: Using string manipulation
- Initialize a string named “test_str” with some HTML content.
- Initialize a string named “tag” with the name of the tag whose content needs to be extracted.
- Find the index of the first occurrence of the opening tag in the “test_str” using the “find()” method and store it in a variable named “start_idx”.
- Initialize an empty list named “res” to store the extracted strings.
- Use a while loop to extract the strings between the tags. The loop will run until there are no more occurrences of the opening tag.
- Inside the loop, find the index of the closing tag using the “find()” method and store it in a variable named “end_idx”. If the closing tag is not found, exit the loop.
- Extract the string between the tags using string slicing, and append it to the “res” list.
- Find the index of the next occurrence of the opening tag using the “find()” method and update the “start_idx” variable.
- Repeat steps 6-8 until there are no more occurrences of the opening tag.
- Print the extracted strings using the “print()” function. The strings are converted to a string using the “str()” function before being printed.
Python3
test_str = '<b>Gfg</b> is <b>Best</b>. I love <b>Reading CS</b> from it.'
tag = "b"
start_idx = test_str.find( "<" + tag + ">" )
res = []
while start_idx ! = - 1 :
end_idx = test_str.find( "</" + tag + ">" , start_idx)
if end_idx = = - 1 :
break
res.append(test_str[start_idx + len (tag) + 2 :end_idx])
start_idx = test_str.find( "<" + tag + ">" , end_idx)
print ( "The Strings extracted : " + str (res))
|
Output
The Strings extracted : ['Gfg', 'Best', 'Reading CS']
Time complexity: O(n), where n is the length of the input string.
Auxiliary space: O(m), where m is the number of occurrences of the tag in the input string.
Method 3: Using recursion method:
Algorithm:
- Find the index of the first occurrence of the opening tag.
- If no opening tag is found, return an empty list.
- Extract the string between the opening and closing tags using the start index of the opening tag and the end index of the closing tag.
- Recursively call the function with the remaining string after the current tag.
- Return the list of extracted strings.
Python3
def extract_strings_recursive(test_str, tag):
start_idx = test_str.find( "<" + tag + ">" )
if start_idx = = - 1 :
return []
end_idx = test_str.find( "</" + tag + ">" , start_idx)
res = [test_str[start_idx + len (tag) + 2 :end_idx]]
res + = extract_strings_recursive(test_str[end_idx + len (tag) + 3 :], tag)
return res
test_str = '<b>Gfg</b> is <b>Best</b>. I love <b>Reading CS</b> from it.'
tag = "b"
print ( "The original string is : " + str (test_str))
res = extract_strings_recursive(test_str, tag)
print ( "The Strings extracted : " + str (res))
|
Output
The original string is : <b>Gfg</b> is <b>Best</b>. I love <b>Reading CS</b> from it.
The Strings extracted : ['Gfg', 'Best', 'Reading CS']
Time Complexity:
The time complexity of this algorithm is O(n), where n is the length of the input string. This is because we iterate through the string only once, and the operations within the loop are constant time.
Auxiliary Space:
The space complexity of this algorithm is also O(n), where n is the length of the input string. This is because we create a new list for each recursive call, and the maximum depth of the recursion is n/2 (when the input string consists entirely of opening and closing tags). However, in practice, the depth of the recursion will be much smaller than n/2.
Please Login to comment...