Open In App

How to Scrape Multiple Pages of a Website Using Python?

Last Updated : 03 Oct, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

Web Scraping is a method of extracting useful data from a website using computer programs without having to manually do it. This data can then be exported and categorically organized for various purposes. Some common places where Web Scraping finds its use are Market research & Analysis Websites, Price Comparison Tools, Search Engines, Data Collection for AI/ML projects, etc.

Let’s dive deep and scrape a website. In this article, we are going to take the GeeksforGeeks website and extract the titles of all the articles available on the Homepage using a Python script. 

If you notice, there are thousands of articles on the website and to extract all of them, we will have to scrape through all pages so that we don’t miss out on any! 

GeeksforGeeks Homepage

Scraping multiple Pages of a website Using Python

Now, there may arise various instances where you may want to get data from multiple pages from the same website or multiple different URLs as well, and manually writing code for each webpage is a time-consuming and tedious task. Plus, it defines all basic principles of automation. Duh!  

To solve this exact problem, we will see two main techniques that will help us extract data from multiple webpages:

  • The same website
  • Different website URLs

Approach:

The approach of the program will be fairly simple, and it will be easier to understand it in a POINT format:

  • We’ll import all the necessary libraries.
  • Set up our URL strings for making a connection using the requests library.
  • Parsing the available data from the target page using the BeautifulSoup library’s parser.
  • From the target page, Identify and Extract the classes and tags which contain the information that is valuable to us.
  • Prototype it for one page using a loop and then apply it to all the pages.

Example 1: Looping through the page numbers 

page numbers at the bottom of the GeeksforGeeks website

Most websites have pages labeled from 1 to N. This makes it really simple for us to loop through these pages and extract data from them as these pages have similar structures. For example:

notice the last section of the URL – page/4/

Here, we can see the page details at the end of the URL. Using this information we can easily create a for loop iterating over as many pages as we want (by putting page/(i)/ in the URL string and iterating “itill N) and scrape all the useful data from them. The following code will give you more clarity over how to scrape data by using a For Loop in Python.

Python




import requests
from bs4 import BeautifulSoup as bs
  
  
req = requests.get(URL)
soup = bs(req.text, 'html.parser')
  
titles = soup.find_all('div',attrs = {'class','head'})
  
print(titles[4].text)


Output:

Output for the above code

Now, using the above code, we can get the titles of all the articles by just sandwiching those lines with a loop.

Python




import requests
from bs4 import BeautifulSoup as bs
  
  
for page in range(1,10):
      # pls note that the total number of
    # pages in the website is more than 5000 so i'm only taking the
    # first 10 as this is just an example
  
    req = requests.get(URL + str(page) + '/')
    soup = bs(req.text, 'html.parser')
  
    titles = soup.find_all('div',attrs={'class','head'})
  
    for i in range(4,19):
        if page>1:
            print(f"{(i-3)+page*15}" + titles[i].text)
        else:
            print(f"{i-3}" + titles[i].text)


Output:

Output for the above code

Note: The above code will fetch the first 10 pages from the website and scrape all the 150 titles of the articles that fall under those pages. 

Example 2: Looping through a list of different URLs.

The above technique is absolutely wonderful, but what if you need to scrape different pages, and you don’t know their page numbers? You’ll need to scrape those different URLs one by one and manually code a script for every such webpage.

Instead, you could just make a list of these URLs and loop through them. By simply iterating the items in the list i.e. the URLs, we will be able to extract the titles of those pages without having to write code for each page. Here’s an example code of how you can do it.

Python




import requests
from bs4 import BeautifulSoup as bs
  
for url in range(0,2):
    req = requests.get(URL[url])
    soup = bs(req.text, 'html.parser')
  
    titles = soup.find_all('div',attrs={'class','head'})
    for i in range(4, 19):
        if url+1  > 1:
            print(f"{(i - 3) + url * 15}" + titles[i].text)
        else:
            print(f"{i - 3}" + titles[i].text)


Output:

Output for the above code

How to avoid getting your IP address banned?

Controlling the crawl rate is the most important thing to keep in mind when carrying out a very large extraction. Bombarding the server with multiple requests within a very short amount of time will most likely result in getting your IP address blacklisted. To avoid this, we can simply carry out our crawling in short random bursts of time. In other words, we add pauses or little breaks between crawling periods, which help us look like actual humans as websites can easily identify a crawler because of the speed it possesses compared to a human trying to visit the website. This helps avoid unnecessary traffic and overloading of the website servers. Win-Win!

Now, how do we control the crawling rate? It’s simple. By using two functions, randint() and sleep() from python modules ‘random’ and ‘time’ respectively. 

Python3




from random import randint
from time import sleep 
  
print(randint(1,10))


Output

1

The randint() function will choose a random integer between the given upper and lower limits, in this case, 10 and 1 respectively, for every iteration of the loop. Using the randint() function in combination with the sleep() function will help in adding short and random breaks in the crawling rate of the program. The sleep() function will basically cease the execution of the program for the given number of seconds. Here, the number of seconds will randomly be fed into the sleep function by using the randint() function. Use the code given below for reference.

Python3




from time import *
from random import randint
  
for i in range(0,3):
  # selects random integer in given range
  x = randint(2,5)
  print(x)
  sleep(x)
  print(f'I waited {x} seconds')


Output

5
I waited 5 seconds
4
I waited 4 seconds
5
I waited 5 seconds

To get you a clear idea of this function in action, refer to the code given below.

Python3




import requests
from bs4 import BeautifulSoup as bs
from random import randint
from time import sleep
  
  
for page in range(1,10): 
      # pls note that the total number of
    # pages in the website is more than 5000 so i'm only taking the
    # first 10 as this is just an example
  
    req = requests.get(URL + str(page) + '/')
    soup = bs(req.text, 'html.parser')
  
    titles = soup.find_all('div',attrs={'class','head'})
  
    for i in range(4,19):
        if page>1:
            print(f"{(i-3)+page*15}" + titles[i].text)
        else:
            print(f"{i-3}" + titles[i].text)
  
    sleep(randint(2,10))


Output:

The program has paused its execution and is waiting to resume

The output of the above code



Previous Article
Next Article

Similar Reads

How to scrape multiple pages using Selenium in Python?
As we know, selenium is a web-based automation tool that helps us to automate browsers. Selenium is an Open-Source testing tool which means we can easily download it from the internet and use it. With the help of Selenium, we can also scrap the data from the webpages. Here, In this article, we are going to discuss how to scrap multiple pages using
4 min read
Scrape Tables From any website using Python
Scraping is a very essential skill for everyone to get data from any website. Scraping and parsing a table can be very tedious work if we use standard Beautiful soup parser to do so. Therefore, here we will be describing a library with the help of which any table can be scraped from any website easily. With this method you don't even have to inspec
3 min read
Python program to Recursively scrape all the URLs of the website
In this tutorial we will see how to we can recursively scrape all the URLs from the website Recursion in computer science is a method of solving a problem where the solution depends on solutions to smaller instances of the same problem. Such problems can generally be solved by iteration, but this needs to identify and index the smaller instances at
2 min read
How to Scrape all PDF files in a Website?
Prerequisites: Implementing Web Scraping in Python with BeautifulSoup Web Scraping is a method of extracting data from the website and use that data for other uses. There are several libraries and modules for doing web scraping in Python. In this article, we'll learn how to scrape the PDF files from the website with the help of beautifulsoup, which
4 min read
Scrape and Save Table Data in CSV file using Selenium in Python
Selenium WebDriver is an open-source API that allows you to interact with a browser in the same way a real user would and its scripts are written in various languages i.e. Python, Java, C#, etc. Here we will be working with python to scrape data from tables on the web and store it as a CSV file. As Google Chrome is the most popular browser, to make
3 min read
Scrape most reviewed news and tweet using Python
Many websites will be providing trendy news in any technology and the article can be rated by means of its review count. Suppose the news is for cryptocurrencies and news articles are scraped from cointelegraph.com, we can get each news item reviewer to count easily and placed in MongoDB collection. Modules Needed Tweepy: Tweepy is the Python clien
5 min read
How to Scrape Data From Local HTML Files using Python?
BeautifulSoup module in Python allows us to scrape data from local HTML files. For some reason, website pages might get stored in a local (offline environment), and whenever in need, there may be requirements to get the data from them. Sometimes there may be a need to get data from multiple Locally stored HTML files too. Usually HTML files got the
4 min read
Scrape Instagram using Instagramy in Python
In this article, we will learn how can we get Instagram profile details using web scraping. Python provides powerful tools for web scraping, we will be using Instagramy here. This tool is specifically made for Instagram and also analyzes the data using Pandas. Installation The python package Instagramy is used to scrape Instagram quick and easily.
1 min read
How to scrape all the text from body tag using Beautifulsoup in Python?
strings generator is provided by Beautiful Soup which is a web scraping framework for Python. Web scraping is the process of extracting data from the website using automated tools to make the process faster. One drawback of the string attribute is that it only works for tags with string inside it and returns nothing for tags with further tags insid
2 min read
How to scrape Comment using Beautifulsoup in Python?
Comments are provided by Beautiful Soup which is a web scraping framework for Python. Web scraping is the process of extracting data from the website using automated tools to make the process faster. The Comment object is just a special type of NavigableString and is used to make the codebase more readable.Syntax: <!-- COMMENT --> Below given
1 min read
Practice Tags :