Open In App

How to Build Web scraping bot in Python

Last Updated : 02 Feb, 2022
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we are going to see how to build a web scraping bot in Python.

Web Scraping is a process of extracting data from websites. A Bot is a piece of code that will automate our task. Therefore, A web scraping bot is a program that will automatically scrape a website for data, based on our requirements.    

Module needed

  • bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not come built-in with Python. To install this type the below command in the terminal.

pip install bs4

  • requests: Request allows you to send HTTP/1.1 requests extremely easily. This module also does not come built-in with Python. To install this type the below command in the terminal.

pip install requests

  • Selenium: Selenium is one of the most popular automation testing tools. It can be used to automate browsers like Chrome, Firefox, Safari, etc.

pip install selenium

Method 1: Using Selenium

We need to install a chrome driver to automate using selenium, our task is to create a bot that will be continuously scraping the google news website and display all the headlines every 10mins.

Stepwise implementation:

Step 1: First we will import some required modules.

Python3




# These are the imports to be made
import time
from selenium import webdriver
from datetime import datetime


Step 2: The next step is to open the required website.

Python3




# path of the chromedriver we have just downloaded
PATH = r"D:\chromedriver"
driver = webdriver.Chrome(PATH)  # to open the browser
  
# url of google news website
  
# to open the url in the browser
driver.get(url)  


Output:

Step 3: Extracting the news title from the webpage, to extract a specific part of the page, we need its XPath, which can be accessed by right-clicking on the required element and selecting Inspect in the dropdown bar. 

After clicking Inspect a window appears. From there, we have to copy the elements full XPath to access it:

Note: You might not always get the exact element that you want by inspecting (depends on the structure of the website), so you may have to surf the HTML code for a while to get the exact element you want.  And now, just copy that path and paste that into your code. After running all these lines of code, you will get the title of the first heading printed on your terminal.

Python3




# Xpath you just copied
news_path = '/html/body/c-wiz/div/div[2]/div[2]/\
div/main/c-wiz/div[1]/div[3]/div/div/article/h3/a'
  
# to get that element
link = driver.find_element_by_xpath(news_path)  
  
# to read the text from that element
print(link.text)  


Output:

‘Attack on Afghan territory’: Taliban on US airstrike that killed 2 ISIS-K men

Step 4: Now, the target is to get the X_Paths of all the headlines present. 

One way is that we can copy all the XPaths of all the headlines (about 6 headlines will be there in google news every time) and we can fetch all those, but that method is not suited if there are a large number of things to be scrapped. So, the elegant way is to find the pattern of the XPaths of the titles which will make our tasks way easier and efficient.  Below are the XPaths of all the headlines on the website, and let’s figure out the pattern.

/html/body/c-wiz/div/div[2]/div[2]/div/main/c-wiz/div[1]/div[3]/div/div/article/h3/a

/html/body/c-wiz/div/div[2]/div[2]/div/main/c-wiz/div[1]/div[4]/div/div/article/h3/a

/html/body/c-wiz/div/div[2]/div[2]/div/main/c-wiz/div[1]/div[5]/div/div/article/h3/a

/html/body/c-wiz/div/div[2]/div[2]/div/main/c-wiz/div[1]/div[6]/div/div/article/h3/a

/html/body/c-wiz/div/div[2]/div[2]/div/main/c-wiz/div[1]/div[7]/div/div/article/h3/a

/html/body/c-wiz/div/div[2]/div[2]/div/main/c-wiz/div[1]/div[8]/div/div/article/h3/a

So, by seeing these XPath’s, we can see that only the 5th div is changing (bolded ones). So based upon this, we can generate the XPaths of all the headlines. We will get all the titles from the page by accessing them with their XPath. So to extract all these, we have the code as 

Python3




# I have used f-strings to format the string
c = 1
for x in range(3, 9):
    print(f"Heading {c}: ")
    c += 1
    curr_path = f'/html/body/c-wiz/div/div[2]/div[2]/div/main\
    /c-wiz/div[1]/div[{x}]/div/div/article/h3/a'
    title = driver.find_element_by_xpath(curr_path)
    print(title.text)


Output:

Now, the code is almost complete, the last thing we have to do is that the code should get headlines for every 10 mins. So we will run a while loop and sleep for 10 mins after getting all the headlines.

Below is the full implementation

Python3




import time
from selenium import webdriver
from datetime import datetime
  
PATH = r"D:\chromedriver"
  
driver = webdriver.Chrome(PATH)
  
  
driver.get(url)
  
while(True):
    now = datetime.now()
      
    # this is just to get the time at the time of 
    # web scraping
    current_time = now.strftime("%H:%M:%S")
    print(f'At time : {current_time} IST')
    c = 1
  
    for x in range(3, 9):
        curr_path = ''
          
        # Exception handling to handle unexpected changes
        # in the structure of the website
        try:
            curr_path = f'/html/body/c-wiz/div/div[2]/div[2]/\
            div/main/c-wiz/div[1]/div[{x}]/div/div/article/h3/a'
            title = driver.find_element_by_xpath(curr_path)
        except:
            continue
        print(f"Heading {c}: ")
        c += 1
        print(title.text)
          
    # to stop the running of code for 10 mins
    time.sleep(600


Output:

Method 2: Using Requests and BeautifulSoup

The requests module gets the raw HTML data from websites and beautiful soup is used to parse that information clearly to get the exact data we require. Unlike Selenium, there is no browser installation involved and it is even lighter because it directly accesses the web without the help of a browser.

Stepwise implementation:

Step 1: Import module.

Python3




import requests
from bs4 import BeautifulSoup
import time


Step 2: The next thing to do is to get the URL data and then parse the HTML code

Python3




response = requests.get(url)
text = response.text
data = BeautifulSoup(text, 'html.parser')


Step 3: First, we shall get all the headings from the table.

Python3




# since, headings are the first row of the table
headings = data.find_all('tr')[0]
headings_list = []  # list to store all headings
  
for x in headings:
    headings_list.append(x.text)
# since, we require only the first ten columns
headings_list = headings_list[:10]
  
print('Headings are: ')
for column in headings_list:
    print(column)


Output:

Step 4: In the same way, all the values in each row can be obtained

Python3




# since we need only first five coins
for x in range(1, 6):
    table = data.find_all('tr')[x]
    c = table.find_all('td')
      
    for x in c:
        print(x.text, end=' ')
    print('')


Output:

Below is the full implementation:

Python3




import requests
from bs4 import BeautifulSoup
from datetime import datetime
import time
  
while(True):
    now = datetime.now()
      
    # this is just to get the time at the time of
    # web scraping
    current_time = now.strftime("%H:%M:%S")
    print(f'At time : {current_time} IST')
  
    response = requests.get('https://finance.yahoo.com/cryptocurrencies/')
    text = response.text
    html_data = BeautifulSoup(text, 'html.parser')
    headings = html_data.find_all('tr')[0]
    headings_list = []
    for x in headings:
        headings_list.append(x.text)
    headings_list = headings_list[:10]
  
    data = []
  
    for x in range(1, 6):
        row = html_data.find_all('tr')[x]
        column_value = row.find_all('td')
        dict = {}
          
        for i in range(10):
            dict[headings_list[i]] = column_value[i].text
        data.append(dict)
          
    for coin in data:
        print(coin)
        print('')
    time.sleep(600)


Output:

Hosting the Bot

This is a specific method, used to run the bot continuously online without the need for any human intervention.  replit.com is an online compiler, where we will be running the code. We will be creating a mini webserver with the help of a flask module in python that helps in the continuous running of the code. Please create an account on that website and create a new repl.

After creating the repl, Create two files, one to run the bot code and the other to create the web server using flask.

Code for cryptotracker.py:

Python3




import requests
from bs4 import BeautifulSoup
from datetime import datetime
import time
  
# keep_alive function, that maintains continuous 
# running of the code.
from keep_alive import keep_alive
import pytz
  
# to start the thread
keep_alive()
while(True):
    tz_NY = pytz.timezone('Asia/Kolkata')
    datetime_NY = datetime.now(tz_NY)
      
    # this is just to get the time at the time of web scraping
    current_time = datetime_NY.strftime("%H:%M:%S - (%d/%m)")
    print(f'At time : {current_time} IST')
  
    response = requests.get('https://finance.yahoo.com/cryptocurrencies/')
    text = response.text
    html_data = BeautifulSoup(text, 'html.parser')
  
    headings = html_data.find_all('tr')[0]
    headings_list = []
    for x in headings:
        headings_list.append(x.text)
    headings_list = headings_list[:10]
  
    data = []
  
    for x in range(1, 6):
        row = html_data.find_all('tr')[x]
        column_value = row.find_all('td')
        dict = {}
        for i in range(10):
            dict[headings_list[i]] = column_value[i].text
        data.append(dict)
  
    for coin in data:
        print(coin)
  
    time.sleep(60)


Code for the keep_alive.py (webserver):

Python3




from flask import Flask
from threading import Thread
  
app = Flask('')
  
@app.route('/')
def home():
    return "Hello. the bot is alive!"
  
def run():
  app.run(host='0.0.0.0',port=8080)
  
def keep_alive():
    t = Thread(target=run)
    t.start()


Keep-alive is a method in networking that is used to prevent a certain link from breaking. Here the purpose of the keep-alive code is to create a web server using flask, that will keep the thread of the code (crypto-tracker code) to be active so that it can give the updates continuously.

Now, we have a web server create, and now, we need something to ping it continuously so that the server does not go down and the code keeps on running continuously. There is a website uptimerobot.com that does this job. Create an account in it 

Running the Crypto tracker code in Replit. Thus, We have successfully created a web scraping bot that will scrap the particular website continuously for every 10 mins and print the data to the terminal.



Previous Article
Next Article

Similar Reads

How to Build a Twitter Bot to Post Latest Stock Update using Python
The stock market is volatile and changes rapidly so we are going to create a simple Twitter bot to post the latest stock updates using Python that posts the tweet about the stocks that users have chosen. First, let's understand the prerequisites of our project: Nsepython: It is a Python library to get publicly available data on the current NSEIndia
5 min read
How to Build a Simple Auto-Login Bot with Python
In this article, we are going to see how to built a simple auto-login bot using python. In this present scenario, every website uses authentication and we have to log in by entering proper credentials. But sometimes it becomes very hectic to login again and again to a particular website. So, to come out of this problem lets, built our own auto logi
3 min read
Create a Web-Crawler Notification Bot in Python
In this article, we will guide you through the process of creating a Web-Crawler Notification Bot using Python. This notification bot is designed to display notification messages on your window, providing a useful tool for alerting users to relevant information gathered through web crawling. What is a Notification Bot?A Notification Bot is a progra
3 min read
Implementing web scraping using lxml in Python
Web scraping basically refers to fetching only some important piece of information from one or more websites. Every website has recognizable structure/pattern of HTML elements. Steps to perform web scraping :1. Send a link and get the response from the sent link 2. Then convert response object to a byte string. 3. Pass the byte string to 'fromstrin
3 min read
Implementing Web Scraping in Python with Scrapy
Nowadays data is everything and if someone wants to get data from webpages then one way to use an API or implement Web Scraping techniques. In Python, Web scraping can be done easily by using scraping tools like BeautifulSoup. But what if the user is concerned about performance of scraper or need to scrape data efficiently. To overcome this problem
5 min read
Web Scraping CryptoCurrency price and storing it in MongoDB using Python
Let us see how to fetch history price in USD or BTC, traded volume and market cap for a given date range using Santiment API and storing the data into MongoDB collection. Python is a mature language and getting much used in the Cryptocurrency domain. MongoDB is a NoSQL database getting paired with Python in many projects which helps to hold details
4 min read
Increase the speed of Web Scraping in Python using HTTPX module
In this article, we will talk about how to speed up web scraping using the requests module with the help of the HTTPX module and AsyncIO by fetching the requests concurrently. The user must be familiar with Python. Knowledge about the Requests module or web scraping would be a bonus. Required Module For this tutorial, we will use 4 modules - timere
4 min read
Web scraping from Wikipedia using Python - A Complete Guide
In this article, you will learn various concepts of web scraping and get comfortable with scraping various types of websites and their data. The goal is to scrape data from the Wikipedia Home page and parse it through various web scraping techniques. You will be getting familiar with various web scraping techniques, python modules for web scraping,
9 min read
Quote Guessing Game using Web Scraping in Python
Prerequisite: BeautifulSoup Installation In this article, we will scrape a quote and details of the author from this site http//quotes.toscrape.com using python framework called BeautifulSoup and develop a guessing game using different data structures and algorithm. The user will be given 4 chances to guess the author of a famous quote, In every ch
3 min read
Clean Web Scraping Data Using clean-text in Python
If you like to play with API's or like to scrape data from various websites, you must've come around random annoying text, numbers, keywords that come around with data. Sometimes it can be really complicating and frustrating to clean scraped data to obtain the actual data that we want. In this article, we are going to explore a python library calle
2 min read