Open In App

Scraping Indeed Job Data Using Python

Last Updated : 23 May, 2021
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we are going to see how to scrape Indeed job data using python. Here we will use Beautiful Soup and the request module to scrape the data.

Module needed

  • bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not come built-in with Python. To install this type the below command in the terminal.

pip install bs4

  • requests: Request allows you to send HTTP/1.1 requests extremely easily. This module also does not come built-in with Python. To install this type the below command in the terminal.

pip install requests

Approach:

  • Import all the required modules.
  • Pass the URL in the getdata() function(User Defined Function) to that will request to a URL, it returns a response. We are using get method to retrieve information from the given server using a given URL.

Syntax: 

requests.get(url, args)

  • Convert that data into HTML code.

In the given image we see the link, where we search the job and its location then the URL becomes something like this https://in.indeed.com/jobs?q=”+job+”&l=”+Location, Hence we will format our string into this format.

  • Now Parse the HTML content using bs4.

Syntax: soup = BeautifulSoup(r.content, ‘html5lib’)

Parameters:

  • r.content : It is the raw HTML content.
  • html.parser : Specifying the HTML parser we want to use.
  • Now filter the required data using soup.Find_all function.
    • Now find the list with a tag where class_ = jobtitle turnstileLink. You can open the webpage in the browser and inspect the relevant element by pressing right-click as shown in the figure.

  • Find the Company name and address with the same as the above methods.

Functions used:

The code for this implementation is divided into user defined functions to increase the readability of the code and add ease of use.

  • geturl(): gets the URL from which data is to be scraped
  • html_code(): get HTML code of the URL provided
  • job_data(): filter out job data
  • Company_data(): filter company data

Program:

Python3




# import module
import requests
from bs4 import BeautifulSoup
  
  
# user define function
# Scrape the data
# and get in string
def getdata(url):
    r = requests.get(url)
    return r.text
  
# Get Html code using parse
def html_code(url):
  
    # pass the url
    # into getdata function
    htmldata = getdata(url)
    soup = BeautifulSoup(htmldata, 'html.parser')
  
    # return html code
    return(soup)
  
# filter job data using
# find_all function
def job_data(soup):
    
    # find the Html tag
    # with find()
    # and convert into string
    data_str = ""
    for item in soup.find_all("a", class_="jobtitle turnstileLink"):
        data_str = data_str + item.get_text()
    result_1 = data_str.split("\n")
    return(result_1)
  
# filter company_data using
# find_all function
  
  
def company_data(soup):
  
    # find the Html tag
    # with find()
    # and convert into string
    data_str = ""
    result = ""
    for item in soup.find_all("div", class_="sjcl"):
        data_str = data_str + item.get_text()
    result_1 = data_str.split("\n")
  
    res = []
    for i in range(1, len(result_1)):
        if len(result_1[i]) > 1:
            res.append(result_1[i])
    return(res)
  
  
# driver nodes/main function
if __name__ == "__main__":
  
    # Data for URL
    job = "data+science+internship"
    Location = "Noida%2C+Uttar+Pradesh"
    url = "https://in.indeed.com/jobs?q="+job+"&l="+Location
  
    # Pass this URL into the soup
    # which will return
    # html string
    soup = html_code(url)
  
    # call job and company data
    # and store into it var
    job_res = job_data(soup)
    com_res = company_data(soup)
  
    # Traverse the both data
    temp = 0
    for i in range(1, len(job_res)):
        j = temp
        for j in range(temp, 2+temp):
            print("Company Name and Address : " + com_res[j])
  
        temp = j
        print("Job : " + job_res[i])
        print("-----------------------------")


Output:



Previous Article
Next Article

Similar Reads

Scraping Weather prediction Data using Python and BS4
This article revolves around scraping weather prediction d data using python and bs4 library. Let's checkout components used in the script - BeautifulSoup- It is a powerful Python library for pulling out data from HTML/XML files. It creates a parse tree for parsed pages that can be used to extract data from HTML/XML files. Requests - It is a Python
3 min read
Scraping data in network traffic using Python
In this article, we will learn how to scrap data in network traffic using Python. Modules Neededselenium: Selenium is a portable framework for controlling web browser.time: This module provides various time-related functions.json: This module is required to work with JSON data.browsermobproxy: This module helps us to get the HAR file from network t
5 min read
Scraping weather data using Python to get umbrella reminder on email
In this article, we are going to see how to scrape weather data using Python and get reminders on email. If the weather condition is rainy or cloudy this program will send you an "umbrella reminder" to your email reminding you to pack an umbrella before leaving the house. We will scrape the weather information from Google using bs4 and requests lib
5 min read
Clean Web Scraping Data Using clean-text in Python
If you like to play with API's or like to scrape data from various websites, you must've come around random annoying text, numbers, keywords that come around with data. Sometimes it can be really complicating and frustrating to clean scraped data to obtain the actual data that we want. In this article, we are going to explore a python library calle
2 min read
Scraping Flipkart Data using Python
Web scraping is commonly used to gather information from a webpage. Using this technique, we are able to extract a large amount of data and then save it. We can use this data at many places later according to our needs. For Scraping data, we need to import a few modules. These modules did not come with the Python package so we need to install these
3 min read
Web Scraping Coronavirus Data into MS Excel
Prerequisites: Web Scraping using BeautifulSoup Coronavirus cases are increasing rapidly worldwide. This article will guide you on how to web scrape Coronavirus data and into Ms-excel. What is Web Scraping? If you’ve ever copy and pasted information from a website, you’ve performed the same function as any web scraper, only on a microscopic, manual
5 min read
Implementing web scraping using lxml in Python
Web scraping basically refers to fetching only some important piece of information from one or more websites. Every website has recognizable structure/pattern of HTML elements. Steps to perform web scraping :1. Send a link and get the response from the sent link 2. Then convert response object to a byte string. 3. Pass the byte string to 'fromstrin
3 min read
Newspaper scraping using Python and News API
There are mainly two ways to extract data from a website: Use the API of the website (if it exists). For example, Facebook has the Facebook Graph API which allows retrieval of data posted on Facebook.Access the HTML of the webpage and extract useful information/data from it. This technique is called web scraping or web harvesting or web data extrac
4 min read
Scraping COVID-19 statistics using Python and Selenium
Selenium is an open source web testing tool that allows users to test web applications across different browsers and platforms. It includes a plethora of software that developers can use to automate web applications including IDE, RC, webdriver and Selenium grid, which all serve different purposes. Moreover, it serves the purpose of scraping dynami
4 min read
Web Scraping CryptoCurrency price and storing it in MongoDB using Python
Let us see how to fetch history price in USD or BTC, traded volume and market cap for a given date range using Santiment API and storing the data into MongoDB collection. Python is a mature language and getting much used in the Cryptocurrency domain. MongoDB is a NoSQL database getting paired with Python in many projects which helps to hold details
4 min read