Open In App

Scraping Reddit with Python and BeautifulSoup

Last Updated : 21 Nov, 2022
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we are going to see how to scrape Reddit with Python and BeautifulSoup. Here we will use Beautiful Soup and the request module to scrape the data.

Module needed

  • bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not come built-in with Python. To install this type the below command in the terminal.
pip install bs4
  • requests: Request allows you to send HTTP/1.1 requests extremely easily. This module also does not come built-in with Python. To install this type the below command in the terminal.
pip install requests

Approach:

  • Import all the required modules.
  • Pass the URL in the getdata function(UDF) to that will request to a URL, it returns a response. We are using the GET method to retrieve information from the given server using a given URL.

Syntax: requests.get(url, args)

  • Now Parse the HTML content using bs4.

Syntax: soup = BeautifulSoup(r.content, ‘html5lib’)

Parameters:

  • r.content : It is the raw HTML content.
  • html.parser : Specifying the HTML parser we want to use.
  • Now filter the required data using soup.Find_all function.

Let’s see the stepwise execution of the script.

Step 1: Import all dependence

Python3




# import module
import requests
from bs4 import BeautifulSoup


Step 2: Create a URL get function

Python3




# user define function
# Scrape the data
def getdata(url):
    r = requests.get(url, headers = HEADERS)
    return r.text


Step 3: Now take the URL and pass the URL into the getdata() function and Convert that data into HTML code.

Python3




 
# pass the url
# into getdata function
htmldata = getdata(url)
soup = BeautifulSoup(htmldata, 'html.parser')
   
# display html code
print(soup)


Output:

Note: This is only HTML code or Raw data.

Getting Author Name

Now find authors with a div tag where class_ =”NAURX0ARMmhJ5eqxQrlQW”. We can open the webpage in the browser and inspect the relevant element by pressing right-click as shown in the figure.

Example:

Python3




# find the Html tag
# with find()
# and convert into string
data_str = ""
for item in soup.find_all("div", class_="NAURX0ARMmhJ5eqxQrlQW"):
    data_str = data_str + item.get_text()
       
print(data_str)


Output:

kashaziz

Getting article contains

Now find the article text, here we will follow the same methods as the above example.

Example:

Python3




# find the Html tag
# with find()
# and convert into string
data_str = ""
result = ""
for item in soup.find_all("div", class_="_3xX726aBn29LDbsDtzr_6E _1Ap4F5maDtT1E1YuCiaO0r D3IL3FD0RFy_mkKLPwL4"):
    data_str = data_str + item.get_text()
print(data_str)


Output:

Getting the comments

Now Scrape the comments, here we will follow the same methods as the above example.

Python3




# find the Html tag
# with find()
# and convert into string
data_str = ""
 
for item in soup.find_all("p", class_="_1qeIAgB0cPwnLhDF9XSiJM"):
    data_str = data_str + item.get_text()
print(data_str)


Output:



Previous Article
Next Article

Similar Reads

Scraping Reddit using Python
In this article, we are going to see how to scrape Reddit using Python, here we will be using python's PRAW (Python Reddit API Wrapper) module to scrape the data. Praw is an acronym Python Reddit API wrapper, it allows Reddit API through Python scripts. Installation To install PRAW, run the following commands on the command prompt: pip install praw
4 min read
Implementing Web Scraping in Python with BeautifulSoup
There are mainly two ways to extract data from a website: Use the API of the website (if it exists). For example, Facebook has the Facebook Graph API which allows retrieval of data posted on Facebook.Access the HTML of the webpage and extract useful information/data from it. This technique is called web scraping or web harvesting or web data extrac
6 min read
BeautifulSoup object - Python Beautifulsoup
BeautifulSoup object is provided by Beautiful Soup which is a web scraping framework for Python. Web scraping is the process of extracting data from the website using automated tools to make the process faster. The BeautifulSoup object represents the parsed document as a whole. For most purposes, you can treat it as a Tag object. Syntax: BeautifulS
2 min read
Web Scraping using Beautifulsoup and scrapingdog API
In this post we are going to scrape dynamic websites that use JavaScript libraries like React.js, Vue.js, Angular.js, etc you have to put extra efforts. It is an easy but lengthy process if you are going to install all the libraries like Selenium, Puppeteer, and headerless browsers like Phantom.js. But, we have a tool that can handle all this load
5 min read
BeautifulSoup - Scraping List from HTML
Prerequisite: RequestsBeautifulSoup Python can be employed to scrap information from a web page. It can also be used to retrieve data provided within a specific tag, this article how list elements can be scraped from HTML. Module Needed: bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not co
2 min read
BeautifulSoup - Scraping Paragraphs from HTML
In this article, we will discuss how to scrap paragraphs from HTML using Beautiful Soup Method 1: using bs4 and urllib. Module Needed: bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. For installing the module-pip install bs4.urllib: urllib is a package that collects several modules for working with URLs. It
3 min read
Scraping Covid-19 statistics using BeautifulSoup
Coronavirus, one of the biggest pandemic has brought all the world to Danger. Along with this, it is one of the trending News, everyone has this day. In this article, we will be scraping data and printing Covid-19 statistics in human-readable form. The data will be scraped from this websitePrerequisites: The libraries 'requests', 'bs4', and 'textta
2 min read
BeautifulSoup - Scraping Link from HTML
Prerequisite: Implementing Web Scraping in Python with BeautifulSoup In this article, we will understand how we can extract all the links from a URL or an HTML document using Python. Libraries Required:bs4 (BeautifulSoup): It is a library in python which makes it easy to scrape information from web pages, and helps in extracting the data from HTML
2 min read
Python | PRAW - Python Reddit API Wrapper
PRAW (Python Reddit API Wrapper) is a Python module that provides a simple access to Reddit’s API. PRAW is easy to use and follows all of Reddit’s API rules.The documentation regarding PRAW is located here.Prerequisites: Basic Python Programming SkillsBasic Reddit Knowledge : Reddit is a network of communities based on people's interests. Each of t
3 min read
How to get client_id and client_secret for Python Reddit API registration ?
Reddit is a network of communities based on people’s interests. Each of these communities is called a subreddit. Users can subscribe to multiple subreddits to post, comment and interact with them. A Reddit bot is something that automatically responds to a user's post or automatically posts things at certain intervals. This could depend on what cont
2 min read
three90RightbarBannerImg