Open In App

How to Scrape all PDF files in a Website?

Last Updated : 21 Dec, 2021
Improve
Improve
Like Article
Like
Save
Share
Report

Prerequisites: Implementing Web Scraping in Python with BeautifulSoup

Web Scraping is a method of extracting data from the website and use that data for other uses. There are several libraries and modules for doing web scraping in Python.  In this article, we’ll learn how to scrape the PDF files from the website with the help of beautifulsoup, which is one of the best web scraping modules in python, and the requests module for the GET requests. Also, for getting more information about the PDF file, we use PyPDF2 module.

Step by Step Code –

Step 1: Import all the important modules and packages.

Python3




# for get the pdf files or url
import requests
 
# for tree traversal scraping in webpage
from bs4 import BeautifulSoup
 
# for input and output operations
import io
 
# For getting information about the pdfs
from PyPDF2 import PdfFileReader


Step 2: Passing the URL and make an HTML parser with the help of BeautifulSoup.

Python3




# website to scrap
 
# get the url from requests get method
read = requests.get(url)
 
# full html content
html_content = read.content
 
# Parse the html content
soup = BeautifulSoup(html_content, "html.parser")


In the above code:

  • Scraping is done by the https://www.geeksforgeeks.org/how-to-extract-pdf-tables-in-python/ link
  • requests module is used for making get request
  • read.content is used to go through all the HTML code. Printing will output the source code of the web page.
  • soup is having HTML content and used to parse the HTML

Step 3: We need to traverse through the PDFs from the website.

Python3




# created an empty list for putting the pdfs
list_of_pdf = set()
 
# accessed the first p tag in the html
l = soup.find('p')
 
# accessed all the anchors tag from given p tag
p = l.find_all('a')
 
# iterate through p for getting all the href links
for link in p:
     
    # original html links
    print("links: ", link.get('href'))
    print("\n")
     
    # converting the extension from .html to .pdf
    pdf_link = (link.get('href')[:-5]) + ".pdf"
     
    # converted to .pdf
    print("converted pdf links: ", pdf_link)
    print("\n")
     
    # added all the pdf links to set
    list_of_pdf.add(pdf_link)


 
 

Output:

In the above code:

  • list_of_pdf is an empty set created for adding all the PDF files from the web page. Set is used because it never repeats the same-named elements. And automatically get rid of duplicates.
  • Iteration is done within all the links converting the .HTML to .pdf. It is done as the PDF name and HTML name has an only difference in the format, the rest all are same.
  • We use the set because we need to get rid of duplicate names. The list can also be used and instead of add, we append all the PDFs.

 Step 4: Create info function with pypdf2 module for getting all the required information of the pdf.

Python3




def info(pdf_path):
 
    # used get method to get the pdf file
    response = requests.get(pdf_path)
 
    # response.content generate binary code for
    # string function
    with io.BytesIO(response.content) as f:
 
        # initialized the pdf
        pdf = PdfFileReader(f)
 
        # all info about pdf
        information = pdf.getDocumentInfo()
        number_of_pages = pdf.getNumPages()
 
    txt = f"""
    Information about {pdf_path}:
     
    Author: {information.author}
    Creator: {information.creator}
    Producer: {information.producer}
    Subject: {information.subject}
    Title: {information.title}
    Number of pages: {number_of_pages}
    """
    print(txt)
     
    return information


 
 In the above code: 

  • Info function is responsible for giving all the required scraped output inside of the PDF.
  • io.BytesIO(response.content) – It is used because response.content is a binary code and the requests library is quite low leveled and generally compiled (not interpreted). So to handle byte, io.BytesIO is used.
  • There are several pypdfs2 functions to access different data in pdf.

 Note: Refer Working with PDF files in Python for detailed information.

Python3




# print all the content of pdf in the console
for i in list_of_pdf:
    info(i)


Complete Code:

Python3




import requests
from bs4 import BeautifulSoup
import io
from PyPDF2 import PdfFileReader
 
 
read = requests.get(url)
html_content = read.content
soup = BeautifulSoup(html_content, "html.parser")
 
list_of_pdf = set()
l = soup.find('p')
p = l.find_all('a')
 
for link in (p):
    pdf_link = (link.get('href')[:-5]) + ".pdf"
    print(pdf_link)
    list_of_pdf.add(pdf_link)
 
def info(pdf_path):
    response = requests.get(pdf_path)
     
    with io.BytesIO(response.content) as f:
        pdf = PdfFileReader(f)
        information = pdf.getDocumentInfo()
        number_of_pages = pdf.getNumPages()
 
    txt = f"""
    Information about {pdf_path}:
 
    Author: {information.author}
    Creator: {information.creator}
    Producer: {information.producer}
    Subject: {information.subject}
    Title: {information.title}
    Number of pages: {number_of_pages}
    """
    print(txt)
    return information
 
 
for i in list_of_pdf:
    info(i)


Output:



Previous Article
Next Article

Similar Reads

Python program to Recursively scrape all the URLs of the website
In this tutorial we will see how to we can recursively scrape all the URLs from the website Recursion in computer science is a method of solving a problem where the solution depends on solutions to smaller instances of the same problem. Such problems can generally be solved by iteration, but this needs to identify and index the smaller instances at
2 min read
Scrape Tables From any website using Python
Scraping is a very essential skill for everyone to get data from any website. Scraping and parsing a table can be very tedious work if we use standard Beautiful soup parser to do so. Therefore, here we will be describing a library with the help of which any table can be scraped from any website easily. With this method you don't even have to inspec
3 min read
How to Scrape Multiple Pages of a Website Using Python?
Web Scraping is a method of extracting useful data from a website using computer programs without having to manually do it. This data can then be exported and categorically organized for various purposes. Some common places where Web Scraping finds its use are Market research & Analysis Websites, Price Comparison Tools, Search Engines, Data Col
6 min read
How to Scrape Data From Local HTML Files using Python?
BeautifulSoup module in Python allows us to scrape data from local HTML files. For some reason, website pages might get stored in a local (offline environment), and whenever in need, there may be requirements to get the data from them. Sometimes there may be a need to get data from multiple Locally stored HTML files too. Usually HTML files got the
4 min read
How to scrape all the text from body tag using Beautifulsoup in Python?
strings generator is provided by Beautiful Soup which is a web scraping framework for Python. Web scraping is the process of extracting data from the website using automated tools to make the process faster. One drawback of the string attribute is that it only works for tags with string inside it and returns nothing for tags with further tags insid
2 min read
Send PDF File through Email using pdf-mail module
pdf_mail module is that library of Python which helps you to send pdf documents through your Gmail account. Installing Library This module does not come built-in with Python. You need to install it externally. To install this module type the below command in the terminal. pip install pdf-mail Function of pdf_mail This module only comes with a singl
2 min read
Scrape and Save Table Data in CSV file using Selenium in Python
Selenium WebDriver is an open-source API that allows you to interact with a browser in the same way a real user would and its scripts are written in various languages i.e. Python, Java, C#, etc. Here we will be working with python to scrape data from tables on the web and store it as a CSV file. As Google Chrome is the most popular browser, to make
3 min read
Scrape most reviewed news and tweet using Python
Many websites will be providing trendy news in any technology and the article can be rated by means of its review count. Suppose the news is for cryptocurrencies and news articles are scraped from cointelegraph.com, we can get each news item reviewer to count easily and placed in MongoDB collection. Modules Needed Tweepy: Tweepy is the Python clien
5 min read
Scrape content from dynamic websites
To scrape content from a static page, we use BeautifulSoup as our package for scraping, and it works flawlessly for static pages. We use requests to load page into our python script. Now, if the page we are trying to load is dynamic in nature and we request this page by requests library, it would send the JS code to be executed locally. Requests pa
3 min read
Create GUI to Web Scrape articles in Python
Prerequisite- GUI Application using Tkinter In this article, we are going to write scripts to extract information from the article in the given URL. Information like Title, Meta information, Articles Description, etc., will be extracted. We are going to use Goose Module. Goose module helps to extract the following information: The main text of an a
2 min read