Open In App

How to Extract Wikipedia Data in Python?

Last Updated : 11 Feb, 2022
Improve
Improve
Like Article
Like
Save
Share
Report

In this article we will learn how to extract Wikipedia Data Using Python, Here we use two methods for extracting Data.

Method 1: Using Wikipedia module

In this method, we will use the Wikipedia Module for Extracting Data. Wikipedia is a multilingual online encyclopedia created and maintained as an open collaboration project by a community of volunteer editors using a wiki-based editing system.

For installation run this command into your terminal.

pip install wikipedia

Wikipedia Data, we will be extracted here:-

  • Summary, title
  • Page Content
  • Get the list of Image Source and Page URL
  • Different categories

Extract Data one by one:

1. Extracting summary and page

Syntax: wikipedia.summary(“Enter Query”)
 

wikipedia.page(“Enter Query”).title

Python3




import wikipedia
 
 
wikipedia.summary("Python (programming language)")


Output:

2. Page Content: 

For extracting the content of an article, we will use page() method and content property to get the actual data.

Syntax: wikipedia.page(“Enter Query”).content

Python3




wikipedia.page("Python (programming language)").content


Output:

3. Extract images from Wikipedia.

Syntax: wikipedia.page(“Enter Query”).images

Python3




wikipedia.page("Python (programming language)").images


Output: 

4. extract current Page URL: 

Use page() method and url property. 

Syntax: wikipedia.page(“Enter Query”).url

Python3




wikipedia.page('"Hello, World!" program').url


Output: 

'https://en.wikipedia.org/wiki/%22Hello,_World!%22_program'

5. Get the list of categories of articles.

Use page() method and categories property. 

Syntax: wikipedia.page(“Enter Query”).categories

Python3




wikipedia.page('"Hello, World!" program').categories


Output: 

['Articles with example code',
 'Articles with short description',
 'Commons category link is on Wikidata',
 'Computer programming folklore',
 'Short description is different from Wikidata',
 'Test items in computer languages',
 'Webarchive template wayback links']

6. Get the list of all links to an article 

Syntax: wikipedia.page(“Enter Query”).links

Python3




wikipedia.page('"Hello, World!" program').links


Output: 

7. Get data in different languages.

Now we will see language conversion, for converting into another language we will use set_lang() method. 

Syntax: wikipedia.set_lang(“Enter Language Type”)

Python3




wikipedia.set_lang("hi")
wikipedia.summary('"Hello, World!" program')


Output:
 

Method 2: Using Requests, BeautifulSoup

In this method, we will use Web Scraping.

For scraping in Python we will use two modules:

  • bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not come built-in with Python. To install this type the below command in the terminal.
pip install bs4
  • requests: Requests allow you to send HTTP/1.1 requests extremely easily. This module also does not comes built-in with Python. To install this type the below command in the terminal.
pip install requests

Data will be extracted:- 

  • Paragraphs
  • Images
  • List of Images
  • Headings
  • Unwanted Content (Remaining Content)

Approach: 

  • Get HTML Code
  • From HTML Code, get the content of inside body tag
  • Iterate 

Python3




# Import Module
from bs4 import *
import requests
 
# Given URL
 
# Fetch URL Content
r = requests.get(url)
 
# Get body content
soup = BeautifulSoup(r.text,'html.parser').select('body')[0]
 
# Initialize variable
paragraphs = []
images = []
link = []
heading = []
remaining_content = []
 
# Iterate through all tags
for tag in soup.find_all():
     
    # Check each tag name
    # For Paragraph use p tag
    if tag.name=="p":
       
        # use text for fetch the content inside p tag
        paragraphs.append(tag.text)
         
    # For Image use img tag
    elif tag.name=="img":
       
        # Add url and Image source URL
        images.append(url+tag['src'])
         
    # For Anchor use a tag
    elif tag.name=="a":
       
        # convert into string and then check href
        # available in tag or not
        if "href" in str(tag):
           
          # In href, there might be possible url is not there
          # if url is not there
            if "https://en.wikipedia.org/w/" not in str(tag['href']):
                link.append(url+tag['href'])
            else:
                link.append(tag['href'])
                 
    # Similarly check for heading
    # Six types of heading are there (H1, H2, H3, H4, H5, H6)
    # check each tag and fetch text
    elif "h" in tag.name:
        if "h1"==tag.name:
            heading.append(tag.text)
        elif "h2"==tag.name:
            heading.append(tag.text)
        elif "h3"==tag.name:
            heading.append(tag.text)
        elif "h4"==tag.name:
            heading.append(tag.text)
        elif "h5"==tag.name:
            heading.append(tag.text)
        else:
            heading.append(tag.text)
             
    # Remain content will store here
    else:
        remaining_content.append(tag.text)
         
print(paragraphs, images, link, heading, remaining_content)


  •  body content and fetch the above data

Below is the full implementation: 

Python3




# Import Module
from bs4 import *
import requests
 
# Given URL
 
# Fetch URL Content
r = requests.get(url)
 
# Get body content
soup = BeautifulSoup(r.text,'html.parser').select('body')[0]
 
# Initialize variable
paragraphs = []
images = []
link = []
heading = []
remaining_content = []
 
# Iterate through all tags
for tag in soup.find_all():
     
    # Check each tag name
    # For Paragraph use p tag
    if tag.name=="p":
       
        # use text for fetch the content inside p tag
        paragraphs.append(tag.text)
         
    # For Image use img tag
    elif tag.name=="img":
       
        # Add url and Image source URL
        images.append(url+tag['src'])
         
    # For Anchor use a tag
    elif tag.name=="a":
       
        # convert into string and then check href
        # available in tag or not
        if "href" in str(tag):
           
          # In href, there might be possible url is not there
          # if url is not there
            if "https://en.wikipedia.org/w/" not in str(tag['href']):
                link.append(url+tag['href'])
            else:
                link.append(tag['href'])
                 
    # Similarly check for heading
    # Six types of heading are there (H1, H2, H3, H4, H5, H6)
    # check each tag and fetch text
    elif "h" in tag.name:
        if "h1"==tag.name:
            heading.append(tag.text)
        elif "h2"==tag.name:
            heading.append(tag.text)
        elif "h3"==tag.name:
            heading.append(tag.text)
        elif "h4"==tag.name:
            heading.append(tag.text)
        elif "h5"==tag.name:
            heading.append(tag.text)
        else:
            heading.append(tag.text)
             
    # Remain content will store here
    else:
        remaining_content.append(tag.text)
         
print(paragraphs, images, link, heading, remaining_content)


Example: 

 



Previous Article
Next Article

Similar Reads

Fetching text from Wikipedia's Infobox in Python
An infobox is a template used to collect and present a subset of information about its subject. It can be described as a structured document containing a set of attribute-value pairs, and in Wikipedia, it represents a summary of information about the subject of an article.So a Wikipedia infobox is a fixed-format table usually added to the top right
3 min read
Voice search Wikipedia using Python
Every day, we visit so many applications, be it messaging platforms like Messenger, Telegram or ordering products on Amazon, Flipkart, or knowing about weather and the list can go on. And we see that these websites have their own software program for initiating conversations with human beings using rules or artificial intelligence. Users interact w
5 min read
Wikipedia module in Python
The Internet is the single largest source of information, and therefore it is important to know how to fetch data from various sources. And with Wikipedia being one of the largest and most popular sources for information on the Internet. Wikipedia is a multilingual online encyclopedia created and maintained as an open collaboration project by a com
3 min read
Web scraping from Wikipedia using Python - A Complete Guide
In this article, you will learn various concepts of web scraping and get comfortable with scraping various types of websites and their data. The goal is to scrape data from the Wikipedia Home page and parse it through various web scraping techniques. You will be getting familiar with various web scraping techniques, python modules for web scraping,
9 min read
Wikipedia Summary Generator using Python Tkinter
Prerequisite: Tkinter Wikipedia Python offers multiple options for developing a GUI (Graphical User Interface). Out of all the GUI methods, Tkinter is the most commonly used method. Python with Tkinter outputs the fastest and easiest way to create GUI applications. Wikipedia is a Python library that makes it easy to access and parse data from "wiki
2 min read
Wikipedia search app using Flask Framework - Python
Flask is a micro web framework written in Python. It is classified as a micro-framework because it does not require particular tools or libraries. Flask is a lightweight WSGI web application framework. It is designed to make getting started quick and easy, with the ability to scale up to complex applications. Installation: 1) In order to create the
2 min read
Wikipedia search app Project using Django
Django is a high-level Python based Web Framework that allows rapid development and clean, pragmatic design. It is also called batteries included framework because Django provides built-in features for everything including Django Admin Interface, default database – SQLlite3, etc. Today we will create joke app in django. In this article we will make
2 min read
Scraping Wikipedia table with Pandas using read_html()
In this article, we will discuss a particular function named read_html() which is used to read HTML tables directly from a webpage into a Pandas DataFrame without knowing how to scrape a website's HTML, this tool can be useful for swiftly combining tables from numerous websites. However, the data must have to be cleaned further, So let's see how we
3 min read
Python - Extract Particular data type rows
Given A Matrix, extract all the rows which have all the elements with particular data type. Input : test_list = [[4, 5, "Hello"], [2, 6, 7], ["g", "f", "g"], [9, 10, 11]], data_type = int Output : [[2, 6, 7], [9, 10, 11]] Explanation : All lists with integer are extracted. Input : test_list = [[4, 5, "Hello"], [2, 6, 7], ["g", "f", "g"], [9, 10, 11
3 min read
Extract Data from Database using MySQL-Connector and XAMPP in Python
Prerequisites: MySQL-Connector, XAMPP Installation A connector is employed when we have to use MySQL with other programming languages. The work of mysql-connector is to provide access to MySQL Driver to the required language. Thus, it generates a connection between the programming language and the MySQL Server. RequirementsXAMPP: Database / Server
2 min read
Practice Tags :