How to Extract Wikipedia Data in Python?
Last Updated :
11 Feb, 2022
In this article we will learn how to extract Wikipedia Data Using Python, Here we use two methods for extracting Data.
Method 1: Using Wikipedia module
In this method, we will use the Wikipedia Module for Extracting Data. Wikipedia is a multilingual online encyclopedia created and maintained as an open collaboration project by a community of volunteer editors using a wiki-based editing system.
For installation run this command into your terminal.
pip install wikipedia
Wikipedia Data, we will be extracted here:-
- Summary, title
- Page Content
- Get the list of Image Source and Page URL
- Different categories
Extract Data one by one:
1. Extracting summary and page
Syntax: wikipedia.summary(“Enter Query”)
wikipedia.page(“Enter Query”).title
Python3
import wikipedia
wikipedia.summary( "Python (programming language)" )
|
Output:
2. Page Content:
For extracting the content of an article, we will use page() method and content property to get the actual data.
Syntax: wikipedia.page(“Enter Query”).content
Python3
wikipedia.page( "Python (programming language)" ).content
|
Output:
3. Extract images from Wikipedia.
Syntax: wikipedia.page(“Enter Query”).images
Python3
wikipedia.page( "Python (programming language)" ).images
|
Output:
4. extract current Page URL:
Use page() method and url property.
Syntax: wikipedia.page(“Enter Query”).url
Python3
wikipedia.page( '"Hello, World!" program' ).url
|
Output:
'https://en.wikipedia.org/wiki/%22Hello,_World!%22_program'
5. Get the list of categories of articles.
Use page() method and categories property.
Syntax: wikipedia.page(“Enter Query”).categories
Python3
wikipedia.page( '"Hello, World!" program' ).categories
|
Output:
['Articles with example code',
'Articles with short description',
'Commons category link is on Wikidata',
'Computer programming folklore',
'Short description is different from Wikidata',
'Test items in computer languages',
'Webarchive template wayback links']
6. Get the list of all links to an article
Syntax: wikipedia.page(“Enter Query”).links
Python3
wikipedia.page( '"Hello, World!" program' ).links
|
Output:
7. Get data in different languages.
Now we will see language conversion, for converting into another language we will use set_lang() method.
Syntax: wikipedia.set_lang(“Enter Language Type”)
Python3
wikipedia.set_lang( "hi" )
wikipedia.summary( '"Hello, World!" program' )
|
Output:
In this method, we will use Web Scraping.
For scraping in Python we will use two modules:
- bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not come built-in with Python. To install this type the below command in the terminal.
pip install bs4
- requests: Requests allow you to send HTTP/1.1 requests extremely easily. This module also does not comes built-in with Python. To install this type the below command in the terminal.
pip install requests
Data will be extracted:-
- Paragraphs
- Images
- List of Images
- Headings
- Unwanted Content (Remaining Content)
Approach:
- Get HTML Code
- From HTML Code, get the content of inside body tag
- Iterate
Python3
from bs4 import *
import requests
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser' ).select( 'body' )[ 0 ]
paragraphs = []
images = []
link = []
heading = []
remaining_content = []
for tag in soup.find_all():
if tag.name = = "p" :
paragraphs.append(tag.text)
elif tag.name = = "img" :
images.append(url + tag[ 'src' ])
elif tag.name = = "a" :
if "href" in str (tag):
link.append(url + tag[ 'href' ])
else :
link.append(tag[ 'href' ])
elif "h" in tag.name:
if "h1" = = tag.name:
heading.append(tag.text)
elif "h2" = = tag.name:
heading.append(tag.text)
elif "h3" = = tag.name:
heading.append(tag.text)
elif "h4" = = tag.name:
heading.append(tag.text)
elif "h5" = = tag.name:
heading.append(tag.text)
else :
heading.append(tag.text)
else :
remaining_content.append(tag.text)
print (paragraphs, images, link, heading, remaining_content)
|
- body content and fetch the above data
Below is the full implementation:
Python3
from bs4 import *
import requests
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser' ).select( 'body' )[ 0 ]
paragraphs = []
images = []
link = []
heading = []
remaining_content = []
for tag in soup.find_all():
if tag.name = = "p" :
paragraphs.append(tag.text)
elif tag.name = = "img" :
images.append(url + tag[ 'src' ])
elif tag.name = = "a" :
if "href" in str (tag):
link.append(url + tag[ 'href' ])
else :
link.append(tag[ 'href' ])
elif "h" in tag.name:
if "h1" = = tag.name:
heading.append(tag.text)
elif "h2" = = tag.name:
heading.append(tag.text)
elif "h3" = = tag.name:
heading.append(tag.text)
elif "h4" = = tag.name:
heading.append(tag.text)
elif "h5" = = tag.name:
heading.append(tag.text)
else :
heading.append(tag.text)
else :
remaining_content.append(tag.text)
print (paragraphs, images, link, heading, remaining_content)
|
Example:
Please Login to comment...