Hamburger_menu.svg

FOR DEVELOPERS

5 Powerful Text Summarization Techniques in Python

Text Summarization Techniques in Python.

Text summarization is a natural language processing (NLP) task that allows users to summarize large amounts of text for quick consumption without losing any important information.

We’ve all come across articles and other long-form texts with a lot of unnecessary content that completely draws us away from the subject matter.

This can get frustrating, especially during research and when collecting valid information for whatever reason. The solution? Text summarization.

With this in mind, let’s first look at the two distinctive methods of text summarization, followed by five techniques that Python developers can use.

Methods of text summarization

‘Extractive’ and ‘Abstractive’ are the two methods of performing text summarization. Let’s discuss them in detail.

Extractive text summarization

As the name suggests, extractive text summarization ‘extracts’ notable information from the large dumps of text provided and groups them into clear and concise summaries.
The method is very straightforward as it extracts texts based on parameters such as the text to be summarized, the most important sentences (Top K), and the value of each of these sentences to the overall subject.
This, however, also means that the method is limited to predetermined parameters that can make extracted text biased under certain conditions.
Owing to its simplicity in most use cases, extractive text summarization is the most common method used by automatic text summarizers.

Abstractive text summarization

Abstractive text summarization generates legible sentences from the entirety of the text provided. It rewrites large amounts of text by creating acceptable representations, which is further processed and summarized by natural language processing.
What makes this method unique is its almost AI-like ability to use a machine’s semantic capability to process text and iron out the kinks using NLP.
Although it might not be as simple to use compared to the extractive method, in many situations, abstract summarization is far more useful. In a lot of ways, it is a precursor to full-fledged AI writing tools. However, this does not mean that there is no need for extractive summarization.

5 techniques for text summarization in Python

Here are five approaches to text summarization using both abstractive and extractive methods.

1. Gensim

Gensim is an open-source topic and vector space modeling toolkit within the Python programming language.

First, the user needs to utilize the summarization.summarizer from Gensim as it is based on a variation of the TextRank algorithm.

Since TextRank is a graph-based ranking algorithm, it helps narrow down the importance of vertices in graphs based on global information drawn from said graphs.

Here’s an example code to summarize text from Wikipedia:

from gensim.summarization.summarizer import summarize                  
from gensim.summarization import keywords                 
import wikipedia              
import en_core_web_sm                  

To import the wikipedia content:

wikisearch = wikipedia.page("")             
wikicontent = wikisearch.content             
nlp = en_core_web_sm.load()            
doc = nlp(wikicontent)                 

To summarize based on percentage:

summ_per = summarize(wikicontent, ratio = “”)                  
print("Percent summary")            
print(summ_per)                  

To summarize based on word count:

summ_words = summarize(wikicontent, word_count = “”)               
print("Word count summary")                
print(summ_words)                       

There are two ways of extracting text using TextRank: keyword and sentence extraction.

Keyword extraction can be done by simply using a frequency test, but this would almost always prove to be inaccurate. This is where TextRank automates the process to semantically provide far more accurate results based on the corpus.

Sentence extraction, on the other hand, studies the corpus to summarize the most valid sentences pertaining to the subject matter and phonologically arranges it.

2. Sumy

Sumy is another library in Python that uses various algorithms to perform text summarization.

Let’s take a look at a few of them.

LexRank

LexRank is a graphical-based summarizer. The code is as follows:

from sumy.summarizers.lex_rank import LexRankSummarizer                   
summarizer_lex = LexRankSummarizer()                      

# Summarize using sumy LexRank                     
summary= summarizer_lex(parser.document, 2)              
lex_summary=""                         
for sentence in summary:                    
lex_summary+=str(sentence)             
print(lex_summary)               


print(text_summary)                      

Luhn

Developed by an IBM researcher of the same name, Luhn is one of the oldest summarization algorithms and ranks sentences based on a frequency criterion for words.

Here’s the code for the algorithm:

from sumy.summarizers.luhn import LuhnSummarizer             
summarizer_1 = LuhnSummarizer()               
summary_1 =summarizer_1(parser.document,2)                      

for sentence in summary_1:               
print(sentence)                 

LSA

Latent semantic analysis is an automated method of summarization that utilizes term frequency with singular value decomposition. It has become one of the most used summarizers in recent years.

The code is as follows:

from sumy.summarizers.lsa import LsaSummarizer                 
summarizer_lsa = LsaSummarizer()                   

# Summarize using sumy LSA              
summary =summarizer_lsa(parser.document,2)                 
lsa_summary=""            
for sentence in summary:                   
lsa_summary+=str(sentence)                 
print(lsa_summary)                

TextRank

And last but not least, there is TextRank which works exactly the same as in Gensim.
Here’s the code for this algorithm:

# Load Packages              
from sumy.parsers.plaintext import PlaintextParser                   
from sumy.nlp.tokenizers import Tokenizer                      

# For Strings               
parser = PlaintextParser.from_string(text,Tokenizer("english"))         
from sumy.summarizers.text_rank import TextRankSummarizer                 

# Summarize using sumy TextRank                  
summarizer = TextRankSummarizer()                   
summary =summarizer_4(parser.document,2)                   
text_summary=""                  
for sentence in summary:                
text_summary+=str(sentence)                     
print(text_summary)                 

When using each of these summarizers, you will notice that they summarize text differently. It’s better to try them all to figure out which one works best in different situations.

3. NLTK

The ‘Natural Language Toolkit’ is an NLP-based toolkit in Python that helps with text summarization.

Here’s how to get it up and running.
Import the required libraries using the code below:

import nltk               
from nltk.corpus import stopwords                       
from nltk.tokenize import word_tokenize, sent_tokenize 

Input your text for summarizing below:

text = """ """                   

Next, you need to tokenize the text:

stopWords = set(stopwords.words("english"))              
words = word_tokenize(text)                   

Now, you will need to create a frequency table to keep a score of each word:

freqTable = dict()                 
for word in words:               
word = word.lower()                 
if word in stopWords:                 
continue                  
if word in freqTable:                       
freqTable[word] += 1            
else:          
freqTable[word] = 1                   

Next, create a dictionary to keep the score of each sentence:

sentences = sent_tokenize(text)                 
sentenceValue = dict()                     

for sentence in sentences:               
for word, freq in freqTable.items():              
if word in sentence.lower():           
if word in sentence.lower():                   
if sentence in sentenceValue:                                 
sentenceValue[sentence] += freq                       
else:                       
sentenceValue[sentence] = freq                    

sumValues = 0                        
for sentence in sentenceValue:              
sumValues += sentenceValue[sentence]                

Now, we define the average value from the original text as such:

average = int(sumValues / len(sentenceValue))                

And lastly, we need to store the sentences into our summary:

summary = ''      

for sentence in sentences:               

if (sentence in sentenceValue) and (sentenceValue[sentence] > (1.2 * average)):                
summary += " " + sentence                  
print(summary)                    

4. T5

To make use of Google’s T5 summarizer, there are a few prerequisites.
First, you will need to install PyTorch and Hugging Face’s Transformers. You can install the transformers using the code below:

pip install transformers           

Next, import PyTorch along with the AutoTokenizer and AutoModelWithLMHead objects:

import torch                

from transformers, import AutoTokenizer, AutoModelWithLMHead

Next, you need to initialize the tokenizer model:

tokenizer = AutoTokenizer.from_pretrained('t5-base')                        
model = AutoModelWithLMHead.from_pretrained('t5-base', return_dict=True)                   

From here, you can use any data you like to summarize. Once you have gathered your data, input the code below to tokenize it:

inputs = tokenizer.encode("summarize: " + text,                  
return_tensors='pt',              
max_length=512,             
truncation=True)                   

Now, you can generate the summary by using the model.generate function on T5:

summary_ids = model.generate(inputs, max_length=150, min_length=80, length_penalty=5., num_beams=2)                   

Feel free to replace the values mentioned above with your desired values. Once it’s ready, you can move on to decode the tokenized summary using the tokenizer.decode function:

summary = tokenizer.decode(summary_ids[0])                    

And there you have it: a text summarizer with Google’s T5. You can replace the texts and values at any time to summarize various arrays of data.

5. GPT-3

GPT-3 is a successor to the GPT-2 API and is much more capable and functional. Let’s take a look at how to get it running on Python with an example of downloading PDF research papers.

First, you will need to import all dependencies as listed below:

import openai             
import wget               
import pathlib               
import pdfplumber    
import numpy as np        

You will then need to install openai to interact with GPT-3, so make sure you have an API key. You can get one here.

You will also need wget to download PDFs from the internet. This will further require pdfplumber to convert it back to text. Install all three with pip:

pip install openai             
pip install wget             
pip install pdfplumber              

To download the PDF and return its local path, enter the following:

def getPaper(paper_url, filename="random_paper.pdf"):                    
"""             

Downloads a paper from the given url and returns the local path to that file.

"""                
downloadedPaper = wget.download(paper_url, filename)                      
downloadedPaperFilePath = pathlib.Path(downloadedPaper)              


return downloadedPaperFilePath       

Now, you need to convert the PDF into text so GPT-3 can read it:

paperFilePath = "random_paper.pdf"            
paperContent = pdfplumber.open(paperFilePath).pages               

def displayPaperContent(paperContent, page_start=0, page_end=5):             
for page in paperContent[page_start:page_end]:             
print(page.extract_text())             
displayPaperContent(paperContent)            

Now that you have the text, it’s time to start summarizing it:

def showPaperSummary(paperContent):           
tldr_tag = "\n tl;dr:"               
openai.organization = 'organization key'               
openai.api_key = "your api key"            
engine_list = openai.Engine.list()          

available from the openai API

Here, we are letting the GPT-3 model know that we require a summary. Then, we proceed to set up the environment to use the openai API.

for page in paperContent:                    
text = page.extract_text() + tldr_tag                
response = openai.Completion.create(engine="davinci",prompt=text,temperature=0.3,           
max_tokens=140,               
top_p=1,            
frequency_penalty=0,           
presence_penalty=0,            
stop=["\n"]        
)                 
print(response["choices"][0]["text"])                

This code extracts the text from each page, feeds the GPT-3 model the max tokens for each page, and prints it to the terminal.
Now that everything is set up, we can run the summarizer:

paperContent = pdfplumber.open(paperFilePath).pages               
showPaperSummary(paperContent)           

Text summarization is very useful for people dealing with large amounts of written data on a daily basis, such as online magazines, research sites, and even for teachers in schools.

While there are simple methods of text summarization in Python such as Gensim and Sumy, there are far more powerful but slightly complicated summarizers such as T5 and GPT-3.

Which technique to choose really comes down to preference and the use-case for each of these summarizers. But in theory, AI-based summarizers will prove better in the long run as they will constantly learn and provide superior results.

Press

Press

What’s up with Turing? Get the latest news about us here.
Blog

Blog

Know more about remote work. Checkout our blog here.
Contact

Contact

Have any questions? We’d love to hear from you.

Hire remote developers

Tell us the skills you need and we'll find the best developer for you in days, not weeks.