For Developers

5 Powerful Text Summarization Techniques in Python

Text Summarization Techniques in Python.

Text summarization is a natural language processing (NLP) task that allows users to summarize large amounts of text for quick consumption without losing any important information.

We’ve all come across articles and other long-form texts with a lot of unnecessary content that completely draws us away from the subject matter.

This can get frustrating, especially during research and when collecting valid information for whatever reason. The solution? Text summarization.

With this in mind, let’s first look at the two distinctive methods of text summarization, followed by five techniques that can be used in Python.

Methods of text summarization

‘Extractive’ and ‘Abstractive’ are the two methods of performing text summarization. Let’s discuss them in detail.

Extractive text summarization

As the name suggests, extractive text summarization ‘extracts’ notable information from the large dumps of text provided and groups them into clear and concise summaries.

The method is very straightforward as it extracts texts based on parameters such as the text to be summarized, the most important sentences (Top K), and the value of each of these sentences to the overall subject.

This, however, also means that the method is limited to predetermined parameters that can make extracted text biased under certain conditions.

Owing to its simplicity in most use cases, extractive text summarization is the most common method used by automatic text summarizers.

Abstractive text summarization

Abstractive text summarization generates legible sentences from the entirety of the text provided. It rewrites large amounts of text by creating acceptable representations, which is further processed and summarized by natural language processing.

What makes this method unique is its almost AI-like ability to use a machine’s semantic capability to process text and iron out the kinks using NLP.

Although it might not be as simple to use compared to the extractive method, in many situations, abstract summarization is far more useful. In a lot of ways, it is a precursor to full-fledged AI writing tools. However, this does not mean that there is no need for extractive summarization.

5 techniques for text summarization in Python

Here are five approaches to text summarization using both abstractive and extractive methods.

1. Gensim

Gensim is an open-source topic and vector space modeling toolkit within the Python programming language.

First, the user needs to utilize the summarization.summarizer from Gensim as it is based on a variation of the TextRank algorithm.

Since TextRank is a graph-based ranking algorithm, it helps narrow down the importance of vertices in graphs based on global information drawn from said graphs.

Here’s an example code to summarize text from Wikipedia:

from gensim.summarization.summarizer import summarize
from gensim.summarization import keywords
import wikipedia
import en_core_web_sm

To import the wikipedia content:

wikisearch = wikipedia.page("")
wikicontent = wikisearch.content
nlp = en_core_web_sm.load()
doc = nlp(wikicontent)

To summarize based on percentage:

summ_per = summarize(wikicontent, ratio = “”)
print("Percent summary")
print(summ_per)

To summarize based on word count:

summ_words = summarize(wikicontent, word_count = “”)
print("Word count summary")
print(summ_words)

There are two ways of extracting text using TextRank: keyword and sentence extraction.

Keyword extraction can be done by simply using a frequency test, but this would almost always prove to be inaccurate. This is where TextRank automates the process to semantically provide far more accurate results based on the corpus.

Sentence extraction, on the other hand, studies the corpus to summarize the most valid sentences pertaining to the subject matter and phonologically arranges it.

2. Sumy

Sumy is another library in Python that uses various algorithms to perform text summarization.

Let’s take a look at a few of them.

LexRank

LexRank is a graphical-based summarizer. The code is as follows:

from sumy.summarizers.lex_rank import LexRankSummarizer
summarizer_lex = LexRankSummarizer()

# Summarize using sumy LexRank
summary= summarizer_lex(parser.document, 2)
lex_summary=""
for sentence in summary:
lex_summary+=str(sentence)
print(lex_summary)

print(text_summary)

Luhn

Developed by an IBM researcher of the same name, Luhn is one of the oldest summarization algorithms and ranks sentences based on a frequency criterion for words.

Here’s the code for the algorithm:

from sumy.summarizers.luhn import LuhnSummarizer
summarizer_1 = LuhnSummarizer()
summary_1 =summarizer_1(parser.document,2)

for sentence in summary_1:
print(sentence)

LSA

Latent semantic analysis is an automated method of summarization that utilizes term frequency with singular value decomposition. It has become one of the most used summarizers in recent years.

The code is as follows:

from sumy.summarizers.lsa import LsaSummarizer
summarizer_lsa = LsaSummarizer()

# Summarize using sumy LSA
summary =summarizer_lsa(parser.document,2)
lsa_summary=""
for sentence in summary:
lsa_summary+=str(sentence)
print(lsa_summary)

TextRank

And last but not least, there is TextRank which works exactly the same as in Gensim.

Here’s the code for this algorithm:

# Load Packages
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer

# For Strings
parser = PlaintextParser.from_string(text,Tokenizer("english"))
from sumy.summarizers.text_rank import TextRankSummarizer

# Summarize using sumy TextRank
summarizer = TextRankSummarizer()
summary =summarizer_4(parser.document,2)
text_summary=""

for sentence in summary:
text_summary+=str(sentence)

print(text_summary)

When using each of these summarizers, you will notice that they summarize text differently. It’s better to try them all to figure out which one works best in different situations.

3. NLTK

The ‘Natural Language Toolkit’ is an NLP-based toolkit in Python that helps with text summarization.

Here’s how to get it up and running.

Import the required libraries using the code below:

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

Input your text for summarizing below:

text = """ """

Next, you need to tokenize the text:

stopWords = set(stopwords.words("english"))
words = word_tokenize(text)

Now, you will need to create a frequency table to keep a score of each word:

freqTable = dict()
for word in words:
word = word.lower()
if word in stopWords:
continue
if word in freqTable:
freqTable[word] += 1
else:
freqTable[word] = 1

Next, create a dictionary to keep the score of each sentence:

sentences = sent_tokenize(text)
sentenceValue = dict()

for sentence in sentences:
for word, freq in freqTable.items():
if word in sentence.lower():
if word in sentence.lower():
if sentence in sentenceValue:
sentenceValue[sentence] += freq
else:
sentenceValue[sentence] = freq

sumValues = 0
for sentence in sentenceValue:
sumValues += sentenceValue[sentence]

Now, we define the average value from the original text as such:

average = int(sumValues / len(sentenceValue))

And lastly, we need to store the sentences into our summary:

summary = ''

for sentence in sentences:

if (sentence in sentenceValue) and (sentenceValue[sentence] > (1.2 * average)):
summary += " " + sentence
print(summary)

4. T5

To make use of Google’s T5 summarizer, there are a few prerequisites.

First, you will need to install PyTorch and Hugging Face’s Transformers. You can install the transformers using the code below:

pip install transformers

Next, import PyTorch along with the AutoTokenizer and AutoModelWithLMHead objects:

import torch

from transformers, import AutoTokenizer, AutoModelWithLMHead

Next, you need to initialize the tokenizer model:

tokenizer = AutoTokenizer.from_pretrained('t5-base')
model = AutoModelWithLMHead.from_pretrained('t5-base', return_dict=True)

From here, you can use any data you like to summarize. Once you have gathered your data, input the code below to tokenize it:

inputs = tokenizer.encode("summarize: " + text,
return_tensors='pt',
max_length=512,
truncation=True)

Now, you can generate the summary by using the model.generate function on T5:

summary_ids = model.generate(inputs, max_length=150, min_length=80, length_penalty=5., num_beams=2)

Feel free to replace the values mentioned above with your desired values. Once it’s ready, you can move on to decode the tokenized summary using the tokenizer.decode function:

summary = tokenizer.decode(summary_ids[0])

And there you have it: a text summarizer with Google’s T5. You can replace the texts and values at any time to summarize various arrays of data.

5. GPT-3

GPT-3 is a successor to the GPT-2 API and is much more capable and functional. Let’s take a look at how to get it running on Python with an example of downloading PDF research papers.

First, you will need to import all dependencies as listed below:

import openai
import wget
import pathlib
import pdfplumber
import numpy as np

You will then need to install openai to interact with GPT-3, so make sure you have an API key. You can get one here.

You will also need wget to download PDFs from the internet. This will further require pdfplumber to convert it back to text. Install all three with pip:

pip install openai
pip install wget
pip install pdfplumber

To download the PDF and return its local path, enter the following:

def getPaper(paper_url, filename="random_paper.pdf"):
"""
Downloads a paper from the given url and returns
the local path to that file.
"""

downloadedPaper = wget.download(paper_url, filename)
downloadedPaperFilePath = pathlib.Path(downloadedPaper)

return downloadedPaperFilePath

Now, you need to convert the PDF into text so GPT-3 can read it:

paperFilePath = "random_paper.pdf"
paperContent = pdfplumber.open(paperFilePath).pages

def displayPaperContent(paperContent, page_start=0, page_end=5):
for page in paperContent[page_start:page_end]:
print(page.extract_text())
displayPaperContent(paperContent)

Now that you have the text, it’s time to start summarizing it:

def showPaperSummary(paperContent):
tldr_tag = "\n tl;dr:"
openai.organization = 'organization key'
openai.api_key = "your api key"
engine_list = openai.Engine.list()

available from the openai API

Here, we are letting the GPT-3 model know that we require a summary. Then, we proceed to set up the environment to use the openai API.

for page in paperContent:

text = page.extract_text() + tldr_tag
response = openai.Completion.create(engine="davinci",prompt=text,temperature=0.3,
max_tokens=140,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
stop=["\n"]
)
print(response["choices"][0]["text"])

This code extracts the text from each page, feeds the GPT-3 model the max tokens for each page, and prints it to the terminal.

Now that everything is set up, we can run the summarizer:

paperContent = pdfplumber.open(paperFilePath).pages
showPaperSummary(paperContent)

Text summarization is very useful for people dealing with large amounts of written data on a daily basis, such as online magazines, research sites, and even for teachers in schools.

While there are simple methods of text summarization in Python such as Gensim and Sumy, there are far more powerful but slightly complicated summarizers such as T5 and GPT-3.

Which technique to choose really comes down to preference and the use-case for each of these summarizers. But in theory, AI-based summarizers will prove better in the long run as they will constantly learn and provide superior results.

[object Object]

Press

What's up with Turing? Get the latest news about us here.
[object Object]

Blog

Know more about remote work.
Checkout our blog here.
[object Object]

Contact

Have any questions?
We'd love to hear from you.

Hire and manage remote developers

Tell us the skills you need and we'll find the best developer for you in days, not weeks.

Hire Developers