So… how fast can you get the main issues from an article? Well, it depends on the article right. Maybe two minutes, three.
With the Weighted Text Summarizer model (WTS) that doesn’t matter anymore. You can take a large body of text and within seconds, using WTS, get a concise and coherent summary with the key point of your article.
There are two different approaches are used for text summarization - Extractive Summarization and Abstractive Summarization.
Refer to Figure 1 in the image
(Patel et al., Abstractive text summarization on google search results: Semantic scholar 1970)
In Extractive Summarization, crucial phrases and sentences from the original body of the text are identified and only these phrases or sentences are used in the summary.
Refer to Figure 2 in the image
While in Abstractive Summarization, new sentences are generated from the original body of text. Contrary to Extractive Summarization, the sentences generated may not even be part of the original body of text.
Refer to Figure 3 in the image
Weighted Text Summarizer model (WTS)
WTS uses Extractive methods. This method works by picking out vital sentences from the body of text and uses them as part of the summary. Using this approach, new sentences are not used, just sentences already in the body of the text.
Step 1: The first step is to import the required libraries. There are two NLTK libraries that are necessary for building an efficient text summary.
1. from nltk.corpus import stopwords
2. from nltk.tokenize import word_tokenize, sent_tokenize
Corpus: A collection of text is known as Corpus. This could be either data sets such as bodies of work by an author, poems by a particular poet, etc.
Tokenizers: This divides a text into a series of tokens. WTS will use word and the sentence tokenizer.
Step 2: Remove the Stop Words and store them in a separate array of words. And create a frequency table of the words.
Stop Words such as is, an, a, the, for that do not add value to the meaning of a sentence.
1. stopWords = set(stopwords.words("english"))
2. words = word_tokenize(text)
3. freqTable = dict() # Creating a frequency table to keep the score of each word
A dictionary can keep a record of how many times each word will appear in the text after removing the stop words. We can use this dictionary over each sentence to know which sentences have the most relevant content in the overall text.
Step 3: Depending on the words it contains and the frequency table, we will assign a score to each sentence.
Here, we will use the sent_tokenize() method that can be used to create the array of sentences. We will also need a dictionary to keep track of the score of each sentence, and we can later go through the dictionary to create a summary.
1. sentences = sent_tokenize(text)
2. sentenceValue = dict()
Step 4: To compare the sentences within the text, assign a score.
We find the average score of a particular sentence. This average score can be a good threshold.
1. sumValues = 0
2. for sentence in sentenceValue:
3. sumValues += sentenceValue[sentence]
4. # Average value of a sentence from the original text
5. average = int(sumValues / len(sentenceValue))
Apply the threshold value and the store sentences in order into the summary.
There you have it! You can reduce reading time by up to 50% and still have 80% of the understanding you would have had if you read the full article using WTS.
Take WTS for a spin, please refer to the following link!