First try at using ML to summarize text

First try at using ML to summarize text

I want to get my computer to be able to do a decent job of summarizing news articles.

For example, consider this article:


FIFA president Sepp Blatter has revealed the logo for the 2018 World Cup in Russia — with the help of a crew of cosmonauts.

The logo depicts the World Cup trophy in red and blue, colors from the Russian flag, with gold trim. Unveiling the logo on a Russian state TV talk show, Blatter said the logo would show Russia’s “heart and spirit.”

It was then presented over video link by a crew of three Russian cosmonauts on the International Space Station. “Seeing the football World Cup in our country was a dream for all of us,” cosmonaut Elena Serova said.

Simultaneously with the logo’s appearance on TV, it was beamed onto Moscow’s Bolshoi Theatre as part of a light show.

Taken from this article about the 2018 soccer World Cup.

One reasonable summary of this article is:

Three astronauts unveil logo from international space station. Sepp Blatter said the logo would show Russia’s “heart and spirit”.

I want my computer to be able to take an article and then create that kind of summary from it.

Being able to do this would be a really cool and useful thing. And so not unexpectedly, it’s an active area of machine learning research. After googling around I found a very helpful paper “Get To The Point: Summarization with Pointer-Generator Networks”, by two people at Stanford (Abigail See, Christopher D. Manning) and a person at Google Brain (Peter J. Liu). The authors link to their GitHub repo containing the code necessary to replicate their experiments.

So now I’m thinking “This is awesome, I’ll just take their code and use it and see what it does”. 🙂

🙁 But sadly the code uses TensorFlow 1 and is written in Python 2. That sucks, because my Ubuntu 20.04 LTS workstation has TensorFlow 2.2.0 and Python 3.7.

So I ported the code, and here’s my resulting GitHub repo.

The model

The model in their paper (and so in their related code) is a sequence-to-sequence recurrent neural network with LSTM cells, attention, and coverage, which has both abstractive and extractive behavior.

Anyway, I got it working…mostly. The experiment done in the paper uses a vocabulary of 50,000 words. But this many words causes my 8GB NVIDIA GeForce RTX 2070 GPU to run out of memory. So I had to scale back to 10,000 words.

This is a little bit of a bummer, because that means I wont be able to exactly replicate the paper’s results. But I think it will be good enough to get a general feel for things.

Below are two pictures of the model’s neural network graph. These images were created by TensorBoard, which is used to vizualize TensorFlow models and their performance.

And here’s a very cool video which has a 3D rendering of the 128-dimensional word embeddings that the model creates.

Training the model

2020-07-22: Start training.

As in the paper, I am training on the CNN/DailyMail dataset, which is a corpus of articles from those news sources along with human-written reference summaries.

The authors trained their model on this dataset for around 230,000 iterations.

The picture below shows where my training is currently at after 12,000 iterations. I get a feeling that because of the reduced vocab list my model won’t need anywhere near 230,000 iterations for the loss to converge.

Update 2020-07-23: Still training.

Here’s how the training loss looks now after 115K iterations:

Below is an example of a summary that it knows how to create at this point in training. The model’s summary doesn’t completely suck, but it’s not great either.

======================== MODEL-GENERATED SUMMARY ========================

real madrid and barcelona remain on course to face each other in the final of the spanish cup. karim benzema scores the only goal of a tempestuous match to give real madrid a narrow 1-0 victory at holders sevilla in the first leg of their semifinal. benzema, whose place in the madrid side is under threat following the loan signing of manchester city striker emmanuel adebayor. sevilla players crowded the referee and his assistant claiming the ball had cross the line.

======================== HUMAN-WRITTEN REFERENCE SUMMARY ========================

real madrid win 1-0 at holders sevilla in their spanish cup semifinal first leg. karim benzema scores the only goal to give madrid a crucial advantage. barcelona thrash almeria 5-0 in the other semifinal , scoring four goals in first half. city rivals ac milan and holders inter milan both through to the italian cup semifinals.

======================== ORIGINAL ARTICLE ========================

real madrid and barcelona remain on course to face each other in the final of the spanish cup after both claimed semifinal first leg victories on wednesday. karim benzema scored the only goal of a tempestuous match to give real madrid a narrow 1-0 victory at holders sevilla in the first leg of their semifinal. meanwhile, a devastating first-half display set barcelona on their way to a 5-0 home win over almeria, leaving the second leg looking like a formality. benzema, whose place in the madrid side is under threat following the loan signing of manchester city striker emmanuel adebayor , settled the contest at the ramon sanchez pizjuan stadium in the 15th minute , dribbling past one defender before cutting inside and curling home a left-foot shot. however , the home side were adamant they had levelled on the stroke of half-time when luis fabiano rounded goalkeeper iker casillas , only for raul albiol to clear the ball off the line. sevilla players crowded the referee and his assistant claiming the ball had cross the line , but the officials waved away their protests and tv replays proved inconclusive. both sides had chances to score after the break , notably real when mesut ozil and cristiano ronaldo combined to somehow miss from close range. the match finished amid ugly scenes with casillas holding his head after being struck from a missile thrown from the crowd. the result proved a timely present for real madrid coach jose mourinho on his 48th birthday and also edged the club closer to the final of a competition they last won in 1993. should real avoid defeat in the second leg at the santiago bernabeu , they will almost certainly face old rivals barcelona for the first time in a final since 1990. two goalkeeping errors handed lionel messi and david villa early goals in the first half to give barca a perfect start and some more messi magic made it 3-0 by just the 16th minute. pedro headed home a xavi free-kick in the 31st-minute for their fourth goal and a flowing move resulted in seydou keita adding a fifth with just two minutes remaining. elsewhere , serie a leaders ac milan are through to the semifinals of the italian cup after a 2-1 victory at sampdoria. two goals from brazilian alexandre pato earned victory for the visitors , and ensured milan striker antonio cassano made a winning return to genoa , less than a month after being effectively sacked by sampdoria after falling out with the club ‘s president. pato struck twice in the space of five first-half minutes , both times being set-up by new dutch signing urby emanuelson. the home side pulled a goal back early in the second half when massimo maccarone netted on his debut following his move from palermo , but milan held on to reach the last four of the competition. milan were joined in the last four by holders and city rivals inter milan , who beat napoli 5-4 on penalties , after 90 minutes and extra time went by without either side scoring a goal. palermo secured their place in the last four on tuesday , also going through on penalties against parma , while juventus and roma will face each other on thursday for the final remaining place. the semifinal line-up for the german cup has also now been decided following wednesday ‘s remaining last eight matches. thomas mueller scored twice as holders bayern munich cruised to a 4-0 victory at second division side alemannia aachen. however , there were shocks in the two other matches on wednesday with second division energie cottbus and duisburg putting out bundesliga sides hoffenheim and kaiserslautern. schalke sealed their place in the semifinals with a 3-2 extra time victory over nuremberg on tuesday.

Update 2020-07-24: Still training.

I was wrong about loss converging well under 230K iterations. It’s at 218K iterations now. I’m going to let it train for a little longer without adding the coverage feature, and then afterwards add the coverage and train a little more (see the paper for an explanation of coverage).

Here’s the current picture of training loss:

Adding coverage is supposed to help avoid repetitive output in the summaries. Look at what the model currently produces in its summary for this article about a dog show:

thousands of owners accompanied their animals today at the second day of this year ‘s crufts dog show at birmingham ‘s national exhibition centre.
thousands of owners accompanied their animals today at the second day of this year ‘s crufts dog show at birmingham ‘s national exhibition

a record 2,131 dogs have been registered to take part in the annual show. kennel club estimates that 145,000 people will travel to birmingham ‘s nec. dog owners from 41 countries will travel to event featuring 13 new breeds. winner of the coveted best in show title will be crowned on sunday night.

Hopefully adding coverage will mitigate this kind of unwanted repetition.

Update 2020-07-25: Done training.

I stopped the non-coverage training at 231.9K iterations, and then trained with coverage from 231.9K until 241.9K iterations. Here are the pictures of the loss. This time TensorBoard provides pictures of the new loss scalars coverage_loss and total_loss.

I think that the summaries have improved a little bit, but I’m still not overwhelmed by awesomeness. Here are the generated and reference summaries for this article about Madonna:


a second person has died during construction for madonna’s upcoming concerts in marseilles, france. the second fatality was a 32-year-old british citizen, the british foreign office said. a third person was in critical condition, a spokesman for marseille hospital. at least one madonna show had been canceled, rosenberg told cnn.


two people killed when stage being built for madonna concert collapses. accident happened thursday afternoon in southern french city of marseille. madonna was due to play first of five concerts in city sunday.

Anyway, at this moment, the fully-trained model is running in decode mode and is creating summaries and calculating ROUGE scores for the test dataset.

At the current rate of decoding, it looks like it will take 2 or 3 days, so won’t be finished until sometime between Monday July 27 and Tuesday July 28.

I am really curious to see how my model with 10K words in the vocab does compared to the paper’s 50K one.

Update 2020-07-27: Fail. Duh. Testing again.

I cannot believe that I ran the test decoding on the train dataset. Running it again on the test dataset. For crying out loud.

Update 2020-07-28: Debugging

Program is not reaching the code which calculates the ROUGE score. I think it might be hanging in a state where there are no more test examples to read but Batcher._finished_reading isn’t set to True. I am debugging; maybe goofed up something when ported to Python 3.

Yes. After digging around, it looks like I need to make a change in accordance with

The upshot is that the PEP-0479 change makes it so that a StopIteration raised inside a generator is replaced by a RuntimeError. This breaks code which depends on the previous behavior. The following code won’t work in Python 3.7, if my_generator raises StopIteration instead of just returning:

except StopIteration: something...

I changed method accordingly. It’s working now.

But the ROUGE scoring of the summaries still wasn’t working. There was a problem with the way I installed the Python pyrouge package. I installed by running pip install pyrouge, but it turns out that the pip package is outdated. So you have to install things manually. Here’s what I came up with. Seems to work now.

Currently running the scoring on the test set.

Update 2020-07-29: ROUGE scores

It looks like the ROUGE scoring worked like it was supposed to.

Paper’s: 50000 word vocabOurs: 10000 word vocab

This is interesting. The ROUGE score is the F1-measure related to the overlap (precision and recall) between the reference summaries and the generated summaries. All other things being equal, higher scores are better. So from that perspective, it looks like reducing vocab from 50K words to 10k doesn’t hurt things at all.

But… the ROUGE scores do not directly measure how readable a summary is, or whether or not it does a good job of capturing the same meaning that a human-written reference summary captures.

Next steps…

I think I’d like to package up the model so that I can use it in other places.

It would be cool to have a tool which:

  1. Receives a URL as an input.
  2. Uses something like Splash to scrape out the text.
  3. Runs the summary model on that text.
  4. Returns the summary.

Leave a Reply