[Teachable NLP] Fake news made with the New York Times

Teachable NLP: link
Tabtab: link
Ainize: view API


You can tune the GPT-2 model without code and GPU resources! Teachable NLP has been released! :clap::clap::clap:

Today I prepared data from The New York Times to train GPT-2. When the GPT model was released, it was feared that fake news would be generated. So I prepared this data. Will this model generate terrifying fake news? Let’s check it out today!


1. Data



URL: http://www.nytimes.com/2016/06/30/sports/baseball/washington-nationals-max-scherzer-baffles-mets-completing-a-sweep.html

WASHINGTON — Stellar pitching kept the Mets afloat in the first half of last season despite their offensive woes. But they cannot produce an encore of their pennant-winning season if their lineup keeps floundering while their pitching is nicked, bruised and stretched thin.
“We were going to ride our pitching,” Manager Terry Collins said before Wednesday’s game. “But we’re not riding it right now. We’ve got as many problems with our pitching as we do anything.”
Wednesday’s 4-2 loss to the Washington Nationals was cruel for the already-limping Mets. Pitching in Steven Matz’s place, the spot starter Logan Verrett allowed two runs over five innings. But even that was too large a deficit for the Mets’ lineup to overcome against Max Scherzer, the Nationals’ starter.
“We’re not even giving ourselves chances,” Collins said, adding later, “We just can’t give our pitchers any room to work.”
The Mets did not score until the ninth inning, when a last-gasp two-run homer by James Loney off Nationals reliever Shawn Kelley snapped a streak of 23 scoreless innings for the team.
...

This is the New York Times article data prepared today. I downloaded it from Kaggle, and is freely available under the CC0: Public Domain license. Size is 43mb of huge data!!! And then, it’s clean enough to be used for training right away, but there are URLs of articles, so I think I just need to delete them.


2. Preprocessing

with open(source_dir + '/' + text_file, 'r', encoding='utf-8') as f:

    name = text_file.split('.')[0]

    with open(result_dir + '/' + name + '.txt', 'w', encoding='utf-8-sig') as r:
        lines = f.readlines()

        for line in lines:
			# If it is blank, it is skipped.
            if line == "\n" or line == "" or line == " " or line == " \n" or not line:
                continue
		
			# Replace the line starting with "URL:" with <news>.
            if line.startswith('URL:'):
                r.write('<news>' + '\n')
                continue

			# I also added a part that normalizes whitespace.
            text = " ".join(line.split())

            r.write(text + '\n')

This is the code that preprocesses the New York Times data. It’s very simple as there is nothing to do except for removing the URL. However, it is unfortunate to end it with just that, so I will put <news> in the place where the existing URL was in order to distinguish between the news and inform the start of the news.

<news>
WASHINGTON — Stellar pitching kept the Mets afloat in the first half of last season despite their offensive woes. But they cannot produce an encore of their pennant-winning season if their lineup keeps floundering while their pitching is nicked, bruised and stretched thin.
“We were going to ride our pitching,” Manager Terry Collins said before Wednesday’s game. “But we’re not riding it right now. We’ve got as many problems with our pitching as we do anything.”
Wednesday’s 4-2 loss to the Washington Nationals was cruel for the already-limping Mets. Pitching in Steven Matz’s place, the spot starter Logan Verrett allowed two runs over five innings. But even that was too large a deficit for the Mets’ lineup to overcome against Max Scherzer, the Nationals’ starter.
“We’re not even giving ourselves chances,” Collins said, adding later, “We just can’t give our pitchers any room to work.”
The Mets did not score until the ninth inning, when a last-gasp two-run homer by James Loney off Nationals reliever Shawn Kelley snapped a streak of 23 scoreless innings for the team.

It is a simple pre-processed data. There is no complicated logic, so the preprocessing is done quickly! For some reason, the data size has increased to 43.9 mb, but I will use it for training right away!


3. Training

This time I used a small model and 4 epochs. Because of the large data, I wanted to use the medium model, but since it took 7 hours to train, I used the small model.

Using the Small model, it took 2 hours and 30 minutes. It is less than the 7 hours of Medium, but it took quite a long time.


While I forgot to turn Training back… Boom! I received an email notification that my model was complete. Then it’s time to test the model!

4. Generate

Let’s test the model via Tabtab using Test your model. Would fake news be created as terrifying as people feared? I’m looking forward to it.

<news>
PHILADELPHIA — The Mets may not know for years whether David Wright will be their everyday third baseman, but sooner or later, he will.
Wright, who was suspended for 162 games without pay by Major League Baseball for violating its domestic violence policy last month, announced on Friday that he would sit out the first three weeks of the season while the team tried to convince him to return. A decision on his return is expected by the end of June. Wright was arrested Oct. 31 after his wife told the police that his former wife had told him that she had grabbed his throat and shoved him into a sliding-glass door at a Hawaiian resort, leaving him with severe rope burns. He returned a short time later on Aug. 9 and was charged with battery. But on Sunday, during a news conference at Major league Baseball’s spring training complex in Arizona, Wright said he no longer believed that was where he had been arrested. It was unclear why he left the country or what role, if any, his lawyers hoped he might have played in inspiring others to act on their own. The police were not immediately available to comment. Messages left with the United States attorney“were not returned.”

This is New York Times-based fake news generated by inputting . There’s a cruel news out there that turns the American baseball team star and legend David Wright into a bad man in an instant. Until the model was completed, I had doubts that there would be fake news that really makes sense, but there was a terrifying news that ruined a person’s reputation.

What do you think of this fake news generator? Are you afraid? Interesting? Although this time it produced terrifying news, will it be the case next time?

3 Likes

[KOR]

New York Times로 만들어낸 가짜 뉴스 생성기 :scream: