[Teachable NLP] GPT-2 LoveCraft

fpem123 · April 20, 2021, 8:20am

Teachable NLP: link
Tabtab: link
Ainize: view API

Hello, today I fine-tuned Teachable NLP GPT-2 using data that I have used before. The data is LoveCraft, a famous horror novel downloaded from Kaggle. This data has a free license because it is labeled as CC0: Public Domain!

1. Data

A Reminiscence of Dr. Samuel Johnson
The Privilege of Reminiscence, however rambling or tiresome, is one generally allow’d
to the very aged; indeed, ’tis frequently by means of such Recollections that the obscure
occurrences of History, and the lesser Anecdotes of the Great, are transmitted to Posterity.
Tho’ many of my readers have at times observ’d and remark’d
a Sort of antique Flow in my Stile of Writing, it hath pleased me to pass amongst the Members
of this Generation as a young Man, giving out the Fiction that I was born in 1890, in America.
I am now, however, resolv’d to unburthen myself of a Secret which I have hitherto kept
thro’ Dread of Incredulity; and to impart to the Publick a true knowledge of my long years,
…

There were 102 data that is 7.58MB. However, except for the “concat.txt” that puts all the data in one place, it is 3.58MB of data. But, that’s a lot of data though!!
It’s good data, but it’s not what I want, so I preprocess it before training.

2. Preprocessing

This time, I want to create a model that generates LoveCraft style text like the Lord of the Rings model. To do that, there is a lot of work to be done. There are titles and chapters for each file, and some spaces are NBSP. Also, newlines are frequently inserted to make the text look good. I can use concat.txt, but I didn’t use concat.txt because there are things I don’t need, such as chapter notation, headings, and quotes.

First, I manually erased the chapter notation before combining the data. I can write the code to remove it with regex expressions, but it seems that the chapter notation method is slightly different for each file, so I deleted it myself. But what I noticed while erasing, it seems to have a certain format, so it will be easier to code.

with open(result_dir + '/' + result_name + '.txt', 'w', encoding='utf-8-sig') as r:

    for file in file_list:
        name = file.split('.')

        if 'txt' not in name:
            continue

        with open(source_dir + '/' + file, 'r', encoding='utf-8') as f:
            lines = f.readlines()
						
			# Variable for counting the title line
            count = 0

            for line in lines:
				# Line 0 is the title.
				# So line 0 is skipped.
                if count == 0:
                    count = 1
                    continue

                r.write(line + " ")

This is a simple code to merge files. You can use the command (e.g., Linux: cat *text> text.txt), but I wrote it in code to delete the title.

Now that I have made the data, if modify it as a whole, the data for training GPT-2 will be complete!

with open(source_dir + '/' + text_file, 'r', encoding='utf-8') as f:

    name = text_file.split('.')[0]
	# These words are not the end of the sentence.
    whos = ['Dr\.$', 'Mr\.$', 'Ms\.$', 'Mrs\.$', '\.\.+$']

    with open(result_dir + '/' + name + '.txt', 'w', encoding='utf-8') as r:
        lines = f.readlines()

        for line in lines:
			# Lines with no text are skipped.
            if line in ["\\n", "", " ", " \\n"] or not line:
                continue

			# Replace nbsp with regular spaces
            line = unidecode.unidecode(line)
			# Normalize whitespace.
            text = " ".join(line.split())
			# make '. . . .' into '...'
            text = re.sub("\\s\\.", ".", text)

			# Text without any posts
            if len(text) == 0:
                continue
			# If the line ends with '.', the sentence is considered to be over.
            elif text[-1] in ["."]:
				# Even if it ends with '.', what is included in 'whos' is not the end of the sentence.
                for who in whos:
                    if re.search(who, text):
						# Since it is not the end of the sentence, I put a space.
                        text += " "
                        break
				# for-else
                else:
                    text += "\\n"
			# Since it is not the end of the sentence, I put a space.
            else:
                text = " " + text

            r.write(text)

This is the code that preprocesses the combined data. It reduces unnecessary newlines and groups them by paragraph. It’s a task you do every time, but it’s a good idea to remove it unless it’s absolutely necessary (e.g., a model that makes a poetry or song).

“Heave to, there’s something floating to the leeward” the speaker was a short stockily built man whose name was William Jones. he was the captain of a small cat boat in whichhe & a party of men were sailing at the time the story opens.
“Aye aye sir” answered John Towers & the boat was brought to a stand still Captain Jones reached out his hand for the object which he now discerned to be a glass bottle “Nothing but a rum flask that the men on a passing boat threw over” he said but from an impulse of curiosity he reached out for it. it was a rum flask & he was about to throw it away when he noticed a piece of paper in it. He pulled it out & on it read the following Jan 1 1864 I am John Jones who writes this letter my ship is fast sinking with a treasure on board I am where it is marked * on the enclosed chart Captain Jones turned the sheet over & the other side was a chart on the edge were written these words dotted lines represent course we took “Towers” Said Capt. Jones exitedly “read this” Towers did as he was directed “I think it would pay to go” said Capt. Jones “do you”? “Just as you say” replied Towers.
…

The 3.9mb data with line 6292 is completed. Now let’s make a horror LoveCraft model with this!

3. Training

This time, set the ModelType to small and use 3 epochs. Previously, I kept using 5 epochs, but I tried using 3 because I wanted to know if it would yield good results even if it wasn’t 5.

4. Generate

Training is done! Can see my model built in Ainize through the View API, and can try my model in Tabtab through Test my model.

From the dark, I saw the thin lines of its carvings… those damnable circles, those perfect circles…" My thirst had nearly caked me, yet I kept not running. The idea of beholding a vast subterrene world surging through the air of subterrene loathsomeness horrified and unnerved me, but my curiosity about the whole business was magnified a thousandfold. Surely, the Old Ones must be close to the visible universe, yet I could not vouch for their absolute perfection. For in addition to their uncanny light and shadow, there was something else vastly more–something infinitely more–unrelatively similar in mentality and construction.
It was I who thought of the blasphemous and unholy carvings on those frightful carvings; the monstrously resemblances which the carversation of those carvings seemed to imply. That their presence was preternatural, I could from a logical and scientific point of view admit; but it was more absolutely distinct from the vague pseudo-memories of other beings–others than those which Carter had dreamt about. The vague pseudo-memories were, I felt, all connected with the one thing I knew–the door of that room with the carven carven carven face and carven face. I recalled once talking with a human being of that race–being guessed from olden letters–who lived in that room, and who said they had talked with strange and significant beings of other universes. And of that being I could speak with a written tongue, though it was alien to me.

I tested the model via Tabtab. This is the result generated by using “From the dark, I saw” as input, and as expected, a fairly eerie post came out. Even though I gave 3 epochs, it’s a pretty good result. If you want to see the results quickly, it would be okay to give less epochs!

I used a small model and 3 epochs, but you can make a model with various combinations. It seems that there will be various and interesting results. Enjoy Teachable NLP in a variety of ways!

fpem123 · April 22, 2021, 12:58am

[KOR]

Teachable NLP를 이용한 공포 소설 Lovecraft 모델 만들기

Topic		Replies	Views
[Teachable NLP] GPT-2 Fairy Tales AI Showcase (EN)	1	708	April 16, 2021
[Teachable NLP] GPT-2 Lord of the rings model AI Showcase (EN)	1	933	April 15, 2021
[Teachable NLP] How to Use Teachable NLP AI Showcase (EN)	0	3590	April 14, 2021
How to fine-tune GPT-2? Q&A (EN)	0	1415	April 28, 2021
Dataset Collection AI Dataset	1	440	July 11, 2022