Teachable-NLP is a GPT-2 Finetuning program with a text(.txt) file without writing NLP codes. As soon as I recognized it, I planned to make a model creating Kantian sentences.
I dreamed of someone who writes a Philosophy essay in place of me when I majored in Philosophy. So I came up with the idea that I trained GPT-2 with ‘The Critique of Pure Reason’ of Kant and Kant becomes Ghostwriter and does write Kantian essay in place of students.
- Acquiring the data
I acquired the text(.txt) file of ‘The Critique of Pure Reason’ from Project Gutenberg. The original text is in Deutsch, but Teachable NLP works only in English. So I used the English version.
Moreover, I had to check the copyright status of text file. It published hundreds years ago, so the copyright was Public domain. It makes me free to copyright issue even if I edit the text and use the text to machine learning.
2. Preprocessing the data
There were unwanted data in the text file (e.g. headings like SECOND’, ‘I.’).
It might cause
garbage out, so I removed them.
SECOND DIVISION—TRANSCENDENTAL LOGIC TRANSCENDENTAL DIALECTIC. INTRODUCTION. I. Of Transcendental Illusory Appearance We termed dialectic in general a logic of appearance. This does not signify a doctrine of probability; for probability is truth, only cognized upon insufficient grounds, and though the information it gives us is imperfect, it is not therefore deceitful. Hence it must not be separated
Also the diagrams which explains the Kantian theories will disrupt the context with the irregular space to Teachable-NLP. After skmming the text, I found all the diagrams and annotations started with spaces and other regular sentences didn’t. I removed the unnecessary data by using the pattern.
NOTHING AS 1 Empty Conception without object, _ens rationis_ 2 3 Empty object of Empty intuition a conception, without object, _nihil privativum ens imaginarium_ 4 Empty object without conception, _nihil negativum_
It was simple data preprocessing with
for loop, I checked every line (
f.readlines())and add only valid data to the new data file. For that, I examined the sentence carefully and tried to find out patterns.
The text in front of PREFACE consists of publication and irrelevant with Kant. So I ignored the data before a sentence starting with PREFACE. And then, I filtered out the headings and gap in the intervals.
As a result, I found out the pattern of headings (e.g. Section, Chaper, Introduction) and the terminologies of Logic, Roman numerals. And I tried to make use of the contents in the parentheses by substring the text and kept in context.
file_name = "The Critique of Pure Reason.txt" f = open(file_name, "rt", encoding='utf-8') file = f.readlines() f.close() sentences =  start = 0 for line in file: if line.startswith('PREFACE'): # In front of PREFACE, no values # var start is used to flag start = 1 continue elif start != 1: continue elif line.startswith('End'): # And vice versa, behind END, no values # So break the loop break elif line.startswith(' '): # Remove the diagrams and annotations continue # Remove headings elif line.startswith('Section'): continue elif line.startswith('Chapter'): continue elif line.startswith('Introduction'): continue elif line.startswith('FIRST') or line.startswith('SECOND') or line.startswith( 'THIRD'): continue elif line.startswith('PROOF') or line.startswith('ANTITHESIS'): continue elif line.startswith('OBSERVATION'): continue elif line.startswith('ON THE THESIS'): continue elif line.startswith('I.') or line.startswith('II.') or line.startswith('III.') or line.startswith('IV.') or line.startswith('V.') or line.startswith('VI.'): # Remove only the Symbols and keep the valid long titles line = line[line.find('.') + 2:] elif line.startswith('-'): continue elif line.startswith('§'): continue # Remove CRLF elif line == "\n": continue elif line.startswith('('): line = line[4:] sentences.append(line)
Finally I got a 1.2MB data with simple sentences. Some Deutsch proper noun and colons are still there. If I remove them, however, it doesn’t make sense. So I kept them given the rarely small amounts.
This is the introduction of data after preprocess.
Human reason, in one sphere of its cognition, is called upon to consider questions, which it cannot decline, as they are presented by its own nature, but which it cannot answer, as they transcend every faculty of the mind. It falls into this difficulty without any fault of its own. It begins with principles, which cannot be dispensed with in the field of experience, and the truth and sufficiency of which are, at the same time, insured by experience. With these principles it rises, in obedience to the laws of its own nature, ...
3. Training with Teachable-NLP
I uploaded the data to Teachable-NLP and set the option to finetune. I’ve already finetuned the model by myself and noticed it takes longer time to finetune larger model. I focused on reducing the loading, so I chose
small model. But I chose
epoch 5 to ensure least at least accuracy.
4. Generate Text from Ghostwriter
The user should give him a prefix of essay. If prefix is ‘The demon’, The sentences prints out as below. I considered the user can control the length of essay, however, I fixed the length to 200 for GPU utilzation rate.
Even though I’ve already preprocessed in No.2, some symbols and headings appear because I missed them (e.g.1. , Remark, -) at that time.
**Remark II.** Now with this view all empirical use of our faculty of cognition in the determination of time is in perfect accordance.
**—**Ovid, Metamorphoses. **xiii**
The one relates to the objects of the pure understanding, and is intended to demonstrate and to render comprehensible the objective validity of its **_à priori_ conceptions**;
**1.** Mathematical judgements are always synthetical. Hitherto this fact, though incontestably true and very important in its consequences,
So I preprocess data again and clearly remove them. For that, I added the code as below.
# missed headings elif line.startswith('Introductory'): continue elif line.startswith('The remark:'): continue elif line.startswith('Remark'): continue # remove the bracket if line.find('[') > -1: if line.find(']') > -1: left = line.find('[') right = line.find(']') new = line[:left-1] + line[right+1:] line = new . . . # remove index and symbols in the middle of sentence line = line.replace('§','') line = line.replace('_', '') line = line.replace(';', '. ') line = line.replace('(a)', '') line = line.replace('(b)', '') line = line.replace('(c)', '') line = line.replace('(1)', '') line = line.replace('(2)', '') line = line.replace('(3)', '') line = line.replace('(4)', '') line = line.replace('1.', '') line = line.replace('2.', '') line = line.replace('3.', '') line = line.replace('4.', '') line = line.replace('5.', '')
- Parentheses : It is ought to be removed for routine dialogue. But for essay, the parentheses doesn’t matter.
- Some texts are stiil alive : There are some texts that seems same, but they are actually different in code. In case of that, you can copy & paste the ambigous text to the preprocessing code, not just type.
5. More Efficient
Over the double preprocessing, the GPT-2 model became fine-tuned to more accurate. Finally with the prefix, the Kantian essay is created automatically based on ‘The Critique of Pure Reason’ and it relieves the students who have to write the long essay.
The program 'Ghostwriter Kant’ fixed the length of essay and the text is printed only at once. If you want to elaborate the essay, write in TabTab that is linked to the Fine-tuned model! In TabTab, you can print out short sentence, and select a sentence in 5 candidate sentences.
7. Wrap up
Now you don’t have to rack your brain to write Kantian essay for Philosophical assignments. Ghostwriter Kant would write it in place of you. Perhaps you might to worry he forgot his theory while he lay in the tomb. But you don’t have to worry about that! Because he reviewed ‘The Critique of Pure Reason’ thoroughly!
You can also easily create a model that writes a essay on difficult subjects with
Teachable NLP and generate articles with Model APIs.