[Teachable NLP] NIPS paper generation model

Teachable NLP : Ainize | Launchpad for open-source AI projects
Model API Test : TabTab!


Now, write your research paper with AI! Today, I will introduce the ML Paper AI model made with the NIPS(Neural Information Processing Systems)Paper dataset in Kaggle. NIPS covers topics ranging from deep learning and computer vision to cognitive science and reinforcement learning. Let me explain how I use the NIPS Paper dataset to train GPT-2 model.

1. Collect data

I downloaded NIPS Paper data sets from Kaggle. (NIPS 2015 Papers | Kaggle)

Neural Information Processing Systems (NIPS) is one of the top machine learning conferences in the world. It covers topics ranging from deep learning and computer vision to cognitive science and reinforcement learning.

I used paper.csv file in the data sets. I extracted Title, Abstract, and PaperText from paper.csv and composed them as a single text file.

PaperText is consisted of Title, Abstract, Author information, Main text.

PaperText

Double or Nothing: Multiplicative
Incentive Mechanisms for Crowdsourcing
exampleAuthorName1
University of California, Berkeley
exampleAuthorEmail@email.com

exampleAuthorName2
Microsoft Research
exampleAuthorEmail2@email.com

Abstract
Crowdsourcing has gained immense popularity in machine learning applications
for obtaining large amounts of labeled data. Crowdsourcing is cheap and fast, but
suffers from the problem of low-quality data. To address this fundamental challenge in crowdsourcing, we propose a simple payment mechanism to incentivize
workers to answer only the questions that they are sure of and skip the rest. We
show that surprisingly, under a mild and natural “no-free-lunch” requirement, this
mechanism is the one and only incentive-compatible payment mechanism possible. We also show that among all possible incentive-compatible mechanisms
(that may or may not satisfy no-free-lunch), our mechanism makes the smallest possible payment to spammers. Interestingly, this unique mechanism takes a
“multiplicative” form. The simplicity of the mechanism is an added benefit. In
preliminary experiments involving over several hundred workers, we observe a
significant reduction in the error rates under our unique mechanism for the same
or lower monetary expenditure.
1

Introduction

Complex machine learning tools such as deep learning are gaining increasing popularity and are
being applied to a wide variety of problems. These tools, however, require large amounts of labeled
data [HDY+ 12, RYZ+ 10, DDS+ 09, CBW+ 10]. These large labeling tasks are being performed by
coordinating crowds of semi-skilled workers through the Internet. This is known as crowdsourcing.
Crowdsourcing as a means of collecting labeled training data has now become indispensable to the
engineering of intelligent systems.

When the model generates a sentence, I thought that the author’s information was not needed, so I deleted the author’s information. I used regular expressions in Python to erase the author’s information and all overlapping texts such as title, abstracts from PaperText.

# text == PaperText
text = text.replace('\n', 'NOT USED WORD')
text = re.sub('.*Introduction', '', text)
text = text.replace('NOT USED WORD', '\n')

If you don’t know Python or regular expressions, you can also use CTRL(COMMAND)+F in a text editor to find the email or ‘Abstract’ to find the author information and delete it yourself.

Title + Abstract + PaperText

Double or Nothing: Multiplicative Incentive Mechanisms for Crowdsourcing

Crowdsourcing has gained immense popularity in machine learning applications for obtaining large amounts of labeled data. Crowdsourcing is cheap and fast, but suffers from the problem of low-quality data. To address this fundamental challenge in crowdsourcing, we propose a simple payment mechanism to incentivize workers to answer only the questions that they are sure of and skip the rest. We show that surprisingly, under a mild and natural no-free-lunch requirement, this mechanism is the one and only incentive-compatible payment mechanism possible. We also show that among all possible incentive-compatible  mechanisms (that may or may not satisfy no-free-lunch), our mechanism makes the smallest possible payment to spammers.  Interestingly, this unique mechanism takes a multiplicative form. The simplicity of the mechanism is an added benefit.  In preliminary experiments involving over several hundred workers, we observe a significant reduction in the error rates under our unique mechanism for the same or lower monetary expenditure.

Complex machine learning tools such as deep learning are gaining increasing popularity and are
being applied to a wide variety of problems. These tools, however, require large amounts of labeled
data [HDY+ 12, RYZ+ 10, DDS+ 09, CBW+ 10]. These large labeling tasks are being performed by
coordinating crowds of semi-skilled workers through the Internet. This is known as crowdsourcing.
Crowdsourcing as a means of collecting labeled training data has now become indispensable to the
engineering of intelligent systems.
Most workers in crowdsourcing are not experts. As a consequence, labels obtained from crowdsourcing typically have a significant amount of error [KKKMF11, VdVE11, WLC+ 10]. Recent
efforts have focused on developing statistical techniques to post-process the noisy labels in order
to improve its quality (e.g., [RYZ+ 10, ZLP+ 15, KOS11, IPSW14]). However, when the inputs to
these algorithms are erroneous, it is difficult to guarantee that the processed labels will be reliable
enough for subsequent use by machine learning or other applications. In order to avoid “garbage in,
garbage out”, we take a complementary approach to this problem: cleaning the data at the time of
collection.
We consider crowdsourcing settings where the workers are paid for their services, such as in the
popular crowdsourcing platforms of Amazon Mechanical Turk and others. These commercial platforms have gained substantial popularity due to their support for a diverse range of tasks for machine
learning labeling, varying from image annotation and text recognition to speech captioning and machine translation. We consider problems that are objective in nature, that is, have a definite answer.
Figure 1a depicts an example of such a question where the worker is shown a set of images, and for
each image, the worker is required to identify if the image depicts the Golden Gate Bridge.

This is how the generated text data looks like and it’s a 12MB text file.

2. Training model

I uploaded data on Teachable NLP and trained with the following options.
modelType: small
epochs: 5

3. Test it yourself!

NIPS Paper model : TabTab

SNS data analysis using clustering For large-scale datasets, it is well understood
that the performance of algorithms with low rank structure such as Lasso, group Lasso,
and sparse group Lasso is only guaranteed via

This is generated by ML Paper AI model.
My input is SNS data analysis using clustering.
With an appropriate keyword in the input, we can see those relevant sentences as outputs.

This model can generate sentences about Machine Learning and Deep Learning since it was trained with NIPS Paper data. Why don’t you make ML Paper AI model by training data sets from other fields?

1 Like