Text Generation

Category
Complexity
3/5
Date published
2019-11-01
Author
  • Vincent Terrasi
Links

Using AI to Generate Qualitative Content

Context

Good SEO requires good content, but content creation is an expensive element in creating and maintaining a website.

Automated text generation can be useful to SEOs in a number of cases.

When the text generation is highly qualitative, it can be used for:

  • Creation of anchors for internal linking
  • Mass-creation of variants of title tags
  • Mass-creation of variants of meta descriptions

Less qualitative text generation can also be used as a guide to help you write content for a query. (Note that low quality generated text should not be used as website content directly, and can be detected and penalized by Google.)

While text generation models have become increasingly easy to find in English, non-English languages have struggled to find a qualitative method.

The reasons for this are tied to the current available technologies in Natural Language Generation. Some of the main difficulties have been:

  • Algorithmic and technical difficulties in getting a model to understand training text. A huge step forward was taken recently via the use of Transformers, which are used in the model we propose here.
  • Difficulties related to the limited amount of training data available, which prevents a model from fully learning how words are used. This is the case for almost all non-English languages. The lack of training data is overcome in this model because we use a relatively "small" training dataset of 100,000 short texts of 500 words each.

The method we propose here using the GPT-2 model successfully produces qualitative content in non-English languages and was tested extensively in French.

More information is availabe: on GPT-2, on training a model, and on the training method used in this project.

Objectives
  • Use a language-independent model for text generation that can be used for SEO purposes
  • Practice training a new language model from scratch using Transformers and Tokenizers
  • Fine-tune, and use the model to generate custom texts
Method

Preparing the data for training this model requires more than 100,000 pieces of content with a minimum of 500 words, which you will need to provide for your own language. This set of text is then be paired with deep learning framework PyTorch to disseminate the text in a way that will later be used by GPT-2 to generate new text.

Generating the compressed training dataset involves encoding the massive amount of insight afforded by PyTorch into a format easily readable by GPT-2.

Training the model is accomplished by setting vocabulary size, embed size, size of the attention model and the number of neural networks, then running the model on the training dataset you've prepared.

Generating article text occurs based on the initial prompt that you define, how much you want your text to deviate from the model, and how much text you want to generate.