The data came from Common Crawl, a non-profit that scans the open web every month and downloads content from billions of HTML pages then makes it available in a special format for large-scale data mining. In 2017 the average monthly “crawl” yielded over three billion web pages. Common Crawl has been doing this since 2011, and has petabytes of data in over 40 different languages. The OpenAI team applied some filtering techniques to improve the overall quality of the data, including adding curated datasets like Wikipedia.
GPT stands for Generative Pretrained Transformer. The “transformer” part refers to a neural network architecture introduced by Google in 2017. Rather than looking at words in sequential order and making decisions based on a word’s positioning within a sentence, text or speech generators with this design model the relationships between all the words in a sentence at once. Each word gets an “attention score,” which is used as its weight and fed into the larger network. Essentially, this is a complex way of saying the model is weighing how likely it is that a given word will be preceded or followed by another word, and how much that likelihood changes based on the other words in the sentence.
Through finding the relationships and patterns between words in a giant dataset, the algorithm ultimately ends up learning from its own inferences, in what’s called unsupervised machine learning. And it doesn’t end with words—GPT-3 can also figure out how concepts relate to each other, and discern context.
Comments are closed.