Recently, researchers at Google have developed a new dataset named Colossal Clean Crawled Corpus and a unified framework and model dubbed Text-to-Text Transformer. It converts language problems into a text-to-text format. According to the researchers, in experiments with one of the largest models ever submitted to the General Language Understanding Evaluation (GLUE) benchmark, they achieved state-of-the-art results on benchmarks covering question answering, text classification, and more.
Generally, to train a model to perform NLP tasks required ensuring the model develops knowledge enabling it to “understand” text — knowledge that might range from low-level to high-level. The team of researchers examined an approach that took text as input and produced new text as output, applying the same objective, training procedure, and decoding process to every task considered.
Snippets in the training corpora, the researchers compiled sourced from the Common Crawl project. This project brushes roughly 20 terabytes of English text from the web each month. To filter out insensible menus, and error messages, they retained only text lines that ended in a terminal punctuation mark while deleting pages with obvious filler text and duplicates. The collection, as a result, is a claimed order of magnitude larger than most data sets used for pre-training, at around 750GB.
The researchers team at Google trained several Transformer-based models on the corpus to evaluate the effectiveness of their text-to-text approach. Notably, transformers are a new type of neural architecture introduced in a 2017 paper co-authored by scientists at Google Brain, Google’s AI research division. The architecture is all deep neural networks and it contains neurons (mathematical functions) arranged in interconnected layers that transmit signals from input data and slowly adjust the synaptic strength (weights) of each connection. In similar way, all AI models extract features and learn to make predictions. However, transformers uniquely have attention such that every output element is connected to every input element and the weightings between them are calculated effectively.
The largest model accommodates around 11 billion parameters, or configuration variables internal to the model. These are required while making predictions. As per the researchers, they managed a state-of-the-art average score (89.7) on GLUE and the reading comprehension benchmarks SQuAD and CNN/Daily Mail. They also tested it on SuperGLUE. The SuperGLUE embraces tasks which are far away from the scope of current NLP systems but solvable by college-educated speakers, it nearly matched human performance with a score of 89.8.
The google team, additionally, acknowledges that their model fell short in linguistic tasks like translation. They blame this shortcoming on a relative dearth of task-specific data and insufficient training scale. Subsequently the team advocates for research on methods that achieve stronger performance with smaller models in order to apply transfer learning where it will have the most impact.
The co-authors of the paper quote, “An unsurprising but important result from our study is that larger models tend to perform better. The fact that the hardware used for running these models is continually getting cheaper and more powerful suggests that scaling up may continue to be a promising way to achieve better performance [Sutton, 2019]. However, it will always be the case that there are applications and scenarios where using a smaller or less expensive model is helpful, for example when performing client-side inference or federated learning.”