Human language

AI needs fundamental models – so what can we learn from GPT-3, BERT and DALL-E 2?

The Stanford Institute for Human-Centered Artificial Intelligence first introduced the term fundamental model. Most AI applications today are bespoke, designed to solve a particular problem with little concern for reuse or broader issues.

Fundamental models are trained on a set of unlabeled data that can be applied to a wide range of tasks with little or no extra effort.

Early examples of the promise of fundamental models such as GPT-3, BERT, or DALL-E 2 showed promise in terms of language and images. By just entering a short string, the system can return an entire trial, even though it hasn’t been trained to understand what you asked for beforehand, and enough to fool you that it understands what you asked for and what is written. It’s not.

The way basic models use unsupervised learning and transfer learning can apply learned connections and relationships to use information from one combination to another. It creates an odd performance that seems like a logical thought. An analogy is learning to drive one car and being able to drive others with little or no training.

The thrill of great language models writing novels is over. The concept of base models is well understood, but the industry is busy finding a way to apply the technology in various fields.

For example, IBM released a open-source tool CodeFlare that streamlines the development and production of machine learning workloads for future base models to solve their most important (AI) problems. For example, insurance could customize a basic template they have for fraud investigation languages.

Core models have the potential, barring other external factors, to accelerate AI in the enterprise. Reducing the biggest headache, labeling requirements, will “democratize” the creation of highly accurate and efficient AI-based automation. Whether this will more effectively resolve biases remains to be seen. There has been skepticism of earlier LLMs.

Even before the recent craze for responsive chatbots, large language models (LLMs) have created a lot of excitement and quite a bit of concern. LLMs – deep learning models trained on large amounts of text – display chrematistics that appear to be the understanding of human language.

LLMs like Bert GPT-3 , and LaMDA manages to maintain consistency with notable portions of text and seems knowledgeable on a range of topics. They can stay cohesive in long conversations so convincing they can be misinterpreted as human intelligence. It’s strange how they maintain context in long conversations and their ability to switch context smoothly.

A burning question is whether LLMs can do human-like logical reasoning. In a research paper on transformers, scientists from the University of California, Los Angeles in a research paper on transformers, the deep learning nets in LLM, undoubtedly, do not learn reasoning functions. By applying statistical methods, they learn quantitative characteristics in the reasoning process.

What did the researchers mean by “statistical methods”? As the Transformer sifted through billions of pages of text, he created a journal of the relationships between the various [phrases. If you asked, “Was the conclusion of Flaubert’s “Madame Bovary” contrived,” you might get an answer like, “I’m not familiar with that.” But if you asked, “How are you today,” you may get a response, “I’m fine, thank you, how are you,” not because it examined its internal state and applied reasoning leading to a logical answer. It simply knew that in the context of the conversation, that was most likely the correct response. 

The researchers used BERT, one of the most popular transformers in use. Their findings show that BERT can respond to reasoning problems in the training space quite well. However, it responds poorly and can’t generalize from examples from other distributions. It seems that it exposes some lapses in deep neural networks. Developing benchmarks for them is a challenge.

Measuring logical reasoning in AI is challenging. GLUE, SuperGLUE, SNLI, and SqUA are benchmark tests for AI, specifically for NLP models. Transformers have been backed by the largest AI complies, and as a result, their defects and deficiencies are addressed promptly and demonstrate incremental improvement. Until now, progress is driven by more and more massive scale as a method to ensure more accuracy. This scale does not come without some about resources. GPT-3, has 175 billion machine learning parameters. It was trained on NVIDIA V100, but researchers have calculated that using A100s would have taken 1,024 GPUs, 34 days and $4.6million to train the model. While energy usage has not been disclosed, it’s estimated that GPT-3 consumed 936 MWh. 

The LLMs are improving because they have acquired logical reasoning capabilities, or have they? Or is it that they have trained on enormous volumes of text? 

The UCLA researchers developed SimpleLogic, based on propositional logic and a set of logical reasoning problems. A problem includes rules, queries (the problem that the ML model must respond to), and Facts are predicates that are known to be true. The answer to the query, “true” or “false,” is the Label.

The researchers concluded (PDF link):

Upon further investigation, we provide an explanation for this paradox: the model attaining high accuracy only on in-distribution test examples has not learned to reason. The model has learned to use statistical features in logical reasoning problems to make predictions rather than to emulate the correct reasoning function.

Neural networks are very good at finding and fitting statistical features. In some applications, this can be very useful. For example, in sentiment analysis, there is a strong correlation between certain words and classes of sentiments. This finding highlights a critical challenge in using deep learning for language tasks:

Caution should be taken when we seek to train neural models end-to-end to solve NLP tasks that involve both logical reasoning and prior knowledge [emphasis mine] and are presented with linguistic variance.

Reasoning in deep learning

Unfortunately, the logical reasoning problem does not go away as language models get larger. It just gets hidden in their massive architecture training data. LLMs can spit out well-put together facts and sentences. However, they still use statistical features to make inferences in logical reasoning, which is not a solid foundation.

And there is no indication that the logical reasoning gap will be bridged by adding layers, parameters, and heads of attention to transformers.

As the UCLA researchers conclude:

On the one hand, when a model is trained to learn a task from data, it always tends to learn statistical patterns, which inherently exist in reasoning examples; on the other hand, the rules of logic never rely on statistical models to drive reasoning. Since it is difficult to construct a logical reasoning dataset without statistical features, learning to reason from the data is difficult.