How to evaluate intelligence? Google approach

From myself:

In November 2019, a programmatic article from Google “On Assessing Intelligence” by Francois Schollet (creator of Keras) was released.
64 pages are devoted to how the modern understanding of AI appeared, why machine learning is so far from it, and why we still cannot adequately measure “intelligence”.


For the selection to be fair, the task for all is one: climb a tree

Our team is engaged in NLP and the general methodology of AI tests, taking into account the latest trends in universal transformers such as BERT, which are evaluated by tests for logic and common sense. So, NLP takes on all the new tasks associated with the reproduction of increasingly complex actions and, in fact, reflecting the mechanisms of thinking. It turned out that other areas of ML grabbed their piece of the pie in this direction. For example, CV - " Animal AI Challenge ".

It is clear that now it’s “better” if possible to make ML-models more interpretable, not to use 10 small classifiers, but to train one model, and so on, but how far away is it from real “intelligence”?

Spoiler:
.

The program article provides a detailed and devastating analysis of research in the field of technical assessment of modern AI.

At the end of the article, the author offers his own test and dataset for it: Abstraction and Reasoning Corpus (ARC), tied to abstract thinking.

But more about everything.

Synopsis of “On the Measure of Intelligence“


In order to consciously create more intelligent and more human-like artificial systems, we need a clear definition of intelligence and the ability to evaluate it. This is necessary in order to correctly compare two systems, or a system with a person. Over the past century, many attempts have been made to determine and measure intelligence both in the field of psychology and in the field of AI.

The modern ML-community still loves to compare the skills that AI and people demonstrate - when playing table and computer games, when solving problems. But to assess intelligence, it is not enough to measure only the ability to solve a task. Why? Because this ability is largely formed not by the intellect, but by previous knowledge and experience. And you can "buy" them. Feeding the system an unlimited amount of training data or preliminary information, experimenters can not only bring the machine to an arbitrary level of skills, but also hide how capable the system itself is for intellectual generalization.

The article proposes 1) a new official definition of intelligence based on the effectiveness of skills acquisition; 2) a new test for the ability to form abstractions and logical conclusions (Abstraction and Reasoning Corpus, ARC). ARC can be used to measure the human form of strong moving intelligence, this allows you to numerically compare the relatively strong intelligence of AI and human systems.

A practically useful definition of intelligence and its metrics is needed.


The goal of AI development is to create machines with intelligence that is comparable to the intelligence of people. (So ​​the goal was formulated since the inception of artificial intelligence in the early 50s of the twentieth century, and since then this formulation has been preserved).

But while we can create systems that do well with specific tasks. These systems are imperfect: they are fragile, require more and more data, are unable to understand examples that slightly deviate from the training set, and cannot be reconfigured to solve new problems without the help of people.

The reason for this is that we still cannot unambiguously answer the question of what intelligence is. Existing tests, for example, the Turing test [11] and the Loebner prize [10], cannot serve as drivers of progress, since they completely exclude the ability to objectively determine and measure intelligence, but rely on a subjective assessment.

Our goal is to point out implicit prejudices in the industry, and also to offer a practical definition of practical definition and criteria for evaluating a strong intellect like a human intellect.

Definition of intelligence: two conflicting approaches


The total basic definition of AI is: "Intelligence measures the agent’s ability to achieve goals in a wide range of environments." Doesn’t explain anything?

The whole conflict in modern science comes down to what is considered the starting point of natural intelligence:

  • mind is a static set of special-purpose mechanisms that are formed by evolution for obviously certain tasks. This point of view of Darwinism, evolutionary psychology and neurophysiologists supporting the concept of biological modularity of consciousness .
    Understanding of the mind as a wide range of vertical, relatively static programs that together form “intelligence” was also developed by Marvin Minsky, which ultimately led to the understanding of AI as an emulation of human results on a given list of test tasks.
  • tabula rasa: the mind is a “clean sheet” of indefinite purpose, capable of turning arbitrary experience into knowledge and skills to solve any problem. This is the point of view of Alan Turing and the connectionists . In this understanding, intelligence is represented through the metaphor of a super-computer, and its low-level mechanics makes it possible to acquire an unlimited set of skills “from scratch”, “according to data”.

Both concepts are currently considered invalid. ¯ \ _ (ツ) _ / ¯

AI Assessment: From Assessment of Skills to Assessment of Broad Abilities


Tests on given data sets have become the main driver of progress in the field of AI, because they are reproducible (the test set is fixed), fair (the test set is the same for everyone), scalable (repeated repetition of the test does not lead to high costs). Many popular tests - DARPA Grand Challenge [3], Netflix Prize - contributed to the development of new algorithms for ML-models.

With positive results, even those obtained by the shortest route (with overfitting and crutches), the expected level of quality is constantly rising. McCordack called it the “AI effect”: “Every time someone came up with a new way to get the computer to do something new (to play checkers), critics who said,“ This is not thinking ”necessarily appeared” [7]. When we know exactly how a machine does something “smart,” we cease to think it is smart.

The “AI effect” appears because the process of using intelligence is confused (for example, the process of learning a neural network to play chess) and the artifact created by such a process (the resulting model). The reason for the confusion is simple - in a person, these two things are inseparable.

To move away from evaluating only artifacts, and the very ability to learn and acquire new skills, they introduce the concept of a “generalization range”, in which the system assumes gradual values.

  • Lack of generalization . AI systems, in which there is no uncertainty and novelty, do not demonstrate the ability to generalize, for example: a program for playing tic-tac-toe, which wins by exhaustive search of options.
  • Local generalization, or “reliability”, is the ability of a system to process new points from a known distribution for a single task. For example, a local classification was performed by an image classifier, which can distinguish previously unseen images of cats with cats from similarly formatted pictures of dogs after training on many similar images of cats and dogs.
  • , «» — : , , « ». , , « » ( ) [16], .
  • . , , — « ». ( , , ).

The history of AI is a history of slow development, starting from systems that do not demonstrate the ability to generalize (symbolic AI), and ending with reliable systems (machine learning) capable of local generalization.

We are currently at a new stage in which we are striving to create flexible systems - there is growing interest in using a wide range of test tasks to evaluate systems that develop flexibility:

  1. reference criteria GLUE [13] and SuperGLUE [12] for natural language processing
  2. Arcade learning environment for reinforcement learning agents [1],
  3. platform for experiments and research of AI "Malmo Project",
  4. Behavior Suite experiment set [8]

In addition to such multitasking tests, two sets of tests have recently been proposed to assess the ability to generalize, rather than the ability to solve specific problems:

  1. Animal-AI Olympics Olympiad [2] ( animalaiolympics.com )
  2. and the GVG-AI competition [9] ( gvgai.net ).

Both tests are based on the assumption that AI agents should be assessed for learning or planning (rather than special skills) by solving a set of tasks or games unknown to them earlier.



New concept


How to compare artificial intelligence with human, if the level of different cognitive abilities varies for different people?

The results of tests for intelligence in people with different abilities may coincide - this is a well-known fact of cognitive psychology. He shows that cognition is a multidimensional object, hierarchically structured in the image of a pyramid with wide and narrow skills, at the top of which is a factor of general intelligence. But is “strong intelligence” really the top of the cognitive pyramid?

The theorem “ no free meals”[14, 15] tells us that any two optimization algorithms (including human intelligence) are equivalent when their performance is averaged for each possible task. That is, in order to achieve performance higher than random, the algorithms must be sharpened for their target task. However, in this context, “any possible task” means uniform distribution over the subject area. The distribution of tasks that would be relevant specifically for our Universe would not correspond to such a definition. Thus, we can ask the following question: is the human intelligence factor universal?

In fact, people have so far collected too little information about the cognitive abilities of the agents surrounding them - other people (in different cultures, “intelligence” is evaluated differently) and animals, for example, octopuses or whales.

Apparently, human intelligence is far from universal: it is unsuitable for a large number of tasks for which our innate a priori knowledge is not adapted.

For example, people can very effectively solve some small problems of polynomial complexity if they mentally intersect with evolutionarily familiar tasks like navigation. So, the traveling salesman problem with a small number of points can be solved by a person almost optimally in an almost linear optimal time [6], using a perception strategy. However, if instead of “finding the shortest path” ask him to find the longest path [5], then a person will cope much worse than one of the simplest heuristic algorithms: the “distant neighbor” algorithm.



The authors argue that human cognition develops in the same way as a person’s physical abilities: both developed in the process of evolution to solve specific problems in specific environments (these tasks are known as “ four F"- four basic instincts: fighting, fleeing, feeding and fornicating: beat, run, feed and breed).

The main message of this work is that “strong intelligence” is a property of the system that cannot be determined binary: “either it is or not”. No, this is a range depending on:

  1. scope, which may be more or less wide;
  2. the degree of efficiency with which the system transforms a priori knowledge and experience into new skills in a given area;
  3. the degree of complexity of the generalization represented by various points in the area under consideration.

The "value" of one sphere of application of intelligence in comparison with another is absolutely subjective - we would not be interested in a system whose sphere of application would not overlap with ours. And they would not even consider such a system intellectual.

?


  • , .
  • ( ).
  • :
    ◩ , – , ,
    ◩ , – (), () ( )
  • He must control the amount of experience used by systems during training. “Buying” the effectiveness of a benchmark by selecting unlimited training data should be impossible.
  • It should provide a clear and comprehensive description of the set of initial knowledge used.
  • He must work impartially for both people and machines, using the same knowledge that people use.

The first attempt to do such a test is described below.

Suggested Test: ARC Dataset


ARC can be considered as a benchmark test of strong artificial intelligence, as a benchmark test of software synthesis, or as a psychometric test of intelligence. It targets both humans and artificial intelligence systems designed to simulate strong moving intelligence similar to human intelligence. The format is somewhat reminiscent of Raven’s progressive matrices [4], a classic IQ test dating back to the 1930s.

ARC includes two data sets: training and assessment. There are 400 in the training set, and 600 in the evaluation set.

Moreover, the assessment set is also divided into two: open (400 tasks) and closed (200 tasks). All the proposed tasks are unique, and the set of assessment tasks does not intersect with the set of trainers.

Task data can be found in the repository .

Each task consists of a small number of demos and test cases. Demonstrations averaged 3.3 per task, test ones from one to three, most often one. Each example, in turn, consists of an input grid and an output grid.

Such a “grid” is a matrix of certain symbols (each of which, as a rule, is highlighted in a certain color):



There are 10 unique symbols (or colors) in total. A “grid” can be of any height or width - from 1x1 to 30x30 inclusive (average height - 9, average width - 10).

When solving the assessment problem, the test participant gets access to training examples (both “input” and “output grid”), as well as to the initial conditions for completing the test task - “input grid” of the corresponding test (evaluation) examples. Next, the test participant must build their own “output grid” for the “input grid” of each test case.

The construction of the "output grid" is carried out exclusively from scratch, that is, the test participant must decide for himself what the height and width of this "grid" should be, what symbols should be placed in it and where. It is believed that the problem is solved successfully if the test participant can give an accurate and correct answer for all test cases included in it (a two-part success indicator).

The presence of a closed assessment set allows us to strictly monitor the purity of the assessment in an open competition. Examples of ARC jobs:



A task whose implicit goal is to complete a symmetric circuit. The nature of this task is determined by three input / output examples. The test participant must draw up an output grid corresponding to the input grid (see bottom right).



The task of eliminating the "noise".



The red object "moves" towards the blue until it comes into contact with it.



A task whose implicit goal is to continue (extrapolate) a diagonal line that “bounces” when it comes into contact with a red obstacle.



A task where it is necessary to complete a number of actions at once: “continue the line”, “bypass obstacles” and “effectively achieve the final goal” (in a real task, more demonstration pairs are given).

ARC is not provided as a perfect and complete test, however, it has important properties:

  • Each test task is new and relies on a clear set of initial knowledge common to all test participants.
  • it can be completely solved by people, but it cannot be accomplished with the help of any existing machine learning techniques (including deep learning).
  • the test can be a very interesting “playground” for AI researchers who are interested in developing algorithms that are capable of broad generalization that acts like a human. In addition, ARC gives us the opportunity to compare human and machine intelligence, as we provide them with the same initial knowledge.

The author plans to further improve ARC - both as a platform for research, and as a joint benchmark for machine and human intelligence.

What do you think - maybe the main idea will be more successful if we manage to distract the attention of the strong AI community from trying to surpass people in specific tasks?

Literature


  • [1] . , , (Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling). : (The arcade learning environment: An evaluation platform for general agents). . (J. Artif). Int. Res., (1):253–279, 2013.
  • [2] , -, , , (Benjamin Beyret, Jos Hernndez-Orallo, Lucy Cheke, Marta Halina, Murray Shanahan, and Matthew Crosby). «-»: (The animal-AI environment: Training and testing animal — like artificial cognition), 2019.
  • [3] , (Martin Buehler, Karl Iagnemma, and Sanjiv Singh). 2005 .: (The 2005 DARPA Grand Challenge: The Great Robot Race). Springer Publishing Company, Incorporated, 1- , 2007.
  • [4] . (Raven J. John). (Raven Progressive Matrices). Springer, , M, 2003.
  • [5] (James Macgregor and Yun Chu). : (Human performance on the traveling salesman and related problems: A review). The Journal of Problem Solving, 3, 02 2011.
  • [6] (James Macgregor and Thomas Ormerod). (Human performance on the traveling salesman problem). Perception & psychophysics, 58:527–39, 06 1996.
  • [7] (Pamela McCorduck). , : (Machines Who Think: A Personal Inquiry into the History and Prospects of Artificial Intelligence). AK Peters Ltd, 2004.
  • [8] , , , , , , , , , . (Ian Osband, Yotam Doron, Matteo Hessel, John Aslanides, Eren Sezener, Andre Saraiva, Katrina McKinney, Tor Lattimore, Csaba Szepezvari, Satinder Singh, et al). (Behaviour suite for reinforcement learning), arXiv, arXiv:1908.03568, 2019.
  • [9] -, , , , . (Diego Perez-Liebana, Jialin Liu, Ahmed Khalifa, Raluca D Gaina, Julian Togelius, and Simon M Lucas). : , (General video game AI: a multi-track framework for evaluating agents, games and content generation algorithms). arXiv arXiv: 1802.10363, 2018.
  • [10] . . (David M. W. Powers). (The total Turing test and the loebner prize). , 1998.
  • [11] A.. (A.M. Turing). (Computing machinery and intelligence). 1950.
  • [12] , , , , , , . (Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman). SuperGLUE: (Superglue: A stickier benchmark for general-purpose language understanding systems.) 2019.
  • [13] , , , , . (Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman). Glue: (Glue: A multi-task benchmark and analysis platform for natural language understanding). 2018.
  • [14] . (David H Wolpert). « »; (What the no free lunch theorems really mean; how to improve search algorithms).
  • [15] .. .. (D.H. Wolpert and W.G. Macready). « » (No free lunch theorems for optimization). (IEEE Transactions on Evolutionary Computation), . 67–82, 1997.
  • [16] . (Stephen G. Wozniak). (Three minutes with steve wozniak). PC World, 2007.

All Articles