Linguistics - A test to measure AI intelligence
Posted on 2024-12-27
State of the art AI models have been coming out every week for the past month (OpenAI's O1, Gemini 2 flash thinking, O3), with each one being more intelligent than the last. O3 in particular has been impressive, with its score on ARC-AGI being touted as a reason for it being an early form of AGI. I personally think ARC is hard not because of reasoning, but because a 2D image grid is hard for LLMs to visualize (see similar opinions here [1]).
Given all the hype, I decided to form my own intelligence test for a model. Coming up with an 'intelligence test' question is tricky because of memorization - LLMs are incredibly good at remembering things they've seen before (a single backward pass is often enough to memorize a large chunk of text[2]). This means I can't use any challenges that exist on the internet (and I probably cant reuse any questions I share here either). It's also important to separate knowledge from intelligence in the question - intelligence is the meta ability to learn and apply new things, while knowledge is a test of a model's existing understanding.
Therefore, any 'AI Intelligence test' question should have the following characteristics:
- Low occurence in the training distribution: The new skill being tested should not be prevalent on the internet. For example, Google measures Gemini's ability to translate to and from Kalamang[3], a language with less than 200 speakers, in order to counter this.
- Large, un-memorizable answer space: One of the reasons AlphaGo is so significant is because Go has an extremely large search space, and it is impossible to win by memorizing the optimal move for every board position.
- Should be completely solvable from the question: As Francis Choilet puts it, intelligence is the skill through which you acquire new skills. Therefore, the test should demonstrate a new skill, and examine a model's ability to use that new skill
Based on the above criterion, I think linguistics Olympiad questions
[4] are a great litmus test for model intelligence.
- The questions are usually about esoteric languages (such as Inuktitut), which models are unlikely to have seen in large quantities during training.
- They are logical deduction problems, where the problem is entirely solvable using the examples provided in the question, with no prior understanding of languages needed.
- Languages generally have a large number of symbols that can be combined in many ways, and so the search space for a possible answer is likely to be large.
Linguistics is also a beautiful evaluation method because its based on the very things LLMs are supposed to be experts on - language. So lets see how they perform on them!
Intelligence Test 1: Japenese Braille
This is a problem of 'easy' difficulty. Here's the question:
This one's pretty fun to solve
The problem's actually fairly fun to solve, and I encourage you to try solving it (Here's
the solution, in case you'd rather pass). This one should also be easy for LLMs because they're likely to have some prior exposure to Japanese Braille in their training data (for example, GPT-4O already knows its basic rules
[6]). I modifed the problem above to reduce the impact of memorization - removing any mention of Japenese Braille, and converting the braille images into a series of binary digits (with 1s representing large dots and 0s representing small ones).
Advaithese is a language system consisting of 0s and 1s. Here’s an example of the word karaoke in Advaithese
Karaoke: 100001 100100 011000 111001
Part 1: The following words represent Atari, haiku, katana, kimono, koi and sake. Which is which? You dont need to know either Japanese or Advaithese to figure it out; you’ll find that the system is highly logical
A: 100011 101000 110001
B: 100101 111001
C: 100001 100110 100010
D: 101001 011111 011010
E: 011001 101000
F: 100000 100110 101100
Part 2: What are the following words?
A: 100001 100100 111110
B: 100000 101010 111111
Part 3: How do you write “ro” in Advaithese?
Part 4: Write the following words in Advaithese characters:
A: Samurai
B: Miso
Here's how the top models did on it:
Model Name |
Performance |
GPT4O |
4/6 in first part, 0 in the rest |
O1 |
Gets all 4 parts correct! Interestingly, it guesses the 2nd part (possibly memorization), but figures out the other parts with correct reasoning. |
Gemini2-Flash-Thinking-Experimental |
3/6 first part, 0 in the rest |
Claude-35-Sonnet |
1/6 in the first part, 1/2 in the second part by guessing (memorization?), 0 in the rest |
O1 is seriously impressive! It nailed the problem in under 2 minutes, while other models struggled to even start it (I might be ashamed to say it took me a lot more than 2 minutes to get through this one). But let's see if O1 can handle a tougher question
Test 2: Inuktitut Numbers
This one's marked a 'medium' in difficulty, but I honestly thought it was pretty easy.
This one becomes easy once you make a reasonable guess
As usual, I removed any mention of Inuktitut to prevent memorization, and I converted the symbols into an easy form for the LLMs to digest - using slashes ('/','\') and apostrophes to write the above equations.
Advaithese is a new language. Advaithese has a new way of writing numbers, more appropriate for the way numbers are expressed in the Advaithese language. Imagine that you find some Advaithese students that know nothing about English, Latin script or Indo-Arabic numerals. Then, in order to start communication, one of the students offer you a list of mathematical operations, shown below. This version of the table uses the Indo-Arabic symbols for the operations.
\ + \ = \/
\/ +\/\ = ‘
\/\/‘ + \/‘ = \’`’
\/\’`’ - \/\ = ‘`’
\/\ * ‘ = ‘`’
\/\/‘ * \/‘ = \/\ \/\
\/\/ ~ + \ ’`’ = \/\/ \’`’
Seeing that you understood the table, the student challenges you to write down the answers to the following operations. Give the answers in Advaithese numerals.
\/\ + \/\ =
~ * ‘` ’` ’` =
\ ~ - \/\ =
‘ * ‘ =
\/\’` - \/\ =
\ \/‘` + \ \/\/‘ =
Here's how the LLMs performed
Model Name |
Performance |
GPT4O |
0 |
O1 |
Gets 2 parts out of 6. It got some parts right (what the slashes represent, the value for 0), but others wrong - the base the system operates in (base 20) and the value for the apostrophes |
Gemini2-Flash-Thinking-Experimental |
0 |
Claude-35-Sonnet |
0 |
Well, it looks like we have our new AI intelligence litmus test question! O1 gives it a fair shot but falls short of making any real progress on the problem. The other models dont even grasp how to start it.
Conclusion
Linguistics Olympiad questions are a great intelligence measure for LLMs because they introduce new rules and test an LLM's ability to understand and apply them. Even though they're based in language, LLMs can't use prior knowledge because the languages have negligible presence in training data. There's a large bank of such questions and new ones are created every year for the Olympiad, and so we should have a fairly large set of questions to test LLMs on. And the questions we tried today are still pretty east (they're nothing close to what gets asked in the Olympiad) - so it'll be interesting to see how O3/O1 pro/future models perform on these.
Update: One paper in NeurIPS 2024 introduces a linguistics evaluation dataset - LingOly
[5]. Model performance on it appears poor (though they have not evaluated O1 on it), it'll be interesting to see how quickly this benchmark saturates.