This year I used the Gemini 1206 LLM to participate in the Advent of Code contest. The LLM was able to solve 60% of puzzles just from the description, and an additional 15% of the problems with simple assistance.
Result | Percent |
---|---|
Solved puzzle without human interaction | 60% |
Solved puzzle with simple debugging | 75% |
Solved puzzle when given strong hint | 90% |
Failed to solve puzzle | 10% |
Category | Count | Zero-shot | With assistance | With hint | Failed |
---|---|---|---|---|---|
Part 1 | 25 | 18 | 4 | 2 | 1 |
Part 2 | 24 | 11 | 4 | 5 | 4 |
Total | 49 | 29 | 8 | 7 | 5 |
The day-by-day solutions can be found in this Github project: github.com/jackpal/aoc2024
Disclosures and background
I work as a SDE at Google, but I don’t work in ML or on the Gemini project. I’ve competed in Advent of Code since 2020, and have gone back and solved all of the problems, often using hints from r/adventofcode and other web discussions. I’ve been interested in LLMs and other machine learning / AI since my student days at MIT, where I caught the tail end of the original AI wave.
Experiment setup
I decided to start using a LLM on Day 11 of the contest. In order to avoid polluting the leader board, I waited 10 minutes after the start of each day’s contest to begin working on that day’s problem. This kept my LLM-assisted scores off the leaderboards. I also went back and solved the earlier days’ problems. Since those earlier day problems were submitted more than 24 hours after the problem was announced, they didn’t affect the leaderboard scoring.
I used Google’s AI Studio, with a custom system prompt. I used the default temperature setting 1.0.
For each day I:
- Manually visited the Advent of Code puzzle page using a Chrome browser.
- Copied the “part one” text of the puzzle, including examples.
- Pasted that text into the “Type something” box of AI Studio.
- Pressed “Run”
- Copied the generated Python code from AI Studio
- Pasted it into a Jupyter notebook running on Visual Studio Code and Python 3.13.1. I had the aocd and networkx libraries installed in the Python environment.
- Ran the Python code.
- If there were issues, I would debug them, potentially prompting the LLM to try again.
- Once Part 1 was complete, I would repeat the steps with “Part Two” of the puzzle.
Automating the process
I used the “Advent of Code Data” Python library to automate retrieving the puzzle input and submitting the puzzle answers. I experimented briefly with automating the process of using the LLM. I was able to automate steps 1-5, but didn’t automate running and scoring the generated Python code. Automating the process would be helpful for comparing the effects of different prompts, as well as comparing the quality of different models.
Mistakes
I accidentally submitted my solutions early on Day 23, getting 8th place on the part 1 and 135th place on part 2. Sorry, humans!
Comparing different versions of Gemini
During the experiment I tried several different versions of Gemini that were available on AI Studio:
- Gemini 2.0 Flash Experimental
- Gemini Experimental 1206
- Gemini 2.0 Flash Thinking Experimental
As far as I can tell, Gemini Experimental 1206 gave the best results for Advent of Code. I only found one case, Day 25, where Gemini 2.0 Flash Thinking Experimental scored better than Gemini Experimental 1206.
That being said, the other models were capable of solving the simpler/easier problems. It would be interesting to study the relative strengths of the different models.
I very briefly experimented with non-Gemini models, including LLama 3.3 70B and whatever version of OpenAI Microsoft Bing was serving on December 10th. But in my limited testing, they were not obviously superior to Gemini 1206, so I decided to concentrate on Gemini 1206. It would be interesting to run the experiment on a variety of other models.
Prompt Engineering
After experimenting with small tweaks during the contest, my System Instruction prompt was a variation of:
You are an expert Python coder. You are participating in the "Advent of Code" programming contest. The following is a puzzle description. Write expert Python code to solve the puzzle.
The puzzle has two parts. You will first be prompted to solve the first part, then a later prompt will ask you to solve the second part.
First write a short explanation of the algorithm . Then write the code.
Read the puzzle input in the form of a text string from puzzle.input_data. Assign the puzzle answer to the property puzzle.answer_a for the first part. Assign the puzzle answer to puzzle.answer_b for the second part.
Only assign to puzzle.answer_a one time. Don't use puzzle.answer_a as a temporary variable or an accumulator.
For example if the puzzle is, "The input is a series of numbers, one per line. Calculate the sum of the numbers", then the code you generate could look like this:
def solve_a(input_data):
return sum([int(line) for line in input_data.splitlines()])
puzzle.answer_a = solve_a(puzzle.input_data)
And if the "Part b" of the puzzle is "Calculate the product of the numbers instead", then the code you generate could look like this:
def solve_b(input_data):
return prod([int(line) for line in input_data.splitlines()])
puzzle.answer_b = solve_b(puzzle.input_data)
Assume that the input is valid. Do not validate the input.
Think carefully. It is important to get the correct answer and for the program to run quickly.
Use split('\n\n') to split chunks of the input that are separated by blank lines.
Pay attention to whitespace in the input. You may need to use "strip()" to remove whitespace.
Use @cache for recursive functions that are called frequently with the same parameters.
If the problem is a graph or network problem, use the networkx library to solve it.
Use subroutines, lambdas, list comprehensions and logical boolean operators where it will make the code shorter.
Use short function names and short variable names.
Assume this code has already been executed to get the puzzle object set up:
from aocd.models import Puzzle
from pathlib import Path
puzzle = Puzzle(year=2024, day=int(Path(__vsc_ipynb_file__).stem))
Gemini did not always adhere to this prompt. Sometimes it generated verbose code.
Prefer short code
It takes Gemini significantly more time to generate code with long variable names and comments than it does to generate short code without comments. For Advent of Code puzzles it’s often clearer to use short variable names, so that more of the algorithm can fit on screen. And in any timed competition it’s better to get the answer generated sooner rather than later.
Asking Gemini to generate a description of the algorithm it is using was helpful in two ways:
-
It may have lead to better code, due to giving the LLM more time to “think” about the problem. (Not certain of this, but it is true of other code LLMs.)
-
It helped me understand the code, which was helpful when debugging the code.
Avoid trying to cover every possible case in the system prompt
I found it was better to keep the system prompt relatively short, especially with respect to advice that was only applicable in some situations. For example, I experimented with adding the direction “be sure to handle very large inputs” to the prompt. While this was helpful for some problems, it caused more problems than it solved, because it pushed the LLM towards less common data structures, which caused it to make more mistakes.
I found that it was better to keep the prompt short, and add “be sure to handle very large inputs”, or “that’s too slow” follow-up prompts as needed.
Results
I started using Gemini on day 10. For the purposes of this report, I went back and used the LLM to solve the earlier problems as well. Here are my day-by-day notes on how successful I was. I score the LLM performance on a scale:
Rating | Meaning |
---|---|
0 | LLM did not produce a working solution, even when prompted with a hint that explained how to solve the problem. |
1 | LLM produced a working solution after being prompted with a hint that explained how to solve the problem. |
2-3 | LLM required help to debug mistakes in its code (such as off-by-one errors or coordinate system errors.) Score a 2 if the mistake is difficult to debug, 3 if the mistake is easy to debug. |
4 | LLM produced a working solution after being prompted with simple prompts like “make it run faster”. |
5 | LLM produced a working solution, no additional prompts required. |
Gemini 1206 performance
Day | Part 1 | Part 2 | Notes |
---|---|---|---|
1 | 5 | 5 | |
2 | 5 | 5 | |
3 | 5 | 5 | |
4 | 5 | 5 | Some versions of system prompt had problems generating code for part 2. |
5 | 5 | 5 | |
6 | 5 | 2 | Faulty “guard against infinite loop code” in part 2. |
7 | 5 | 5 | |
8 | 5 | 5 | |
9 | 5 | 5 | |
10 | 5 | 5 | |
11 | 5 | 2 | Prompt for part 2: That’s still too slow. Can you think of a different way of counting the stones? |
12 | 5 | 1 | Had trouble counting edges correctly. Had to be prompted with corner-counting algorithm. |
13 | 5 | 1 | Use Cramer’s Rule |
14 | 5 | 1 | See Day 14 Easter egg prompt below |
15 | 1 | 0 | Part 1 prompt: You have to recursively check-and-move the boxes. No known working prompt chain for part 2. |
16 | 5 | 1 | See Day 16 part 2 prompt |
17 | 2 | 0 | See Day 17 part 2 for details. |
18 | 5 | 5 | |
19 | 3 | 3 | Mistakes parsing input. |
20 | 1 | 1 | Succeeds with Reddit prompt |
21 | 0 | 0 | Unable to code Reddit hints correctly |
22 | 5 | 2 | Part 2 required prompts to optimize runtime and manual debugging of generated code. |
23 | 5 | 5 | Day 23: LAN Party Part 1 and 2 both used networkx algorithms |
24 | 3 | 0 | Day 24: Crossed Wires |
25 | 2 | n/a | Day 25: Code Chronicle |
Interesting Days
Day 14 Easter egg prompt
Day 14 was an unusual Advent of Code Puzzle. The part 2 directions did not provide a way of detecting a Christmas tree. The LLM was completely lost.
Once this hint was provided, the LLM produced code that worked well:
That's not correct. Run the simulation normally, but each step, calculate the number of horizontally or vertically adjacent robots. Two robots are horizontally adjacent if their x coordinates differ by 1 and their y coordinates are equal. Two robots are vertically adjacent if their y coordinates differ by 1 and their x coordinates are equal. Keep track of the maximum number of adjacent robots seen. Simulate 10000 steps and then print the step number that had the maximum number of adjacent robots.
I’m using the agreed upon scoring rubric, but it could be argued that this response should be scored more highly, because the puzzle directions were intentionally obscure.
Day 15 part 2
Nothing worked. Lots of incorrect code generated. The LLM had trouble maintaining the data structures for the wide-box simulation, even when prompted with the specific answers.
Day 16 part 2 prompt
Used this prompt, which is a lightly edited version of a Reddit comment
That's not correct. Here is a hint: Run a Dijkstra to get the distance matrix (from_start).
Obtain from_end by running another Dijkstra, but with 4 starting vertices [ (target_row, target_col, dir) for dir in "EWNS"]
Iter through every coord and direction and check if from_start[(row,col,dir)] + from_end[(row,col,flip(dir)]) is the same as answer from part 1.
Day 17 part 2
To solve day 17 part 2 you had to analyze your specific input program, reverse engineer it, and then write a search function that searched a narrow, deep, tree.
The LLM made mistakes on both part 1 and part. It was unable to provide a useful program for part 2. But it was able to generate correct subroutines when directed to by special-purpose prompts. For example I asked it to disassemble my puzzle input, turning it into Python code, and it did a fairly good job, making just one mistake. Similarly, when I asked it to write a search function, it wrote one quickly.
Day 19 whitespace parsing
Both parts had problems parsing the puzzle input. Items were separated by ‘, ‘, but the LLM split on ‘,’ instead of the correct ‘, ‘.
Day 20: Race Condition
No luck with LLM or on my own. LLM solved easily with this Reddit-provided hint:
Both parts 1 and 2 can be solved using the same algorithm. Write a single function solve(allowed_cheat_length) that is used for both solutions.
We find the distance from end position to any other position.
We find the path from start to end using knowledge of the distance from any point to end
For each pair (a,b) where a and b are some coordinates of the fastest path
calculate Manhattan distance between a and b
if this distance larger than allowed cheat, a -> b is not an allowed cheat
if the distance from b to the end + distance from a to b is not lower than the distance from a to b, then there is no benefit of such "cheat"
if the benefit of the move from a to b is less than defined minimal benefit - we ignore it
otherwise, increase counter of valid cheats by 1
Day 21: Keypad Conundrum
This was a rough day for the LLM. It was unable to implementing the optimal algorithm, even with numerous Reddit hints.
The LLM handled parsing the input and creating an overall program structure, but was unable to think of the necessary optimizations.
Day 23: LAN Party
The LLM correctly identified that the problem could be solved by finding cliques. But the code it generated for finding cliques was buggy. Amending the system prompt to recommend using the networkx library for network and graph puzzles solved this problem because the LLM was able to call the bug-free networkx clique algorithm.
Day 24: Crossed Wires
The LLM was helpful for solving part 1. For part 2, it couldn’t help with a simple solution, but it was helpful for utility functions, visualizers, stats computations, and also as a reference, for example it correctly answered “what is the circuit for a full adder”.
In the end I had the LLM write a sum(x,y)
function, and used it to help identify bits that were bad. An undocumented simplifying feature of the actual input data was that the swapped circuits were always within
a single full adder. So it was easy enough to identify the four full adders in my input that had bad bits, and then inspect the 5 gates per full adder to see which two gates had their outputs swapped.
Day 25: Code Chronicle
Gemini 1206 misunderstood how to parse the input, and how to tell locks from keys. It also had several off-by-one errors in the tumbler/key height calculations. It required four rounds of prompt correction to produce the correct answer.
Gemini 2.0 Flash Thinking Experimental also misunderstood how to parse the input, but otherwise made fewer mistakes than Gemini 1206. It only required one prompt correction (explaining how to parse the input) in order to produce the correct answer.
Uses beyond solving puzzles
Gemini 1206 is good for more than just trying to solve the puzzle:
- Parsing. Gemini does a pretty good job of parsing puzzle input. I did run into two flaws:
-
Gemini needs help parsing input that uses blank lines to separate different chunks of input. I had to explicitly instruct it, “Use split(‘\n\n’) to parse chunks separated by blank lines.
-
Gemini is blind to whitespace following commas, so it doesn’t parse “a, b” correctly. It uses split(‘,’) instead of split(‘, ‘). I wasn’t able to discover a prompt to fix this. The best I figured out was to either manually correct the issue, or to ask it to strip() the resulting parts.
-
-
Debugging. I was stuck helping my friend debug their code for Day 16 Part 1. His code solved the examples, but not his input. We worked on the code for 30 minutes, rewriting almost all of it. But nothing helped. Finally, we dumped his original code into Gemini and asked Gemini to find the bug. It immediately pointed out the bug (a direction table with a copy-and-paste error).
-
Visualization. Several days I asked Gemini to add a visualization to my code. (For example, showing a path on a grid.) It was simple code, but Gemini wrote it more quickly than I could have.
- Other languages besides Python. I didn’t experiment much with this. I did see that Gemini was able to produce working Go code. I expect it would work for other popular languages as well.
Conclusions
In many ways Advent of Code is a “best case scenario” for LLM-based code generation: The problem descriptions are extremely clear. The two parts of each problem help the LLM refine its understanding of the problem space. The problems are typically solved by applying well known computer science algorithms. The problems are designed to be solved by normal programmers, using normal Python, after a modest amount of thinking time and computer time. There are 441 published example problems, with hundreds of solutions for each problem are available on the web.
I found it enjoyable to use the LLM for help with the harder puzzles. Even when it couldn’t solve the puzzle, the LLM helped me with parsing, visualization, and utility functions. It also sometimes helped with debugging and with answering general knowledge questions.
This was my first experience using LLMs to generate code. I was impressed by the results and I look forward to trying LLMs in other problem domains.
I think I will continue to use LLMs in future Advent of Code competitions. I’ll keep my “wait at least 10 minutes” personal rule in place, to give unassisted humans a sporting chance. I wish there was a separate LLM-assist leader board, that would be fun to compete on.