How Computers Solve the Rubik's Cube #PID1.4

March 7, 2025

PID PID1 RubiksCubeSolver puzzle automation search algorithms heuristics BFS DFS A* search state graph cube solver AI integration algorithm efficiency backtracking computational complexity software design

The previous post established the mathematical structure of the Rubik’s Cube: a non-abelian group of order 43,252,003,274,489,856,000, where each element is a permutation of pieces and the solving problem is finding a path in the Cayley graph from an arbitrary node to the identity. The question now is how a computer navigates that graph.

The answer is not brute force. Understanding why brute force fails precisely – not just intuitively – is the necessary starting point for understanding why the algorithms that do work are designed the way they are.

Why Brute Force Fails

The Cayley graph Cay(G, S) has |G| = 43.25 * 10^18 nodes and is 18-regular in the half-turn metric (18 generators: 6 faces x 3 move types each). A computer evaluating 10^9 states per second would require approximately 1.37 * 10^9 seconds – about 43 years – just to traverse the node set once, setting aside the infeasibility of storing 43 quintillion states in memory.

Depth-limited search over move sequences is worse. The number of distinct move sequences of length k is at most 18^k, but many sequences reach the same state. At k = 20, 18^{20} is approximately 1.2 * 10^25 – roughly 280,000 times larger than |G|. The ratio 18^{20} / |G| quantifies the redundancy: there are about 280,000 distinct length-20 paths reaching any given state on average. Enumerating all paths without deduplication is vastly worse than enumerating states.

The root failure of brute force is the absence of any measure of progress. The search expands states without preferring those closer to the solution, so it spends equal effort on states that are far from a solution and states that are one move away. An efficient algorithm needs a function that measures closeness to the goal and uses it to prune the search.

Graph Search: BFS, DFS, and IDDFS

Framing cube solving as a graph search problem gives access to a well-developed toolkit.

Breadth-first search (BFS) processes nodes in order of their distance from the source. It maintains a queue of frontier nodes and guarantees the shortest path. The frontier at distance d contains all states reachable in exactly d moves. For the cube:

Depth 0: 1 state (solved)
Depth 1: 18 states
Depth 2: 18 * 15 = 270 states (each move has 15 distinct non-trivial continuations after pruning immediate inverses)
Depth d: roughly 18 * 15^{d-1} states for small d

At depth 10, the frontier contains approximately 18 * 15^9 ≈ 10^{11} states. Storing this frontier requires memory in the terabyte range, which is impractical. BFS guarantees optimality but is memory-infeasible beyond depth 10 or so.

Depth-first search (DFS) explores one path to its maximum depth before backtracking. Memory usage is O(d * b) where d is the depth limit and b is the branching factor. For the cube with b = 15 and d = 20, this is manageable. The problem is that DFS finds an arbitrary solution (often very long) and gives no guarantee about solution length. A DFS with depth limit 20 may find a 20-move solution when a 5-move solution exists along a different branch.

Iterative deepening depth-first search (IDDFS) runs DFS with depth limit 1, then 2, then 3, and so on. At each depth limit, the entire search restarts from scratch. This seems wasteful: the iteration with depth limit d repeats all work from iterations 1 through d-1. But because the number of states grows geometrically with depth (by factor roughly b = 15 per level), the final iteration dominates total work. The cost ratio between the final iteration and all previous iterations combined is:

$$ \frac{b^d}{b^d + b^{d-1} + \ldots + b^0} = \frac{b^d}{(b^{d+1} - 1)/(b - 1)} \approx \frac{b - 1}{b} = \frac{14}{15} $$

So IDDFS does at most b/(b-1) = 15/14 times the work of the final iteration alone. The overhead from restarting is less than 8%. IDDFS achieves BFS-level optimality guarantees with DFS-level memory usage.

IDDFS is the right base algorithm for the cube, but without a heuristic it still searches the full tree to depth d at each iteration. Making it tractable requires pruning: a function that identifies subtrees that cannot contain an optimal solution and skips them.

Admissible Heuristics

A heuristic h(s) estimates the cost (number of moves) to reach the solved state from state s. For a search algorithm to guarantee optimality, h must be admissible: it must never overestimate the true optimal cost. Formally, if d*(s) is the true shortest-path distance from s to the identity, admissibility requires h(s) <= d*(s) for all s.

Admissibility is the key property. If h overestimates for some state s’, then d*(s’) could be less than h(s’), which might cause the algorithm to discard paths through s’ as too expensive even though they lead to the optimal solution. An overestimating heuristic can cause the algorithm to miss the optimal path and return a suboptimal one.

A heuristic is informative (or tight) if it is close to d*(s) in practice. An admissible heuristic h = 0 (always return 0) is maximally admissible but provides no information: the search degenerates to uninformed IDDFS. A heuristic equal to d*(s) for all s is perfect – the search would proceed directly to the solution – but computing d*(s) exactly is the original hard problem. The design challenge is to find admissible heuristics that are as tight as possible while remaining cheap to evaluate.

A* and IDA*

A search* processes nodes in order of f(s) = g(s) + h(s), where g(s) is the known distance from the source to s and h(s) is the heuristic estimate of the distance from s to the goal. A priority queue (min-heap keyed on f) determines expansion order.

If h is admissible, A* is guaranteed to find an optimal path. The intuition: A* never discards a node unless it has found another path to the same node with lower f value, and f = g + h underestimates the true cost of any path through s. Formally, A* with an admissible heuristic is optimally efficient: no algorithm with the same heuristic information can expand fewer nodes (among algorithms that guarantee optimality).

For the cube, A* is memory-intensive. The open list (nodes to be expanded) can contain exponentially many nodes. For states at depth 20, this is again impractical.

IDA (Iterative Deepening A)** applies iterative deepening to A*. Instead of a depth limit, IDA* uses an f-cost threshold: at each iteration, prune any node s where f(s) = g(s) + h(s) > threshold. The first iteration uses threshold = h(start). When the search terminates without finding the goal (because all nodes exceeded the threshold), the next threshold is the minimum f value of any pruned node. This guarantees that the threshold increases by the smallest necessary amount.

IDA* maintains the memory efficiency of IDDFS (linear in the current path length) and the pruning effectiveness of A*. For the cube, a good heuristic prunes most of the search tree. The fraction of nodes expanded compared to uninformed IDDFS can be reduced by several orders of magnitude.

IDA* with a tight admissible heuristic is the foundation of both Korf’s algorithm and each phase of Kociemba’s algorithm.

Pattern Databases: Construction and Use

The most effective heuristics for the cube are pattern databases (PDBs). A PDB precomputes and stores the exact minimum number of moves to solve a specific subset of pieces from every configuration of that subset, ignoring all other pieces.

Construction. Run a backward BFS from the solved state, tracking only the specified subset of pieces and ignoring all others. At each BFS level d, record d in the table entry for every subset configuration first reached at level d. When the BFS terminates, every entry contains the exact minimum number of moves to solve that subset from the corresponding configuration. Construction is done offline, once.

The corner PDB covers all 8 corners. The number of entries is:

$$ 8! \times 3^7 = 40{,}320 \times 2{,}187 = 88{,}179{,}840 $$

Each entry is a value in [0, 11] (the maximum is 11, since corners alone can always be solved in at most 11 moves). At 1 byte per entry, this is about 84 MB. The table is admissible: since edges are ignored in the BFS, the stored distances are lower bounds on the full solution length.

Edge PDBs. The 12 edges give 12! x 2^{11} / 2 ≈ 980 million configurations, too large to store completely. A practical approach splits the edges into two disjoint sets of 6. The number of configurations for a 6-edge PDB is:

$$ \frac{12!}{(12-6)!} \times 2^6 = 665{,}280 \times 64 = 42{,}577{,}920 $$

At 4 bits per entry (values fit in [0, 10]), one 6-edge PDB is about 21 MB. Two disjoint 6-edge PDBs cover all edges without overlap.

Combined heuristic. For any cube state s, compute h using:

$$ h(s) = \max\{h_{\text{corners}}(s),\ h_{\text{edges}_1}(s),\ h_{\text{edges}_2}(s)\} $$

Taking the maximum preserves admissibility (each individual PDB is admissible, and the max of admissible heuristics is admissible) while producing a tighter bound than any single PDB. If the corner PDB says at least 9 moves are needed, and the edge PDB says at least 7, then h(s) = 9. A stronger lower bound prunes more of the search tree.

The improvement from PDB heuristics over a naive heuristic (like counting misplaced pieces) is dramatic. With the combined corner + two 6-edge PDB heuristic, IDA* solves most cube states in under a minute on 1990s hardware. Without it, IDA* cannot solve deep states in practical time.

Thistlethwaite’s Algorithm: Subgroup Reduction

Morwen Thistlethwaite’s 1981 algorithm introduced the key idea: rather than searching from an arbitrary state directly to the solved state, decompose the problem by targeting a chain of subgroups.

Define the chain:

$$ G = G_0 \supset G_1 \supset G_2 \supset G_3 \supset G_4 = \{I\} $$

At each stage i, find a move sequence (using moves from a restricted generator set S_i) that takes the cube from a state in G_i to a state in G_{i+1}. The subgroups and their generators are chosen so that each stage’s search space is small enough to solve with a precomputed lookup table.

Stage 0 -> G_1. G_1 is the kernel of the edge orientation homomorphism phi: G -> (Z_2)^{11}, the map sending each state to its 12-bit edge orientation vector (with the parity bit redundant). G_1 = {g in G : all edges correctly oriented}. The index [G : G_1] = 2^{11} = 2048. The full generator set S = {U, D, F, B, R, L, U’, D’, F’, B’, R’, L’, U2, D2, F2, B2, R2, L2} is allowed. A BFS from the solved state over G_1 labels each of the 2048 edge orientation classes with its minimum distance from the G_1 class (all-zeros). This table guides IDA* to the target.

Stage 1 -> G_2. G_2 adds the conditions: all corners correctly oriented, and the four UD-slice edges (UF, UR, UB, UL) contained within the middle slice. The allowed generators are S_1 = {U, D, R, L, F2, B2} (quarter-turns of F and B are removed, since they disrupt edge orientation and are no longer needed). The state space for this phase, within G_1, has size:

$$ 3^7 \times \binom{12}{4} = 2{,}187 \times 495 = 1{,}082{,}565 $$

Stage 2 -> G_3. G_3 requires all pieces to be in their correct tetrad (geometric orbit): each of the two corner tetrads contains its correct 4 corners, and each edge slice contains its correct 4 edges. The allowed generators are S_2 = {U2, D2, R2, L2, F2, B2}: only half-turns. The state space for this phase is:

$$ \frac{8!}{(4!)^2 \times 2} \times \frac{4! \times 4! \times 4!}{?} \approx 4{,}000{,}000 $$

(The exact count requires accounting for the tetrad structure and is approximately 4 million.)

Stage 3 -> identity. Within G_3, using only half-turns, solve completely. The state space of G_3 is:

$$ \frac{8!}{2} \times \frac{12!}{2} \div (\text{additional constraints from G_3}) \approx 663{,}552 $$

Each stage’s lookup table is precomputed by BFS within the restricted generator set and stored. The total table size is small by modern standards. The algorithm is guaranteed to find solutions in at most 52 moves in HTM, a remarkable guarantee given that it was proven in 1981 with 52 KB of tables.

Korf’s Algorithm: Optimal IDA*

Richard Korf’s 1997 algorithm finds provably optimal solutions using IDA* with large PDBs. No subgroup decomposition; a single IDA* search over the full cube group, guided by the combined corner + two 6-edge PDB heuristic.

The algorithm is straightforward to describe:

Compute h(start) = max(h_corners(start), h_edges1(start), h_edges2(start)).
Run IDA* with initial threshold = h(start).
At each node s with current path cost g: if g + h(s) > threshold, prune. If s = identity, return the path.
If the iteration terminates without finding the goal, set threshold = min(g + h(s)) over all pruned nodes and restart.

Because h is admissible, the first solution found has optimal length. The IDA* search expands only nodes where g + h <= threshold, which is the set of nodes on any path whose total f-cost might be at most optimal. With the tight PDB heuristic, this set is much smaller than the full search tree.

Korf’s original 1997 paper reported solving a random cube state in approximately 2 minutes on a Sun SPARCstation. The heuristic prunes enough of the search tree that the IDA* expansion is in the millions to billions of nodes rather than 10^19. On modern hardware, most states solve in seconds, but states near the 20-move maximum can still take hours.

The limitation of Korf’s algorithm for practical use is that search time is highly variable: most states are fast, but the worst cases are slow. Kociemba’s algorithm trades guaranteed optimality for consistent speed.

Reinforcement Learning: DeepCubeA

DeepCubeA (Agostinelli et al., 2019) is the most developed learning-based approach. Rather than using precomputed PDBs or algebraic subgroup structure, it trains a neural network f_theta to estimate the cost-to-go h(s).

Training procedure (Autodidactic Iteration):

Start from the solved state.
Apply k random moves (k drawn uniformly from [1, K]) to generate scrambled states at varying depths.
Train f_theta to predict the value V(s) = 1 + min(V(s’)) over successors s’ of s, with V(identity) = 0.
Repeat with the updated network as the value estimate.

This bootstrapping procedure, analogous to value iteration in dynamic programming, trains the network to estimate distances at all depths simultaneously. The network architecture is a fully connected network with residual connections.

At test time, the trained network serves as the heuristic in a weighted A* search (GBFS in their paper): expand the node with minimum h(s) = f_theta(s), without tracking g (making it greedy best-first rather than full A*). This is not admissible in general, so solutions are not guaranteed optimal.

Performance: DeepCubeA solves 100% of test states in their experiments, with an average solution length of approximately 28 moves (compared to Kociemba’s 20-move average). The search explores an average of roughly 10,000 nodes per state.

The interesting property of the approach is generalizability. The same training framework applies to any puzzle with a defined successor function and identity, without any puzzle-specific structural knowledge. The same code trained on 15-puzzle and Sokoban instances. Kociemba’s algorithm, by contrast, uses the specific group structure of the 3x3 cube.

The tradeoff is solution quality and theoretical guarantees. DeepCubeA solutions are longer, the heuristic is not admissible (overestimates are possible in principle), and the search does not guarantee optimality. For research into generalizable planning algorithms, these tradeoffs are acceptable. For a practical cube solver where solution length matters (as it does in a robot solver where each move has a physical execution time), Kociemba’s algorithm is the better choice.

Comparison of Algorithms

Algorithm	Optimality guarantee	Avg moves	Typical search time	Approach
Brute-force BFS	Optimal	18-20	Infeasible	Exhaustive graph search
Korf’s IDA*	Optimal	18-20	Seconds to hours	IDA* + PDB heuristic
Thistlethwaite’s	Not optimal	40-52	Fast	4-phase subgroup reduction
Kociemba’s	Not optimal	18-22	Milliseconds	2-phase subgroup reduction + IDA*
DeepCubeA	Not optimal	~28	Fast (after training)	Learned heuristic + best-first search
CFOP (human)	None	50-60	10-60 seconds	Memorized algorithm set

The progression from Thistlethwaite (1981) to Korf (1997) to Kociemba (1992, refined into the 2010 era) reflects two distinct strategies for making the search tractable. Korf’s approach is to improve the heuristic until the single-phase search is fast enough. Kociemba’s approach is to decompose the problem into phases so that each phase’s search space is small. Kociemba’s practical performance is better because the decomposition reduces the search space more aggressively than any single admissible heuristic can, at the cost of producing near-optimal rather than optimal solutions.

For the project here, Kociemba’s algorithm is the target. The next post covers it in detail: the specific subgroup H, how the two-phase IDA* search is structured, how the coordinate representation makes the search fast, and what the implementation actually looks like.

What Computers Bring That Humans Cannot

The contrast between human and computer solvers is worth being precise about, because the differences are structural rather than merely quantitative.

Memory scale. A competitive human memorizes a few hundred algorithms covering specific recognition patterns. A computer using PDBs has precomputed and stored the exact optimal cost for every configuration of a subset of pieces – 88 million corner configurations, for example – and can evaluate this in O(1) time per lookup. This is not faster pattern matching; it is a qualitatively different kind of knowledge.

Proof of optimality. A human choosing between two continuation moves is making an informed guess. IDA* with an admissible heuristic is provably not missing any path shorter than the current bound. The guarantee is mathematical, not empirical.

Consistency. Human solvers make recognition errors, slip moves, and misremember algorithms. A computer’s error model is different (implementation bugs, hardware faults) but does not include fatigue, attention lapses, or motor errors.

Phase parallelism. The subgroup decomposition in Thistlethwaite’s and Kociemba’s algorithms separates the problem into independent phases. Multiple phases can be computed in parallel on multiple cores, independently of each other. Human cognition cannot parallelize across subproblems in this way.

These structural differences explain why a computer running Kociemba’s algorithm finds a 20-move solution in milliseconds while the fastest human solvers require around 4 to 5 seconds for a full solve (not counting recognition time). The interesting constraint for human solving is not move optimality but ergonomic efficiency: how to organize moves so that the hands can execute them as a continuous, high-speed physical motion.