r33drichards · December 28, 2025 16:45
diff --git a/hiq-blog-post-v2.html b/hiq-blog-post-v2.html
 <!-- wp:image {"id":69,"sizeSlug":"large","linkDestination":"none"} -->
 <figure class="wp-block-image size-large"><img src="https://wordpress-sm7xo.wasmer.app/wp-content/uploads/2025/12/image-768x1024.png" alt="" class="wp-image-69"/></figure>
 <!-- /wp:image -->

 <!-- wp:paragraph -->
 <p>On a snowy day at my parents' house, I was passing time with family and reading a blog post about <a href="https://www.hillelwayne.com/post/picat/">planner programming</a> when my grandmother invited me to play Othello at her apartment. While getting the game out of her cabinet, I noticed a box for <a href="https://en.wikipedia.org/wiki/Peg_solitaire">Hi-Q</a>. I owned this game as a child and had played it to the point where I could reliably win with one peg left every time. To my disappointment, I could only get down to 4 pegs when I tried again now. While my Hi-Q skills had atrophied, my programming skills have improved considerably—I wondered if I could find an optimal solution using <a href="http://picat-lang.org/">Picat</a>. I couldn't. But I did find a way to solve it with deep learning.</p>
 <!-- /wp:paragraph -->

 <!-- wp:heading -->
 <h2>Attempt 1: Constraint-Based Approaches</h2>
 <!-- /wp:heading -->

 <!-- wp:paragraph -->
 <p>I installed Picat and vibe-coded some scripts to find a sequence of moves leading to one peg in the center using the planning module. The scripts ran indefinitely without producing output—the search space was simply too large.</p>
 <!-- /wp:paragraph -->

 <!-- wp:paragraph -->
 <p>I'm the author of <a href="https://github.com/r33drichards/minizinc-mcp">a MiniZinc MCP server</a> connected to my Claude account. Since planners are essentially a DSL on top of constraints, it seemed like a constraint-based solution should work. But Claude's solutions also timed out.</p>
 <!-- /wp:paragraph -->

 <!-- wp:paragraph -->
 <p>Looking for hints, I found a paper on <a href="https://www.kurims.kyoto-u.ac.jp/~kyodo/kokyuroku/contents/pdf/1185-11.pdf">Integer Programming Based Algorithms for Peg Solitaire Problems</a>. The key insight: the search space is large enough that you need sophisticated pruning methods. I could have fed this paper to Claude and been done with it, but I wanted a solution I actually understood. The integer programming approach would have been another black box. This motivated my pivot to deep learning—the loss function was intuitive, and I suspected a neural network would naturally learn to prune the search space.</p>
 <!-- /wp:paragraph -->

 <!-- wp:heading -->
 <h2>Attempt 2: PPO</h2>
 <!-- /wp:heading -->

 <!-- wp:paragraph -->
 <p>My previous experience with reinforcement learning was using <a href="https://en.wikipedia.org/wiki/Proximal_policy_optimization">Proximal Policy Optimization</a> in <a href="https://pytorch.org/">PyTorch</a> to solve simple <a href="https://gymnasium.farama.org/">Gym</a> environments like <a href="https://gymnasium.farama.org/environments/classic_control/cart_pole/">CartPole</a> and <a href="https://gymnasium.farama.org/environments/box2d/lunar_lander/">LunarLander</a>. I vibe-coded a Gym environment for Hi-Q and a basic PPO network, then trained for 100k timesteps. The resulting policy reliably got down to 4 pegs. Not bad!</p>
 <!-- /wp:paragraph -->

 <!-- wp:paragraph -->
 <p>But the policy would converge on a solution and refuse to deviate—it was stuck in local maxima. When I analyzed the model's behavior, it was playing the <em>exact same 30 moves every single game</em>, ending with two pegs that couldn't jump each other. The +1 reward per move meant it had found a local optimum: make 30 valid moves and get stuck (+28 total reward) was good enough that it never explored further.</p>
 <!-- /wp:paragraph -->

 <!-- wp:heading -->
 <h2>Attempt 3: Breaking Out of Local Maxima</h2>
 <!-- /wp:heading -->

 <!-- wp:paragraph -->
 <p>Claude proposed four approaches to escape the local optimum. I tested them in parallel:</p>
 <!-- /wp:paragraph -->

 <!-- wp:list {"ordered":true} -->
 <ol>
 <li><strong>Remove per-move reward</strong> — Only reward at game end based on remaining pegs. <em>Result: Didn't help.</em></li>
 <li><strong>Progressive <a href="https://gibberblot.github.io/rl-notes/single-agent/reward-shaping.html">reward shaping</a></strong> — Exponential scaling where fewer remaining pegs = much higher reward. <em>Result: Marginal improvement.</em></li>
 <li><strong>Higher exploration</strong> — Increased <a href="https://costa.sh/blog-the-role-of-entropy-in-policy-gradient.html">entropy coefficient</a> from 0.01 to 0.05. <em>Result: Didn't help.</em></li>
 <li><strong><a href="https://lilianweng.github.io/posts/2020-01-29-curriculum-rl/">Curriculum learning</a></strong> — Start the agent on easier board states (fewer pegs) and gradually increase difficulty. <em>Result: Significant improvement!</em></li>
 </ol>
 <!-- /wp:list -->

 <!-- wp:paragraph -->
 <p>Only curriculum learning significantly decreased loss. After 1 million timesteps with curriculum learning, I got down to 2 pegs—but still couldn't solve it completely.</p>
 <!-- /wp:paragraph -->

 <!-- wp:heading -->
 <h2>Attempt 4: AlphaZero</h2>
 <!-- /wp:heading -->

 <!-- wp:paragraph -->
 <p>The neural network needed to see examples of winning solutions to converge on one. I introduced the <strong><a href="https://en.wikipedia.org/wiki/AlphaZero">AlphaZero</a> architecture</strong>, which combines neural networks with <a href="https://en.wikipedia.org/wiki/Monte_Carlo_tree_search">Monte Carlo Tree Search</a>. MCTS runs simulations of possible move sequences, giving the agent the ability to plan ahead and occasionally stumble onto winning games during training.</p>
 <!-- /wp:paragraph -->

 <!-- wp:paragraph -->
 <p>Two changes made the difference:</p>
 <!-- /wp:paragraph -->

 <!-- wp:list {"ordered":true} -->
 <ol>
 <li><strong>MCTS depth</strong> became the most important hyperparameter—more simulations meant better planning and faster convergence</li>
 <li><strong>Full-board mixing</strong>—even during curriculum learning, I mixed in 20% full-board games so the model always trained on the actual problem</li>
 </ol>
 <!-- /wp:list -->

 <!-- wp:paragraph -->
 <p>I fired off a 6-hour training job on an <a href="https://www.nvidia.com/en-us/data-center/h100/">H100</a> and went to my family Christmas party. When I woke up, Santa had delivered an optimal solution to Hi-Q.</p>
 <!-- /wp:paragraph -->

 <!-- wp:heading -->
 <h2>Reflections</h2>
 <!-- /wp:heading -->

 <!-- wp:paragraph -->
 <p>The integer programming paper solved this in 15 minutes on a 2001 laptop CPU. Mine took 6+ hours on a 2025 H100. Hardly an improvement in efficiency! But I learned about curriculum learning, and I ended up with an artifact I actually understand—something I wouldn't have gotten from the integer programming route, even with vibe coding.</p>
 <!-- /wp:paragraph -->

 <!-- wp:paragraph -->
 <p>This project taught me that vibe coding lets me punch above my weight class with neural network architectures. For one-off projects, it's great: I can vibe-code slop, and as long as my network converges on something useful, that's all that matters. I'm skeptical this strategy scales to more sophisticated projects—there's a lot of plumbing and failure modes in deep learning. But for off-the-shelf algorithms in PyTorch, <a href="https://docs.anthropic.com/en/docs/claude-code">Claude Code</a> acted as faster arms and legs to execute on my ideas and test hypotheses.</p>
 <!-- /wp:paragraph -->

 <!-- wp:paragraph -->
 <p>The code is <a href="https://github.com/r33drichards/hiq-ppo">on GitHub</a>, and you can <a href="https://r33drichards.github.io/hiq-ppo/">view the solution on GitHub Pages</a>.</p>
 <!-- /wp:paragraph -->
	<!-- wp:image {"id":69,"sizeSlug":"large","linkDestination":"none"} -->
	<figure class="wp-block-image size-large"><img src="https://wordpress-sm7xo.wasmer.app/wp-content/uploads/2025/12/image-768x1024.png" alt="" class="wp-image-69"/></figure>
	<!-- /wp:image -->

	<!-- wp:paragraph -->
	<p>On a snowy day at my parents' house, I was passing time with family and reading a blog post about <a href="https://www.hillelwayne.com/post/picat/">planner programming</a> when my grandmother invited me to play Othello at her apartment. While getting the game out of her cabinet, I noticed a box for <a href="https://en.wikipedia.org/wiki/Peg_solitaire">Hi-Q</a>. I owned this game as a child and had played it to the point where I could reliably win with one peg left every time. To my disappointment, I could only get down to 4 pegs when I tried again now. While my Hi-Q skills had atrophied, my programming skills have improved considerably—I wondered if I could find an optimal solution using <a href="http://picat-lang.org/">Picat</a>. I couldn't. But I did find a way to solve it with deep learning.</p>
	<!-- /wp:paragraph -->

	<!-- wp:heading -->
	<h2>Attempt 1: Constraint-Based Approaches</h2>
	<!-- /wp:heading -->

	<!-- wp:paragraph -->
	<p>I installed Picat and vibe-coded some scripts to find a sequence of moves leading to one peg in the center using the planning module. The scripts ran indefinitely without producing output—the search space was simply too large.</p>
	<!-- /wp:paragraph -->

	<!-- wp:paragraph -->
	<p>I'm the author of <a href="https://github.com/r33drichards/minizinc-mcp">a MiniZinc MCP server</a> connected to my Claude account. Since planners are essentially a DSL on top of constraints, it seemed like a constraint-based solution should work. But Claude's solutions also timed out.</p>
	<!-- /wp:paragraph -->

	<!-- wp:paragraph -->
	<p>Looking for hints, I found a paper on <a href="https://www.kurims.kyoto-u.ac.jp/~kyodo/kokyuroku/contents/pdf/1185-11.pdf">Integer Programming Based Algorithms for Peg Solitaire Problems</a>. The key insight: the search space is large enough that you need sophisticated pruning methods. I could have fed this paper to Claude and been done with it, but I wanted a solution I actually understood. The integer programming approach would have been another black box. This motivated my pivot to deep learning—the loss function was intuitive, and I suspected a neural network would naturally learn to prune the search space.</p>
	<!-- /wp:paragraph -->

	<!-- wp:heading -->
	<h2>Attempt 2: PPO</h2>
	<!-- /wp:heading -->

	<!-- wp:paragraph -->
	<p>My previous experience with reinforcement learning was using <a href="https://en.wikipedia.org/wiki/Proximal_policy_optimization">Proximal Policy Optimization</a> in <a href="https://pytorch.org/">PyTorch</a> to solve simple <a href="https://gymnasium.farama.org/">Gym</a> environments like <a href="https://gymnasium.farama.org/environments/classic_control/cart_pole/">CartPole</a> and <a href="https://gymnasium.farama.org/environments/box2d/lunar_lander/">LunarLander</a>. I vibe-coded a Gym environment for Hi-Q and a basic PPO network, then trained for 100k timesteps. The resulting policy reliably got down to 4 pegs. Not bad!</p>
	<!-- /wp:paragraph -->

	<!-- wp:paragraph -->
	<p>But the policy would converge on a solution and refuse to deviate—it was stuck in local maxima. When I analyzed the model's behavior, it was playing the <em>exact same 30 moves every single game</em>, ending with two pegs that couldn't jump each other. The +1 reward per move meant it had found a local optimum: make 30 valid moves and get stuck (+28 total reward) was good enough that it never explored further.</p>
	<!-- /wp:paragraph -->

	<!-- wp:heading -->
	<h2>Attempt 3: Breaking Out of Local Maxima</h2>
	<!-- /wp:heading -->

	<!-- wp:paragraph -->
	<p>Claude proposed four approaches to escape the local optimum. I tested them in parallel:</p>
	<!-- /wp:paragraph -->

	<!-- wp:list {"ordered":true} -->
	<ol>
	<li><strong>Remove per-move reward</strong> — Only reward at game end based on remaining pegs. <em>Result: Didn't help.</em></li>
	<li><strong>Progressive <a href="https://gibberblot.github.io/rl-notes/single-agent/reward-shaping.html">reward shaping</a></strong> — Exponential scaling where fewer remaining pegs = much higher reward. <em>Result: Marginal improvement.</em></li>
	<li><strong>Higher exploration</strong> — Increased <a href="https://costa.sh/blog-the-role-of-entropy-in-policy-gradient.html">entropy coefficient</a> from 0.01 to 0.05. <em>Result: Didn't help.</em></li>
	<li><strong><a href="https://lilianweng.github.io/posts/2020-01-29-curriculum-rl/">Curriculum learning</a></strong> — Start the agent on easier board states (fewer pegs) and gradually increase difficulty. <em>Result: Significant improvement!</em></li>
	</ol>
	<!-- /wp:list -->

	<!-- wp:paragraph -->
	<p>Only curriculum learning significantly decreased loss. After 1 million timesteps with curriculum learning, I got down to 2 pegs—but still couldn't solve it completely.</p>
	<!-- /wp:paragraph -->

	<!-- wp:heading -->
	<h2>Attempt 4: AlphaZero</h2>
	<!-- /wp:heading -->

	<!-- wp:paragraph -->
	<p>The neural network needed to see examples of winning solutions to converge on one. I introduced the <strong><a href="https://en.wikipedia.org/wiki/AlphaZero">AlphaZero</a> architecture</strong>, which combines neural networks with <a href="https://en.wikipedia.org/wiki/Monte_Carlo_tree_search">Monte Carlo Tree Search</a>. MCTS runs simulations of possible move sequences, giving the agent the ability to plan ahead and occasionally stumble onto winning games during training.</p>
	<!-- /wp:paragraph -->

	<!-- wp:paragraph -->
	<p>Two changes made the difference:</p>
	<!-- /wp:paragraph -->

	<!-- wp:list {"ordered":true} -->
	<ol>
	<li><strong>MCTS depth</strong> became the most important hyperparameter—more simulations meant better planning and faster convergence</li>
	<li><strong>Full-board mixing</strong>—even during curriculum learning, I mixed in 20% full-board games so the model always trained on the actual problem</li>
	</ol>
	<!-- /wp:list -->

	<!-- wp:paragraph -->
	<p>I fired off a 6-hour training job on an <a href="https://www.nvidia.com/en-us/data-center/h100/">H100</a> and went to my family Christmas party. When I woke up, Santa had delivered an optimal solution to Hi-Q.</p>
	<!-- /wp:paragraph -->

	<!-- wp:heading -->
	<h2>Reflections</h2>
	<!-- /wp:heading -->

	<!-- wp:paragraph -->
	<p>The integer programming paper solved this in 15 minutes on a 2001 laptop CPU. Mine took 6+ hours on a 2025 H100. Hardly an improvement in efficiency! But I learned about curriculum learning, and I ended up with an artifact I actually understand—something I wouldn't have gotten from the integer programming route, even with vibe coding.</p>
	<!-- /wp:paragraph -->

	<!-- wp:paragraph -->
	<p>This project taught me that vibe coding lets me punch above my weight class with neural network architectures. For one-off projects, it's great: I can vibe-code slop, and as long as my network converges on something useful, that's all that matters. I'm skeptical this strategy scales to more sophisticated projects—there's a lot of plumbing and failure modes in deep learning. But for off-the-shelf algorithms in PyTorch, <a href="https://docs.anthropic.com/en/docs/claude-code">Claude Code</a> acted as faster arms and legs to execute on my ideas and test hypotheses.</p>
	<!-- /wp:paragraph -->

	<!-- wp:paragraph -->
	<p>The code is <a href="https://github.com/r33drichards/hiq-ppo">on GitHub</a>, and you can <a href="https://r33drichards.github.io/hiq-ppo/">view the solution on GitHub Pages</a>.</p>
	<!-- /wp:paragraph -->
No results found