Initial commit: Lush vs Bash AI benchmarking framework

Benchmark harness that uses LLM agents to solve shell scripting tasks in both Bash and Lush, then compares correctness and code quality. - CLI with run, run-all, list-tasks, report, and export commands - Agent loop with retry support via Anthropic Claude provider - Test harness executing solutions in sandboxed subprocesses - LLM-driven questionnaire for subjective code quality evaluation - HTML report export with charts (matplotlib) - 8 Category A tasks (write-from-scratch in both languages) - 4 Category B tasks (verify provided Bash, convert to Lush) - Lush language reference for agent context
2026-03-29 17:56:30 +01:00
commit be8d657b24
33 changed files with 3302 additions and 0 deletions
--- a/tasks/category_a/fizzbuzz.toml
+++ b/tasks/category_a/fizzbuzz.toml
@@ -0,0 +1,38 @@
+name = "fizzbuzz"
+category = "a"
+description = """
+Read a single integer N from stdin. Print numbers from 1 to N, one per line.
+For multiples of 3, print "Fizz" instead of the number.
+For multiples of 5, print "Buzz" instead of the number.
+For multiples of both 3 and 5, print "FizzBuzz" instead of the number.
+"""
+
+[[test_cases]]
+stdin = "15"
+expected_stdout = """1
+2
+Fizz
+4
+Buzz
+Fizz
+7
+8
+Fizz
+Buzz
+11
+Fizz
+13
+14
+FizzBuzz"""
+
+[[test_cases]]
+stdin = "5"
+expected_stdout = """1
+2
+Fizz
+4
+Buzz"""
+
+[[test_cases]]
+stdin = "1"
+expected_stdout = "1"