Initial commit: Lush vs Bash AI benchmarking framework

Benchmark harness that uses LLM agents to solve shell scripting tasks
in both Bash and Lush, then compares correctness and code quality.

- CLI with run, run-all, list-tasks, report, and export commands
- Agent loop with retry support via Anthropic Claude provider
- Test harness executing solutions in sandboxed subprocesses
- LLM-driven questionnaire for subjective code quality evaluation
- HTML report export with charts (matplotlib)
- 8 Category A tasks (write-from-scratch in both languages)
- 4 Category B tasks (verify provided Bash, convert to Lush)
- Lush language reference for agent context
This commit is contained in:
Cormac Shannon
2026-03-29 17:56:30 +01:00
commit be8d657b24
33 changed files with 3302 additions and 0 deletions

View File

@@ -0,0 +1,24 @@
name = "two_sum"
category = "a"
description = """
Read input from stdin. The first line contains a target integer.
The second line contains space-separated integers (the array).
Find two indices (0-based) such that the numbers at those indices add up to the target.
Print the two indices on a single line, space-separated, smaller index first.
There is exactly one solution.
"""
[[test_cases]]
stdin = """9
2 7 11 15"""
expected_stdout = "0 1"
[[test_cases]]
stdin = """6
3 2 4"""
expected_stdout = "1 2"
[[test_cases]]
stdin = """6
3 3"""
expected_stdout = "0 1"