Benchmark harness that uses LLM agents to solve shell scripting tasks in both Bash and Lush, then compares correctness and code quality. - CLI with run, run-all, list-tasks, report, and export commands - Agent loop with retry support via Anthropic Claude provider - Test harness executing solutions in sandboxed subprocesses - LLM-driven questionnaire for subjective code quality evaluation - HTML report export with charts (matplotlib) - 8 Category A tasks (write-from-scratch in both languages) - 4 Category B tasks (verify provided Bash, convert to Lush) - Lush language reference for agent context
8 lines
203 B
TOML
8 lines
203 B
TOML
[project]
|
|
name = "lush-benchmarking"
|
|
version = "0.1.0"
|
|
description = "Add your description here"
|
|
readme = "README.md"
|
|
requires-python = ">=3.11"
|
|
dependencies = ["anthropic>=0.40.0", "matplotlib>=3.8.0"]
|