- Replace 6 compound Likert questions with 12 atomic ones grouped by
dimension (syntax, expressiveness, data/IO, errors, overall); drop
free-form question. Responses now stored as ints, not strings.
- Back-compat layer maps legacy keys to new dimensions so existing
results still render.
- Parallelize run-all with ThreadPoolExecutor (configurable workers)
and add a thread-safe min-request-interval rate limiter to the
Anthropic provider.
- Add new tasks: path_normalizer, todo_manager, currency_converter,
locale_weather_url, network_info_parser, url_normalizer.
Benchmark harness that uses LLM agents to solve shell scripting tasks
in both Bash and Lush, then compares correctness and code quality.
- CLI with run, run-all, list-tasks, report, and export commands
- Agent loop with retry support via Anthropic Claude provider
- Test harness executing solutions in sandboxed subprocesses
- LLM-driven questionnaire for subjective code quality evaluation
- HTML report export with charts (matplotlib)
- 8 Category A tasks (write-from-scratch in both languages)
- 4 Category B tasks (verify provided Bash, convert to Lush)
- Lush language reference for agent context