- Replace 6 compound Likert questions with 12 atomic ones grouped by
dimension (syntax, expressiveness, data/IO, errors, overall); drop
free-form question. Responses now stored as ints, not strings.
- Back-compat layer maps legacy keys to new dimensions so existing
results still render.
- Parallelize run-all with ThreadPoolExecutor (configurable workers)
and add a thread-safe min-request-interval rate limiter to the
Anthropic provider.
- Add new tasks: path_normalizer, todo_manager, currency_converter,
locale_weather_url, network_info_parser, url_normalizer.
Replace category_a/category_b directories with algorithm, pipeline,
environment, filesystem, and process. Add separate mode field (solve/convert)
to decouple orchestration from capability grouping. Add per-category
summary and questionnaire breakdowns to both terminal report and HTML export.
Benchmark harness that uses LLM agents to solve shell scripting tasks
in both Bash and Lush, then compares correctness and code quality.
- CLI with run, run-all, list-tasks, report, and export commands
- Agent loop with retry support via Anthropic Claude provider
- Test harness executing solutions in sandboxed subprocesses
- LLM-driven questionnaire for subjective code quality evaluation
- HTML report export with charts (matplotlib)
- 8 Category A tasks (write-from-scratch in both languages)
- 4 Category B tasks (verify provided Bash, convert to Lush)
- Lush language reference for agent context