lush_grading

nik/lush_grading

Fork 0

Commit Graph

Author	SHA1	Message	Date
Cormac Shannon	18ce7e57cf	Revamp questionnaire, parallelize run-all, add new tasks - Replace 6 compound Likert questions with 12 atomic ones grouped by dimension (syntax, expressiveness, data/IO, errors, overall); drop free-form question. Responses now stored as ints, not strings. - Back-compat layer maps legacy keys to new dimensions so existing results still render. - Parallelize run-all with ThreadPoolExecutor (configurable workers) and add a thread-safe min-request-interval rate limiter to the Anthropic provider. - Add new tasks: path_normalizer, todo_manager, currency_converter, locale_weather_url, network_info_parser, url_normalizer.	2026-04-07 19:07:21 +01:00
Cormac Shannon	20e62f60f6	Reorganize task categories from opaque a/b to descriptive names Replace category_a/category_b directories with algorithm, pipeline, environment, filesystem, and process. Add separate mode field (solve/convert) to decouple orchestration from capability grouping. Add per-category summary and questionnaire breakdowns to both terminal report and HTML export.	2026-03-29 20:59:01 +01:00
Cormac Shannon	be8d657b24	Initial commit: Lush vs Bash AI benchmarking framework Benchmark harness that uses LLM agents to solve shell scripting tasks in both Bash and Lush, then compares correctness and code quality. - CLI with run, run-all, list-tasks, report, and export commands - Agent loop with retry support via Anthropic Claude provider - Test harness executing solutions in sandboxed subprocesses - LLM-driven questionnaire for subjective code quality evaluation - HTML report export with charts (matplotlib) - 8 Category A tasks (write-from-scratch in both languages) - 4 Category B tasks (verify provided Bash, convert to Lush) - Lush language reference for agent context	2026-03-29 17:56:30 +01:00

Author

SHA1

Message

Date

Cormac Shannon

18ce7e57cf

Revamp questionnaire, parallelize run-all, add new tasks

- Replace 6 compound Likert questions with 12 atomic ones grouped by
  dimension (syntax, expressiveness, data/IO, errors, overall); drop
  free-form question. Responses now stored as ints, not strings.
- Back-compat layer maps legacy keys to new dimensions so existing
  results still render.
- Parallelize run-all with ThreadPoolExecutor (configurable workers)
  and add a thread-safe min-request-interval rate limiter to the
  Anthropic provider.
- Add new tasks: path_normalizer, todo_manager, currency_converter,
  locale_weather_url, network_info_parser, url_normalizer.

2026-04-07 19:07:21 +01:00

Cormac Shannon

20e62f60f6

Reorganize task categories from opaque a/b to descriptive names

Replace category_a/category_b directories with algorithm, pipeline,
environment, filesystem, and process. Add separate mode field (solve/convert)
to decouple orchestration from capability grouping. Add per-category
summary and questionnaire breakdowns to both terminal report and HTML export.

2026-03-29 20:59:01 +01:00

Cormac Shannon

be8d657b24

Initial commit: Lush vs Bash AI benchmarking framework

Benchmark harness that uses LLM agents to solve shell scripting tasks
in both Bash and Lush, then compares correctness and code quality.

- CLI with run, run-all, list-tasks, report, and export commands
- Agent loop with retry support via Anthropic Claude provider
- Test harness executing solutions in sandboxed subprocesses
- LLM-driven questionnaire for subjective code quality evaluation
- HTML report export with charts (matplotlib)
- 8 Category A tasks (write-from-scratch in both languages)
- 4 Category B tasks (verify provided Bash, convert to Lush)
- Lush language reference for agent context

2026-03-29 17:56:30 +01:00

3 Commits