Benchmark harness that uses LLM agents to solve shell scripting tasks in both Bash and Lush, then compares correctness and code quality. - CLI with run, run-all, list-tasks, report, and export commands - Agent loop with retry support via Anthropic Claude provider - Test harness executing solutions in sandboxed subprocesses - LLM-driven questionnaire for subjective code quality evaluation - HTML report export with charts (matplotlib) - 8 Category A tasks (write-from-scratch in both languages) - 4 Category B tasks (verify provided Bash, convert to Lush) - Lush language reference for agent context
22 lines
354 B
TOML
22 lines
354 B
TOML
name = "reverse_string"
|
|
category = "a"
|
|
description = """
|
|
Read a single line from stdin and print it reversed to stdout.
|
|
"""
|
|
|
|
[[test_cases]]
|
|
stdin = "hello"
|
|
expected_stdout = "olleh"
|
|
|
|
[[test_cases]]
|
|
stdin = "abcdef"
|
|
expected_stdout = "fedcba"
|
|
|
|
[[test_cases]]
|
|
stdin = "racecar"
|
|
expected_stdout = "racecar"
|
|
|
|
[[test_cases]]
|
|
stdin = "a"
|
|
expected_stdout = "a"
|