Initial commit: Lush vs Bash AI benchmarking framework
Benchmark harness that uses LLM agents to solve shell scripting tasks in both Bash and Lush, then compares correctness and code quality. - CLI with run, run-all, list-tasks, report, and export commands - Agent loop with retry support via Anthropic Claude provider - Test harness executing solutions in sandboxed subprocesses - LLM-driven questionnaire for subjective code quality evaluation - HTML report export with charts (matplotlib) - 8 Category A tasks (write-from-scratch in both languages) - 4 Category B tasks (verify provided Bash, convert to Lush) - Lush language reference for agent context
This commit is contained in:
24
tasks/category_a/two_sum.toml
Normal file
24
tasks/category_a/two_sum.toml
Normal file
@@ -0,0 +1,24 @@
|
||||
name = "two_sum"
|
||||
category = "a"
|
||||
description = """
|
||||
Read input from stdin. The first line contains a target integer.
|
||||
The second line contains space-separated integers (the array).
|
||||
Find two indices (0-based) such that the numbers at those indices add up to the target.
|
||||
Print the two indices on a single line, space-separated, smaller index first.
|
||||
There is exactly one solution.
|
||||
"""
|
||||
|
||||
[[test_cases]]
|
||||
stdin = """9
|
||||
2 7 11 15"""
|
||||
expected_stdout = "0 1"
|
||||
|
||||
[[test_cases]]
|
||||
stdin = """6
|
||||
3 2 4"""
|
||||
expected_stdout = "1 2"
|
||||
|
||||
[[test_cases]]
|
||||
stdin = """6
|
||||
3 3"""
|
||||
expected_stdout = "0 1"
|
||||
Reference in New Issue
Block a user