Benchmark harness that uses LLM agents to solve shell scripting tasks in both Bash and Lush, then compares correctness and code quality. - CLI with run, run-all, list-tasks, report, and export commands - Agent loop with retry support via Anthropic Claude provider - Test harness executing solutions in sandboxed subprocesses - LLM-driven questionnaire for subjective code quality evaluation - HTML report export with charts (matplotlib) - 8 Category A tasks (write-from-scratch in both languages) - 4 Category B tasks (verify provided Bash, convert to Lush) - Lush language reference for agent context
18 lines
339 B
Python
18 lines
339 B
Python
from __future__ import annotations
|
|
|
|
from dataclasses import dataclass
|
|
from typing import Protocol
|
|
|
|
|
|
@dataclass
|
|
class Message:
|
|
role: str # "user" or "assistant"
|
|
content: str
|
|
|
|
|
|
class LLMProvider(Protocol):
|
|
def send(self, messages: list[Message], system: str = "") -> str: ...
|
|
|
|
@property
|
|
def model_name(self) -> str: ...
|