Agent evaluations are more complex than traditional LLM tests because they involve multiple turns, tool usage, and state changes; the key is distinguishing between transcript (recorded interactions) and outcome (actual final state) to create meaningful assessments.