Skip to content

Evaluating Deep Agents with LangSmith on AWS

The Point: AWS and LangChain present a new guide showing how developers can systematically evaluate and monitor AI agents. With LangSmith on AWS, Amazon Nova 2 Lite, and structured evaluation patterns, the reliability of complex multi-step agents can be significantly enhanced – from development to production.

Validating the behavior of AI agents before production deployment is one of the greatest challenges in applied AI. LangSmith on AWS provides an evaluation framework to identify these issues early, track them in production, and continuously improve agent reliability. A joint project by LangChain and AWS demonstrates how developers can systematically test and optimize their deep agents.

Agents are non-deterministic and multi-step: errors in early stages can impact downstream results. A single faulty tool call can cause an entire workflow to fail. This practical guide combines insights from LangChain’s work on agent evaluations with Anthropic’s guidance on demystifying evaluations.

The content includes five evaluation patterns for deep agents, building offline evaluations with pytest and LangSmith, and configuring online monitoring for production. A text-to-SQL deep agent using Amazon Bedrock is used for demonstration.

The new Amazon Nova 2 Lite model is a fast, cost-effective reasoning model that supports Extended Thinking with configurable budget tiers (low, medium, high). It accepts text, image, video, and document inputs with a 1-million-token context window and is particularly suited for agent-based tasks.

When evaluating agents, three key aspects become particularly complex: non-determinism, since agent behavior varies between runs; multi-step logic, since each component becomes more intricate; and outcome orientation, since it’s not just about the spoken answer but the actual result achieved in the environment. An evaluation consists of tests with defined inputs, multiple attempts per task, scoring logic for different dimensions, and complete transcripts for analysis.


Source: aws.amazon.com

Share on: