How to Evaluate Exa Search

Overview

Evaluating search APIs requires careful methodology to ensure fair, reproducible comparisons. This guide provides a framework for assessing Exa’s search capabilities across multiple dimensions:

Retrieval Quality: Accuracy and relevance of returned results
Latency: Response time from query to results
Freshness: Ability to retrieve up-to-date information
Cost Efficiency: Value delivered per API call
Agentic Suitability: Performance in multi-step reasoning workflows

Exa is designed to excel across different use cases:

Deep Research: Multi-hop queries requiring comprehensive context and query expansion
Agentic Workflows: Complex tasks involving multiple search iterations and reasoning steps
Low-Latency QA: Fast factual question-answering for real-time applications
Semantic Discovery: Finding conceptually related content beyond keyword matching

Best Practice: Start with Defaults

The most important recommendation for fair evaluation: use Exa’s default settings. Adding restrictive parameters (date filters, domain restrictions, text inclusion/exclusion) often causes agents to over-optimize in non-meaningful ways, unnecessarily limiting results and reducing quality without providing valuable insights. Unless your evaluation specifically tests a filtered use case, avoid adding constraints that don’t reflect real-world usage. Recommended minimal configuration:

# Option 1: Use text with character limit (recommended for consistent comparisons)
exa.search_and_contents(
    query,
    type="fast",  # or "auto" (for `Deep`, see Option 2)
    num_results=10,
    text={"max_characters": 5000}
)

# Option 2: Use context string for RAG (single string with total max characters)
# Note: `Deep` search may require context=True to return results
# exa.search_and_contents(
#     query,
#     type="deep",
#     additional_queries=["variation 1", "variation 2"],  # Optional query variations
#     num_results=10,
#     context={"max_characters": 20000}
# )

# Option 3: Use full text (may result in very long content)
# exa.search_and_contents(
#     query,
#     type="fast",
#     num_results=10,
#     text=True
# )

Setting a consistent max_characters ensures fair comparisons by standardizing content length across all queries. The context parameter returns a single RAG-ready string, while text returns individual content for each result. Note: Deep search may require context=True to return detailed summaries. Only add additional parameters (date filters, domain restrictions, etc.) when they’re essential to your specific evaluation objective.

Compare Within Latency Classes

Critical: Always find the closest competitor in terms of P50 latency for meaningful comparisons. Don’t compare systems with vastly different latency profiles — a 500ms API serves different use cases than a 5000ms API. Instead, benchmark within similar latency ranges:

For Exa Fast (<500ms): Compare to other sub-1s APIs with similar latency
For Exa Auto (~1s): Compare to mid-latency systems (800ms-1500ms)
For Exa Deep (>2s): Compare to other multi-second agentic/research systems

Comparing across latency classes (e.g., Fast vs Deep) is not meaningful — they’re optimized for different requirements and use cases.

Search Types: Understanding the Quality-Latency Spectrum

Exa offers four search types, each optimized for different evaluation scenarios:

Exa search types positioned on speed vs depth/quality spectrum

Fast Search

Optimized for: Speed-critical applications Characteristics:

Median latency: ~500ms (excluding network and optional features)
Streamlined neural and reranking models
Best for single-step factual queries

When to benchmark with Fast:

Low-latency QA datasets (SimpleQA, WebWalkerQA)
Real-time applications (voice agents, autocomplete)
High-volume agentic workflows where latency accumulates

Example configuration:

result = exa.search_and_contents(
    "latest AI breakthroughs in 2025",
    type="fast",
    num_results=10,
    text={"max_characters": 5000}
)

Auto Search (Default)

Optimized for: Balanced performance without manual tuning Characteristics:

Median latency: ~1000ms
Intelligently combines multiple search methods
Reranker model adapts to query type

When to benchmark with Auto:

General-purpose search evaluations
When query types vary significantly
Production workloads requiring versatility

Example configuration:

result = exa.search_and_contents(
    "companies building climate tech solutions",
    type="auto",  # or omit - auto is default
    num_results=10,
    text={"max_characters": 5000}
)

Deep Search

Learn more about Deep search in our Deep Search changelog.

Optimized for: Comprehensive research and multi-hop queries Characteristics:

Median latency: ~5000ms
Automatic query expansion or custom query variations via additional_queries (Python) / additionalQueries (JavaScript)
Rich contextual summaries for each result (requires context=True)
Parallel search across multiple query formulations

Using query variations: Provide 2-3 query variations using additional_queries (Python) or additionalQueries (JavaScript) for best results. If not provided, Deep search will automatically generate variations.

When to benchmark with Deep:

Agentic workflows (FRAMES, MultiLoKo, BrowseComp)
Complex research tasks requiring multiple perspectives
Scenarios where comprehensive coverage matters more than speed

Example configuration:

result = exa.search_and_contents(
    "impact of quantum computing on cryptography",
    type="deep",
    additional_queries=[
        "quantum threats to encryption",
        "post-quantum cryptography research"
    ],
    num_results=10,
    text=True,
    context=True  # Required for `Deep` search summaries
)

Neural Search

Optimized for: Semantic similarity and exploratory queries Characteristics:

Embeddings-based next-link prediction
Excels at thematic and conceptual relationships
Incorporated into Fast and Auto search types

When to benchmark with Neural:

Exploratory search tasks
Finding semantically related content
Long-form query matching

Evaluating Exa with Tool Calling

For evaluating Exa in agentic workflows where LLMs autonomously call search tools, proper tool calling setup is critical. Tool calling allows agents to dynamically invoke Exa search based on user queries and reasoning steps.

Why Tool Calling Matters for Evaluation

When benchmarking agentic systems:

Agents decide when to search: The LLM determines if/when to call Exa based on the task
Dynamic parameter selection: Agents may choose search parameters (though we recommend minimal defaults)
Multi-step workflows: Agents can make multiple Exa calls in sequence or parallel

Tool Calling Best Practices for Evaluation

Keep tool definitions minimal: Don’t expose too many parameters to the agent — this encourages over-filtering
Use consistent tool schemas: Standardize tool definitions across all evaluated systems
Monitor tool call patterns: Track how often and when agents invoke Exa vs competitors

Implementation Guides

See our detailed guides for implementing Exa with popular LLM providers:

Anthropic Tool Calling - Using Claude with Exa search integration
OpenAI Tool Calling - Integrating Exa with GPT models
OpenAI Responses API - Recommended for new OpenAI projects

These guides show how to define Exa search as a tool and handle the agent’s tool call responses properly.

Evaluation Methodology

Core Principles for Fair Benchmarking

To ensure reproducible, meaningful comparisons:

Use default settings: Start with minimal parameters (type, num_results, text). Avoid adding restrictive filters (date ranges, domains, text inclusion/exclusion) unless they’re core to your evaluation — these often cause over-optimization that artificially limits results without meaningful benefit.
Standardize queries: Use identical query sets across all systems
Control downstream processing: Use the same LLM for answer synthesis and grading
Disable prompt engineering: Evaluate base API performance without query optimization
Measure consistently: Track P50 latency, accuracy, and coverage using identical metrics
Document configurations: Record all parameter settings for reproducibility

Four-Phase Evaluation Workflow

Phase 1: Scope Definition

Define evaluation objectives:

What capabilities are you testing? (factual QA, research depth, freshness, etc.)
What latency requirements matter for your use case?
Are you evaluating single-step retrieval or multi-step agentic workflows?

Phase 2: Dataset Selection

Choose benchmarks aligned with your scope (see Datasets section below):

Low-latency factual QA: SimpleQA, WebWalkerQA
Single-step retrieval: FRAMES (single-step slice), Seal0
Agentic workflows: FRAMES (agentic slice), MultiLoKo, BrowseComp
Complex reasoning: HLE (hard, long, emerging questions)
Freshness: FreshQA, time-sensitive queries

Phase 3: Run Configurations

Execute standardized retrieval-synthesis-grading loop:

# 1. Retrieval step
results = exa.search_and_contents(
    query,
    type="fast",  # or "auto", "deep"
    num_results=10,
    text={"max_characters": 5000}
)

# 2. Answer synthesis (downstream LLM restricted to retrieved context)
context = "\n\n".join([r.text for r in results.results])
answer = llm.generate(
    f"Answer the question using only the provided context.\n\n"
    f"Context: {context}\n\n"
    f"Question: {query}\n\n"
    f"Answer:"
)

# 3. Grading (LLM-based correctness evaluation)
grade = grading_llm.evaluate(
    question=query,
    expected_answer=ground_truth,
    generated_answer=answer
)
# Returns: "correct", "partial", or "incorrect"

Phase 4: Results Analysis

Aggregate metrics:

Accuracy: Percentage of correct answers
Partial-credit accuracy: Weighted score (e.g., correct=1.0, partial=0.5, incorrect=0.0)
Retrieval coverage: Percentage of queries where relevant information was retrieved
P50 latency: Median response time across all queries
Cost per query: Total API cost divided by number of queries

Optimal Exa Settings for Evaluation

Configuration Parameters

Parameter	Purpose	Evaluation Recommendations
`type`	Search method	Match to benchmark type (fast/auto/deep)
`num_results`	Number of results	Fix at 10 for consistency across comparisons
`text`	Retrieve full content	Set to `true` for RAG-style evaluation
`context`	Get AI-generated summaries	Set to `true` for Deep search
`livecrawl`	Real-time web fetching	Default `"fallback"` is recommended; use `"preferred"` for freshness tests
`additional_queries` / `additionalQueries`	Query variations (Deep only)	Provide 2-3 variations for best Deep search results. Use `additional_queries` in Python, `additionalQueries` in JavaScript

Recommended Configuration Templates

Fast-Baseline Configuration

For latency-sensitive evaluations:

result = exa.search_and_contents(
    query,
    type="fast",
    num_results=10,
    text={"max_characters": 5000}
)

Auto-Quality Configuration

For balanced evaluations:

result = exa.search_and_contents(
    query,
    type="auto",
    num_results=10,
    text={"max_characters": 5000}
)

Deep-Comprehensive Configuration

For agentic and research evaluations:

result = exa.search_and_contents(
    query,
    type="deep",
    additional_queries=[variation1, variation2],
    num_results=10,
    text=True,
    context=True,
    livecrawl="fallback"
)

Choosing Datasets for Evaluation

Benchmark-to-Search-Type Mapping

Benchmark	Description	Recommended Search Type	Focus Area
SimpleQA	Single-step factual questions	`Fast`, `Auto`	Low-latency QA accuracy
FRAMES (single-step)	Straightforward retrieval tasks	`Fast`, `Auto`	Single-hop retrieval quality
FRAMES (agentic)	Multi-step reasoning requiring iteration	`Deep`	Agentic workflow performance
MultiLoKo	Multi-hop knowledge queries	`Deep`	Complex reasoning chains
BrowseComp	Web browsing comprehension	`Deep`	Context understanding
Seal0	General search quality	`Fast`, `Auto`, `Deep`	Overall performance
WebWalkerQA	Navigation-style queries	`Fast`, `Auto`	Real-world search scenarios
HLE	Hard, long, emerging questions	`Deep`	Difficult edge cases
FreshQA	Time-sensitive queries	All types with `livecrawl="preferred"`	Freshness/timeliness

Dataset Characteristics

SimpleQA

Purpose: Tests fast, factual question-answering
Query style: “What is the capital of France?”, “Who invented the telephone?”
Evaluation focus: Accuracy and latency for straightforward queries
Exa configuration: Fast search with cached content

FRAMES

Purpose: Evaluates both single-step and multi-step retrieval
Two slices:
- Single-step: Direct queries answerable from one search
- Agentic: Complex queries requiring multiple search iterations
Evaluation focus: Versatility across task complexity
Exa configuration: Fast/Auto for single-step, Deep for agentic

MultiLoKo & BrowseComp

Purpose: Multi-hop reasoning and deep comprehension
Query style: Questions requiring synthesis across multiple sources
Evaluation focus: Quality of context and reasoning support
Exa configuration: Deep search with query expansion

Seal0

Purpose: General search quality benchmark
Query style: Diverse real-world queries
Evaluation focus: Overall retrieval accuracy
Exa configuration: All search types (compare performance)

HLE (Hard, Long, Emerging)

Purpose: Stress-test with difficult queries
Query style: Complex, lengthy queries about recent topics
Evaluation focus: Handling edge cases and emerging information
Exa configuration: Deep search with livecrawling

Benchmark Results

Low-Latency Search Engines

Performance on speed-critical tasks (latency <1s):

Key findings:

Exa Fast achieves 94% accuracy on SimpleQA with median latency <500ms
Strong performance across multiple benchmarks while maintaining speed advantage
Ideal for real-time applications and high-volume agent workflows

Agentic Search APIs

Performance on complex, multi-step tasks (latency >2s):

Key findings:

Exa Deep leads on FRAMES (96%) and MultiLoKo (89%) benchmarks
Query expansion and rich context enable superior agentic performance
Higher latency justified by comprehensive, high-quality results

Quality-Latency Tradeoffs

Understanding the Spectrum

Different use cases require different points on the quality-latency spectrum:

Use Case	Priority	Recommended Type	Expected Latency	Quality Characteristics
Voice agents	Speed	`Fast`	<500ms	Good factual accuracy
Chatbot grounding	Balanced	`Auto`	~1000ms	Versatile, high quality
Research assistant	Depth	`Deep`	~5000ms	Comprehensive, multi-faceted
Batch enrichment	Quality	`Deep`	~5000ms	Maximum coverage
Real-time autocomplete	Speed	`Fast`	<500ms	Relevant suggestions

Interpreting Tradeoffs

When analyzing evaluation results:

Don’t compare across latency classes: Fast search at 500ms vs Deep search at 5000ms serve different purposes. Always find the closest competitor in terms of latency for meaningful comparisons — compare systems with similar P50 latency ranges.
Benchmark within peer groups:
- Compare Exa Fast (<500ms) to other sub-1s APIs
- Compare Exa Auto (~1s) to similar mid-latency systems
- Compare Exa Deep (>2s) to other agentic/research-oriented systems
Consider total workflow time: For multi-step agents, Fast search may complete the entire workflow faster than Deep search on a single query
Account for quality requirements: If accuracy >90% is required, accept higher latency; if <1s is required, accept some quality tradeoff

Factors That Impact Latency

Beyond search type selection, several parameters affect response time:

Parameter	Latency Impact	Recommendation
`livecrawl="preferred"`	+500-2000ms	Use only when freshness is critical
`livecrawl="fallback"`	Variable	Balanced freshness/speed (default)
AI-generated summaries	+300-800ms	Evaluate necessity vs speed
`num_results > 10`	+50-200ms	Keep at 10 for fair comparisons
Complex date filters	+100-300ms	Simplify when possible
Text filtering (`include_text`, `exclude_text`)	+100-500ms	Use sparingly

Running Production-Grade Evaluations

Example: SimpleQA Evaluation Script

from exa_py import Exa
import json
from datetime import datetime

exa = Exa(api_key="YOUR_API_KEY")

def evaluate_simpleqa(dataset_path, config):
    """
    Run SimpleQA evaluation with specified configuration.

    Args:
        dataset_path: Path to SimpleQA JSON file
        config: Dict with keys: type, num_results, text, livecrawl
    """
    with open(dataset_path) as f:
        questions = json.load(f)

    results = []
    latencies = []

    for item in questions:
        query = item['question']
        ground_truth = item['answer']

        # Retrieval
        start = datetime.now()
        search_result = exa.search_and_contents(
            query,
            type=config['type'],
            num_results=config['num_results'],
            text=config['text'],
            livecrawl=config['livecrawl']
        )
        latency = (datetime.now() - start).total_seconds() * 1000
        latencies.append(latency)

        # Synthesis (using your LLM)
        context = "\n\n".join([r.text for r in search_result.results])
        answer = your_llm.generate(
            f"Answer concisely using only the context.\n\n"
            f"Context: {context}\n\n"
            f"Question: {query}\n\n"
            f"Answer:"
        )

        # Grading (using your grading LLM)
        grade = grading_llm.evaluate(
            question=query,
            expected=ground_truth,
            generated=answer
        )

        results.append({
            'query': query,
            'grade': grade,
            'latency_ms': latency
        })

    # Calculate metrics
    accuracy = sum(1 for r in results if r['grade'] == 'correct') / len(results)
    p50_latency = sorted(latencies)[len(latencies) // 2]

    return {
        'accuracy': accuracy,
        'p50_latency_ms': p50_latency,
        'total_queries': len(results),
        'config': config
    }

# Run evaluation
config = {
    'type': 'fast',
    'num_results': 10,
    'text': {'max_characters': 5000}
}

results = evaluate_simpleqa('simpleqa.json', config)
print(f"Accuracy: {results['accuracy']:.2%}")
print(f"P50 Latency: {results['p50_latency_ms']:.0f}ms")

Multi-Configuration Comparison

Best practice: Run multiple configurations to understand tradeoffs:

configs = [
    {'name': 'Fast', 'type': 'fast'},
    {'name': 'Auto', 'type': 'auto'},
    {'name': 'Deep', 'type': 'deep'},
]

for config in configs:
    results = evaluate_simpleqa('simpleqa.json', config)
    print(f"{config['name']}: {results['accuracy']:.1%} @ {results['p50_latency_ms']:.0f}ms")

Example output:

`Fast`: 94.2% @ 450ms
`Auto`: 95.8% @ 1050ms
`Deep`: 97.2% @ 4950ms

Recommendations

For Low-Latency QA Benchmarks

Datasets: SimpleQA, WebWalkerQA, Seal0 (single-step) Configuration:

Use type="fast" or type="auto"
Fix num_results=10
Use text={"max_characters": 5000} for consistent context length

Expected performance:

Accuracy: 90-95% on factual queries
Latency: 400-600ms (Fast), 900-1200ms (Auto)

For Agentic Workflow Benchmarks

Datasets: FRAMES (agentic), MultiLoKo, BrowseComp, HLE Configuration:

Use type="deep"
Provide 2-3 query variations via additional_queries (Python) / additionalQueries (JavaScript) for best results
Enable context=True for rich summaries
Set livecrawl="fallback" for freshness

For tool calling evaluations: See the Evaluating Exa with Tool Calling section below for guidance on setting up agents to autonomously invoke Exa search. Expected performance:

Accuracy: 85-96% on complex multi-hop queries
Latency: 4000-6000ms
Higher comprehensive coverage vs single-query search

For Freshness Benchmarks

Datasets: FreshQA, time-sensitive custom queries Configuration:

Use any search type based on latency requirements
Set livecrawl="preferred" or livecrawl="fallback"
Include recent date filters if needed

Expected performance:

Freshness: Up-to-date information from recent sources
Latency: +500-2000ms vs cached content

For Production Deployment

Run comparative benchmarks across Fast, Auto, and Deep to understand your quality-latency frontier
Match search type to use case:
- Real-time user-facing: Fast
- General chatbot/assistant: Auto
- Deep research/agent workflows: Deep
Monitor in production: Track accuracy, latency, and cost metrics continuously
Optimize parameters: Adjust livecrawl, num_results, and content options based on actual usage patterns
Document your evaluation: Record configurations, datasets, and results for reproducibility

For Meaningful Cross-System Comparisons

Standardize everything:
- Identical query sets
- Same downstream LLM for synthesis
- Same grading model/rubric
- Fixed num_results across systems
Compare within latency classes — find the closest competitor in terms of P50 latency:
- For Exa Fast (<500ms): Compare to other sub-1s APIs with similar latency
- For Exa Auto (~1s): Compare to mid-latency systems (800ms-1500ms)
- For Exa Deep (>2s): Compare to other multi-second agentic/research systems
Account for feature differences:
- Some systems don’t offer content retrieval
- Some don’t support livecrawling
- Some have different context limits
Measure what matters for your use case:
- If latency <500ms is required, only benchmark Fast-class systems
- If accuracy >95% is required, accept higher latency configurations

Additional Resources

How Exa Search Works - Deep dive into neural search and search types
Exa’s Capabilities Explained - Feature overview and use cases
Livecrawling Contents - When and how to use livecrawling
API Reference: Search - Complete parameter documentation

For questions about evaluation methodology or custom benchmark needs, join our Discord community or reach out to our team.

Getting Started

API Reference

Concepts

Integrations

Admin

​Overview

​Best Practice: Start with Defaults

​Compare Within Latency Classes

​Search Types: Understanding the Quality-Latency Spectrum

​Fast Search

​Auto Search (Default)

​Deep Search

​Neural Search

​Evaluating Exa with Tool Calling

​Why Tool Calling Matters for Evaluation

​Tool Calling Best Practices for Evaluation

​Implementation Guides

​Evaluation Methodology

​Core Principles for Fair Benchmarking

​Four-Phase Evaluation Workflow

​Phase 1: Scope Definition

​Phase 2: Dataset Selection

​Phase 3: Run Configurations

​Phase 4: Results Analysis

​Optimal Exa Settings for Evaluation

​Configuration Parameters

​Recommended Configuration Templates

​Fast-Baseline Configuration

​Auto-Quality Configuration

​Deep-Comprehensive Configuration

​Choosing Datasets for Evaluation

​Benchmark-to-Search-Type Mapping

​Dataset Characteristics

​SimpleQA

​FRAMES

​MultiLoKo & BrowseComp

​Seal0

​HLE (Hard, Long, Emerging)

​Benchmark Results

​Low-Latency Search Engines

​Agentic Search APIs

​Quality-Latency Tradeoffs

​Understanding the Spectrum

​Interpreting Tradeoffs

​Factors That Impact Latency

​Running Production-Grade Evaluations

​Example: SimpleQA Evaluation Script

​Multi-Configuration Comparison

​Recommendations

​For Low-Latency QA Benchmarks

​For Agentic Workflow Benchmarks

​For Freshness Benchmarks

​For Production Deployment

​For Meaningful Cross-System Comparisons

​Additional Resources

Overview

Best Practice: Start with Defaults

Compare Within Latency Classes

Search Types: Understanding the Quality-Latency Spectrum

Fast Search

Auto Search (Default)

Deep Search

Neural Search

Evaluating Exa with Tool Calling

Why Tool Calling Matters for Evaluation

Tool Calling Best Practices for Evaluation

Implementation Guides

Evaluation Methodology

Core Principles for Fair Benchmarking

Four-Phase Evaluation Workflow

Phase 1: Scope Definition

Phase 2: Dataset Selection

Phase 3: Run Configurations

Phase 4: Results Analysis

Optimal Exa Settings for Evaluation

Configuration Parameters

Recommended Configuration Templates

Fast-Baseline Configuration

Auto-Quality Configuration

Deep-Comprehensive Configuration

Choosing Datasets for Evaluation

Benchmark-to-Search-Type Mapping

Dataset Characteristics

SimpleQA

FRAMES

MultiLoKo & BrowseComp

Seal0

HLE (Hard, Long, Emerging)

Benchmark Results

Low-Latency Search Engines

Agentic Search APIs

Quality-Latency Tradeoffs

Understanding the Spectrum

Interpreting Tradeoffs

Factors That Impact Latency

Running Production-Grade Evaluations

Example: SimpleQA Evaluation Script

Multi-Configuration Comparison

Recommendations

For Low-Latency QA Benchmarks

For Agentic Workflow Benchmarks

For Freshness Benchmarks

For Production Deployment

For Meaningful Cross-System Comparisons

Additional Resources