Using Exa with Instructor for Structured Outputs
This guide demonstrates how to use Exa alongside Instructor to generate structured outputs from web content.
What this doc covers
- Setting up Exa to use Instructor for structured output generation
- Practical examples of using Exa and Instructor together
Guide
1. Pre-requisites and installation
Install the required libraries:
pip install exa_py instructor openai
Ensure API keys are initialized properly. The environment variable names are EXA_API_KEY
and OPENAI_API_KEY
.
2. Why use Instructor?
Instructor is a Python library that allows you to generate structured outputs from a language model.
We could instruct the LLM to return a structured output, but the output will still be a string, which we need to convert to a dictionary. What if the dictionary is not structured as we want? What if the LLM forgot to add the last "}" in the JSON? We would have to handle all of these errors manually.
We could use { "type": "json_object" } which will make the LLM return a JSON object. But for this, we would need to provide a JSON schema, which can get large and complex.
Instead of doing this, we can use Instructor. Instructor is powered by pydantic, which means that it integrates with your IDE. We use pydantic's BaseModel
to define the output model:
3. Setup and Basic Usage
Let's set up Exa and Instructor:
import os
import instructor
from exa_py import Exa
from openai import OpenAI
from pydantic import BaseModel
exa = Exa(os.environ["EXA_API_KEY"])
client = instructor.from_openai(OpenAI())
search_results = exa.search_and_contents(
"Latest advancements in quantum computing",
use_autoprompt=True,
type="neural",
text=True,
)
# Limit search_results to a maximum of 20,000 characters
search_results = search_results.results[:20000]
class QuantumComputingAdvancement(BaseModel):
technology: str
description: str
potential_impact: str
def __str__(self):
return (
f"Technology: {self.technology}\n"
f"Description: {self.description}\n"
f"Potential Impact: {self.potential_impact}"
)
structured_output = client.chat.completions.create(
model="gpt-4o",
response_model=QuantumComputingAdvancement,
messages=[
{
"role": "user",
"content": f"Based on the provided context, describe a recent advancement in quantum computing.\n\n{search_results}",
}
],
)
print(structured_output)
Here we define a QuantumComputingAdvancement
class that inherits from BaseModel
from Pydantic. This class will be used by Instructor to validate the output from the LLM and for the LLM as a response model. We also implement the __str__()
method for easy printing of the output. We then initialize OpenAI()
and wrap instructor on top of it with instructor.from_openai
to create a client that will return structured outputs. If the output is not structured as our class, Instructor makes the LLM retry until max_retries is reached. You can read more about how Instructor retries here.
This example demonstrates how to use Exa to search for content about quantum computing advancements and structure the output using Instructor.
4. Advanced Example: Analyzing Multiple Research Papers
Let's create a more complex example where we analyze multiple research papers on a specific topic and use pydantic's own validation model to correct the structured data to show you how we can be even more fine-grained:
import os
from typing import List
import instructor
from exa_py import Exa
from openai import OpenAI
from pydantic import BaseModel, field_validator
exa = Exa(os.environ["EXA_API_KEY"])
client = instructor.from_openai(OpenAI())
class ResearchPaper(BaseModel):
title: str
authors: List[str]
key_findings: List[str]
methodology: str
@field_validator("title")
@classmethod
def validate_title(cls, v):
if v.upper() != v:
raise ValueError("Title must be in uppercase.")
return v
def __str__(self):
return (
f"Title: {self.title}\n"
f"Authors: {', '.join(self.authors)}\n"
f"Key Findings: {', '.join(self.key_findings)}\n"
f"Methodology: {self.methodology}"
)
class ResearchAnalysis(BaseModel):
papers: List[ResearchPaper]
common_themes: List[str]
future_directions: str
def __str__(self):
return (
f"Common Themes:\n- {', '.join(self.common_themes)}\n"
f"Future Directions: {self.future_directions}\n"
f"Analyzed Papers:\n" + "\n".join(str(paper) for paper in self.papers)
)
# Search for recent AI ethics research papers
search_results = exa.search_and_contents(
"Recent AI ethics research papers",
use_autoprompt=True,
type="neural",
text=True,
num_results=5, # Limit to 5 papers for this example
)
# Combine all search results into one string
combined_results = "\n\n".join([result.text for result in search_results.results])
structured_output = client.chat.completions.create(
model="gpt-3.5-turbo",
response_model=ResearchAnalysis,
max_retries=5,
messages=[
{
"role": "user",
"content": f"Analyze the following AI ethics research papers and provide a structured summary:\n\n{combined_results}",
}
],
)
print(structured_output)
By using pydantic’s field_validator
, we can create our own rules to validate each field to be exactly what we want, so that we can work with predictable data even though we are using an LLM. Additionally, implementing the __str__()
method allows for more readable and convenient output formatting. Read more about different pydantic validators here. Because we don’t specify that the Title
should be in uppercase in the prompt, this will result in at least two API calls. You should avoid using field_validator
s as the only means to get the data in the right format; instead, you should include instructions in the prompt, such as specifying that the Title
should be in uppercase/all-caps.
This advanced example demonstrates how to use Exa and Instructor to analyze multiple research papers, extract structured information, and provide a comprehensive summary of the findings.
5. Streaming Structured Outputs
Instructor also supports streaming structured outputs, which is useful for getting partial results as they're generated (this does not support validators due to the nature of streaming responses, you can read more about it here):
To make the output easier to see, we will use the rich Python package. It should already be installed, but if it isn’t, just run pip install rich
.
import os
from typing import List
import instructor
from exa_py import Exa
from openai import OpenAI
from pydantic import BaseModel
from rich.console import Console
exa = Exa(os.environ["EXA_API_KEY"])
client = instructor.from_openai(OpenAI())
class AIEthicsInsight(BaseModel):
topic: str
description: str
ethical_implications: List[str]
def __str__(self):
return (
f"Topic: {self.topic}\n"
f"Description: {self.description}\n"
f"Ethical Implications:\n- {', '.join(self.ethical_implications or [])}"
)
# Search for recent AI ethics research papers
search_results = exa.search_and_contents(
"Recent AI ethics research papers",
use_autoprompt=True,
type="neural",
text=True,
num_results=5, # Limit to 5 papers for this example
)
# Combine all search results into one string
combined_results = "\n\n".join([result.text for result in search_results.results])
structured_output = client.chat.completions.create_partial(
model="gpt-3.5-turbo",
response_model=AIEthicsInsight,
messages=[
{
"role": "user",
"content": f"Provide insights on AI ethics based on the following research:\n\n{combined_results}",
}
],
stream=True,
)
console = Console()
for output in structured_output:
obj = output.model_dump()
console.clear()
print(output)
if (
output.topic
and output.description
and output.ethical_implications is not None
and len(output.ethical_implications) >= 4
):
break
topic='AI Ethics in Mimetic Models' description='Exploring the ethical implications of AI that simulates the decisions and behavior of specific individuals, known as mimetic models, and the social impact of their availability in various domains such as game-playing, text generation, and artistic expression.' ethical_implications=['Deception Concerns: Mimetic models can potentially be used for deception, leading to misinformation and challenges in distinguishing between a real individual and a simulated model.', 'Normative Issues: Mimetic models raise normative concerns related to the interactions between the target individual, the model operator, and other entities that interact with the model, impacting transparency, authenticity, and ethical considerations in various scenarios.', 'Preparation and End-Use: Mimetic models can be used as preparation for real-life interactions or as an end in themselves, affecting interactions, personal relationships, labor dynamics, and audience engagement, leading to questions about consent, labor devaluation, and reputation consequences.', '']
Final Output:
Topic: AI Ethics in Mimetic Models
Description: Exploring the ethical implications of AI that simulates the decisions and behavior of specific individuals, known as mimetic models, and the social impact of their availability in various domains such as game-playing, text generation, and artistic expression.
Ethical Implications:
- Deception Concerns: Mimetic models can potentially be used for deception, leading to misinformation and challenges in distinguishing between a real individual and a simulated model.
- Normative Issues: Mimetic models raise normative concerns related to the interactions between the target individual, the model operator, and other entities that interact with the model, impacting transparency, authenticity, and ethical considerations in various scenarios.
- Preparation and End-Use: Mimetic models can be used as preparation for real-life interactions or as an end in themselves, affecting interactions, personal relationships, labor dynamics, and audience engagement, leading to questions about consent, labor devaluation, and reputation consequences.
This example shows how to stream partial results and break the loop when certain conditions are met.
By combining Exa’s powerful search capabilities with Instructor’s predictable output generation, you can extract and analyze information from web content efficiently and accurately.