Structured Outputs with Instructor
Using Exa with instructor to generate structured outputs from web content.
What this doc covers
- Setting up Exa to use Instructor for structured output generation
- Practical examples of using Exa and Instructor together
Guide
1. Pre-requisites and installation
Install the required libraries:
Ensure API keys are initialized properly. The environment variable names are EXA_API_KEY
and OPENAI_API_KEY
.
Get your Exa API key
2. Why use Instructor?
Instructor is a Python library that allows you to generate structured outputs from a language model.
We could instruct the LLM to return a structured output, but the output will still be a string, which we need to convert to a dictionary. What if the dictionary is not structured as we want? What if the LLM forgot to add the last ”}” in the JSON? We would have to handle all of these errors manually.
We could use { "type": "json_object" }
which will make the LLM return a JSON object. But for this, we would need to provide a JSON schema, which can get large and complex.
Instead of doing this, we can use Instructor. Instructor is powered by pydantic, which means that it integrates with your IDE. We use pydantic’s BaseModel
to define the output model:
3. Setup and Basic Usage
Let’s set up Exa and Instructor:
Here we define a QuantumComputingAdvancement
class that inherits from BaseModel
from Pydantic. This class will be used by Instructor to validate the output from the LLM and for the LLM as a response model. We also implement the __str__()
method for easy printing of the output. We then initialize OpenAI()
and wrap instructor on top of it with instructor.from_openai
to create a client that will return structured outputs. If the output is not structured as our class, Instructor makes the LLM retry until max_retries is reached. You can read more about how Instructor retries here.
This example demonstrates how to use Exa to search for content about quantum computing advancements and structure the output using Instructor.
4. Advanced Example: Analyzing Multiple Research Papers
Let’s create a more complex example where we analyze multiple research papers on a specific topic and use pydantic’s own validation model to correct the structured data to show you how we can be even more fine-grained:
By using pydantic’s field_validator
, we can create our own rules to validate each field to be exactly what we want, so that we can work with predictable data even though we are using an LLM. Additionally, implementing the __str__()
method allows for more readable and convenient output formatting. Read more about different pydantic validators here. Because we don’t specify that the Title
should be in uppercase in the prompt, this will result in at least two API calls. You should avoid using field_validator
s as the only means to get the data in the right format; instead, you should include instructions in the prompt, such as specifying that the Title
should be in uppercase/all-caps.
This advanced example demonstrates how to use Exa and Instructor to analyze multiple research papers, extract structured information, and provide a comprehensive summary of the findings.
5. Streaming Structured Outputs
Instructor also supports streaming structured outputs, which is useful for getting partial results as they’re generated (this does not support validators due to the nature of streaming responses, you can read more about it here):
To make the output easier to see, we will use the rich Python package. It should already be installed, but if it isn’t, just run pip install rich
.
This example shows how to stream partial results and break the loop when certain conditions are met.
6. Writing Results to CSV
After generating structured outputs, you might want to save the results for further analysis or record-keeping. Here’s how you can write the results to a CSV file:
After running the code, you’ll have a CSV file named “ai_ethics_insights.csv”. Here’s an example of what the contents might look like:
Instructor has enabled the creation of structured data that can as such be stored in tabular format, e.g.in a CRM or similar.
By combining Exa’s powerful search capabilities with Instructor’s predictable output generation, you can extract and analyze information from web content efficiently and accurately.