What this doc covers
- Setting up Exa to use Instructor for structured output generation
- Practical examples of using Exa and Instructor together
Guide
1. Pre-requisites and installation
Install the required libraries:Python
EXA_API_KEY
and OPENAI_API_KEY
.
Get your Exa API key
2. Why use Instructor?
Instructor is a Python library that allows you to generate structured outputs from a language model. We could instruct the LLM to return a structured output, but the output will still be a string, which we need to convert to a dictionary. What if the dictionary is not structured as we want? What if the LLM forgot to add the last ”}” in the JSON? We would have to handle all of these errors manually. We could use{ "type": "json_object" }
which will make the LLM return a JSON object. But for this, we would need to provide a JSON schema, which can get large and complex.
Instead of doing this, we can use Instructor. Instructor is powered by pydantic, which means that it integrates with your IDE. We use pydantic’s BaseModel
to define the output model:
3. Setup and Basic Usage
Let’s set up Exa and Instructor:Python
QuantumComputingAdvancement
class that inherits from BaseModel
from Pydantic. This class will be used by Instructor to validate the output from the LLM and for the LLM as a response model. We also implement the __str__()
method for easy printing of the output. We then initialize OpenAI()
and wrap instructor on top of it with instructor.from_openai
to create a client that will return structured outputs. If the output is not structured as our class, Instructor makes the LLM retry until max_retries is reached. You can read more about how Instructor retries here.
This example demonstrates how to use Exa to search for content about quantum computing advancements and structure the output using Instructor.
4. Advanced Example: Analyzing Multiple Research Papers
Let’s create a more complex example where we analyze multiple research papers on a specific topic and use pydantic’s own validation model to correct the structured data to show you how we can be even more fine-grained:Python
field_validator
, we can create our own rules to validate each field to be exactly what we want, so that we can work with predictable data even though we are using an LLM. Additionally, implementing the __str__()
method allows for more readable and convenient output formatting. Read more about different pydantic validators here. Because we don’t specify that the Title
should be in uppercase in the prompt, this will result in at least two API calls. You should avoid using field_validator
s as the only means to get the data in the right format; instead, you should include instructions in the prompt, such as specifying that the Title
should be in uppercase/all-caps.
This advanced example demonstrates how to use Exa and Instructor to analyze multiple research papers, extract structured information, and provide a comprehensive summary of the findings.
5. Streaming Structured Outputs
Instructor also supports streaming structured outputs, which is useful for getting partial results as they’re generated (this does not support validators due to the nature of streaming responses, you can read more about it here): To make the output easier to see, we will use the rich Python package. It should already be installed, but if it isn’t, just runpip install rich
.
Python
stream output
6. Writing Results to CSV
After generating structured outputs, you might want to save the results for further analysis or record-keeping. Here’s how you can write the results to a CSV file:Python