websets-news-monitor

Click here to try it out.

Overview

We created a Websets News Monitor that uses the Websets API to monitor the web semantically for queries like “startup funding round announcements” or “new product launches.” Each tab uses a different Webset that updates daily using a monitor.

It demonstrates best practices for news monitoring including:

  • Deduplicating articles about the same story
  • Filtering out low-quality data sources
  • Receiving real-time updates via webhooks

View the full source code on GitHub.

How it Works

1

Set Up a Webhook

Webhooks allow you to subscribe to real-time updates as your Websets run. We want to know when a Webset is created and items finish enriching, so we’ll subscribe to webset.created and webset.item.enriched.

Javascript
const exa = new Exa(process.env.EXA_API_KEY);
const webhookUrl = 'https://smee.io/123abc456def'; // Replace with your webhook handler endpoint

webhook = await exa.websets.webhooks.create({
    url: webhookUrl,
    events: [
        EventType.webset_created,
        EventType.webset_item_enriched,
    ],
});

console.log(`✅ Webhook created with ID: ${webhook.id}`);
console.log(`WEBHOOK_SECRET=${webhook.secret}`);
Save webhook.secret, we’ll use it later to validate incoming webhook requests.
2

Create a Webset

Now we’ll create a Webset that searches for the types of articles we are looking for. Use query to direct the search and criteria to narrow down the results.

In this example we’re looking for articles about recent startup fundraises.

Javascript
const webset = await exa.websets.create({
    search: {
        query: "Startups that raised a funding round in the last 24 hours",
        criteria: [
            {
                description: "Article is about a startup raising a funding round of at least $1M",
            },
            {
                description: "Article published in a top 20 tech publication (TechCrunch, The Verge, Wired, etc.)",
            },
            {
                description: "Article was published in the last 24 hours",
            }
        ],
        entity: { type: "article" },
        behavior: "append",
        count: 25
    },
    enrichments: [
        {
            description: "One sentence summary of the article using content not in the title",
            format: "text",
        }
    ]
});

console.log(`✅ Webset created with ID: ${webset.id}`);
3

Monitor the Webset

We want our Webset to update with new articles daily, so we’ll create a monitor with the webset.id. We set the cadence parameter to run daily and the search behavior so it looks for new results.

By default, monitors use the last search the Webset ran. When we created the Webset we used “in the last 24 hours” so it’s always relative to when the monitor runs.

Javascript
const monitor = await exa.websets.monitors.create({
    websetId: webset.id,
    behavior: { type: "search", config: { count: 10 } },
    cadence: {
        cron: "0 0 * * *", // Every day
        timezone: "UTC"
    }
});

console.log(`✅ Monitor created with ID: ${monitor.id}`);
4

Handle the Webhook

Lastly, we need to create an endpoint to handle the webhook requests. We’ll setup a Next.js route to handle POST requests and parse the event data.

For security purposes, you should verify the request’s signature using the webhook secret from the first step. See the signature verification guide for more info.

Javascript
// app/api/webhook/route.ts
import { NextRequest, NextResponse } from 'next/server';
import { prisma } from '@/lib/prisma';
import { verifyWebhookSignature } from '@/lib/webhook';
import { exa } from '@/lib/exa';
import { embedText } from '@/lib/openai';
import { isDuplicate } from '@/lib/dedupe';

export async function POST(request: NextRequest) {
    // Get the raw body for signature verification
    const rawBody = await request.text();
    const signatureHeader = request.headers.get('exa-signature') || '';
    const webhookSecret = process.env.WEBHOOK_SECRET;

    // Verify webhook signature 
    if (!verifyWebhookSignature(rawBody, signatureHeader, webhookSecret)) {
        console.error('Invalid webhook signature');
        return NextResponse.json({ error: 'Invalid signature' }, { status: 400 });
    }

     const body = JSON.parse(rawBody);

    switch (body.type) {
        case 'webset.created':
            // Handle new Webset
            break;
        case 'webset.item.enriched':
            // Handle new enriched item
            break;
        default:
            break;
    }

    return NextResponse.json({ 
        received: true,
        type: body.type,
        timestamp: new Date().toISOString()
    });
View the full route implementation here.

Semantic Whitelisting

We want our feeds to contain high-quality links and avoid SEO spam. This would normally require manually maintaining lists of domains to include/exclude from your results, but with Websets it’s simple.

You can create criteria that function as a semantic whitelist, telling the LLM what kinds of articles to allow. Here’s an example:

Article published in a top 20 tech publication (TechCrunch, The Verge, Wired, etc.)

You can see all of the criteria used in the demo here.

Storyline Deduplication

A common issue when monitoring news is handling multiple articles about the same storyline. Often you want to group articles by storyline or remove duplicates so users don’t see repeated content.

In our demo, we solve this using embeddings, vector search, and an LLM to classify duplicates.

1

Embed the Article Title

First, we’ll embed the article’s title using OpenAI’s embedding API. We’ll use the text-embedding-3-small model that produces vectors optimized for similarity comparisons.

Javascript
import OpenAI from 'openai';

const openai = new OpenAI({
    apiKey: process.env.OPENAI_API_KEY,
});

const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: title,
    dimensions: 1536,
});

const embedding = response.data[0].embedding;
2

Search for Similar Articles

Next, we use PostgreSQL’s pgvector extension to find the 10 most similar articles from the last week.

Javascript
import { prisma } from '@/lib/prisma';

const query = `
    SELECT id, title, "publishedAt", embedding <+> $1::vector AS distance
    FROM "Articles"
    WHERE "publishedAt" >= NOW() - INTERVAL '7 days'
    ORDER BY embedding <+> $1::vector
    LIMIT 10;
`;

const similarArticles = await prisma.$queryRawUnsafe(query, embedding)
3

Classify Duplicates with an LLM

Finally, we’ll use an LLM with structured outputs to classify whether the article is a duplicate. The LLM will look at the titles of similar articles and determine if they are about the same event.

Javascript
const DuplicateCheck = z.object({
    is_duplicate: z.boolean(),
});

const response = await openai.responses.parse({
    model: 'gpt-4o-mini',
    input: [
        {
            role: 'system',
            content: 'You are a news deduplication assistant. Determine if stories are about the same event.'
        },
        {
            role: 'user',
            content: `Is this story a duplicate of any in the list? \nQuery story: "${title}" \nSimilar stories: ${similarArticles.map(item => item.title).join('\n')}`
        }
    ],
    text: {
        format: zodTextFormat(DuplicateCheck, "duplicate_check"),
    },
});

const isDuplicate = response.output_parsed.is_duplicate;

You can view the complete deduplication implementation here.