Skip to content

Add zero-LLM page exploration tools (search_page, find_elements, structured extract) #432

@lmorchard

Description

@lmorchard

Current state

Pilo's only on-page information-extraction tool is extract (packages/core/src/tools/webActionTools.ts:310-358), which is LLM-powered:

extract: tool({
  description: "Extract specific data from the current page for later reference",
  inputSchema: z.object({
    description: z.string(),
  }),
  execute: async ({ description }) => {
    const markdown = await context.browser.getMarkdown();
    const prompt = buildExtractionPrompt(description, markdown);
    const extractResponse = await generateTextWithRetry({...}, { maxAttempts: 3 });
    // ... returns extractedData as markdown string ...
  },
})

Every extract call:

  • Converts the whole page to markdown (via Turndown, playwrightBrowser.ts:668-696)
  • Sends ~5000 tokens to an LLM
  • Retries up to 3 times on failure
  • Returns markdown text the agent then has to parse/interpret

The agent has no cheaper alternative for simpler questions ("is the word 'logout' on this page?" / "how many product cards are there?" / "what's the URL of the link with text 'Privacy Policy'?"). Every such question costs an extract LLM round trip.

The gap

Three related capability gaps:

  1. No zero-LLM page text search — for "does the page contain X?" the agent must call extract with a descriptive query and pay LLM cost + latency.
  2. No zero-LLM element query — for "how many <article> elements are there?" or "what are the hrefs of links in <nav>?" — same story.
  3. extract returns markdown only — when the agent wants structured data (a list of 10 items each with { name, price, url }), it has to parse the markdown back out, which is fragile. The Vercel AI SDK supports generateObject for structured output; Pilo's extract doesn't use it.

Proposed scope

A. Add search_page tool

search_page: tool({
  description:
    "Search the current page content for text matching a pattern. " +
    "Returns matches with surrounding context. Free and fast — prefer this over " +
    "extract() when you know what text to look for.",
  inputSchema: z.object({
    pattern: z.string(),
    regex: z.boolean().default(false),
    caseSensitive: z.boolean().default(false),
    contextChars: z.number().min(0).max(500).default(80),
    maxResults: z.number().min(1).max(50).default(10),
  }),
  execute: async ({ pattern, regex, caseSensitive, contextChars, maxResults }) => {
    return performActionWithValidation(
      PageAction.SearchPage,
      context,
      undefined,
      JSON.stringify({ pattern, regex, caseSensitive, contextChars, maxResults }),
    );
  },
}),

Implementation in playwrightBrowser.ts via page.evaluate:

const matches = await this.page!.evaluate(({ pattern, regex, caseSensitive, contextChars, maxResults }) => {
  const re = regex
    ? new RegExp(pattern, caseSensitive ? "g" : "gi")
    : new RegExp(pattern.replace(/[.*+?^${}()|[\]\\]/g, "\\$&"), caseSensitive ? "g" : "gi");
  // Walk text nodes via TreeWalker, accumulating offsets
  // Return array of { match, contextBefore, contextAfter, element selector hint }
}, { ... });

B. Add find_elements tool

find_elements: tool({
  description:
    "Find elements on the page by CSS selector. Returns matching elements with their " +
    "text and attributes. Free and fast — useful for inventory queries like " +
    "'how many product cards are there?' before deciding to extract().",
  inputSchema: z.object({
    selector: z.string(),
    attributes: z.array(z.string()).optional()
      .describe("Specific attributes to include (e.g., ['href', 'data-id'])"),
    maxResults: z.number().min(1).max(100).default(20),
    includeText: z.boolean().default(true),
  }),
  execute: async ({ selector, attributes, maxResults, includeText }) => {
    return performActionWithValidation(
      PageAction.FindElements,
      context,
      undefined,
      JSON.stringify({ selector, attributes, maxResults, includeText }),
    );
  },
}),

Implementation runs document.querySelectorAll in-page, returns { tag, text, attributes } per match. Resolve src/href to absolute URLs.

C. Add optional outputSchema to extract

Extend the existing tool:

extract: tool({
  description:
    "Extract data from the current page. If outputSchema is provided, returns structured " +
    "data matching the schema. Else returns markdown text.",
  inputSchema: z.object({
    description: z.string(),
    outputSchema: z.record(z.string(), z.unknown()).optional()
      .describe("JSON Schema describing the desired output structure"),
  }),
  execute: async ({ description, outputSchema }) => {
    const markdown = await context.browser.getMarkdown();
    if (outputSchema) {
      const zodSchema = jsonSchemaToZod(outputSchema);
      const { object } = await generateObjectWithRetry({
        ...providerConfig,
        prompt: buildExtractionPrompt(description, markdown),
        schema: zodSchema,
      }, { maxAttempts: 3 });
      return { success: true, action: "extract", description, data: object };
    } else {
      // existing markdown path
    }
  },
}),

generateObjectWithRetry is a thin wrapper around generateObject from the AI SDK following the same retry pattern as generateTextWithRetry.

Need a small JSON Schema → Zod converter, OR (simpler) accept Zod schemas directly and the model returns the matching JSON. Since tool schemas are already Zod, accepting outputSchema: z.record(z.string(), z.unknown()) is the most flexible — interpret it via generateObject({ output: 'no-schema' }) mode and validate after.

D. Update prompts

In prompts.ts:163-210 (buildToolExamples):

- search_page({"pattern": "logout"}) - Search page text. Free, fast.
- find_elements({"selector": "a.nav-link"}) - Query elements by CSS selector. Free, fast.
- extract({"description": "...", "outputSchema": {...}}) - Extract data. Use outputSchema
  for structured output (lists of items, key-value pairs, etc.).

Add to best practices:

For inventory questions ("how many X are there?", "is Y on the page?"), prefer
find_elements or search_page — they are free and instant. Reserve extract for
cases where you need synthesized or structured data the page doesn't expose directly.

Implementation notes

  • These tools run via performActionWithValidation for consistency in error handling and event emission, even though they aren't "actions" in the traditional sense (no DOM mutation). The naming is a bit off but consistent with the existing pattern.
  • search_page regex compilation can throw SyntaxError on bad patterns — return { success: false, error: "...", isRecoverable: true } rather than crashing.
  • find_elements selector can throw DOMException on bad selectors — same treatment.
  • Both tools should be safe and idempotent — no pageChanged: true.
  • The result shapes are not the standard ActionResult; consider extending the type or adding a discriminated union. Worth a small refactor.

Acceptance criteria

  • search_page and find_elements are available in webActionTools, with the right tool descriptions and prompt examples.
  • extract accepts an optional outputSchema and returns structured data when provided.
  • Tests in packages/core/test/ cover: text search (literal and regex), CSS query for various selectors, bad-pattern error handling, structured extract with a schema.
  • A manual eval on a small task set (e.g., "find the number of pricing tiers on this page" / "extract the top 5 product names and prices") shows the new tools reduce LLM calls per task.

Effort estimate

2-4 days. The two zero-LLM tools are quick (1 day each). The outputSchema work depends on how clean the JSON Schema → Zod path is.

Related issues

Pairs with the action-vocabulary-additions issue (both expand tool capabilities). Related to the modal/viewport-context issue (those tools also benefit from a clearer page model).

Files likely affected

  • packages/core/src/tools/webActionTools.ts (or new tools/inspectionTools.ts)
  • packages/core/src/browser/ariaBrowser.ts (PageAction enum)
  • packages/core/src/browser/playwrightBrowser.ts (handlers)
  • packages/core/src/prompts.ts (tool examples + best practices)
  • packages/core/src/utils/retry.ts (add generateObjectWithRetry)
  • packages/core/test/

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions