Getting StartedQuick Start

Quick Start

This guide will walk you through creating your first Agentary JS application with both on-device and cloud providers.

Choose Your Provider

Agentary JS supports two types of inference providers:

  • Device Provider: Run models locally in the browser using WebGPU or WebAssembly
  • Cloud Provider: Use cloud LLM providers (OpenAI, Anthropic) via a secure proxy

On-Device Inference

Run models directly in the browser with no server dependencies:

import { createSession } from 'agentary-js';
 
// Create a session with on-device model
const session = await createSession({
  models: [{
    type: 'device',
    model: 'onnx-community/Qwen3-0.6B-ONNX',
    quantization: 'q4',
    engine: 'webgpu' // or 'wasm'
  }]
});
 
// Generate text with streaming
const response = await session.createResponse('onnx-community/Qwen3-0.6B-ONNX', {
  messages: [{ role: 'user', content: 'Hello, how are you today?' }]
});
 
if (response.type === 'streaming') {
  for await (const chunk of response.stream) {
    if (chunk.isFirst && chunk.ttfbMs) {
      console.log(`Time to first byte: ${chunk.ttfbMs}ms`);
    }
    if (!chunk.isLast) {
      process.stdout.write(chunk.token);
    }
  }
}
 
// Clean up resources
await session.dispose();

Cloud Provider Inference

Use cloud LLM providers via a secure proxy (API keys stay on your backend):

import { createSession } from 'agentary-js';
 
// Create a session with cloud provider
const session = await createSession({
  models: [{
    type: 'cloud',
    model: 'claude-3-5-sonnet-20241022',
    proxyUrl: 'https://your-backend.com/api/anthropic',
    modelProvider: 'anthropic',
    timeout: 30000,
    maxRetries: 3
  }]
});
 
// Generate text (streaming or non-streaming based on proxy)
const response = await session.createResponse('claude-3-5-sonnet-20241022', {
  messages: [{ role: 'user', content: 'Hello, how are you today?' }]
});
 
if (response.type === 'streaming') {
  for await (const chunk of response.stream) {
    process.stdout.write(chunk.token);
  }
} else {
  console.log(response.content);
}
 
await session.dispose();

Note: Cloud providers require a proxy server. See the Cloud Provider Guide for setup instructions.

Understanding Provider Configuration

Device Provider Options

{
  type: 'device',
  model: 'onnx-community/Qwen3-0.6B-ONNX',
  quantization: 'q4',  // q4, q8, fp16, fp32
  engine: 'webgpu',    // webgpu, wasm, auto
  hfToken?: string     // Optional: for private models
}

Key options:

  • type: Must be 'device' for on-device inference
  • model: ONNX model identifier from Hugging Face
  • quantization: Model quantization level (affects size/speed/quality)
  • engine: Inference engine to use

Cloud Provider Options

{
  type: 'cloud',
  model: 'claude-3-5-sonnet-20241022',
  proxyUrl: 'https://your-backend.com/api/anthropic',
  modelProvider: 'anthropic',  // 'anthropic' or 'openai'
  timeout: 30000,              // Optional: request timeout in ms
  maxRetries: 3,               // Optional: max retry attempts
  headers: {}                  // Optional: custom headers
}

Key options:

  • type: Must be 'cloud' for cloud-based inference
  • model: Model identifier for the provider
  • proxyUrl: Your backend proxy endpoint
  • modelProvider: The cloud provider type

Generate Responses

const response = await session.createResponse(modelId, {
  messages: [{ role: 'user', content: 'Hello!' }]
});
 
if (response.type === 'streaming') {
  for await (const chunk of response.stream) {
    process.stdout.write(chunk.token);
  }
} else {
  console.log(response.content);
}

The createResponse method returns either a streaming or non-streaming response.

Streaming Response (TokenStreamChunk):

  • token: The generated text token
  • tokenId: Numeric token ID
  • isFirst: True for the first token
  • isLast: True for the final token
  • ttfbMs: Time to first byte (only on first chunk)

Non-Streaming Response:

  • content: Complete generated text
  • usage: Token usage statistics
  • toolCalls: Tool/function calls (if any)
  • finishReason: Why generation stopped

Clean Up

await session.dispose();

Always dispose of sessions to free up memory and terminate workers/connections.

Multi-Provider Setup

You can register multiple providers and use them as needed:

const session = await createSession({
  models: [
    // On-device model for quick responses
    {
      type: 'device',
      model: 'onnx-community/Qwen3-0.6B-ONNX',
      quantization: 'q4',
      engine: 'webgpu'
    },
    // Cloud model for complex tasks
    {
      type: 'cloud',
      model: 'claude-3-5-sonnet-20241022',
      proxyUrl: 'https://your-backend.com/api/anthropic',
      modelProvider: 'anthropic'
    }
  ]
});
 
// Use device model for quick response
const quickResponse = await session.createResponse('onnx-community/Qwen3-0.6B-ONNX', {
  messages: [{ role: 'user', content: 'Quick question' }]
});
 
// Use cloud model for complex reasoning
const complexResponse = await session.createResponse('claude-3-5-sonnet-20241022', {
  messages: [{ role: 'user', content: 'Complex analysis task' }]
});

Device Provider Configuration

Engine Selection

engine: 'auto'    // Automatically selects best available
engine: 'webgpu'  // Force WebGPU (fastest, but limited browser support)
engine: 'wasm'    // Force WebAssembly (universal compatibility)

Model Quantization

Quantization reduces model size and improves performance:

LevelSizeSpeedQuality
q4SmallestFastestGood
q8SmallFastBetter
fp16MediumModerateGreat
fp32LargeSlowBest

Supported Models

Check the Model Support documentation for a list of validated on-device models.

Generation Parameters

Control the output with generation parameters:

const response = await session.createResponse(modelId, {
  messages: [{ role: 'user', content: 'Write a poem' }],
  temperature: 0.7,        // Randomness (0.0-2.0)
  max_new_tokens: 200,     // Maximum tokens to generate
  top_p: 0.9,              // Nucleus sampling
  top_k: 50,               // Top-k sampling
  repetition_penalty: 1.1  // Penalty for repetition
});

Next Steps