Quick Start
This guide will walk you through creating your first Agentary JS application with both on-device and cloud providers.
Choose Your Provider
Agentary JS supports two types of inference providers:
- Device Provider: Run models locally in the browser using WebGPU or WebAssembly
- Cloud Provider: Use cloud LLM providers (OpenAI, Anthropic) via a secure proxy
On-Device Inference
Run models directly in the browser with no server dependencies:
import { createSession } from 'agentary-js';
// Create a session with on-device model
const session = await createSession({
models: [{
type: 'device',
model: 'onnx-community/Qwen3-0.6B-ONNX',
quantization: 'q4',
engine: 'webgpu' // or 'wasm'
}]
});
// Generate text with streaming
const response = await session.createResponse('onnx-community/Qwen3-0.6B-ONNX', {
messages: [{ role: 'user', content: 'Hello, how are you today?' }]
});
if (response.type === 'streaming') {
for await (const chunk of response.stream) {
if (chunk.isFirst && chunk.ttfbMs) {
console.log(`Time to first byte: ${chunk.ttfbMs}ms`);
}
if (!chunk.isLast) {
process.stdout.write(chunk.token);
}
}
}
// Clean up resources
await session.dispose();Cloud Provider Inference
Use cloud LLM providers via a secure proxy (API keys stay on your backend):
import { createSession } from 'agentary-js';
// Create a session with cloud provider
const session = await createSession({
models: [{
type: 'cloud',
model: 'claude-3-5-sonnet-20241022',
proxyUrl: 'https://your-backend.com/api/anthropic',
modelProvider: 'anthropic',
timeout: 30000,
maxRetries: 3
}]
});
// Generate text (streaming or non-streaming based on proxy)
const response = await session.createResponse('claude-3-5-sonnet-20241022', {
messages: [{ role: 'user', content: 'Hello, how are you today?' }]
});
if (response.type === 'streaming') {
for await (const chunk of response.stream) {
process.stdout.write(chunk.token);
}
} else {
console.log(response.content);
}
await session.dispose();Note: Cloud providers require a proxy server. See the Cloud Provider Guide for setup instructions.
Understanding Provider Configuration
Device Provider Options
{
type: 'device',
model: 'onnx-community/Qwen3-0.6B-ONNX',
quantization: 'q4', // q4, q8, fp16, fp32
engine: 'webgpu', // webgpu, wasm, auto
hfToken?: string // Optional: for private models
}Key options:
type: Must be'device'for on-device inferencemodel: ONNX model identifier from Hugging Facequantization: Model quantization level (affects size/speed/quality)engine: Inference engine to use
Cloud Provider Options
{
type: 'cloud',
model: 'claude-3-5-sonnet-20241022',
proxyUrl: 'https://your-backend.com/api/anthropic',
modelProvider: 'anthropic', // 'anthropic' or 'openai'
timeout: 30000, // Optional: request timeout in ms
maxRetries: 3, // Optional: max retry attempts
headers: {} // Optional: custom headers
}Key options:
type: Must be'cloud'for cloud-based inferencemodel: Model identifier for the providerproxyUrl: Your backend proxy endpointmodelProvider: The cloud provider type
Generate Responses
const response = await session.createResponse(modelId, {
messages: [{ role: 'user', content: 'Hello!' }]
});
if (response.type === 'streaming') {
for await (const chunk of response.stream) {
process.stdout.write(chunk.token);
}
} else {
console.log(response.content);
}The createResponse method returns either a streaming or non-streaming response.
Streaming Response (TokenStreamChunk):
token: The generated text tokentokenId: Numeric token IDisFirst: True for the first tokenisLast: True for the final tokenttfbMs: Time to first byte (only on first chunk)
Non-Streaming Response:
content: Complete generated textusage: Token usage statisticstoolCalls: Tool/function calls (if any)finishReason: Why generation stopped
Clean Up
await session.dispose();Always dispose of sessions to free up memory and terminate workers/connections.
Multi-Provider Setup
You can register multiple providers and use them as needed:
const session = await createSession({
models: [
// On-device model for quick responses
{
type: 'device',
model: 'onnx-community/Qwen3-0.6B-ONNX',
quantization: 'q4',
engine: 'webgpu'
},
// Cloud model for complex tasks
{
type: 'cloud',
model: 'claude-3-5-sonnet-20241022',
proxyUrl: 'https://your-backend.com/api/anthropic',
modelProvider: 'anthropic'
}
]
});
// Use device model for quick response
const quickResponse = await session.createResponse('onnx-community/Qwen3-0.6B-ONNX', {
messages: [{ role: 'user', content: 'Quick question' }]
});
// Use cloud model for complex reasoning
const complexResponse = await session.createResponse('claude-3-5-sonnet-20241022', {
messages: [{ role: 'user', content: 'Complex analysis task' }]
});Device Provider Configuration
Engine Selection
engine: 'auto' // Automatically selects best available
engine: 'webgpu' // Force WebGPU (fastest, but limited browser support)
engine: 'wasm' // Force WebAssembly (universal compatibility)Model Quantization
Quantization reduces model size and improves performance:
| Level | Size | Speed | Quality |
|---|---|---|---|
q4 | Smallest | Fastest | Good |
q8 | Small | Fast | Better |
fp16 | Medium | Moderate | Great |
fp32 | Large | Slow | Best |
Supported Models
Check the Model Support documentation for a list of validated on-device models.
Generation Parameters
Control the output with generation parameters:
const response = await session.createResponse(modelId, {
messages: [{ role: 'user', content: 'Write a poem' }],
temperature: 0.7, // Randomness (0.0-2.0)
max_new_tokens: 200, // Maximum tokens to generate
top_p: 0.9, // Nucleus sampling
top_k: 50, // Top-k sampling
repetition_penalty: 1.1 // Penalty for repetition
});Next Steps
- Learn about Core Concepts
- Set up a Cloud Provider
- Explore Tool Calling
- Build Agentic Workflows