LLM Calls API¶

The LLM Calls API provides endpoints for making both non-streaming and streaming LLM calls within job contexts. All calls are tracked, aggregated, and billed at the job level.

Overview¶

LLM calls are always made within the context of a job. The API supports:

Non-streaming calls - Standard request/response pattern
Streaming calls - Real-time Server-Sent Events (SSE) streaming
Model group resolution - Automatic model selection based on team permissions
OpenAI-compatible format - Standard messages format
Cost tracking - Automatic tracking of tokens and costs

Base URL: /api/jobs/{job_id}

Authentication: All endpoints require a Bearer token (virtual API key) in the Authorization header.

Endpoints¶

Non-Streaming LLM Call¶

Make a standard LLM call within a job context.

Endpoint: POST /api/jobs/{job_id}/llm-call

Authentication: Required (virtual key)

Path Parameters:

Parameter	Type	Description
`job_id`	string (UUID)	The job identifier

Request Body:

{
  "model_group": "ResumeAgent",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "Parse this resume..."
    }
  ],
  "purpose": "resume_parsing",
  "temperature": 0.7,
  "max_tokens": 1000
}

Request Fields:

Field	Type	Required	Description
`model_group`	string	Yes	Name of model group (e.g., "ResumeAgent", "ChatAgent")
`messages`	array	Yes	OpenAI-compatible messages array
`messages[].role`	string	Yes	Message role: "system", "user", or "assistant"
`messages[].content`	string	Yes	Message content
`purpose`	string	No	Optional label for tracking (e.g., "parsing", "analysis")
`temperature`	number	No	Sampling temperature (0.0-2.0, default: 0.7)
`max_tokens`	integer	No	Maximum tokens to generate (optional)

Response (200 OK):

{
  "call_id": "call-uuid-123",
  "response": {
    "content": "Here is the parsed resume information...",
    "finish_reason": "stop"
  },
  "metadata": {
    "tokens_used": 450,
    "latency_ms": 1250,
    "model_group": "ResumeAgent"
  }
}

Response Fields:

Field	Type	Description
`call_id`	string (UUID)	Unique identifier for this LLM call
`response.content`	string	The generated response content
`response.finish_reason`	string	Why generation stopped: "stop", "length", or "content_filter"
`metadata.tokens_used`	integer	Total tokens used (prompt + completion)
`metadata.latency_ms`	integer	Call latency in milliseconds
`metadata.model_group`	string	Model group that was used

Example Request:

cURLPythonJavaScript

curl -X POST http://localhost:8003/api/jobs/{job_id}/llm-call \
  -H "Authorization: Bearer sk-your-virtual-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model_group": "ResumeAgent",
    "messages": [
      {
        "role": "system",
        "content": "You are a resume parsing assistant."
      },
      {
        "role": "user",
        "content": "Extract key skills from this resume: ..."
      }
    ],
    "purpose": "skill_extraction",
    "temperature": 0.3
  }'

import requests

API_URL = "http://localhost:8003/api"
VIRTUAL_KEY = "sk-your-virtual-key"

headers = {
    "Authorization": f"Bearer {VIRTUAL_KEY}",
    "Content-Type": "application/json"
}

response = requests.post(
    f"{API_URL}/jobs/{job_id}/llm-call",
    headers=headers,
    json={
        "model_group": "ResumeAgent",
        "messages": [
            {
                "role": "system",
                "content": "You are a resume parsing assistant."
            },
            {
                "role": "user",
                "content": "Extract key skills from this resume: ..."
            }
        ],
        "purpose": "skill_extraction",
        "temperature": 0.3
    }
)

result = response.json()
print(result['response']['content'])
print(f"Tokens used: {result['metadata']['tokens_used']}")
print(f"Latency: {result['metadata']['latency_ms']}ms")

const API_URL = "http://localhost:8003/api";
const VIRTUAL_KEY = "sk-your-virtual-key";

const response = await fetch(`${API_URL}/jobs/${jobId}/llm-call`, {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${VIRTUAL_KEY}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    model_group: 'ResumeAgent',
    messages: [
      {
        role: 'system',
        content: 'You are a resume parsing assistant.'
      },
      {
        role: 'user',
        content: 'Extract key skills from this resume: ...'
      }
    ],
    purpose: 'skill_extraction',
    temperature: 0.3
  })
});

const result = await response.json();
console.log(result.response.content);
console.log(`Tokens used: ${result.metadata.tokens_used}`);
console.log(`Latency: ${result.metadata.latency_ms}ms`);

Error Responses:

Status Code	Error	Description
401	Unauthorized	Invalid or missing virtual key
403	Forbidden	Job does not belong to your team, or model group not allowed
404	Not Found	Job not found
422	Validation Error	Invalid request data
500	Internal Server Error	LLM call failed or server error

Example Error Response:

{
  "detail": "Team 'acme-corp' does not have access to model group 'GPT4Agent'"
}

Streaming LLM Call¶

Make a streaming LLM call with real-time Server-Sent Events (SSE).

Endpoint: POST /api/jobs/{job_id}/llm-call-stream

Authentication: Required (virtual key)

Path Parameters:

Parameter	Type	Description
`job_id`	string (UUID)	The job identifier

Request Body:

{
  "model_group": "ChatAgent",
  "messages": [
    {
      "role": "user",
      "content": "Tell me a story"
    }
  ],
  "purpose": "chat",
  "temperature": 0.8,
  "max_tokens": 500
}

Request Fields: (Same as non-streaming call)

Field	Type	Required	Description
`model_group`	string	Yes	Name of model group
`messages`	array	Yes	OpenAI-compatible messages array
`purpose`	string	No	Optional label for tracking
`temperature`	number	No	Sampling temperature (0.0-2.0, default: 0.7)
`max_tokens`	integer	No	Maximum tokens to generate

Response Headers:

HTTP/1.1 200 OK
Content-Type: text/event-stream
Cache-Control: no-cache
X-Accel-Buffering: no
Connection: keep-alive

Response Format (Server-Sent Events):

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1697896000,"model":"gpt-4","choices":[{"index":0,"delta":{"role":"assistant","content":"Once"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1697896000,"model":"gpt-4","choices":[{"index":0,"delta":{"content":" upon"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1697896000,"model":"gpt-4","choices":[{"index":0,"delta":{"content":" a"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1697896000,"model":"gpt-4","choices":[{"index":0,"delta":{"content":" time"},"finish_reason":null}]}

data: [DONE]

SSE Chunk Format:

Each chunk is a JSON object prefixed with data::

{
  "id": "chatcmpl-123",
  "object": "chat.completion.chunk",
  "created": 1697896000,
  "model": "gpt-4-turbo",
  "choices": [
    {
      "index": 0,
      "delta": {
        "role": "assistant",
        "content": "Hello"
      },
      "finish_reason": null
    }
  ]
}

Chunk Fields:

Field	Type	Description
`id`	string	Unique completion ID
`object`	string	Always "chat.completion.chunk" for streaming
`created`	integer	Unix timestamp
`model`	string	Actual model used (resolved from model group)
`choices[].index`	integer	Choice index (always 0)
`choices[].delta.role`	string	Role (only present in first chunk: "assistant")
`choices[].delta.content`	string	Incremental text content
`choices[].finish_reason`	string	null during streaming, "stop"/"length" at end

Final Chunk:

The last chunk includes usage metadata:

{
  "id": "chatcmpl-123",
  "object": "chat.completion.chunk",
  "created": 1697896000,
  "model": "gpt-4-turbo",
  "choices": [
    {
      "index": 0,
      "delta": {},
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 125,
    "completion_tokens": 450,
    "total_tokens": 575
  }
}

Stream Termination:

The stream ends with:

data: [DONE]

Example Request:

Python (Raw)Python (Typed Client)JavaScript

import requests
import json

API_URL = "http://localhost:8003/api"
VIRTUAL_KEY = "sk-your-virtual-key"

headers = {
    "Authorization": f"Bearer {VIRTUAL_KEY}",
    "Content-Type": "application/json"
}

# Make streaming request
response = requests.post(
    f"{API_URL}/jobs/{job_id}/llm-call-stream",
    headers=headers,
    json={
        "model_group": "ChatAgent",
        "messages": [
            {"role": "user", "content": "Tell me a story"}
        ],
        "temperature": 0.8
    },
    stream=True  # Important: enable streaming
)

# Process Server-Sent Events
accumulated = ""
for line in response.iter_lines():
    if line:
        line = line.decode('utf-8')
        if line.startswith('data: '):
            data_str = line[6:]  # Remove 'data: ' prefix

            if data_str == '[DONE]':
                print("\n\nStream complete!")
                break

            try:
                chunk = json.loads(data_str)
                if chunk.get("choices"):
                    delta = chunk["choices"][0].get("delta", {})
                    content = delta.get("content", "")
                    if content:
                        accumulated += content
                        print(content, end="", flush=True)
            except json.JSONDecodeError:
                continue

print(f"\n\nFull response: {accumulated}")

from examples.typed_client import SaaSLLMClient

async def streaming_example():
    async with SaaSLLMClient(
        base_url="http://localhost:8003",
        team_id="acme-corp",
        virtual_key="sk-your-virtual-key"
    ) as client:
        # Create job
        job_id = await client.create_job("chat")

        # Stream response
        accumulated = ""
        async for chunk in client.chat_stream(
            job_id=job_id,
            messages=[
                {"role": "user", "content": "Tell me a story"}
            ],
            temperature=0.8
        ):
            if chunk.choices:
                delta = chunk.choices[0].delta
                content = delta.get("content", "")
                if content:
                    accumulated += content
                    print(content, end="", flush=True)

        print(f"\n\nFull response: {accumulated}")

        # Complete job
        result = await client.complete_job(job_id, "completed")
        print(f"Credits remaining: {result.credits_remaining}")

import asyncio
asyncio.run(streaming_example())

async function streamChat(jobId, messages) {
  const response = await fetch(`${API_URL}/jobs/${jobId}/llm-call-stream`, {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${VIRTUAL_KEY}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      model_group: 'ChatAgent',
      messages: messages,
      temperature: 0.8
    })
  });

  const reader = response.body.getReader();
  const decoder = new TextDecoder();

  let accumulated = '';

  while (true) {
    const {done, value} = await reader.read();
    if (done) break;

    const text = decoder.decode(value);
    const lines = text.split('\n');

    for (const line of lines) {
      if (line.startsWith('data: ')) {
        const data = line.substring(6);

        if (data === '[DONE]') {
          console.log('\nStream complete');
          return accumulated;
        }

        try {
          const chunk = JSON.parse(data);
          const content = chunk.choices?.[0]?.delta?.content || '';
          if (content) {
            accumulated += content;
            process.stdout.write(content);  // Node.js
            // Or: document.getElementById('output').textContent += content;  // Browser
          }
        } catch (e) {
          // Ignore parse errors
        }
      }
    }
  }

  return accumulated;
}

Error Handling:

Errors during streaming are sent as error chunks:

{
  "error": {
    "message": "Model timeout exceeded",
    "type": "timeout_error",
    "code": "model_timeout"
  }
}

Error Responses:

Status Code	Error	Description
401	Unauthorized	Invalid or missing virtual key
403	Forbidden	Job does not belong to your team, or model group not allowed
404	Not Found	Job not found
422	Validation Error	Invalid request data
500	Internal Server Error	Stream failed or server error

Model Parameters¶

Temperature¶

Controls randomness in the output.

Range: 0.0 to 2.0
Default: 0.7
Lower values (0.0-0.5): More deterministic, focused responses
Higher values (0.8-1.5): More creative, varied responses

Examples:

# Factual, precise response
response = requests.post(
    f"{API_URL}/jobs/{job_id}/llm-call",
    headers=headers,
    json={
        "model_group": "AnalysisAgent",
        "messages": [{"role": "user", "content": "What is photosynthesis?"}],
        "temperature": 0.2  # Low temperature for facts
    }
)

# Creative, varied response
response = requests.post(
    f"{API_URL}/jobs/{job_id}/llm-call",
    headers=headers,
    json={
        "model_group": "ChatAgent",
        "messages": [{"role": "user", "content": "Write a creative story"}],
        "temperature": 1.2  # High temperature for creativity
    }
)

Max Tokens¶

Limits the maximum number of tokens to generate.

Type: Integer
Default: Varies by model (typically 1000-4000)
Use cases: Limit response length, control costs

Examples:

# Short summary
response = requests.post(
    f"{API_URL}/jobs/{job_id}/llm-call",
    headers=headers,
    json={
        "model_group": "SummaryAgent",
        "messages": [{"role": "user", "content": "Summarize this article..."}],
        "max_tokens": 150  # Limit to ~150 tokens
    }
)

# Long-form content
response = requests.post(
    f"{API_URL}/jobs/{job_id}/llm-call",
    headers=headers,
    json={
        "model_group": "WritingAgent",
        "messages": [{"role": "user", "content": "Write a detailed article..."}],
        "max_tokens": 2000  # Allow up to 2000 tokens
    }
)

Messages Format¶

The messages array follows the OpenAI format:

System Message:

Sets the assistant's behavior and context.

{
  "role": "system",
  "content": "You are a helpful assistant that specializes in resume analysis."
}

User Message:

User input or query.

{
  "role": "user",
  "content": "Parse this resume and extract key skills."
}

Assistant Message:

Previous assistant responses (for multi-turn conversations).

{
  "role": "assistant",
  "content": "I've identified the following skills: Python, SQL, Machine Learning..."
}

Complete Multi-Turn Example:

messages = [
    {
        "role": "system",
        "content": "You are a Python tutor."
    },
    {
        "role": "user",
        "content": "What is a list comprehension?"
    },
    {
        "role": "assistant",
        "content": "A list comprehension is a concise way to create lists in Python..."
    },
    {
        "role": "user",
        "content": "Can you show me an example?"
    }
]

Model Group Resolution¶

Model groups abstract the actual model selection, allowing you to:

Change models without code changes - Update model group configuration
Control costs - Use different models for different teams
Manage permissions - Restrict which teams can use which models
Implement fallbacks - Automatically fallback to alternative models

Example:

If "ResumeAgent" model group is configured with: - Primary: gpt-4-turbo - Fallbacks: gpt-3.5-turbo, claude-3-sonnet

Your call to "ResumeAgent" will: 1. Attempt gpt-4-turbo first 2. Fallback to gpt-3.5-turbo if primary fails 3. Fallback to claude-3-sonnet if both fail

See Model Groups API for configuration details.

Complete Example: Multi-Step Job¶

import requests

API_URL = "http://localhost:8003/api"
VIRTUAL_KEY = "sk-your-virtual-key"

headers = {
    "Authorization": f"Bearer {VIRTUAL_KEY}",
    "Content-Type": "application/json"
}

# 1. Create job
job = requests.post(
    f"{API_URL}/jobs/create",
    headers=headers,
    json={
        "team_id": "acme-corp",
        "job_type": "document_analysis"
    }
).json()

job_id = job["job_id"]
print(f"Created job: {job_id}")

# 2. Make multiple LLM calls
steps = [
    ("Parse document", "Extract key information from this document..."),
    ("Classify content", "Classify the content type..."),
    ("Generate summary", "Generate a concise summary...")
]

for purpose, prompt in steps:
    response = requests.post(
        f"{API_URL}/jobs/{job_id}/llm-call",
        headers=headers,
        json={
            "model_group": "AnalysisAgent",
            "messages": [
                {"role": "user", "content": prompt}
            ],
            "purpose": purpose,
            "temperature": 0.3
        }
    ).json()

    print(f"{purpose}: {response['response']['content'][:50]}...")
    print(f"  Tokens: {response['metadata']['tokens_used']}")

# 3. Complete job
result = requests.post(
    f"{API_URL}/jobs/{job_id}/complete",
    headers=headers,
    json={"status": "completed"}
).json()

print(f"\nJob completed!")
print(f"Total calls: {result['costs']['total_calls']}")
print(f"Total tokens: {result['costs']['total_tokens']}")
print(f"Total cost: ${result['costs']['total_cost_usd']:.4f}")
print(f"Credit deducted: {result['costs']['credit_applied']}")
print(f"Credits remaining: {result['costs']['credits_remaining']}")

Streaming vs Non-Streaming Comparison¶

Feature	Non-Streaming	Streaming
Latency (perceived)	High (~2000ms TTFT)	Low (~300-500ms TTFT)
User Experience	Wait for complete response	Progressive display
Implementation	Simpler	More complex
Use Case	Batch processing	Interactive apps
Buffering	Full response buffered	Zero buffering
Credits	1 per completed job	1 per completed job
Cost	Same	Same

When to use non-streaming: - Batch processing jobs - Background tasks - Simple integrations - When full response is needed before processing

When to use streaming: - Chat interfaces - Real-time user interactions - Long-form content generation - Lower perceived latency requirement

Rate Limiting¶

LLM calls are subject to rate limiting:

Requests per minute (RPM): Configurable per team
Tokens per minute (TPM): Configurable per team
Default: 60 RPM, 60,000 TPM

When rate limited, you'll receive a 429 Too Many Requests response.

Best Practices¶

Use appropriate model groups - Select the right model group for your use case
Set reasonable temperatures - Lower for facts, higher for creativity
Limit max_tokens - Control response length and costs
Add purpose labels - Track different types of calls for analytics
Handle errors gracefully - Implement retry logic with exponential backoff
Use streaming for UX - Provide better user experience with real-time feedback
Always complete jobs - Ensure jobs are completed to trigger proper billing