Work In progress
LLM Service Requests Example
This example shows how to make requests to the LLM service using the OpenAI compatible API.
Installation
Install the required packages:
pip install openai
Configuration
Substitute the following values with your own:
MODEL: The model ID of the model you want to use. Example:Qwen/Qwen3-4BAPI_KEY: Your API key (if authentication is required)API_URL: The URL of the API. Example:http://kalavai-api.public.kalavai.net/v1
Examples
1. Streaming Inference
A single request with streaming response to get the output tokens as soon as they are generated.
from openai import OpenAI
API_URL = "http://kalavai-api.public.kalavai.net/v1"
API_KEY = "<your-api-key>"
MODEL = "Qwen/Qwen3-4B"
client = OpenAI(
base_url=API_URL,
api_key=API_KEY
)
def stream_chat():
response = client.chat.completions.create(
model=MODEL,
messages=[{"role": "user", "content": "Tell me a long story"}],
stream=True
)
print("Assistant:", end=" ", flush=True)
for chunk in response:
delta = chunk.choices[0].delta
if delta and delta.content:
print(delta.content, end="", flush=True)
print("\n--- Done ---")
if __name__ == "__main__":
stream_chat()
2. Batched Inference
Multiple requests submitted simultaneously. The results are displayed in bulk once all of them are completed.
from openai import OpenAI
import asyncio
API_URL = "http://kalavai-api.public.kalavai.net/v1"
API_KEY = "<your-api-key>"
MODEL = "Qwen/Qwen3-4B"
client = OpenAI(
base_url=API_URL,
api_key=API_KEY
)
async def batched_inference_openai():
prompts = [
"What is the capital of France?",
"Explain quantum computing",
"Write a short poem",
"What are the benefits of exercise?",
"Describe the solar system"
]
tasks = []
for i, prompt in enumerate(prompts):
task = asyncio.create_task(single_request(client, MODEL, prompt, i))
tasks.append(task)
results = await asyncio.gather(*tasks)
for i, result in enumerate(results):
print(f"Response {i+1}: {result[:100]}...")
async def single_request(client, model, prompt, request_id):
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=100
)
return response.choices[0].message.content
if __name__ == "__main__":
asyncio.run(batched_inference_openai())
High-Performance notes
- Streaming: Use
stream=Truefor real-time response generation - Batching: Use async/await patterns for concurrent requests
- Error Handling: Always include proper error handling for network requests