Work In progress
LLM Service Requests Example
This example shows how to make requests to the LLM service using the OpenAI compatible API.
Installation
Install the required packages:
pip install openai
Configuration
Substitute the following values with your own:
MODEL: The model ID of the model you want to use. Example:Qwen/Qwen3-4BAPI_KEY: Your API key (if authentication is required)API_URL: The URL of the API. Example:http://kalavai-api.public.kalavai.net/v1
Examples
1. Streaming Inference
A single request with streaming response to get the output tokens as soon as they are generated.
from openai import OpenAI
API_URL = "http://kalavai-api.public.kalavai.net/v1"
API_KEY = "<your-api-key>"
MODEL = "Qwen/Qwen3-4B"
client = OpenAI(
base_url=API_URL,
api_key=API_KEY
)
def stream_chat():
response = client.chat.completions.create(
model=MODEL,
messages=[{"role": "user", "content": "Tell me a long story"}],
stream=True
)
print("Assistant:", end=" ", flush=True)
for chunk in response:
delta = chunk.choices[0].delta
if delta and delta.content:
print(delta.content, end="", flush=True)
print("\n--- Done ---")
if __name__ == "__main__":
stream_chat()
2. Batched Inference
Multiple requests submitted simultaneously. The results are displayed in bulk once all of them are completed.
from openai import OpenAI
import asyncio
API_URL = "http://kalavai-api.public.kalavai.net/v1"
API_KEY = "<your-api-key>"
MODEL = "Qwen/Qwen3-4B"
client = OpenAI(
base_url=API_URL,
api_key=API_KEY
)
async def batched_inference_openai():
prompts = [
"What is the capital of France?",
"Explain quantum computing",
"Write a short poem",
"What are the benefits of exercise?",
"Describe the solar system"
]
tasks = []
for i, prompt in enumerate(prompts):
task = asyncio.create_task(single_request(client, MODEL, prompt, i))
tasks.append(task)
results = await asyncio.gather(*tasks)
for i, result in enumerate(results):
print(f"Response {i+1}: {result[:100]}...")
async def single_request(client, model, prompt, request_id):
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=100
)
return response.choices[0].message.content
if __name__ == "__main__":
asyncio.run(batched_inference_openai())
3. Custom parameters
Some models support custom parameters that can be passed in the extra_body field. For example, the Qwen model supports the enable_thinking parameter to disable reasoning mode.
from openai import OpenAI
API_URL = "http://kalavai-api.public.kalavai.net/v1"
API_KEY = "<your-api-key>"
MODEL = "Qwen/Qwen3-4B"
client = OpenAI(
base_url=API_URL,
api_key=API_KEY
)
def stream_chat():
response = client.chat.completions.create(
model=MODEL,
messages=[{"role": "user", "content": "Who, Tell me a story"}],
stream=True,
extra_body={
"chat_template_kwargs": {
"enable_thinking": False
}
}
)
for chunk in response:
delta = chunk.choices[0].delta
if delta and delta.content:
print(delta.content, end="", flush=True)
if delta and hasattr(delta, 'reasoning_content') and delta.reasoning_content:
print(delta.reasoning_content, end="", flush=True)
if delta and hasattr(delta, 'reasoning') and delta.reasoning:
print(delta.reasoning, end="", flush=True)
print("\n--- Done ---")
if __name__ == "__main__":
stream_chat()
OpenAI compatible API
With the OpenAI compatible API, you can use the same code to interact with the LLM service as you would with the OpenAI API. This means that you can use the same code to interact with the service as you would with the OpenAI API, including methods and parameters to customise your inference calls.
For a detailed view of the OpenAI compatible API, supported methods and parameters, see the LiteLLM API Documentation.
High-Performance notes
- Streaming: Use
stream=Truefor real-time response generation - Batching: Use async/await patterns for concurrent requests
- Error Handling: Always include proper error handling for network requests