Public Petals Swarm: BitTorrent-style LLMs
Contribute to the public Petals swarm and help deploy and fine tune Large Language Models across consumer-grade devices. See more about the Petals project here. You'll get:
- Eternal kudos from the community!
- Access to all the models in the server
- Easy access for inference (via Petals SDK and installation-free Kalavai endpoint).
Requirements
- A free Kalavai account. Create one here.
- A computer with the minimum requirements (see below)
Hardware requirements
- 1+ NVIDIA GPU
- 2+ CPUs
- 4GB+ RAM
- Free space 4x available VRAM (for an 8GB VRAM GPU, you'll need ~32GB free space in your disk)
How to join
-
Create a free account with Kalavai.
-
Install the kalavai client following the instructions here. Currently we support Linux distros and Windows.
-
Get the joining token. Visit our platform and go to
Community pools
. Then clickJoin
on thePetals
Pool to reveal the joining details. Copy the command (including the token).
- Authenticate the computer you want to use as worker:
$ kalavai login
[10:33:16] Kalavai account details. If you don't have an account, create one at https://platform.kalavai.net
User email: <your email>
Password: <your password>
[10:33:25] <email> logged in successfully
- Join the pool with the following command:
$ kalavai pool join <token>
[16:28:14] Token format is correct
Joining private network
[16:28:24] Scanning for valid IPs...
Using 100.10.0.8 address for worker
Connecting to PETALS @ 100.10.0.9 (this may take a few minutes)...
[16:29:41] Worskpace created
You are connected to PETALS
Check Petals health
Kalavai's pool connects directly to the public swarm on Petals, which means we can use their public health check UI to see how much we are contributing and what models are ready to use.
Models with at least one copy of each shard (a green dot in each column) are ready to be used. If not, wait for more workers to join in.
Using the kalavai client you can monitor the state of the pool and all of the connected nodes:
$ kalavai pool status
# Displays the status of the pool
$ kalavai node list
# Displays the list of connected nodes, and their current status
The command kalavai node list
is useful to see if our node has any issues and whether it's currently online.
How to use the models
For all public swarms you can use the Petals SDK in the usual way. Here is an example:
from transformers import AutoTokenizer
from petals import AutoDistributedModelForCausalLM
# Choose any model available at https://health.petals.dev
model_name = "mistralai/Mixtral-8x22B-Instruct-v0.1"
# Connect to a distributed network hosting model layers
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoDistributedModelForCausalLM.from_pretrained(model_name)
# Run the model as if it were on your computer
inputs = tokenizer("A cat sat", return_tensors="pt")["input_ids"]
outputs = model.generate(inputs, max_new_tokens=5)
print(tokenizer.decode(outputs[0])) # A cat sat on a mat...
This path is great if you are a dev with python installed, and don't mind installing the Petals SDK. If you want an install-free path, Kalavai deploys a single endpoint for models, which allows you to do inference via gRPC and HTTP requests. Substitute KALAVAI_ENDPOINT with the endpoint displayed under the Community Pools
page. Here is a request example:
"""
More info: https://github.com/petals-infra/chat.petals.dev
Required: pip install websockets
"""
import time
import json
import websockets
import asyncio
KALAVAI_ENDPOINT = "192.168.68.67:31220" # <-- change for the kalavai endpoint
MODEL_NAME = "mistralai/Mixtral-8x22B-Instruct-v0.1" # <-- change for the models available in Kalavai PETALS pool.
async def ws_generate(text, max_length=100, temperature=0.1):
async with websockets.connect(f"ws://{KALAVAI_ENDPOINT}/api/v2/generate") as websocket:
try:
await websocket.send(
json.dumps({"model": MODEL_NAME, "type": "open_inference_session", "max_length": max_length})
)
response = await websocket.recv()
result = json.loads(response)
if result["ok"]:
await websocket.send(
json.dumps({
"type": "generate",
"model": MODEL_NAME,
"inputs": text,
"max_length": max_length,
"temperature": temperature
})
)
response = await websocket.recv()
return json.loads(response)
else:
return response
except Exception as e:
return {"error": str(e)}
if __name__ == "__main__":
t = time.time()
output = asyncio.get_event_loop().run_until_complete(
ws_generate(text="Tell me a story: ")
)
final_time = time.time() - t
print(f"[{final_time:.2f} secs]", output)
print(f"{output['token_count'] / final_time:.2f}", "tokens/s")
NOTE: the endpoints are only available within worker nodes, not from any other computer.
Stop sharing
You can either pause sharing, or stop and leave the pool altogether (don't worry, you can rejoin using the same steps above anytime).
To pause sharing (but remain on the pool), run the following command:
kalavai pool pause
When you are ready to resume sharing, run:
kalavai pool resume
To stop and leave the pool, run the following:
kalavai pool stop
FAQs
Something isn't right
Growing pains! Please report any issues in our github repository.
Can I join (and leave) whenever I want?
Yes, you can, and we won't hold a grudge if you need to use your computer. You can pause or quit altogether as indicated here.
What is in it for me?
If you decide to share your compute with the community, not only you'll get access to all the models we deploy in it, but you will also gather credits in Kalavai, which will be redeemable for computing in any other public pool (this feature is coming really soon).
Is my data secured / private?
The public pool in Kalavai has the same level of privacy and security than the general Petals public swarm. See their privacy details here. In the future we will improve support for private swarms; at the moment private swarms are a beta feature for all kalavai pools that can be used via the petals template.
Is my GPU constantly being used?
Yes and no. The model weights for the shard you are responsible for are loaded in GPU memory for as long as your machine is sharing. However, this does not mean the GPU is active (doing computing) constantly; computation (and hence the vast majority of energy comsumption) only happens when your shard is summoned to process inference requests.
If at any point you need your GPU memory back, pause or stop sharing and come back when you are free.