Skip to content

Work in progress

Ray clusters for distributed computing

From Ray's documentation:

Ray is an open-source unified framework for scaling AI and Python applications like machine learning.

Kalavai and Ray work perfectly together. Ray is a great framework to deal with distributed computation on top of an existing hardware pool. Kalavai acts as a unifying layer that brings that required hardware together for Ray to do its magic.

To get started, check out our example to get a Ray cluster going.

Create a cluster

  • Specs how to define specs: kalavai pool resources (cpu, memory and nvidia.com/gpu)
$ kalavai pool resources

┏━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓ 
┃           ┃ n_nodes ┃ cpu                ┃ memory      ┃ nvidia.com/gpu ┃ 
┡━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩ 
│ Available │ 2       │ 10.684999999999999 │ 16537780224 │ 1              │ 
├───────────┼─────────┼────────────────────┼─────────────┼────────────────┤ 
│ Total     │ 4       │ 42                 │ 70895030272 │ 3              │ 
└───────────┴─────────┴────────────────────┴─────────────┴────────────────┘ 
spec:
  ...
  headGroupSpec:
    ...
    template:
      spec:
        ...
        containers:
        ...
          resources:
            limits:
              cpu: 2
              memory: 4Gi
            requests:
              cpu: 2
              memory: 4Gi
  workerGroupSpecs:
  ...
    template:
      spec:
        containers:
        ...
          resources:
            limits:
              nvidia.com/gpu: 1
              cpu: 2
              memory: 4Gi
            requests:
              nvidia.com/gpu: 1
              cpu: 2
              memory: 4Gi

Interact with Ray - Interactive mode - Endpoint - RayJobs

Advanced topics

Autoscaling

Node hardware requirements (limits vs requests)