Work in progress

Ray clusters for distributed computing

From Ray's documentation:

Ray is an open-source unified framework for scaling AI and Python applications like machine learning.

Kalavai and Ray work perfectly together. Ray is a great framework to deal with distributed computation on top of an existing hardware pool. Kalavai acts as a unifying layer that brings that required hardware together for Ray to do its magic.

To get started, check out our example to get a Ray cluster going.

Create a cluster

  • Specs how to define specs: kalavai pool resources (cpu, memory and nvidia.com/gpu)
$ kalavai pool resources

┏━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓             n_nodes  cpu                 memory       nvidia.com/gpu  
┡━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩  Available  2        10.684999999999999  16537780224  1               
├───────────┼─────────┼────────────────────┼─────────────┼────────────────┤  Total      4        42                  70895030272  3               
└───────────┴─────────┴────────────────────┴─────────────┴────────────────┘ 
spec:
  ...
  headGroupSpec:
    ...
    template:
      spec:
        ...
        containers:
        ...
          resources:
            limits:
              cpu: 2
              memory: 4Gi
            requests:
              cpu: 2
              memory: 4Gi
  workerGroupSpecs:
  ...
    template:
      spec:
        containers:
        ...
          resources:
            limits:
              nvidia.com/gpu: 1
              cpu: 2
              memory: 4Gi
            requests:
              nvidia.com/gpu: 1
              cpu: 2
              memory: 4Gi

Interact with Ray - Interactive mode - Endpoint - RayJobs

Advanced topics

Autoscaling

Node hardware requirements (limits vs requests)