Skip to main content

ADK Agents on GKE with Self-Hosted LLMs

·3 mins

ADK Agents on GKE wit Self-Hosted LLM and vLLM

This tutorial demonstrates how to deploy the Llama-3.1–8B-Instruct model on Google Kubernetes Engine (GKE) using vLLM and integrating an ADK agent to interact with the model, supporting both basic chat completions and tool usage. The setup leverages a GKE Autopilot cluster with GPU-enabled nodes to handle the computational requirements.

By the end of this tutorial, you will:

  1. Set up a GKE Autopilot cluster.
  2. Deploy the Llama-3.1–8B-Instruct model using vLLM.
  3. Deploy an ADK agent that communicates with the model endpoint.
  4. Test the setup with basic chat completion and tool usage scenarios.

Prerequisites #

  • A terminal with kubectl, helm, and gcloud installed.
  • A Hugging Face account with a token that has Read permission to access the Llama-3.1–8B-Instruct model.
  • Sufficient GPU quota in your Google Cloud project. You need at least 2 NVIDIA L4 GPUs in the region (e.g., us-central1) to deploy the tutorial’s setup.
  • Coffee or Tea ;)

Set Up the GKE Cluster #

Create a folder to hold the files we will be using:

mkdir adk-vllm
cd adk-vllm

Export environment variables. Replace the values with yours:

export PROJECT_ID=xxxx
export REGION=xxxxx
export HF_TOKEN=xxxxxxxx

From a terminal (or the GCP Cloud Shell) run the following command to create the cluster:

gcloud container --project $PROJECT_ID clusters create-auto adk-vllm \
    --region $REGION \
    --release-channel regular \
    --enable-dns-access

Once finished, get the credentials for the cluster:

gcloud container clusters get-credentials adk-vllm \
--region=$REGION \
--project $PROJECT_ID

Create a secret with the HuggingFace Token:

kubectl create secret generic hf-secret \
  --from-literal=hf_api_token=$HF_TOKEN \
  --dry-run=client -o yaml | kubectl apply -f -

Deploy the model with vLLM #

Write the following YAML file (vllm-deployment.yaml) that contains the vLLM deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama3-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llama3-server
  template:
    metadata:
      labels:
        app: llama3-server
        ai.gke.io/model: llama3-1-8b-it
        ai.gke.io/inference-server: vllm
        examples.ai.gke.io/source: user-guide
    spec:
      containers:
      - name: inference-server
        image: vllm/vllm-openai:v0.8.5
        resources:
          requests:
            cpu: "2"
            memory: "10Gi"
            nvidia.com/gpu: "2"
          limits:
            cpu: "2"
            memory: "10Gi"
            nvidia.com/gpu: "2"
        command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
        args:
        - --model=$(MODEL_ID)
        - --tensor-parallel-size=1
        - --host=0.0.0.0
        - --port=8000
        - --enable-auto-tool-choice
        - --tool-call-parser=llama3_json
        - --chat-template=/templates/tool_chat_template_llama3.1_json.jinja
        - --trust-remote-code
        - --enable-chunked-prefill
        - --max-model-len=32768
        env:
        - name: MODEL_ID
          value: meta-llama/Llama-3.1-8B-Instruct
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-secret
              key: hf_api_token
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
        - mountPath: /templates
          name: chat-templates
          readOnly: true
      volumes:
      - name: chat-templates
        configMap:
          name: llama-chat-templates
      - name: dshm
        emptyDir:
            medium: Memory
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4
        cloud.google.com/gke-spot: "true"
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-llama3-service
spec:
  selector:
    app: llama3-server
  type: ClusterIP
  ports:
    - protocol: TCP
      port: 8000
      targetPort: 8000
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: llama-chat-templates
data:
  tool_chat_template_llama3.1_json.jinja: |
    # (Full Jinja template content omitted here for brevity, but included in original file)

Deploy the model:

kubectl apply -f vllm-deployment.yaml

Check identifying the status of pods:

kubectl get pods

Once the pod is ready, set up a port-forward to test the model:

kubectl port-forward services/vllm-llama3-service 8000:8000

Build and deploy the ADK Agent #

Clone the repository and build the agent:

git clone https://github.com/boredabdel/adk-vllm-gke
cd adk-vllm-gke/adk_agent

Create a Docker repository in Artifact Registry:

gcloud artifacts repositories create adk-vllm \
--location=$REGION \
--repository-format=docker

Build the agent with Cloud Build:

gcloud builds submit . \
--project=$PROJECT_ID \
--region=$REGION

Deploy the agent to GKE:

kubectl apply -f example_agent/agent-deployment.yaml

Once the agent is live, you can interact with it via its External IP and a Session ID generated through the API.


Clean-up (Optional) #

gcloud container --project $PROJECT_ID clusters delete adk-vllm \
    --region $REGION