Running Language Models – Deploy Llama2 7b on AI C...

felixbartler · ‎07-28-2023

With the rapid release of new language models, it is challenging to keep track. The latest language model that will be particularly interesting in the context of SAP AI Core is Llama2, with 7 billion parameters. The performance increase compared to other models in the category of 16gb VRAM and below is astonishing, in my opinion. This installment of the series "Running Language Models" will demonstrate the deployment of Llama2 and models of similar size on SAP BTP.

Szenario:

This blog post functions as a practical guide on deploying Language Models on AI Core for inference. As developers of custom AI functionality, we aim to utilize the capabilities of these models in our business processes in a tailored manner. We will provide a brief overview of the limitations surrounding AI Core and delve into the step-by-step explanation of the inference process.

Szenario Architecture

About AI Core and Language Models:

AI Core is a container-based engine that handles AI workloads, offering two primary functions: executing training workflows and deploying models for inference in a productive environment. For the latter purpose, the container exposes an HTTP-based endpoint, accompanied by various supporting features like configurability, scalability, and observability.

When using AI Core, we have the flexibility to choose from different resource plans, which determine the types of machines provisioned from the underlying hyperscaler. In an upcoming blog post, we will provide a comprehensive explanation of these capabilities. For now, it's essential to note that the largest available GPUs currently support up to 16GB of VRAM, thereby limiting the deployment of models to approximately 7 billion parameters in size.

Note: This simplified explanation focuses on the basic capabilities of AI Core and its limitations concerning model deployment based on available GPU VRAM. However, it's important to highlight that there are several alternative methods to run language models with lower resource requirements. These approaches may involve trade-offs in terms of performance or speed to achieve efficient deployment on less powerful hardware configurations.

Deployment Step-by-Step:

In this blog, we will discuss the essential artifacts required for deployment. We will provide a step-by-step guide on how to configure AI Core to utilize your Git account and Docker repository. For detailed instructions on this configuration process, we highly recommend referring to the comprehensive and official tutorial series, which offers valuable insights.

Serving Template:

The YAML-based template serves as a description of the serving configuration. To simplify the demonstration, I have kept it at its most basic form. For the scenario name and executables, I have chosen "transformers" as the resource plan, specifically "infer.l" - the largest plan available for inference.

In this setup, the serving container is pulled directly from my DockerHub account and exposes port 8080 to facilitate the HTTP endpoint. To ensure secure access, I have incorporated an image pull secret for the account. This way, we can seamlessly deploy the chosen model with minimal configuration for this demonstration.

apiVersion: ai.sap.com/v1alpha1

kind: ServingTemplate

metadata:

  name: transformers

  annotations:

    scenarios.ai.sap.com/description: "transformers"

    scenarios.ai.sap.com/name: "transformers"

    executables.ai.sap.com/description: "transformers"

    executables.ai.sap.com/name: "transformers"

  labels:

    scenarios.ai.sap.com/id: "transformers"

    ai.sap.com/version: "1.0"

spec:

  template:

    apiVersion: "serving.kserve.io/v1beta1"

    metadata:

      annotations: |

        autoscaling.knative.dev/metric: concurrency

        autoscaling.knative.dev/target: 1

        autoscaling.knative.dev/targetBurstCapacity: 0

      labels: |

        ai.sap.com/resourcePlan: infer.l

    spec: |

      predictor:

        imagePullSecrets:

          - name: felixdockersecrect

        minReplicas: 1

        maxReplicas: 5

        containers:

        - name: kserve-container

          image: docker.io/bfwork/huggingcore-customgpu

          ports:

            - containerPort: 8080

              protocol: TCP

Dockerfiles:

For the serving code, I have opted to use Dockerfiles with multiple stages. The foundation for these Dockerfiles is a custom-built version of the huggingface/transformers-pytorch-gpu image. The code for this custom image is available on Github, and it allows us to utilize the latest versions of all the required libraries. The decision to build the image ourselves stems from the fact that the versions available on Dockerhub are somewhat outdated.

In the custom image, I have made slight modifications. Specifically, I installed transformers directly from PyPI instead of relying on the GitHub main branch. Additionally, I added supplementary packages necessary for specific optional parts of transformers.

Note: When dealing with numerous third-party dependencies, it is always advisable to leverage existing Dockerfiles instead of reinventing the wheel. This is a critical step to ensure the code works effectively in the end, and utilizing established Dockerfiles can save time and reduce potential issues.

FROM nvidia/cuda:11.7.1-cudnn8-devel-ubuntu20.04

LABEL maintainer="Hugging Face"



ARG DEBIAN_FRONTEND=noninteractive



RUN apt update

RUN apt install -y git libsndfile1-dev tesseract-ocr espeak-ng python3 python3-pip ffmpeg

RUN python3 -m pip install --no-cache-dir --upgrade pip



RUN python3 -m pip install --no-cache-dir transformers==4.31.0



# If set to nothing, will install the latest version

ARG PYTORCH='2.0.1'

ARG TORCH_VISION=''

ARG TORCH_AUDIO=''

# Example: `cu102`, `cu113`, etc.

ARG CUDA='cu117'



RUN [ ${#PYTORCH} -gt 0 ] && VERSION='torch=='$PYTORCH'.*' ||  VERSION='torch'; python3 -m pip install --no-cache-dir -U $VERSION --extra-index-url https://download.pytorch.org/whl/$CUDA

RUN [ ${#TORCH_VISION} -gt 0 ] && VERSION='torchvision=='TORCH_VISION'.*' ||  VERSION='torchvision'; python3 -m pip install --no-cache-dir -U $VERSION --extra-index-url https://download.pytorch.org/whl/$CUDA

RUN [ ${#TORCH_AUDIO} -gt 0 ] && VERSION='torchaudio=='TORCH_AUDIO'.*' ||  VERSION='torchaudio'; python3 -m pip install --no-cache-dir -U $VERSION --extra-index-url https://download.pytorch.org/whl/$CUDA



RUN python3 -m pip uninstall -y tensorflow flax

RUN python3 -m pip install -U "itsdangerous<2.1.0"

RUN python3 -m pip install -U accelerate einops bitsandbytes-cuda117

The following is the actual content of our Dockerfile, which includes our serving configuration:

FROM bfwork/huggingcore-transformers



WORKDIR /serving

COPY requirements.txt requirements.txt

RUN pip3 install -r requirements.txt



ENV LC_ALL=C.UTF-8

ENV LANG=C.UTF-8



RUN export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda-10.0/targets/x86_64-linux/lib:/usr/local/cuda-10.2/targets/x86_64-linux/lib:/usr/local/cuda-11/targets/x86_64-linux/lib:/usr/local/cuda-11.6/targets/x86_64-linux/lib/stubs:/usr/local/cuda-11.6/compat:/usr/local/cuda-11.6/targets/x86_64-linux/lib

RUN export PATH=$PATH:/usr/local/cuda-11/bin



# Required for huggingface

RUN mkdir -p /nonexistent/

RUN mkdir -p /transformerscache/



RUN chown -R 1000:1000 /nonexistent

RUN chown -R 1000:1000 /transformerscache



RUN chmod -R 777 /nonexistent

RUN chmod -R 777 /transformerscache



ENV TRANSFORMERS_CACHE=/transformerscache



COPY /serving /serving



ENV MODEL_CLASS="PIPELINE"



CMD ["uvicorn", "app:api", "--host", "0.0.0.0", "--port", "8080"]

In the Dockerfile, we install specific requirements for serving, such as FastAPI. Additionally, we set the LD_LIBRARY_PATH variable, which is necessary for AI Core. Furthermore, we create various folders and configure permissions to enable writing to them. These steps are crucial as Transformers downloads model files to the disk, using the directory specified by the TRANSFORMERS_CACHE environment variable. For a detailed explaination of the Dockerfile, check out my last blog post.

Finally, we define the entrypoint for hosting our endpoints on port 8080 using Uvicorn and FastAPI. This setup ensures the smooth functioning of the serving process, allowing us to deploy and serve language models effectively.

Serving script:

The serving process is divided into two components: the server part and the sections dedicated to handling the actual language model. While the example utilizes FastAPI, it's essential to note that there are various other approaches available for building a model inference server.

from fastapi import FastAPI, Request

from model_pipeline import Model



api = FastAPI()



@api.on_event("startup")

async def on_app_start():

    """this function is called on startup and facilitates the loading and setup of the model for inference"""

    Model.setup()



@api.post("/v2/predict")

async def predict(request: Request):

    """this function exposes the inference endpoint, expecting a json object with the prompt and a dictionary of arguments for the model"""

    request_content = await request.json()

    return Model.predict(request_content["prompt"], args=request_content["args"])

And the model part:

import os

import sys

import torch

import transformers

import huggingface_hub



transformers.utils.logging.set_verbosity_debug()

transformers.utils.logging.disable_progress_bar()



HUB_TOKEN = "hf_qsb<your hf token>"



huggingface_hub.login(token=HUB_TOKEN)



class Model:

    generator = None



    def setup():

        """model setup"""

        print("START LOADING SETUP", file=sys.stderr)   # somehow AI Cores logs only show the error stream 🙂

        

        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")



        model_name = "meta-llama/Llama-2-7b-chat-hf"

        

        tokenizer = transformers.AutoTokenizer.from_pretrained(model_name, use_auth_token=True)

        

        pipeline = transformers.pipeline(

            "text-generation",

            model=model_name,

            tokenizer=tokenizer,

            torch_dtype=torch.bfloat16,

            device=device,

            trust_remote_code=True,

            use_auth_token=True

        )

        

        print("MODEL DEVICE", str(device), file=sys.stderr)

        

        Model.generator = lambda prompt, args: pipeline(

            prompt,

            **{

                "max_length": 2000,

                "do_sample": True,

                "top_k": 10,

                "num_return_sequences": 1,

                "eos_token_id": tokenizer.eos_token_id,

                **args

            }

        )



        print("SETUP DONE", file=sys.stderr)



    def predict(prompt, args):

        """model setup"""

        return Model.generator(prompt, args)

The setup is simple and straightforward. The model is loaded from the Hugging Face model hub, and a generator lambda function is created, encompassing the tokenizer and the transformers pipeline. To access the Llama2 model, a Hugging Face account is required, and the terms form must be completed. For this demonstration, we utilize the chat-finetuned Llama2 model.

The generator function enables the specification of prompts (or a list of prompts for batching) and arguments. These arguments can also be set in the API call. For more information on the available parameters, please refer to the official documentation.

Inference:

Once the container is successfully deployed, we can utilize the endpoint to interact with the model. To prompt the model effectively, we follow the same template that was used during training. This involves specifying a system prompt and a task prompt.

def build_llama2_prompt(role_prompt, task_prompt):

    B_S, E_S = "<s>", " </s>"

    B_INST, E_INST = "[INST]", " [/INST]\n"

    B_SYS, E_SYS = " <<SYS>>\n", "\n<</SYS>>\n\n"

    SYSTEM_PROMPT = B_SYS + role_prompt + E_SYS

    return B_S + B_INST + SYSTEM_PROMPT + task_prompt + E_INST

We create the request to AI Core:

def get_response(full_prompt, args={}):

    res = requests.post(

        f"https://api.ai.internalprod.eu-central-1.aws.ml.hana.ondemand.com/v2/inference/deployments/{deployment_resp.id}/v2/predict",

        json={"prompt": full_prompt, "args": args},

        headers={

            "Authorization": ai_api_v2_client.rest_client.get_token(),

            "ai-resource-group": RESOURCE_GROUP,

            "Content-Type": "application/json"

        })

    if res.status_code != 200:

        raise Exception("ERROR WITH DEPLOYMENT " + str(res.status_code) + " " + str(res.content))

    return res.json()[0]["generated_text"]

Note: Please be aware that the deployment URL for your AI Launchpad instance will vary depending on the landscape. Kindly copy the specific deployment URL for your instance to proceed with the appropriate setup and usage.

r = get_response(build_llama2_prompt(role_prompt="You are a poet!", task_prompt="Write a 5 line Poem, about lamas!"))

The completion of the prompt results in the generation of the actual poem. Notice the format of the prompt, particularly the inclusion of system and instruction tokens, as they play a significant role in guiding the model's generation process.

<s>[INST] <<SYS>>

You are a poet!

<</SYS>>



Write a 5 line Poem, about lamas! [/INST]

Oh, lamas, oh so serene,

With coats of gold, so divine.

Their eyes so bright, their steps so light,

They roam the mountains with such grace.

In peaceful silence, they take flight.

Indeed, considering its size, the language model proves capable of performing significant tasks. For more detailed insights into the model's capabilities, be sure to explore the upcoming blog posts where we will dive deeper into using Language Models in the business context.

If you're interested in the complete coding and want to go through the deployment and inference process, you can find it on GitHub. Additionally, there is a deployment Jupyter notebook available to guide you through the entire process.

Enjoy experimenting with language models on BTP, and feel free to leave any comments or feedback. Happy coding!