Deploying LLMs with Ollama

A Comprehensive Guide to Spinning Up an Ollama Server

Oct 21, 2024

Background

As LLMs and LMM continues to advance, effectively deploying the models has become essential for a wide range of applications. If you’re using OpenAI or similar LLM APIs for your applications/AI agents, you might have concerns about exposing sensitive organizational data to these models. Have you also considered the API costs you incur? If so, you’re in the right place.

While numerous LLM/LMM APIs (OpenAI, Gemini, etc., ) are readily available, hosting your own models offers several significant advantages. It provides greater control over customization to meet specific requirements and enhances data privacy by keeping sensitive information within your own infrastructure. Additionally, self-hosted models can lead to cost savings on API usage and offer faster response times, particularly for high-traffic scenarios. Ollama simplifies the deployment process, enabling developers to run these models in Docker containers with ease. In this article, we will explore the steps to deploy an LLM model using Ollama.

Ollama is an opensource model deployment platform designed to simplify the deployment and management of Large Language Models (LLMs) using Docker containers. It allows developers to easily run, manage, and customize various LLMs without the complexities often associated with traditional deployment methods. With Ollama, you can host models locally or on your infrastructure, providing enhanced control over data privacy and customization.

Local Installation

To get started with Ollama, you need to install it on your local machine. Depending on your operating system, you can use the following commands:

macOS:

brew install ollama

Linux: Use the following script to install:

curl -sSfL https://ollama.com/download.sh | sh

Windows: Download the installer from the official Ollama website and follow the installation instructions.

After installation, start the Ollama service by running

ollama serve

You can verify if the Ollama is running by running the command in terminal.

ollama --version

You can also verify the installation by visiting http://127.0.0.1:11434/ in the browser. By default, Ollama runs in port 11434.

Downloading Models

By default, there won’t be any models available on your local machine. You can check the available models in the system using

ollama list

You can download any model available in the Ollama library and run inference on it. Visit Ollama Library to find the model that best suits your requirements. For this example, I am using Microsoft phi3.

Using the command below you can download the models into the local system.

ollama pull phi3

Once the model is downloaded, run the model using

ollama run phi3

Each model will have its own configuration, such as temperature and max tokens. You can pass the configuration while running the model according to your requirements.

Inference

Ollama models can be invoked programmatically using a REST API. For instance, if we're building a chatbot with Ollama, the application can communicate with the model through the REST API as demonstrated below. You can define the messages passed in the input as per the model’s requirement.

curl --location 'http://localhost:11434/api/chat' \
--header 'content-type: application/json' \
--data '{
  "model": "phi3", 
  "messages": [
    {
      "role": "system", 
      "content": "you are a Q&A Chatbot specializes in answering questions about cities" 
    },
    {
      "role": "user", 
      "content": "What is the capital of France." 
    }
  ],
  "stream": false 
}'

The model will provide output as below. You can parse the output and use it as per your requirement.

{
    "model": "phi3",
    "created_at": "2024-10-21T09:07:54.40015Z",
    "message": {
        "role": "assistant",
        "content": "The capital of France is Paris. It's not only politically significant, but also culturally and historically important city with renowned landmarks such as the Eiffel Tower and Louvre Museum."
    },
    "done_reason": "stop",
    "done": true,
    "total_duration": 1329401041,
    "load_duration": 16785666,
    "prompt_eval_count": 39,
    "prompt_eval_duration": 263199000,
    "eval_count": 44,
    "eval_duration": 1042799000
}

Docker Setup

Similar to running Ollama on a local machine, you can also deploy the Ollama service as a container.

Create a shell script called run_ollama.sh with instructions to download the model.

ollama serve &
ollama list
ollama pull phi3

Create a Docker file to pull the base Ollama image and run the Ollama service.

FROM ollama/ollama

COPY ./run-ollama.sh /tmp/run-ollama.sh

WORKDIR /tmp

RUN chmod +x run-ollama.sh \
    && ./run-ollama.sh

EXPOSE 11434

Build the image and Run the container.

docker build -t my-ollama .  

docker run -p 11434:11434 my-ollama

This will spin up the docker container with phi3 model. You can invoke the models as mentioned in the Inference section.

Thili.net’s Learn Fast

Discussion about this post