caleb-kaiser · October 23, 2024 23:55
diff --git a/3-1-evaluation-heuristics.ipynb b/3-1-evaluation-heuristics.ipynb
 {
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/caleb-kaiser/b10444b331bf64be64b6b72ee161e0b9/3-1-evaluation-heuristics.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "<img src=\"https://raw.githubusercontent.com/comet-ml/opik/main/apps/opik-documentation/documentation/static/img/opik-logo.svg\" width=\"250\"/>"
      ],
      "metadata": {
        "id": "Wd6G0Z-k1VeU"
      }
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "jtg-Gl_IW_OM"
      },
      "source": [
        "# Heuristic Evaluation with Opik\n",
        "\n",
        "In this exercise, you'll implement a basic evaluation pipeline with Opik using heuristic evaluation metrics. You can use OpenAI or open source models via LiteLLM.\n",
        "\n",
        "[Heuristic metrics are rule-based evaluation methods](https://www.comet.com/docs/opik/evaluation/metrics/heuristic_metrics) that allow you to check specific aspects of language model outputs. These metrics use predefined criteria or patterns to assess the quality, consistency, or characteristics of generated text."
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Imports & Configuration"
      ],
      "metadata": {
        "id": "DWAY0s_FDQtl"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "%pip install opik openai comet_ml litellm --quiet"
      ],
      "metadata": {
        "id": "5kiuoilUz63A"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "IQffvGSCW_OO"
      },
      "outputs": [],
      "source": [
        "import opik\n",
        "from opik import Opik, track\n",
        "from opik.evaluation import evaluate\n",
        "from opik.evaluation.metrics import (Equals, LevenshteinRatio)\n",
        "from opik.integrations.openai import track_openai\n",
        "import openai\n",
        "import getpass\n",
        "import os\n",
        "import csv\n",
        "from datetime import datetime\n",
        "import comet_ml\n",
        "import litellm\n",
        "\n",
        "# Define project name to enable tracing\n",
        "os.environ[\"OPIK_PROJECT_NAME\"] = \"food_chatbot_eval\""
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# opik configs\n",
        "if \"OPIK_API_KEY\" not in os.environ:\n",
        "    os.environ[\"OPIK_API_KEY\"] = getpass.getpass(\"Enter your Opik API key: \")\n",
        "\n",
        "opik.configure()"
      ],
      "metadata": {
        "id": "IuJOARZIzmVq"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# openai configs\n",
        "if \"OPENAI_API_KEY\" not in os.environ:\n",
        "    os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"Enter your OpenAI API key: \")\n",
        "\n",
        "MODEL = \"gpt-4o-mini\""
      ],
      "metadata": {
        "id": "8Qj8Cyn4zmP7"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "KYZAKedXW_OQ"
      },
      "source": [
        "# Templates & Prompts\n"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# define menu items for the food chatbot\n",
        "menu_items = \"\"\"\n",
        "Menu: Kids Menu\n",
        "Food Item: Mini Cheeseburger\n",
        "Price: $6.99\n",
        "Vegan: N\n",
        "Popularity: 4/5\n",
        "Included: Mini beef patty, cheese, lettuce, tomato, and fries.\n",
        "\n",
        "Menu: Appetizers\n",
        "Food Item: Loaded Potato Skins\n",
        "Price: $8.99\"\n",
        "Vegan: N\n",
        "Popularity: 3/5\n",
        "Included: Crispy potato skins filled with cheese, bacon bits, and served with sour cream.\n",
        "\n",
        "Menu: Appetizers\n",
        "Food Item: Bruschetta\n",
        "Price: $7.99\n",
        "Vegan: Y\n",
        "Popularity: 4/5\n",
        "Included: Toasted baguette slices topped with fresh tomatoes, basil, garlic, and balsamic glaze.\n",
        "\n",
        "Menu: Main Menu\n",
        "Food Item: Grilled Chicken Caesar Salad\n",
        "Price: $12.99\n",
        "Vegan: N\n",
        "Popularity: 4/5\n",
        "Included: Grilled chicken breast, romaine lettuce, Parmesan cheese, croutons, and Caesar dressing.\n",
        "\n",
        "Menu: Main Menu\n",
        "Food Item: Classic Cheese Pizza\n",
        "Price: $10.99\n",
        "Vegan: N\n",
        "Popularity: 5/5\n",
        "Included: Thin-crust pizza topped with tomato sauce, mozzarella cheese, and fresh basil.\n",
        "\n",
        "Menu: Main Menu\n",
        "Food Item: Spaghetti Bolognese\n",
        "Price: $14.99\n",
        "Vegan: N\n",
        "Popularity: 4/5\n",
        "Included: Pasta tossed in a savory meat sauce made with ground beef, tomatoes, onions, and herbs.\n",
        "\n",
        "Menu: Vegan Options\n",
        "Food Item: Veggie Wrap\n",
        "Price: $9.99\n",
        "Vegan: Y\n",
        "Popularity: 3/5\n",
        "Included: Grilled vegetables, hummus, mixed greens, and a wrap served with a side of sweet potato fries.\n",
        "\n",
        "Menu: Vegan Options\n",
        "Food Item: Vegan Beyond Burger\n",
        "Price: $11.99\n",
        "Vegan: Y\n",
        "Popularity: 4/5\n",
        "Included: Plant-based patty, vegan cheese, lettuce, tomato, onion, and a choice of regular or sweet potato fries.\n",
        "\n",
        "Menu: Desserts\n",
        "Food Item: Chocolate Lava Cake\n",
        "Price: $6.99\n",
        "Vegan: N\n",
        "Popularity: 5/5\n",
        "Included: Warm chocolate cake with a gooey molten center, served with vanilla ice cream.\n",
        "\n",
        "Menu: Desserts\n",
        "Food Item: Fresh Berry Parfait\n",
        "Price: $5.99\n",
        "Vegan: Y\n",
        "Popularity: 4/5\n",
        "Included: Layers of mixed berries, granola, and vegan coconut yogurt.\n",
        "\"\"\""
      ],
      "metadata": {
        "id": "QqqG70yF2qoi"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Simple little client class for using different LLM APIs (OpenAI or LiteLLM)\n",
        "class LLMClient:\n",
        "  def __init__(self, client_type: str =\"openai\", model: str =\"gpt-4\"):\n",
        "    self.client_type = client_type\n",
        "    self.model = model\n",
        "\n",
        "    if self.client_type == \"openai\":\n",
        "      self.client = track_openai(openai.OpenAI())\n",
        "\n",
        "    else:\n",
        "      self.client = None\n",
        "\n",
        "  # LiteLLM query function\n",
        "  def _get_litellm_response(self, query: str, system: str = \"You are a helpful assistant.\"):\n",
        "    messages = [\n",
        "        {\"role\": \"system\", \"content\": system },\n",
        "        { \"role\": \"user\", \"content\": query }\n",
        "    ]\n",
        "\n",
        "    response = litellm.completion(\n",
        "        model=self.model,\n",
        "        messages=messages\n",
        "    )\n",
        "\n",
        "    return response.choices[0].message.content\n",
        "\n",
        "  # OpenAI query function - use **kwargs to pass arguments like temperature\n",
        "  def _get_openai_response(self, query: str, system: str = \"You are a helpful assistant.\", **kwargs):\n",
        "    messages = [\n",
        "        {\"role\": \"system\", \"content\": system },\n",
        "        { \"role\": \"user\", \"content\": query }\n",
        "    ]\n",
        "\n",
        "    response = self.client.chat.completions.create(\n",
        "        model=self.model,\n",
        "        messages=messages,\n",
        "        **kwargs\n",
        "    )\n",
        "\n",
        "    return response.choices[0].message.content\n",
        "\n",
        "\n",
        "  def query(self, query: str, system: str = \"You are a helpful assistant.\", **kwargs):\n",
        "    if self.client_type == 'openai':\n",
        "      return self._get_openai_response(query, system, **kwargs)\n",
        "\n",
        "    else:\n",
        "      return self._get_litellm_response(query, system)\n",
        "\n"
      ],
      "metadata": {
        "id": "8w9Doerp4d1G"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Initialize llm_client\n",
        "\n",
        "llm_client = LLMClient(model=MODEL)"
      ],
      "metadata": {
        "id": "TVY5JVjIyBx3"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# LLM Call Steps"
      ],
      "metadata": {
        "id": "d9QwTEd46MrS"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "@track\n",
        "def reasoning_step(user_query, menu_items):\n",
        "    prompt_template = \"\"\"\n",
        "    Your task is to answer questions factually about a food menu, provided below and delimited by +++++. The user request is provided here: {request}\n",
        "\n",
        "    Step 1: The first step is to check if the user is asking a question related to any type of food (even if that food item is not on the menu). If the question is about any type of food, we move on to Step 2 and ignore the rest of Step 1. If the question is not about food, then we send a response: \"Sorry! I cannot help with that. Please let me know if you have a question about our food menu.\"\n",
        "\n",
        "    Step 2: In this step, we check that the user question is relevant to any of the items on the food menu. You should check that the food item exists in our menu first. If it doesn't exist then send a kind response to the user that the item doesn't exist in our menu and then include a list of available but similar food items without any other details (e.g., price). The food items available are provided below and delimited by +++++:\n",
        "\n",
        "    +++++\n",
        "    {menu_items}\n",
        "    +++++\n",
        "\n",
        "    Step 3: If the item exists in our food menu and the user is requesting specific information, provide that relevant information to the user using the food menu. Make sure to use a friendly tone and keep the response concise.\n",
        "\n",
        "    Perform the following reasoning steps to send a response to the user:\n",
        "    Step 1: <Step 1 reasoning>\n",
        "    Step 2: <Step 2 reasoning>\n",
        "    Response to the user (only output the final response): <response to user>\n",
        "    \"\"\"\n",
        "\n",
        "    prompt = prompt_template.format(request=user_query, menu_items=menu_items)\n",
        "    response = llm_client.query(prompt)\n",
        "    return response"
      ],
      "metadata": {
        "id": "kq8ELqs64M7W"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "@track\n",
        "def extraction_step(reasoning):\n",
        "    prompt_template = \"\"\"\n",
        "    Extract the final response from delimited by ###.\n",
        "\n",
        "    ###\n",
        "    {reasoning}.\n",
        "    ###\n",
        "\n",
        "    Only output what comes after \"Response to the user:\".\n",
        "    \"\"\"\n",
        "\n",
        "    prompt = prompt_template.format(reasoning=reasoning)\n",
        "    response = llm_client.query(prompt)\n",
        "    return response"
      ],
      "metadata": {
        "id": "BJbVTqko4PPH"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "@track\n",
        "def refinement_step(final_response):\n",
        "    prompt_template = \"\"\"\n",
        "    Perform the following refinement steps on the final output delimited by ###.\n",
        "\n",
        "    1). Shorten the text to one sentence\n",
        "    2). Use a friendly tone\n",
        "\n",
        "    ###\n",
        "    {final_response}\n",
        "    ###\n",
        "    \"\"\"\n",
        "\n",
        "    prompt = prompt_template.format(final_response=final_response)\n",
        "    response = llm_client.query(prompt)\n",
        "    return response"
      ],
      "metadata": {
        "id": "HfLqs0Yt4Rc3"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "@track\n",
        "def verification_step(user_question, refined_response, menu_items):\n",
        "    prompt_template = \"\"\"\n",
        "    Your task is to check that the refined response (delimited by ###) is providing a factual response based on the user question (delimited by @@@) and the menu below (delimited by +++++). If yes, just output the refined response in its original form (without the delimiters). If no, then make the correction to the response and return the new response only.\n",
        "\n",
        "    User question: @@@ {user_question} @@@\n",
        "\n",
        "    Refined response: ### {refined_response} ###\n",
        "\n",
        "    +++++\n",
        "    {menu_items}\n",
        "    +++++\n",
        "    \"\"\"\n",
        "\n",
        "    prompt = prompt_template.format(user_question=user_question, refined_response=refined_response, menu_items=menu_items)\n",
        "    response = llm_client.query(prompt)\n",
        "    return response"
      ],
      "metadata": {
        "id": "TWIEicCS4lq8"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# LLM Application\n",
        "\n",
        "Adding tracking to your LLM application allows you to have full visibility into each evaluation run."
      ],
      "metadata": {
        "id": "DEaODqXY47oz"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "@track\n",
        "def generate_chatbot_response(user_query, menu_items):\n",
        "    reasoning = reasoning_step(user_query, menu_items)\n",
        "    extraction = extraction_step(reasoning)\n",
        "    refinement = refinement_step(extraction)\n",
        "    verification = verification_step(user_query, refinement, menu_items)\n",
        "    return verification\n",
        "\n",
        "@track\n",
        "def chatbot_application(input: str) -> str:\n",
        "    response = generate_chatbot_response(input, menu_items)\n",
        "    return response"
      ],
      "metadata": {
        "id": "N_4yXZq347Gt"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Define the Evaluation Task\n",
        "The evaluation task takes in as an input a dataset item and needs to return a dictionary with keys that match the parameters expected by the metrics you are using."
      ],
      "metadata": {
        "id": "FHTlEgJE5QFU"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Define the evaluation task\n",
        "def evaluation_task(x):\n",
        "    return {\n",
        "        \"input\": x['question'],\n",
        "        \"output\": chatbot_application(x['question']),\n",
        "        \"context\": menu_items,\n",
        "        \"reference\": x['response']\n",
        "    }"
      ],
      "metadata": {
        "id": "F5UPxC6Q5ODx"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Dataset\n",
        "To retrieve an existing dataset, use **`client.get_dataset()`**, or to create a new dataset, use [**`client.get_or_create_dataset()`**](https://www.comet.com/docs/opik/evaluation/manage_datasets/)."
      ],
      "metadata": {
        "id": "hpDJDCqY5E37"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Create or retrieve the dataset\n",
        "client = Opik()\n",
        "dataset = client.get_or_create_dataset(name=\"foodchatbot_eval\")"
      ],
      "metadata": {
        "id": "dd7s_4XXEPW8"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Optional: Download Dataset From Comet\n",
        "\n",
        "If you have not previously created the `foodchatbot_eval` dataset in your Opik workspace, run the following code to download the dataset as a Comet Artifact and populate your Opik dataset.\n",
        "\n",
        "If you have already created the `foodchatbot_eval` dataset, you can skip to the next section."
      ],
      "metadata": {
        "id": "71FSSzTWEXV8"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "comet_ml.login(api_key=os.environ[\"OPIK_API_KEY\"])\n",
        "experiment = comet_ml.start(project_name=\"foodchatbot_eval\")\n",
        "\n",
        "logged_artifact = experiment.get_artifact(artifact_name=\"foodchatbot_eval\",\n",
        "                                          workspace=\"examples\")\n",
        "local_artifact = logged_artifact.download(\"./\")\n",
        "experiment.end()"
      ],
      "metadata": {
        "id": "6fYufEyRAY3L"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Create or retrieve the dataset\n",
        "client = Opik()\n",
        "dataset = client.get_or_create_dataset(name=\"foodchatbot_eval\")\n",
        "\n",
        "# Read the CSV file and insert items into the dataset\n",
        "with open('./foodchatbot_clean_eval_dataset.csv', newline='') as csvfile:\n",
        "    reader = csv.reader(csvfile)\n",
        "    next(reader, None) # skip the header\n",
        "    for row in reader:\n",
        "        index, question, response = row\n",
        "        dataset.insert([\n",
        "            {\"question\": question, \"response\": response}\n",
        "        ])"
      ],
      "metadata": {
        "id": "OwzZZ9CVizB_"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Choose Evaluation Metrics\n",
        "Opik provides a set of built-in evaluation metrics, including heuristic metrics and LLMs-as-a-judge.\n",
        "\n",
        "Note that each metric expects the data in a certain format, you will need to ensure that the task you have defined in step 1. returns the data in the correct format."
      ],
      "metadata": {
        "id": "l3fHqouG5s6g"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Define the metrics\n",
        "metrics = [Equals(), LevenshteinRatio()]\n",
        "\n",
        "# experiment_name\n",
        "experiment_name = MODEL + \"_\" + dataset.name + \"_\" + datetime.now().strftime(\"%Y-%m-%d_%H-%M-%S\")"
      ],
      "metadata": {
        "id": "8dd-HVRm5o04"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Run the evaluation"
      ],
      "metadata": {
        "id": "RBiNM5nP6Tj-"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "B33fWWm9W_OQ"
      },
      "outputs": [],
      "source": [
        "# run evaluation\n",
        "evaluation = evaluate(\n",
        "    experiment_name=experiment_name,\n",
        "    dataset=dataset,\n",
        "    task=evaluation_task,\n",
        "    scoring_metrics=metrics,\n",
        "    experiment_config={\n",
        "        \"model\": MODEL\n",
        "    }\n",
        ")"
      ]
    },
    {
      "cell_type": "code",
      "source": [],
      "metadata": {
        "id": "FzKlTqUlyu-0"
      },
      "execution_count": null,
      "outputs": []
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "comet-eval",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.15"
    },
    "colab": {
      "provenance": [],
      "collapsed_sections": [
        "DWAY0s_FDQtl",
        "KYZAKedXW_OQ",
        "d9QwTEd46MrS",
        "DEaODqXY47oz",
        "FHTlEgJE5QFU",
        "hpDJDCqY5E37",
        "l3fHqouG5s6g",
        "RBiNM5nP6Tj-"
      ],
      "include_colab_link": true
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "view-in-github",
	"colab_type": "text"
	},
	"source": [
	"<a href=\"https://colab.research.google.com/gist/caleb-kaiser/b10444b331bf64be64b6b72ee161e0b9/3-1-evaluation-heuristics.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
	]
	},
	{
	"cell_type": "markdown",
	"source": [
	"<img src=\"https://raw.githubusercontent.com/comet-ml/opik/main/apps/opik-documentation/documentation/static/img/opik-logo.svg\" width=\"250\"/>"
	],
	"metadata": {
	"id": "Wd6G0Z-k1VeU"
	}
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "jtg-Gl_IW_OM"
	},
	"source": [
	"# Heuristic Evaluation with Opik\n",
	"\n",
	"In this exercise, you'll implement a basic evaluation pipeline with Opik using heuristic evaluation metrics. You can use OpenAI or open source models via LiteLLM.\n",
	"\n",
	"[Heuristic metrics are rule-based evaluation methods](https://www.comet.com/docs/opik/evaluation/metrics/heuristic_metrics) that allow you to check specific aspects of language model outputs. These metrics use predefined criteria or patterns to assess the quality, consistency, or characteristics of generated text."
	]
	},
	{
	"cell_type": "markdown",
	"source": [
	"# Imports & Configuration"
	],
	"metadata": {
	"id": "DWAY0s_FDQtl"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"%pip install opik openai comet_ml litellm --quiet"
	],
	"metadata": {
	"id": "5kiuoilUz63A"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "IQffvGSCW_OO"
	},
	"outputs": [],
	"source": [
	"import opik\n",
	"from opik import Opik, track\n",
	"from opik.evaluation import evaluate\n",
	"from opik.evaluation.metrics import (Equals, LevenshteinRatio)\n",
	"from opik.integrations.openai import track_openai\n",
	"import openai\n",
	"import getpass\n",
	"import os\n",
	"import csv\n",
	"from datetime import datetime\n",
	"import comet_ml\n",
	"import litellm\n",
	"\n",
	"# Define project name to enable tracing\n",
	"os.environ[\"OPIK_PROJECT_NAME\"] = \"food_chatbot_eval\""
	]
	},
	{
	"cell_type": "code",
	"source": [
	"# opik configs\n",
	"if \"OPIK_API_KEY\" not in os.environ:\n",
	" os.environ[\"OPIK_API_KEY\"] = getpass.getpass(\"Enter your Opik API key: \")\n",
	"\n",
	"opik.configure()"
	],
	"metadata": {
	"id": "IuJOARZIzmVq"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"# openai configs\n",
	"if \"OPENAI_API_KEY\" not in os.environ:\n",
	" os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"Enter your OpenAI API key: \")\n",
	"\n",
	"MODEL = \"gpt-4o-mini\""
	],
	"metadata": {
	"id": "8Qj8Cyn4zmP7"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "KYZAKedXW_OQ"
	},
	"source": [
	"# Templates & Prompts\n"
	]
	},
	{
	"cell_type": "code",
	"source": [
	"# define menu items for the food chatbot\n",
	"menu_items = \"\"\"\n",
	"Menu: Kids Menu\n",
	"Food Item: Mini Cheeseburger\n",
	"Price: $6.99\n",
	"Vegan: N\n",
	"Popularity: 4/5\n",
	"Included: Mini beef patty, cheese, lettuce, tomato, and fries.\n",
	"\n",
	"Menu: Appetizers\n",
	"Food Item: Loaded Potato Skins\n",
	"Price: $8.99\"\n",
	"Vegan: N\n",
	"Popularity: 3/5\n",
	"Included: Crispy potato skins filled with cheese, bacon bits, and served with sour cream.\n",
	"\n",
	"Menu: Appetizers\n",
	"Food Item: Bruschetta\n",
	"Price: $7.99\n",
	"Vegan: Y\n",
	"Popularity: 4/5\n",
	"Included: Toasted baguette slices topped with fresh tomatoes, basil, garlic, and balsamic glaze.\n",
	"\n",
	"Menu: Main Menu\n",
	"Food Item: Grilled Chicken Caesar Salad\n",
	"Price: $12.99\n",
	"Vegan: N\n",
	"Popularity: 4/5\n",
	"Included: Grilled chicken breast, romaine lettuce, Parmesan cheese, croutons, and Caesar dressing.\n",
	"\n",
	"Menu: Main Menu\n",
	"Food Item: Classic Cheese Pizza\n",
	"Price: $10.99\n",
	"Vegan: N\n",
	"Popularity: 5/5\n",
	"Included: Thin-crust pizza topped with tomato sauce, mozzarella cheese, and fresh basil.\n",
	"\n",
	"Menu: Main Menu\n",
	"Food Item: Spaghetti Bolognese\n",
	"Price: $14.99\n",
	"Vegan: N\n",
	"Popularity: 4/5\n",
	"Included: Pasta tossed in a savory meat sauce made with ground beef, tomatoes, onions, and herbs.\n",
	"\n",
	"Menu: Vegan Options\n",
	"Food Item: Veggie Wrap\n",
	"Price: $9.99\n",
	"Vegan: Y\n",
	"Popularity: 3/5\n",
	"Included: Grilled vegetables, hummus, mixed greens, and a wrap served with a side of sweet potato fries.\n",
	"\n",
	"Menu: Vegan Options\n",
	"Food Item: Vegan Beyond Burger\n",
	"Price: $11.99\n",
	"Vegan: Y\n",
	"Popularity: 4/5\n",
	"Included: Plant-based patty, vegan cheese, lettuce, tomato, onion, and a choice of regular or sweet potato fries.\n",
	"\n",
	"Menu: Desserts\n",
	"Food Item: Chocolate Lava Cake\n",
	"Price: $6.99\n",
	"Vegan: N\n",
	"Popularity: 5/5\n",
	"Included: Warm chocolate cake with a gooey molten center, served with vanilla ice cream.\n",
	"\n",
	"Menu: Desserts\n",
	"Food Item: Fresh Berry Parfait\n",
	"Price: $5.99\n",
	"Vegan: Y\n",
	"Popularity: 4/5\n",
	"Included: Layers of mixed berries, granola, and vegan coconut yogurt.\n",
	"\"\"\""
	],
	"metadata": {
	"id": "QqqG70yF2qoi"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"# Simple little client class for using different LLM APIs (OpenAI or LiteLLM)\n",
	"class LLMClient:\n",
	" def __init__(self, client_type: str =\"openai\", model: str =\"gpt-4\"):\n",
	" self.client_type = client_type\n",
	" self.model = model\n",
	"\n",
	" if self.client_type == \"openai\":\n",
	" self.client = track_openai(openai.OpenAI())\n",
	"\n",
	" else:\n",
	" self.client = None\n",
	"\n",
	" # LiteLLM query function\n",
	" def _get_litellm_response(self, query: str, system: str = \"You are a helpful assistant.\"):\n",
	" messages = [\n",
	" {\"role\": \"system\", \"content\": system },\n",
	" { \"role\": \"user\", \"content\": query }\n",
	" ]\n",
	"\n",
	" response = litellm.completion(\n",
	" model=self.model,\n",
	" messages=messages\n",
	" )\n",
	"\n",
	" return response.choices[0].message.content\n",
	"\n",
	" # OpenAI query function - use **kwargs to pass arguments like temperature\n",
	" def _get_openai_response(self, query: str, system: str = \"You are a helpful assistant.\", **kwargs):\n",
	" messages = [\n",
	" {\"role\": \"system\", \"content\": system },\n",
	" { \"role\": \"user\", \"content\": query }\n",
	" ]\n",
	"\n",
	" response = self.client.chat.completions.create(\n",
	" model=self.model,\n",
	" messages=messages,\n",
	" **kwargs\n",
	" )\n",
	"\n",
	" return response.choices[0].message.content\n",
	"\n",
	"\n",
	" def query(self, query: str, system: str = \"You are a helpful assistant.\", **kwargs):\n",
	" if self.client_type == 'openai':\n",
	" return self._get_openai_response(query, system, **kwargs)\n",
	"\n",
	" else:\n",
	" return self._get_litellm_response(query, system)\n",
	"\n"
	],
	"metadata": {
	"id": "8w9Doerp4d1G"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"# Initialize llm_client\n",
	"\n",
	"llm_client = LLMClient(model=MODEL)"
	],
	"metadata": {
	"id": "TVY5JVjIyBx3"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"# LLM Call Steps"
	],
	"metadata": {
	"id": "d9QwTEd46MrS"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"@track\n",
	"def reasoning_step(user_query, menu_items):\n",
	" prompt_template = \"\"\"\n",
	" Your task is to answer questions factually about a food menu, provided below and delimited by +++++. The user request is provided here: {request}\n",
	"\n",
	" Step 1: The first step is to check if the user is asking a question related to any type of food (even if that food item is not on the menu). If the question is about any type of food, we move on to Step 2 and ignore the rest of Step 1. If the question is not about food, then we send a response: \"Sorry! I cannot help with that. Please let me know if you have a question about our food menu.\"\n",
	"\n",
	" Step 2: In this step, we check that the user question is relevant to any of the items on the food menu. You should check that the food item exists in our menu first. If it doesn't exist then send a kind response to the user that the item doesn't exist in our menu and then include a list of available but similar food items without any other details (e.g., price). The food items available are provided below and delimited by +++++:\n",
	"\n",
	" +++++\n",
	" {menu_items}\n",
	" +++++\n",
	"\n",
	" Step 3: If the item exists in our food menu and the user is requesting specific information, provide that relevant information to the user using the food menu. Make sure to use a friendly tone and keep the response concise.\n",
	"\n",
	" Perform the following reasoning steps to send a response to the user:\n",
	" Step 1: <Step 1 reasoning>\n",
	" Step 2: <Step 2 reasoning>\n",
	" Response to the user (only output the final response): <response to user>\n",
	" \"\"\"\n",
	"\n",
	" prompt = prompt_template.format(request=user_query, menu_items=menu_items)\n",
	" response = llm_client.query(prompt)\n",
	" return response"
	],
	"metadata": {
	"id": "kq8ELqs64M7W"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"@track\n",
	"def extraction_step(reasoning):\n",
	" prompt_template = \"\"\"\n",
	" Extract the final response from delimited by ###.\n",
	"\n",
	" ###\n",
	" {reasoning}.\n",
	" ###\n",
	"\n",
	" Only output what comes after \"Response to the user:\".\n",
	" \"\"\"\n",
	"\n",
	" prompt = prompt_template.format(reasoning=reasoning)\n",
	" response = llm_client.query(prompt)\n",
	" return response"
	],
	"metadata": {
	"id": "BJbVTqko4PPH"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"@track\n",
	"def refinement_step(final_response):\n",
	" prompt_template = \"\"\"\n",
	" Perform the following refinement steps on the final output delimited by ###.\n",
	"\n",
	" 1). Shorten the text to one sentence\n",
	" 2). Use a friendly tone\n",
	"\n",
	" ###\n",
	" {final_response}\n",
	" ###\n",
	" \"\"\"\n",
	"\n",
	" prompt = prompt_template.format(final_response=final_response)\n",
	" response = llm_client.query(prompt)\n",
	" return response"
	],
	"metadata": {
	"id": "HfLqs0Yt4Rc3"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"@track\n",
	"def verification_step(user_question, refined_response, menu_items):\n",
	" prompt_template = \"\"\"\n",
	" Your task is to check that the refined response (delimited by ###) is providing a factual response based on the user question (delimited by @@@) and the menu below (delimited by +++++). If yes, just output the refined response in its original form (without the delimiters). If no, then make the correction to the response and return the new response only.\n",
	"\n",
	" User question: @@@ {user_question} @@@\n",
	"\n",
	" Refined response: ### {refined_response} ###\n",
	"\n",
	" +++++\n",
	" {menu_items}\n",
	" +++++\n",
	" \"\"\"\n",
	"\n",
	" prompt = prompt_template.format(user_question=user_question, refined_response=refined_response, menu_items=menu_items)\n",
	" response = llm_client.query(prompt)\n",
	" return response"
	],
	"metadata": {
	"id": "TWIEicCS4lq8"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"# LLM Application\n",
	"\n",
	"Adding tracking to your LLM application allows you to have full visibility into each evaluation run."
	],
	"metadata": {
	"id": "DEaODqXY47oz"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"@track\n",
	"def generate_chatbot_response(user_query, menu_items):\n",
	" reasoning = reasoning_step(user_query, menu_items)\n",
	" extraction = extraction_step(reasoning)\n",
	" refinement = refinement_step(extraction)\n",
	" verification = verification_step(user_query, refinement, menu_items)\n",
	" return verification\n",
	"\n",
	"@track\n",
	"def chatbot_application(input: str) -> str:\n",
	" response = generate_chatbot_response(input, menu_items)\n",
	" return response"
	],
	"metadata": {
	"id": "N_4yXZq347Gt"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"# Define the Evaluation Task\n",
	"The evaluation task takes in as an input a dataset item and needs to return a dictionary with keys that match the parameters expected by the metrics you are using."
	],
	"metadata": {
	"id": "FHTlEgJE5QFU"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"# Define the evaluation task\n",
	"def evaluation_task(x):\n",
	" return {\n",
	" \"input\": x['question'],\n",
	" \"output\": chatbot_application(x['question']),\n",
	" \"context\": menu_items,\n",
	" \"reference\": x['response']\n",
	" }"
	],
	"metadata": {
	"id": "F5UPxC6Q5ODx"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"# Dataset\n",
	"To retrieve an existing dataset, use `client.get_dataset()`, or to create a new dataset, use [`client.get_or_create_dataset()`](https://www.comet.com/docs/opik/evaluation/manage_datasets/)."
	],
	"metadata": {
	"id": "hpDJDCqY5E37"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"# Create or retrieve the dataset\n",
	"client = Opik()\n",
	"dataset = client.get_or_create_dataset(name=\"foodchatbot_eval\")"
	],
	"metadata": {
	"id": "dd7s_4XXEPW8"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"## Optional: Download Dataset From Comet\n",
	"\n",
	"If you have not previously created the `foodchatbot_eval` dataset in your Opik workspace, run the following code to download the dataset as a Comet Artifact and populate your Opik dataset.\n",
	"\n",
	"If you have already created the `foodchatbot_eval` dataset, you can skip to the next section."
	],
	"metadata": {
	"id": "71FSSzTWEXV8"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"comet_ml.login(api_key=os.environ[\"OPIK_API_KEY\"])\n",
	"experiment = comet_ml.start(project_name=\"foodchatbot_eval\")\n",
	"\n",
	"logged_artifact = experiment.get_artifact(artifact_name=\"foodchatbot_eval\",\n",
	" workspace=\"examples\")\n",
	"local_artifact = logged_artifact.download(\"./\")\n",
	"experiment.end()"
	],
	"metadata": {
	"id": "6fYufEyRAY3L"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"# Create or retrieve the dataset\n",
	"client = Opik()\n",
	"dataset = client.get_or_create_dataset(name=\"foodchatbot_eval\")\n",
	"\n",
	"# Read the CSV file and insert items into the dataset\n",
	"with open('./foodchatbot_clean_eval_dataset.csv', newline='') as csvfile:\n",
	" reader = csv.reader(csvfile)\n",
	" next(reader, None) # skip the header\n",
	" for row in reader:\n",
	" index, question, response = row\n",
	" dataset.insert([\n",
	" {\"question\": question, \"response\": response}\n",
	" ])"
	],
	"metadata": {
	"id": "OwzZZ9CVizB_"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"# Choose Evaluation Metrics\n",
	"Opik provides a set of built-in evaluation metrics, including heuristic metrics and LLMs-as-a-judge.\n",
	"\n",
	"Note that each metric expects the data in a certain format, you will need to ensure that the task you have defined in step 1. returns the data in the correct format."
	],
	"metadata": {
	"id": "l3fHqouG5s6g"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"# Define the metrics\n",
	"metrics = [Equals(), LevenshteinRatio()]\n",
	"\n",
	"# experiment_name\n",
	"experiment_name = MODEL + \"_\" + dataset.name + \"_\" + datetime.now().strftime(\"%Y-%m-%d_%H-%M-%S\")"
	],
	"metadata": {
	"id": "8dd-HVRm5o04"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"# Run the evaluation"
	],
	"metadata": {
	"id": "RBiNM5nP6Tj-"
	}
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "B33fWWm9W_OQ"
	},
	"outputs": [],
	"source": [
	"# run evaluation\n",
	"evaluation = evaluate(\n",
	" experiment_name=experiment_name,\n",
	" dataset=dataset,\n",
	" task=evaluation_task,\n",
	" scoring_metrics=metrics,\n",
	" experiment_config={\n",
	" \"model\": MODEL\n",
	" }\n",
	")"
	]
	},
	{
	"cell_type": "code",
	"source": [],
	"metadata": {
	"id": "FzKlTqUlyu-0"
	},
	"execution_count": null,
	"outputs": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "comet-eval",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.10.15"
	},
	"colab": {
	"provenance": [],
	"collapsed_sections": [
	"DWAY0s_FDQtl",
	"KYZAKedXW_OQ",
	"d9QwTEd46MrS",
	"DEaODqXY47oz",
	"FHTlEgJE5QFU",
	"hpDJDCqY5E37",
	"l3fHqouG5s6g",
	"RBiNM5nP6Tj-"
	],
	"include_colab_link": true
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}
No results found