Initial commit

194bc3a2 · Embruch, Gerd · 194bc3a2 · 194bc3a2 · 194bc3a2 · 194bc3a2
Commit 194bc3a2 authored Apr 19, 2024 by Embruch, Gerd
--- a/.env.template
+++ b/.env.template
+# WHICH PORT SHOULD THE WEBSERVICE LISTEN ON?
+PHOENIX_PORT=6006
+# API KEY TO OPENAI / CHATGPT API
+OPENAI_API_KEY=sk-hj8dfs8f9dsfsdgibberish
+#CHUNK SIZE OF DOMAIN FILES
+DOCUMENTS_CHUNK_SIZE=512
+#HOW MANY QUESTIONS PER CHUNK SHALL BE GENERATED?
+QUESTIONS_PER_CHUNK=2
+# BASEPATH OF DOMAIN FILES
+DOMAIN_BASEPATH=https://rodderberg.zbh.uni-hamburg.de/wiki/data/pages/
+# FILES TO DOWNLOAD WITHOUT BASEPATH
+DOMAIN_DOCS=amd/tutorials/programming/ide_setup.txt,amd/teaching/lessons_learned_lehre.txt
+# DOWNLOAD FOLDER
+DOMAIN_DOWNLOAD_FOLDER=./downloads
--- a/.gitignore
+++ b/.gitignore
+.env
+.venv
+.ipynb_checkpoints
+downloads
--- a/README.md
+++ b/README.md
+# Prerequisits
+- python3 installed
+- [openAI API Key](https://auth.openai.com/)
+# Install
+```
+python3 -m venv .venv
+source .venv/bin/activate
+pip install-r requirements.txt
+cp ./.env.template ./.env
+```
+# Configure
+populate the `.env` file with proper information
+# Start
+```
+python3 evaluateRAG.py
+```
+acccess webservice via browser, i.e.
+`firefox http:<HOSTNAME>:<PORT>`
+# Sources
+- [YT: RAG Time! Evaluate RAG with LLM Evals and Benchmarking](https://www.youtube.com/watch?v=LrMguHcbpO8)
+- [Phoenix Docs](https://docs.arize.com/phoenix)
--- a/evaluateRAG.ipynb
+++ b/evaluateRAG.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "jW2gMR_Jwu-z"
+   },
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "dceT9P4ewu-0"
+   },
+   "source": [
+    "<center>\n",
+    "    <p style=\"text-align:center\">\n",
+    "        <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg\" width=\"200\"/>\n",
+    "        <br>\n",
+    "        <a href=\"https://docs.arize.com/phoenix/\">Docs</a>\n",
+    "        |\n",
+    "        <a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n",
+    "        |\n",
+    "        <a href=\"https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q\">Community</a>\n",
+    "    </p>\n",
+    "</center>\n",
+    "<h1 align=\"center\">Evaluate RAG with LLM Evals</h1>\n",
+    "\n",
+    "In this tutorial we will look into building a RAG pipeline and evaluating it with Phoenix Evals.\n",
+    "\n",
+    "It has the the following sections:\n",
+    "\n",
+    "1. Understanding Retrieval Augmented Generation (RAG).\n",
+    "2. Building RAG (with the help of a framework such as LlamaIndex).\n",
+    "3. Evaluating RAG with Phoenix Evals."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "BZOgXaWtwu-1"
+   },
+   "source": [
+    "## Retrieval Augmented Generation (RAG)\n",
+    "\n",
+    "LLMs are trained on vast datasets, but these will not include your specific data (things like company knowledge bases and documentation). Retrieval-Augmented Generation (RAG) addresses this by dynamically incorporating your data as context during the generation process. This is done not by altering the training data of the LLMs but by allowing the model to access and utilize your data in real-time to provide more tailored and contextually relevant responses.\n",
+    "\n",
+    "In RAG, your data is loaded and prepared for queries. This process is called indexing. User queries act on this index, which filters your data down to the most relevant context. This context and your query then are sent to the LLM along with a prompt, and the LLM provides a response.\n",
+    "\n",
+    "RAG is a critical component for building applications such a chatbots or agents and you will want to know RAG techniques on how to get data into your application.\n",
+    "\n",
+    "<img src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/RAG_Pipeline.png\">"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "XXKO5xSpwu-2"
+   },
+   "source": [
+    "## Stages within RAG\n",
+    "\n",
+    "There are five key stages within RAG, which will in turn be a part of any larger RAG application.\n",
+    "\n",
+    "- **Loading**: This refers to getting your data from where it lives - whether it's text files, PDFs, another website, a database or an API - into your pipeline.\n",
+    "- **Indexing**: This means creating a data structure that allows for querying the data. For LLMs this nearly always means creating vector embeddings, numerical representations of the meaning of your data, as well as numerous other metadata strategies to make it easy to accurately find contextually relevant data.\n",
+    "- **Storing**: Once your data is indexed, you will want to store your index, along with any other metadata, to avoid the need to re-index it.\n",
+    "\n",
+    "- **Querying**: For any given indexing strategy there are many ways you can utilize LLMs and data structures to query, including sub-queries, multi-step queries, and hybrid strategies.\n",
+    "- **Evaluation**: A critical step in any pipeline is checking how effective it is relative to other strategies, or when you make changes. Evaluation provides objective measures on how accurate, faithful, and fast your responses to queries are.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "Nm2RgUPBwu-2"
+   },
+   "source": [
+    "## Build a RAG system\n",
+    "\n",
+    "Now that we have understood the stages of RAG, let's build a pipeline. We will use [LlamaIndex](https://www.llamaindex.ai/) for RAG and [Phoenix Evals](https://docs.arize.com/phoenix/llm-evals/llm-evals) for evaluation.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "saN04YU9wu-3"
+   },
+   "outputs": [],
+   "source": [
+    "pip install -qq \"arize-phoenix[evals]\" \"llama-index>=0.10.3\" \"openinference-instrumentation-llama-index>=1.0.0\" \"llama-index-callbacks-arize-phoenix>=0.1.2\" \"llama-index-llms-openai\" \"openai>=1\" gcsfs nest_asyncio"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "N4GfGTBFwu-3"
+   },
+   "outputs": [],
+   "source": [
+    "# The nest_asyncio module enables the nesting of asynchronous functions within an already running async loop.\n",
+    "# This is necessary because Jupyter notebooks inherently operate in an asynchronous loop.\n",
+    "# By applying nest_asyncio, we can run additional async functions within this existing loop without conflicts.\n",
+    "import nest_asyncio\n",
+    "\n",
+    "nest_asyncio.apply()\n",
+    "\n",
+    "import os\n",
+    "from getpass import getpass\n",
+    "\n",
+    "import pandas as pd\n",
+    "import phoenix as px\n",
+    "from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, set_global_handler\n",
+    "from llama_index.core.node_parser import SimpleNodeParser\n",
+    "from llama_index.llms.openai import OpenAI"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "KmMAOy8Wwu-4"
+   },
+   "source": [
+    "During this tutorial, we will capture all the data we need to evaluate our RAG pipeline using Phoenix Tracing. To enable this, simply start the phoenix application and instrument LlamaIndex."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "9X0V3nFYwu-4",
+    "outputId": "73c3d7b9-4433-49e7-bb0a-0df51b0fa436"
+   },
+   "outputs": [],
+   "source": [
+    "session = px.launch_app()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "BOkqB0Yrwu-5"
+   },
+   "outputs": [],
+   "source": [
+    "set_global_handler(\"arize_phoenix\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "j3vUkUCywu-6"
+   },
+   "source": [
+    "For this tutorial we will be using OpenAI for creating synthetic data as well as for evaluation."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "fdHFT4Xqwu-6"
+   },
+   "outputs": [],
+   "source": [
+    "if not (openai_api_key := os.getenv(\"OPENAI_API_KEY\")):\n",
+    "    openai_api_key = getpass(\"🔑 Enter your OpenAI API key: \")\n",
+    "os.environ[\"OPENAI_API_KEY\"] = openai_api_key"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "N9H1PTKfwu-7"
+   },
+   "source": [
+    "### Load Data and Build an Index"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "ZEVnHSmMwu-7"
+   },
+   "source": [
+    "Let's use an [essay by Paul Graham](https://www.paulgraham.com/worked.html) to build our RAG pipeline."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "KlWMwYvRwu-7"
+   },
+   "outputs": [],
+   "source": [
+    "import tempfile\n",
+    "from urllib.request import urlretrieve\n",
+    "\n",
+    "with tempfile.NamedTemporaryFile() as tf:\n",
+    "    urlretrieve(\n",
+    "        \"https://raw.githubusercontent.com/Arize-ai/phoenix-assets/main/data/paul_graham/paul_graham_essay.txt\",\n",
+    "        tf.name,\n",
+    "    )\n",
+    "    documents = SimpleDirectoryReader(input_files=[tf.name]).load_data()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "ZQhALsSFwu-7"
+   },
+   "outputs": [],
+   "source": [
+    "# Define an LLM\n",
+    "llm = OpenAI(model=\"gpt-4\")\n",
+    "\n",
+    "# Build index with a chunk_size of 512\n",
+    "node_parser = SimpleNodeParser.from_defaults(chunk_size=512)\n",
+    "nodes = node_parser.get_nodes_from_documents(documents)\n",
+    "vector_index = VectorStoreIndex(nodes)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "4sm7_z6Xwu-7"
+   },
+   "source": [
+    "Build a QueryEngine and start querying."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "qsGd3PFjwu-7"
+   },
+   "outputs": [],
+   "source": [
+    "query_engine = vector_index.as_query_engine()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "E_Kd2BD8wu-8"
+   },
+   "outputs": [],
+   "source": [
+    "response_vector = query_engine.query(\"What did the author do growing up?\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "mNPZYHYdwu-8"
+   },
+   "source": [
+    "Check the response that you get from the query."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "w0geYSA-wu-8",
+    "outputId": "9daa5db2-e787-479d-ba32-e7d20a54b560"
+   },
+   "outputs": [],
+   "source": [
+    "response_vector.response"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "htopUJtvwu-8"
+   },
+   "source": [
+    "By default LlamaIndex retrieves two similar nodes/ chunks. You can modify that in `vector_index.as_query_engine(similarity_top_k=k)`.\n",
+    "\n",
+    "Let's check the text in each of these retrieved nodes."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "e1vr_6-Qwu-9",
+    "outputId": "e8e2f696-0108-4cb5-f2a6-1e216e22a319"
+   },
+   "outputs": [],
+   "source": [
+    "# First retrieved node\n",
+    "response_vector.source_nodes[0].get_text()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "fsq048oPwu-9",
+    "outputId": "06bb2060-37dd-420f-b00e-9f3579cdf6f1"
+   },
+   "outputs": [],
+   "source": [
+    "# Second retrieved node\n",
+    "response_vector.source_nodes[1].get_text()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "6YrAwPqEwu--"
+   },
+   "source": [
+    "Remember that we are using Phoenix Tracing to capture all the data we need to evaluate our RAG pipeline. You can view the traces in the phoenix application."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "tqmTMQecwu--",
+    "outputId": "3637cf63-9fae-4491-ebfc-7f12041c8af4"
+   },
+   "outputs": [],
+   "source": [
+    "print(\"phoenix URL\", px.active_session().url)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "CCEsRwi7wu--"
+   },
+   "source": [
+    "We can access the traces by directly pulling the spans from the phoenix session."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "nhsUrSdDwu--"
+   },
+   "outputs": [],
+   "source": [
+    "spans_df = px.Client().get_spans_dataframe()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "vgjonfjewu--",
+    "outputId": "05b63d90-fc6b-4a20-c81b-0ea0c97674af"
+   },
+   "outputs": [],
+   "source": [
+    "spans_df[[\"name\", \"span_kind\", \"attributes.input.value\", \"attributes.retrieval.documents\"]].head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "f7eqXsJDwu-_"
+   },
+   "source": [
+    "Note that the traces have captured the documents that were retrieved by the query engine. This is nice because it means we can introspect the documents without having to keep track of them ourselves."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "bqHN8dRywu-_"
+   },
+   "outputs": [],
+   "source": [
+    "spans_with_docs_df = spans_df[spans_df[\"attributes.retrieval.documents\"].notnull()]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "XNubJzs4wu-_",
+    "outputId": "007b4872-8d3e-498b-d7db-c94c5b3d23d9"
+   },
+   "outputs": [],
+   "source": [
+    "spans_with_docs_df[[\"attributes.input.value\", \"attributes.retrieval.documents\"]].head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "fbctDmnQwu-_"
+   },
+   "source": [
+    "We have built a RAG pipeline and also have instrumented it using Phoenix Tracing. We now need to evaluate it's performance. We can assess our RAG system/query engine using Phoenix's LLM Evals. Let's examine how to leverage these tools to quantify the quality of our retrieval-augmented generation system."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "ep84uALpwu-_"
+   },
+   "source": [
+    "## Evaluation\n",
+    "\n",
+    "Evaluation should serve as the primary metric for assessing your RAG application. It determines whether the pipeline will produce accurate responses based on the data sources and range of queries.\n",
+    "\n",
+    "While it's beneficial to examine individual queries and responses, this approach is impractical as the volume of edge-cases and failures increases. Instead, it's more effective to establish a suite of metrics and automated evaluations. These tools can provide insights into overall system performance and can identify specific areas that may require scrutiny.\n",
+    "\n",
+    "In a RAG system, evaluation focuses on two critical aspects:\n",
+    "\n",
+    "- **Retrieval Evaluation**: To assess the accuracy and relevance of the documents that were retrieved\n",
+    "- **Response Evaluation**: Measure the appropriateness of the response generated by the system when the context was provided."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "TS6ODlADwu_A"
+   },
+   "source": [
+    "### Generate Question Context Pairs\n",
+    "\n",
+    "For the evaluation of a RAG system, it's essential to have queries that can fetch the correct context and subsequently generate an appropriate response.\n",
+    "\n",
+    "For this tutorial, let's use Phoenix's `llm_generate` to help us create the question-context pairs."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "mE-YrHIBwu_A"
+   },
+   "source": [
+    "First, let's create a dataframe of all the document chunks that we have indexed."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "Ap3HNkzowu_A",
+    "outputId": "023b6b9c-e2cb-40de-8522-fce71597b269"
+   },
+   "outputs": [],
+   "source": [
+    "# Let's construct a dataframe of just the documents that are in our index\n",
+    "document_chunks_df = pd.DataFrame({\"text\": [node.get_text() for node in nodes]})\n",
+    "document_chunks_df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "QD3tyGpFwu_A"
+   },
+   "source": [
+    "Now that we have the document chunks, let's prompt an LLM to generate us 3 questions per chunk. Note that you could manually solicit questions from your team or customers, but this is a quick and easy way to generate a large number of questions."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "E5NayIUowu_A"
+   },
+   "outputs": [],
+   "source": [
+    "generate_questions_template = \"\"\"\\\n",
+    "Context information is below.\n",
+    "\n",
+    "---------------------\n",
+    "{text}\n",
+    "---------------------\n",
+    "\n",
+    "Given the context information and not prior knowledge.\n",
+    "generate only questions based on the below query.\n",
+    "\n",
+    "You are a Teacher/ Professor. Your task is to setup \\\n",
+    "3 questions for an upcoming \\\n",
+    "quiz/examination. The questions should be diverse in nature \\\n",
+    "across the document. Restrict the questions to the \\\n",
+    "context information provided.\"\n",
+    "\n",
+    "Output the questions in JSON format with the keys question_1, question_2, question_3.\n",
+    "\"\"\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "referenced_widgets": [
+      "1765156f2fb1420188c7d9b993e64067"
+     ]
+    },
+    "id": "Han4qQhZwu_B",
+    "outputId": "f96c22bd-b085-4d7a-8835-fc4c92b44483"
+   },
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "\n",
+    "from phoenix.evals import OpenAIModel, llm_generate\n",
+    "\n",
+    "\n",
+    "def output_parser(response: str, index: int):\n",
+    "    try:\n",
+    "        return json.loads(response)\n",
+    "    except json.JSONDecodeError as e:\n",
+    "        return {\"__error__\": str(e)}\n",
+    "\n",
+    "\n",
+    "questions_df = llm_generate(\n",
+    "    dataframe=document_chunks_df,\n",
+    "    template=generate_questions_template,\n",
+    "    model=OpenAIModel(\n",
+    "        model_name=\"gpt-3.5-turbo\",\n",
+    "    ),\n",
+    "    output_parser=output_parser,\n",
+    "    concurrency=20,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "C1QoOXYgwu_B",
+    "outputId": "a1a33d76-c992-4d1f-bb5c-58818423fb49"
+   },
+   "outputs": [],
+   "source": [
+    "questions_df.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "5y4rnH_Xwu_B"
+   },
+   "outputs": [],
+   "source": [
+    "# Construct a dataframe of the questions and the document chunks\n",
+    "questions_with_document_chunk_df = pd.concat([questions_df, document_chunks_df], axis=1)\n",
+    "questions_with_document_chunk_df = questions_with_document_chunk_df.melt(\n",
+    "    id_vars=[\"text\"], value_name=\"question\"\n",
+    ").drop(\"variable\", axis=1)\n",
+    "# If the above step was interrupted, there might be questions missing. Let's run this to clean up the dataframe.\n",
+    "questions_with_document_chunk_df = questions_with_document_chunk_df[\n",
+    "    questions_with_document_chunk_df[\"question\"].notnull()\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "pcXqT4y-wu_B"
+   },
+   "source": [
+    "The LLM has generated three questions per chunk. Let's take a quick look."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "2EAZve1Lwu_C",
+    "outputId": "4b515f34-b6ed-4edd-e4be-75a93d33a148"
+   },
+   "outputs": [],
+   "source": [
+    "questions_with_document_chunk_df.head(10)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "c2w403uFwu_C"
+   },
+   "source": [
+    "### Retrieval Evaluation\n",
+    "\n",
+    "We are now prepared to perform our retrieval evaluations. We will execute the queries we generated in the previous step and verify whether or not that the correct context is retrieved."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "Zmv6vtHIwu_C",
+    "outputId": "57b88f35-6914-42ce-ec25-5c2ffa37b724"
+   },
+   "outputs": [],
+   "source": [
+    "# First things first, let's reset phoenix\n",
+    "px.close_app()\n",
+    "px.launch_app()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "p66fv5cewu_C",
+    "outputId": "e2c87b02-d957-4a32-f9e5-f5954d087b97"
+   },
+   "outputs": [],
+   "source": [
+    "# loop over the questions and generate the answers\n",
+    "for _, row in questions_with_document_chunk_df.iterrows():\n",
+    "    question = row[\"question\"]\n",
+    "    response_vector = query_engine.query(question)\n",
+    "    print(f\"Question: {question}\\nAnswer: {response_vector.response}\\n\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "zjACQC7Gwu_D"
+   },
+   "source": [
+    "Now that we have executed the queries, we can start validating whether or not the RAG system was able to retrieve the correct context. Let's extract all the retrieved documents from the traces logged to phoenix. (For an in-depth explanation of how to export trace data from the phoenix runtime, consult the [docs](https://docs.arize.com/phoenix/how-to/extract-data-from-spans))."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "LzhjPApKwu_D",
+    "outputId": "c0060e3e-86e0-46fd-f9f7-31f9ffbbf426"
+   },
+   "outputs": [],
+   "source": [
+    "from phoenix.session.evaluation import get_retrieved_documents\n",
+    "\n",
+    "retrieved_documents_df = get_retrieved_documents(px.Client())\n",
+    "retrieved_documents_df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "0zlt8Yk4wu_D"
+   },
+   "source": [
+    "Let's now use Phoenix's LLM Evals to evaluate the relevance of the retrieved documents with regards to the query. Note, we've turned on `explanations` which prompts the LLM to explain it's reasoning. This can be useful for debugging and for figuring out potential corrective actions."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "referenced_widgets": [
+      "cbe61513c0e843748c56ccf2a2acea8f"
+     ]
+    },
+    "id": "RVv33Juywu_D",
+    "outputId": "d3a4bbcc-5ac6-4e05-f18e-b67dff22a3c9"
+   },
+   "outputs": [],
+   "source": [
+    "from phoenix.evals import (\n",
+    "    RelevanceEvaluator,\n",
+    "    run_evals,\n",
+    ")\n",
+    "\n",
+    "relevance_evaluator = RelevanceEvaluator(OpenAIModel(model=\"gpt-4-turbo-preview\"))\n",
+    "\n",
+    "retrieved_documents_relevance_df = run_evals(\n",
+    "    evaluators=[relevance_evaluator],\n",
+    "    dataframe=retrieved_documents_df,\n",
+    "    provide_explanation=True,\n",
+    "    concurrency=20,\n",
+    ")[0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "pwOhpBp7wu_E",
+    "outputId": "ec887ffe-fbcf-4763-86db-aaa4e9ec1384"
+   },
+   "outputs": [],
+   "source": [
+    "retrieved_documents_relevance_df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "Eqz_vA5iwu_E"
+   },
+   "source": [
+    "We can now combine the documents with the relevance evaluations to compute retrieval metrics. These metrics will help us understand how well the RAG system is performing."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "RO7Jw9uCwu_E",
+    "outputId": "642cc67a-8584-45a3-f4fc-d55d6faf3b1b"
+   },
+   "outputs": [],
+   "source": [
+    "documents_with_relevance_df = pd.concat(\n",
+    "    [retrieved_documents_df, retrieved_documents_relevance_df.add_prefix(\"eval_\")], axis=1\n",
+    ")\n",
+    "documents_with_relevance_df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "2hSIT8Wjwu_E"
+   },
+   "source": [
+    "Let's compute Normalized Discounted Cumulative Gain [NCDG](https://en.wikipedia.org/wiki/Discounted_cumulative_gain) at 2 for all our retrieval steps.  In information retrieval, this metric is often used to measure effectiveness of search engine algorithms and related applications."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "m0tYfArWwu_F"
+   },
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "from sklearn.metrics import ndcg_score\n",
+    "\n",
+    "\n",
+    "def _compute_ndcg(df: pd.DataFrame, k: int):\n",
+    "    \"\"\"Compute NDCG@k in the presence of missing values\"\"\"\n",
+    "    n = max(2, len(df))\n",
+    "    eval_scores = np.zeros(n)\n",
+    "    doc_scores = np.zeros(n)\n",
+    "    eval_scores[: len(df)] = df.eval_score\n",
+    "    doc_scores[: len(df)] = df.document_score\n",
+    "    try:\n",
+    "        return ndcg_score([eval_scores], [doc_scores], k=k)\n",
+    "    except ValueError:\n",
+    "        return np.nan\n",
+    "\n",
+    "\n",
+    "ndcg_at_2 = pd.DataFrame(\n",
+    "    {\"score\": documents_with_relevance_df.groupby(\"context.span_id\").apply(_compute_ndcg, k=2)}\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "na2ysFNgwu_F",
+    "outputId": "a28adfde-98c9-449d-ef4b-52e761416b5a"
+   },
+   "outputs": [],
+   "source": [
+    "ndcg_at_2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "bgFUGRXawu_F"
+   },
+   "source": [
+    "Let's also compute precision at 2 for all our retrieval steps."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "UZAqp9fnwu_F"
+   },
+   "outputs": [],
+   "source": [
+    "precision_at_2 = pd.DataFrame(\n",
+    "    {\n",
+    "        \"score\": documents_with_relevance_df.groupby(\"context.span_id\").apply(\n",
+    "            lambda x: x.eval_score[:2].sum(skipna=False) / 2\n",
+    "        )\n",
+    "    }\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "LdN5nBk9wu_F",
+    "outputId": "b3c96f4d-e21b-48ac-9893-379f1c76ab7d"
+   },
+   "outputs": [],
+   "source": [
+    "precision_at_2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "Fo_IrClBwu_G"
+   },
+   "source": [
+    "Lastly, let's compute whether or not a correct document was retrieved at all for each query (e.g. a hit)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "mloPhp2Swu_G"
+   },
+   "outputs": [],
+   "source": [
+    "hit = pd.DataFrame(\n",
+    "    {\n",
+    "        \"hit\": documents_with_relevance_df.groupby(\"context.span_id\").apply(\n",
+    "            lambda x: x.eval_score[:2].sum(skipna=False) > 0\n",
+    "        )\n",
+    "    }\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "zR-y0dtcwu_G"
+   },
+   "source": [
+    "Let's now view the results in a combined dataframe."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "8z8xNpDZwu_G",
+    "outputId": "4855c55a-fe93-44ea-bdd0-7903b6ec539a"
+   },
+   "outputs": [],
+   "source": [
+    "retrievals_df = px.Client().get_spans_dataframe(\"span_kind == 'RETRIEVER'\")\n",
+    "rag_evaluation_dataframe = pd.concat(\n",
+    "    [\n",
+    "        retrievals_df[\"attributes.input.value\"],\n",
+    "        ndcg_at_2.add_prefix(\"ncdg@2_\"),\n",
+    "        precision_at_2.add_prefix(\"precision@2_\"),\n",
+    "        hit,\n",
+    "    ],\n",
+    "    axis=1,\n",
+    ")\n",
+    "rag_evaluation_dataframe"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "1GPpwK6Fwu_G",
+    "jp-MarkdownHeadingCollapsed": true
+   },
+   "source": [
+    "### Observations\n",
+    "\n",
+    "Let's now take our results and aggregate them to get a sense of how well our RAG system is performing."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "CYGYPrfewu_G",
+    "outputId": "f392ff4f-6384-46fa-f238-cdf69b9799f2"
+   },
+   "outputs": [],
+   "source": [
+    "# Aggregate the scores across the retrievals\n",
+    "results = rag_evaluation_dataframe.mean(numeric_only=True)\n",
+    "results"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "483lMSSDwu_H"
+   },
+   "source": [
+    "As we can see from the above numbers, our RAG system is not perfect, there are times when it fails to retrieve the correct context within the first two documents. At other times the correct context is included in the top 2 results but non-relevant information is also included in the context. This is an indication that we need to improve our retrieval strategy. One possible solution could be to increase the number of documents retrieved and then use a more sophisticated ranking strategy (such as a reranker) to select the correct context."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "cvdwAT7Lwu_H"
+   },
+   "source": [
+    "We have now evaluated our RAG system's retrieval performance. Let's send these evaluations to Phoenix for visualization. By sending the evaluations to Phoenix, you will be able to view the evaluations alongside the traces that were captured earlier."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "Uwjzx6bAwu_H"
+   },
+   "outputs": [],
+   "source": [
+    "from phoenix.trace import DocumentEvaluations, SpanEvaluations\n",
+    "\n",
+    "px.Client().log_evaluations(\n",
+    "    SpanEvaluations(dataframe=ndcg_at_2, eval_name=\"ndcg@2\"),\n",
+    "    SpanEvaluations(dataframe=precision_at_2, eval_name=\"precision@2\"),\n",
+    "    DocumentEvaluations(dataframe=retrieved_documents_relevance_df, eval_name=\"relevance\"),\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "nNmr42v-wu_H"
+   },
+   "source": [
+    "### Response Evaluation\n",
+    "\n",
+    "The retrieval evaluations demonstrates that our RAG system is not perfect. However, it's possible that the LLM is able to generate the correct response even when the context is incorrect. Let's evaluate the responses generated by the LLM."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "6HtdOojZwu_H",
+    "outputId": "97666acc-5faf-45d0-fb5e-8bbed0bbe31a"
+   },
+   "outputs": [],
+   "source": [
+    "from phoenix.session.evaluation import get_qa_with_reference\n",
+    "\n",
+    "qa_with_reference_df = get_qa_with_reference(px.Client())\n",
+    "qa_with_reference_df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "ZUH3k_Lhwu_H"
+   },
+   "source": [
+    "Now that we have a dataset of the question, context, and response (input, reference, and output), we now can measure how well the LLM is responding to the queries. For details on the QA correctness evaluation, see the [LLM Evals documentation](https://docs.arize.com/phoenix/llm-evals/running-pre-tested-evals/q-and-a-on-retrieved-data)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "referenced_widgets": [
+      "56864cc5565c44a794347700481cde54"
+     ]
+    },
+    "id": "CPrm8IASwu_I",
+    "outputId": "96a7caad-39a5-431f-a853-c08a9c71859d"
+   },
+   "outputs": [],
+   "source": [
+    "from phoenix.evals import (\n",
+    "    HallucinationEvaluator,\n",
+    "    OpenAIModel,\n",
+    "    QAEvaluator,\n",
+    "    run_evals,\n",
+    ")\n",
+    "\n",
+    "qa_evaluator = QAEvaluator(OpenAIModel(model=\"gpt-4-turbo-preview\"))\n",
+    "hallucination_evaluator = HallucinationEvaluator(OpenAIModel(model=\"gpt-4-turbo-preview\"))\n",
+    "\n",
+    "qa_correctness_eval_df, hallucination_eval_df = run_evals(\n",
+    "    evaluators=[qa_evaluator, hallucination_evaluator],\n",
+    "    dataframe=qa_with_reference_df,\n",
+    "    provide_explanation=True,\n",
+    "    concurrency=20,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "8PgdJi5Swu_I",
+    "outputId": "c7574a74-20f7-4eb7-b9d2-2f04e885e942"
+   },
+   "outputs": [],
+   "source": [
+    "qa_correctness_eval_df.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "yw-DRlecwu_I",
+    "outputId": "9d4c0305-4163-45b3-9f22-432ee71dc71a"
+   },
+   "outputs": [],
+   "source": [
+    "hallucination_eval_df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "96vBANGCwu_I"
+   },
+   "source": [
+    "#### Observations\n",
+    "\n",
+    "Let's now take our results and aggregate them to get a sense of how well the LLM is answering the questions given the context."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "SkEvvm7swu_I",
+    "outputId": "99345841-3602-4c73-b557-74a47c2bbe8b"
+   },
+   "outputs": [],
+   "source": [
+    "qa_correctness_eval_df.mean(numeric_only=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "-pbBomEnwu_J",
+    "outputId": "54a2144c-49a2-4287-bfce-d21ea2ae9e64"
+   },
+   "outputs": [],
+   "source": [
+    "hallucination_eval_df.mean(numeric_only=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "D0MyLZ3Iwu_J"
+   },
+   "source": [
+    "Our QA Correctness score of `0.91` and a Hallucinations score `0.05` signifies that the generated answers are correct ~91% of the time and that the responses contain hallucinations 5% of the time - there is room for improvement. This could be due to the retrieval strategy or the LLM itself. We will need to investigate further to determine the root cause."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "9VQ7HqkFwu_J"
+   },
+   "source": [
+    "Since we have evaluated our RAG system's QA performance and Hallucinations performance, let's send these evaluations to Phoenix for visualization."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "SJLGHAExwu_J"
+   },
+   "outputs": [],
+   "source": [
+    "from phoenix.trace import SpanEvaluations\n",
+    "\n",
+    "px.Client().log_evaluations(\n",
+    "    SpanEvaluations(dataframe=qa_correctness_eval_df, eval_name=\"Q&A Correctness\"),\n",
+    "    SpanEvaluations(dataframe=hallucination_eval_df, eval_name=\"Hallucination\"),\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "llrunz6Rwu_J"
+   },
+   "source": [
+    "We now have sent all our evaluations to Phoenix. Let's go to the Phoenix application and view the results! Since we've sent all the evals to Phoenix, we can analyze the results together to make a determination on whether or not poor retrieval or irrelevant context has an effect on the LLM's ability to generate the correct response."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "qbPGwiZSwu_K",
+    "outputId": "ad61b6de-4bb2-48b3-ec59-87751261f687"
+   },
+   "outputs": [],
+   "source": [
+    "print(\"phoenix URL\", px.active_session().url)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "px.close_app()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "DrLQseqNwu_K",
+    "jp-MarkdownHeadingCollapsed": true
+   },
+   "source": [
+    "## Conclusion\n",
+    "\n",
+    "We have explored how to build and evaluate a RAG pipeline using LlamaIndex and Phoenix, with a specific focus on evaluating the retrieval system and generated responses within the pipelines.\n",
+    "\n",
+    "Phoenix offers a variety of other evaluations that can be used to assess the performance of your LLM Application. For more details, see the [LLM Evals](https://docs.arize.com/phoenix/llm-evals/llm-evals) documentation."
+   ]
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
+%% Cell type:markdown id: tags:
+%% Cell type:markdown id: tags:
+<center>
+    <p style="text-align:center">
+        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg" width="200"/>
+        <br>
+        <a href="https://docs.arize.com/phoenix/">Docs</a>
+        |
+        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
+        |
+        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
+    </p>
+</center>
+<h1 align="center">Evaluate RAG with LLM Evals</h1>
+In this tutorial we will look into building a RAG pipeline and evaluating it with Phoenix Evals.
+It has the the following sections:
+1. Understanding Retrieval Augmented Generation (RAG).
+2. Building RAG (with the help of a framework such as LlamaIndex).
+3. Evaluating RAG with Phoenix Evals.
+%% Cell type:markdown id: tags:
+## Retrieval Augmented Generation (RAG)
+LLMs are trained on vast datasets, but these will not include your specific data (things like company knowledge bases and documentation). Retrieval-Augmented Generation (RAG) addresses this by dynamically incorporating your data as context during the generation process. This is done not by altering the training data of the LLMs but by allowing the model to access and utilize your data in real-time to provide more tailored and contextually relevant responses.
+In RAG, your data is loaded and prepared for queries. This process is called indexing. User queries act on this index, which filters your data down to the most relevant context. This context and your query then are sent to the LLM along with a prompt, and the LLM provides a response.
+RAG is a critical component for building applications such a chatbots or agents and you will want to know RAG techniques on how to get data into your application.
+<img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/RAG_Pipeline.png">
+%% Cell type:markdown id: tags:
+## Stages within RAG
+There are five key stages within RAG, which will in turn be a part of any larger RAG application.
+- **Loading**: This refers to getting your data from where it lives - whether it's text files, PDFs, another website, a database or an API - into your pipeline.
+- **Indexing**: This means creating a data structure that allows for querying the data. For LLMs this nearly always means creating vector embeddings, numerical representations of the meaning of your data, as well as numerous other metadata strategies to make it easy to accurately find contextually relevant data.
+- **Storing**: Once your data is indexed, you will want to store your index, along with any other metadata, to avoid the need to re-index it.
+- **Querying**: For any given indexing strategy there are many ways you can utilize LLMs and data structures to query, including sub-queries, multi-step queries, and hybrid strategies.
+- **Evaluation**: A critical step in any pipeline is checking how effective it is relative to other strategies, or when you make changes. Evaluation provides objective measures on how accurate, faithful, and fast your responses to queries are.
+%% Cell type:markdown id: tags:
+## Build a RAG system
+Now that we have understood the stages of RAG, let's build a pipeline. We will use [LlamaIndex](https://www.llamaindex.ai/) for RAG and [Phoenix Evals](https://docs.arize.com/phoenix/llm-evals/llm-evals) for evaluation.
+%% Cell type:code id: tags:
+``` python
+pip install -qq "arize-phoenix[evals]" "llama-index>=0.10.3" "openinference-instrumentation-llama-index>=1.0.0" "llama-index-callbacks-arize-phoenix>=0.1.2" "llama-index-llms-openai" "openai>=1" gcsfs nest_asyncio
+```
+%% Cell type:code id: tags:
+``` python
+# The nest_asyncio module enables the nesting of asynchronous functions within an already running async loop.
+# This is necessary because Jupyter notebooks inherently operate in an asynchronous loop.
+# By applying nest_asyncio, we can run additional async functions within this existing loop without conflicts.
+import nest_asyncio
+nest_asyncio.apply()
+import os
+from getpass import getpass
+import pandas as pd
+import phoenix as px
+from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, set_global_handler
+from llama_index.core.node_parser import SimpleNodeParser
+from llama_index.llms.openai import OpenAI
+```
+%% Cell type:markdown id: tags:
+During this tutorial, we will capture all the data we need to evaluate our RAG pipeline using Phoenix Tracing. To enable this, simply start the phoenix application and instrument LlamaIndex.
+%% Cell type:code id: tags:
+``` python
+session = px.launch_app()
+```
+%% Cell type:code id: tags:
+``` python
+set_global_handler("arize_phoenix")
+```
+%% Cell type:markdown id: tags:
+For this tutorial we will be using OpenAI for creating synthetic data as well as for evaluation.
+%% Cell type:code id: tags:
+``` python
+if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
+    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
+os.environ["OPENAI_API_KEY"] = openai_api_key
+```
+%% Cell type:markdown id: tags:
+### Load Data and Build an Index
+%% Cell type:markdown id: tags:
+Let's use an [essay by Paul Graham](https://www.paulgraham.com/worked.html) to build our RAG pipeline.
+%% Cell type:code id: tags:
+``` python
+import tempfile
+from urllib.request import urlretrieve
+with tempfile.NamedTemporaryFile() as tf:
+    urlretrieve(
+        "https://raw.githubusercontent.com/Arize-ai/phoenix-assets/main/data/paul_graham/paul_graham_essay.txt",
+        tf.name,
+    )
+    documents = SimpleDirectoryReader(input_files=[tf.name]).load_data()
+```
+%% Cell type:code id: tags:
+``` python
+# Define an LLM
+llm = OpenAI(model="gpt-4")
+# Build index with a chunk_size of 512
+node_parser = SimpleNodeParser.from_defaults(chunk_size=512)
+nodes = node_parser.get_nodes_from_documents(documents)
+vector_index = VectorStoreIndex(nodes)
+```
+%% Cell type:markdown id: tags:
+Build a QueryEngine and start querying.
+%% Cell type:code id: tags:
+``` python
+query_engine = vector_index.as_query_engine()
+```
+%% Cell type:code id: tags:
+``` python
+response_vector = query_engine.query("What did the author do growing up?")
+```
+%% Cell type:markdown id: tags:
+Check the response that you get from the query.
+%% Cell type:code id: tags:
+``` python
+response_vector.response
+```
+%% Cell type:markdown id: tags:
+By default LlamaIndex retrieves two similar nodes/ chunks. You can modify that in `vector_index.as_query_engine(similarity_top_k=k)`.
+Let's check the text in each of these retrieved nodes.
+%% Cell type:code id: tags:
+``` python
+# First retrieved node
+response_vector.source_nodes[0].get_text()
+```
+%% Cell type:code id: tags:
+``` python
+# Second retrieved node
+response_vector.source_nodes[1].get_text()
+```
+%% Cell type:markdown id: tags:
+Remember that we are using Phoenix Tracing to capture all the data we need to evaluate our RAG pipeline. You can view the traces in the phoenix application.
+%% Cell type:code id: tags:
+``` python
+print("phoenix URL", px.active_session().url)
+```
+%% Cell type:markdown id: tags:
+We can access the traces by directly pulling the spans from the phoenix session.
+%% Cell type:code id: tags:
+``` python
+spans_df = px.Client().get_spans_dataframe()
+```
+%% Cell type:code id: tags:
+``` python
+spans_df[["name", "span_kind", "attributes.input.value", "attributes.retrieval.documents"]].head()
+```
+%% Cell type:markdown id: tags:
+Note that the traces have captured the documents that were retrieved by the query engine. This is nice because it means we can introspect the documents without having to keep track of them ourselves.
+%% Cell type:code id: tags:
+``` python
+spans_with_docs_df = spans_df[spans_df["attributes.retrieval.documents"].notnull()]
+```
+%% Cell type:code id: tags:
+``` python
+spans_with_docs_df[["attributes.input.value", "attributes.retrieval.documents"]].head()
+```
+%% Cell type:markdown id: tags:
+We have built a RAG pipeline and also have instrumented it using Phoenix Tracing. We now need to evaluate it's performance. We can assess our RAG system/query engine using Phoenix's LLM Evals. Let's examine how to leverage these tools to quantify the quality of our retrieval-augmented generation system.
+%% Cell type:markdown id: tags:
+## Evaluation
+Evaluation should serve as the primary metric for assessing your RAG application. It determines whether the pipeline will produce accurate responses based on the data sources and range of queries.
+While it's beneficial to examine individual queries and responses, this approach is impractical as the volume of edge-cases and failures increases. Instead, it's more effective to establish a suite of metrics and automated evaluations. These tools can provide insights into overall system performance and can identify specific areas that may require scrutiny.
+In a RAG system, evaluation focuses on two critical aspects:
+- **Retrieval Evaluation**: To assess the accuracy and relevance of the documents that were retrieved
+- **Response Evaluation**: Measure the appropriateness of the response generated by the system when the context was provided.
+%% Cell type:markdown id: tags:
+### Generate Question Context Pairs
+For the evaluation of a RAG system, it's essential to have queries that can fetch the correct context and subsequently generate an appropriate response.
+For this tutorial, let's use Phoenix's `llm_generate` to help us create the question-context pairs.
+%% Cell type:markdown id: tags:
+First, let's create a dataframe of all the document chunks that we have indexed.
+%% Cell type:code id: tags:
+``` python
+# Let's construct a dataframe of just the documents that are in our index
+document_chunks_df = pd.DataFrame({"text": [node.get_text() for node in nodes]})
+document_chunks_df.head()
+```
+%% Cell type:markdown id: tags:
+Now that we have the document chunks, let's prompt an LLM to generate us 3 questions per chunk. Note that you could manually solicit questions from your team or customers, but this is a quick and easy way to generate a large number of questions.
+%% Cell type:code id: tags:
+``` python
+generate_questions_template = """\
+Context information is below.
+---------------------
+{text}
+---------------------
+Given the context information and not prior knowledge.
+generate only questions based on the below query.
+You are a Teacher/ Professor. Your task is to setup \
+3 questions for an upcoming \
+quiz/examination. The questions should be diverse in nature \
+across the document. Restrict the questions to the \
+context information provided."
+Output the questions in JSON format with the keys question_1, question_2, question_3.
+"""
+```
+%% Cell type:code id: tags:
+``` python
+import json
+from phoenix.evals import OpenAIModel, llm_generate
+def output_parser(response: str, index: int):
+    try:
+        return json.loads(response)
+    except json.JSONDecodeError as e:
+        return {"__error__": str(e)}
+questions_df = llm_generate(
+    dataframe=document_chunks_df,
+    template=generate_questions_template,
+    model=OpenAIModel(
+        model_name="gpt-3.5-turbo",
+    ),
+    output_parser=output_parser,
+    concurrency=20,
+)
+```
+%% Cell type:code id: tags:
+``` python
+questions_df.head()
+```
+%% Cell type:code id: tags:
+``` python
+# Construct a dataframe of the questions and the document chunks
+questions_with_document_chunk_df = pd.concat([questions_df, document_chunks_df], axis=1)
+questions_with_document_chunk_df = questions_with_document_chunk_df.melt(
+    id_vars=["text"], value_name="question"
+).drop("variable", axis=1)
+# If the above step was interrupted, there might be questions missing. Let's run this to clean up the dataframe.
+questions_with_document_chunk_df = questions_with_document_chunk_df[
+    questions_with_document_chunk_df["question"].notnull()
+]
+```
+%% Cell type:markdown id: tags:
+The LLM has generated three questions per chunk. Let's take a quick look.
+%% Cell type:code id: tags:
+``` python
+questions_with_document_chunk_df.head(10)
+```
+%% Cell type:markdown id: tags:
+### Retrieval Evaluation
+We are now prepared to perform our retrieval evaluations. We will execute the queries we generated in the previous step and verify whether or not that the correct context is retrieved.
+%% Cell type:code id: tags:
+``` python
+# First things first, let's reset phoenix
+px.close_app()
+px.launch_app()
+```
+%% Cell type:code id: tags:
+``` python
+# loop over the questions and generate the answers
+for _, row in questions_with_document_chunk_df.iterrows():
+    question = row["question"]
+    response_vector = query_engine.query(question)
+    print(f"Question: {question}\nAnswer: {response_vector.response}\n")
+```
+%% Cell type:markdown id: tags:
+Now that we have executed the queries, we can start validating whether or not the RAG system was able to retrieve the correct context. Let's extract all the retrieved documents from the traces logged to phoenix. (For an in-depth explanation of how to export trace data from the phoenix runtime, consult the [docs](https://docs.arize.com/phoenix/how-to/extract-data-from-spans)).
+%% Cell type:code id: tags:
+``` python
+from phoenix.session.evaluation import get_retrieved_documents
+retrieved_documents_df = get_retrieved_documents(px.Client())
+retrieved_documents_df
+```
+%% Cell type:markdown id: tags:
+Let's now use Phoenix's LLM Evals to evaluate the relevance of the retrieved documents with regards to the query. Note, we've turned on `explanations` which prompts the LLM to explain it's reasoning. This can be useful for debugging and for figuring out potential corrective actions.
+%% Cell type:code id: tags:
+``` python
+from phoenix.evals import (
+    RelevanceEvaluator,
+    run_evals,
+)
+relevance_evaluator = RelevanceEvaluator(OpenAIModel(model="gpt-4-turbo-preview"))
+retrieved_documents_relevance_df = run_evals(
+    evaluators=[relevance_evaluator],
+    dataframe=retrieved_documents_df,
+    provide_explanation=True,
+    concurrency=20,
+)[0]
+```
+%% Cell type:code id: tags:
+``` python
+retrieved_documents_relevance_df.head()
+```
+%% Cell type:markdown id: tags:
+We can now combine the documents with the relevance evaluations to compute retrieval metrics. These metrics will help us understand how well the RAG system is performing.
+%% Cell type:code id: tags:
+``` python
+documents_with_relevance_df = pd.concat(
+    [retrieved_documents_df, retrieved_documents_relevance_df.add_prefix("eval_")], axis=1
+)
+documents_with_relevance_df
+```
+%% Cell type:markdown id: tags:
+Let's compute Normalized Discounted Cumulative Gain [NCDG](https://en.wikipedia.org/wiki/Discounted_cumulative_gain) at 2 for all our retrieval steps.  In information retrieval, this metric is often used to measure effectiveness of search engine algorithms and related applications.
+%% Cell type:code id: tags:
+``` python
+import numpy as np
+from sklearn.metrics import ndcg_score
+def _compute_ndcg(df: pd.DataFrame, k: int):
+    """Compute NDCG@k in the presence of missing values"""
+    n = max(2, len(df))
+    eval_scores = np.zeros(n)
+    doc_scores = np.zeros(n)
+    eval_scores[: len(df)] = df.eval_score
+    doc_scores[: len(df)] = df.document_score
+    try:
+        return ndcg_score([eval_scores], [doc_scores], k=k)
+    except ValueError:
+        return np.nan
+ndcg_at_2 = pd.DataFrame(
+    {"score": documents_with_relevance_df.groupby("context.span_id").apply(_compute_ndcg, k=2)}
+)
+```
+%% Cell type:code id: tags:
+``` python
+ndcg_at_2
+```
+%% Cell type:markdown id: tags:
+Let's also compute precision at 2 for all our retrieval steps.
+%% Cell type:code id: tags:
+``` python
+precision_at_2 = pd.DataFrame(
+    {
+        "score": documents_with_relevance_df.groupby("context.span_id").apply(
+            lambda x: x.eval_score[:2].sum(skipna=False) / 2
+        )
+    }
+)
+```
+%% Cell type:code id: tags:
+``` python
+precision_at_2
+```
+%% Cell type:markdown id: tags:
+Lastly, let's compute whether or not a correct document was retrieved at all for each query (e.g. a hit)
+%% Cell type:code id: tags:
+``` python
+hit = pd.DataFrame(
+    {
+        "hit": documents_with_relevance_df.groupby("context.span_id").apply(
+            lambda x: x.eval_score[:2].sum(skipna=False) > 0
+        )
+    }
+)
+```
+%% Cell type:markdown id: tags:
+Let's now view the results in a combined dataframe.
+%% Cell type:code id: tags:
+``` python
+retrievals_df = px.Client().get_spans_dataframe("span_kind == 'RETRIEVER'")
+rag_evaluation_dataframe = pd.concat(
+    [
+        retrievals_df["attributes.input.value"],
+        ndcg_at_2.add_prefix("ncdg@2_"),
+        precision_at_2.add_prefix("precision@2_"),
+        hit,
+    ],
+    axis=1,
+)
+rag_evaluation_dataframe
+```
+%% Cell type:markdown id: tags:
+### Observations
+Let's now take our results and aggregate them to get a sense of how well our RAG system is performing.
+%% Cell type:code id: tags:
+``` python
+# Aggregate the scores across the retrievals
+results = rag_evaluation_dataframe.mean(numeric_only=True)
+results
+```
+%% Cell type:markdown id: tags:
+As we can see from the above numbers, our RAG system is not perfect, there are times when it fails to retrieve the correct context within the first two documents. At other times the correct context is included in the top 2 results but non-relevant information is also included in the context. This is an indication that we need to improve our retrieval strategy. One possible solution could be to increase the number of documents retrieved and then use a more sophisticated ranking strategy (such as a reranker) to select the correct context.
+%% Cell type:markdown id: tags:
+We have now evaluated our RAG system's retrieval performance. Let's send these evaluations to Phoenix for visualization. By sending the evaluations to Phoenix, you will be able to view the evaluations alongside the traces that were captured earlier.
+%% Cell type:code id: tags:
+``` python
+from phoenix.trace import DocumentEvaluations, SpanEvaluations
+px.Client().log_evaluations(
+    SpanEvaluations(dataframe=ndcg_at_2, eval_name="ndcg@2"),
+    SpanEvaluations(dataframe=precision_at_2, eval_name="precision@2"),
+    DocumentEvaluations(dataframe=retrieved_documents_relevance_df, eval_name="relevance"),
+)
+```
+%% Cell type:markdown id: tags:
+### Response Evaluation
+The retrieval evaluations demonstrates that our RAG system is not perfect. However, it's possible that the LLM is able to generate the correct response even when the context is incorrect. Let's evaluate the responses generated by the LLM.
+%% Cell type:code id: tags:
+``` python
+from phoenix.session.evaluation import get_qa_with_reference
+qa_with_reference_df = get_qa_with_reference(px.Client())
+qa_with_reference_df
+```
+%% Cell type:markdown id: tags:
+Now that we have a dataset of the question, context, and response (input, reference, and output), we now can measure how well the LLM is responding to the queries. For details on the QA correctness evaluation, see the [LLM Evals documentation](https://docs.arize.com/phoenix/llm-evals/running-pre-tested-evals/q-and-a-on-retrieved-data).
+%% Cell type:code id: tags:
+``` python
+from phoenix.evals import (
+    HallucinationEvaluator,
+    OpenAIModel,
+    QAEvaluator,
+    run_evals,
+)
+qa_evaluator = QAEvaluator(OpenAIModel(model="gpt-4-turbo-preview"))
+hallucination_evaluator = HallucinationEvaluator(OpenAIModel(model="gpt-4-turbo-preview"))
+qa_correctness_eval_df, hallucination_eval_df = run_evals(
+    evaluators=[qa_evaluator, hallucination_evaluator],
+    dataframe=qa_with_reference_df,
+    provide_explanation=True,
+    concurrency=20,
+)
+```
+%% Cell type:code id: tags:
+``` python
+qa_correctness_eval_df.head()
+```
+%% Cell type:code id: tags:
+``` python
+hallucination_eval_df.head()
+```
+%% Cell type:markdown id: tags:
+#### Observations
+Let's now take our results and aggregate them to get a sense of how well the LLM is answering the questions given the context.
+%% Cell type:code id: tags:
+``` python
+qa_correctness_eval_df.mean(numeric_only=True)
+```
+%% Cell type:code id: tags:
+``` python
+hallucination_eval_df.mean(numeric_only=True)
+```
+%% Cell type:markdown id: tags:
+Our QA Correctness score of `0.91` and a Hallucinations score `0.05` signifies that the generated answers are correct ~91% of the time and that the responses contain hallucinations 5% of the time - there is room for improvement. This could be due to the retrieval strategy or the LLM itself. We will need to investigate further to determine the root cause.
+%% Cell type:markdown id: tags:
+Since we have evaluated our RAG system's QA performance and Hallucinations performance, let's send these evaluations to Phoenix for visualization.
+%% Cell type:code id: tags:
+``` python
+from phoenix.trace import SpanEvaluations
+px.Client().log_evaluations(
+    SpanEvaluations(dataframe=qa_correctness_eval_df, eval_name="Q&A Correctness"),
+    SpanEvaluations(dataframe=hallucination_eval_df, eval_name="Hallucination"),
+)
+```
+%% Cell type:markdown id: tags:
+We now have sent all our evaluations to Phoenix. Let's go to the Phoenix application and view the results! Since we've sent all the evals to Phoenix, we can analyze the results together to make a determination on whether or not poor retrieval or irrelevant context has an effect on the LLM's ability to generate the correct response.
+%% Cell type:code id: tags:
+``` python
+print("phoenix URL", px.active_session().url)
+```
+%% Cell type:code id: tags:
+``` python
+px.close_app()
+```
+%% Cell type:markdown id: tags:
+## Conclusion
+We have explored how to build and evaluate a RAG pipeline using LlamaIndex and Phoenix, with a specific focus on evaluating the retrieval system and generated responses within the pipelines.
+Phoenix offers a variety of other evaluations that can be used to assess the performance of your LLM Application. For more details, see the [LLM Evals](https://docs.arize.com/phoenix/llm-evals/llm-evals) documentation.
--- a/evaluateRAG.py
+++ b/evaluateRAG.py
+import os
+import subprocess
+import json
+from pathlib import Path
+# enables counting list items
+from operator import length_hint
+# .env parser
+from dotenv import load_dotenv
+# getpass enables secure password input
+from getpass import getpass
+# creating temporary files
+import tempfile
+# download files containing the domain specific information
+from urllib.request import urlparse, urlretrieve
+# pandas handles table data
+import pandas as pd
+# The nest_asyncio module enables the nesting of asynchronous functions within an already running async loop.
+import nest_asyncio
+nest_asyncio.apply()
+# colored print
+from colorist import Color, BrightColor, bright_yellow, magenta, red, green
+# phoenix is the framework & webservice from arize (https://docs.arize.com/phoenix)
+import phoenix as px
+from phoenix.evals import OpenAIModel, llm_generate, HallucinationEvaluator, QAEvaluator, run_evals
+from phoenix.session.evaluation import get_retrieved_documents, get_qa_with_reference
+from phoenix.evals import RelevanceEvaluator, run_evals
+from phoenix.trace import DocumentEvaluations, SpanEvaluations
+# llama index boilerplates chunking, verctorizing, storing querying aso. of private data
+from llama_index.core import set_global_handler, SimpleDirectoryReader, VectorStoreIndex
+from llama_index.core.node_parser import SimpleNodeParser
+from llama_index.llms.openai import OpenAI
+import numpy as np
+from sklearn.metrics import ndcg_score
+##################################################
+### PREPARING & STARTING
+##################################################
+##########
+# SET VARS
+##########
+# Load environment variables from .env file
+if not os.path.isfile('./.env'):
+  raise RuntimeError("Aborting: No .env file found.")
+load_dotenv()
+# tell llama_index to send all infos to the phoenix instance
+set_global_handler("arize_phoenix")
+# ##########
+# # LOAD OPENAPI KEY
+# ##########
+if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
+    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
+os.environ["OPENAI_API_KEY"] = openai_api_key
+# Define the LLM
+llm = OpenAI(model="gpt-4")
+##########
+# START PHOENIX
+##########
+# check for running process
+process = subprocess.run(["lsof", f"-iTCP:{os.environ['PHOENIX_PORT']}"], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+output = process.stdout.decode("utf8")
+if len(output.strip()) > 1:
+  print(f"Aborting: Error while attempting to bind on address ('0.0.0.0', {os.environ['PHOENIX_PORT']}): address already in use")
+  os._exit(1)
+# launch phoenix
+green(f"Launching phoenix application")
+session = px.launch_app()
+("phoenix URL", px.active_session().url)
+##################################################
+### PREPARING FILES FOR RAG
+##################################################
+##########
+# LOAD DOMAIN SPECIFIC INFORMATION FILES
+##########
+# create array of files to download
+urls = os.environ['DOMAIN_DOCS'].split(',')
+# create download folder
+Path(os.environ['DOMAIN_DOWNLOAD_FOLDER']).mkdir(parents=True, exist_ok=True)
+# for each file to download
+for url in urls:
+  # parse url
+  source = urlparse(os.environ['DOMAIN_BASEPATH'] + url)
+  # transfer path into filename (because filenames may occur multiple times in different paths)
+  targetFilename = source.path.replace('/',':')
+  # remember full target path
+  targetPath = os.environ['DOMAIN_DOWNLOAD_FOLDER']+'/'+ targetFilename
+  # output
+  print(f'downloading {source.path} \t>\t {targetPath}')
+  # download
+  urlretrieve(os.environ['DOMAIN_BASEPATH'] + url, targetPath)
+# load docs into llamaIndex
+documents = SimpleDirectoryReader(input_dir=os.environ['DOMAIN_DOWNLOAD_FOLDER']).load_data()
+##########
+# CHUNK & VECTORIZE FILES
+##########
+green(f"Building index with a chunk_size of {os.environ['DOCUMENTS_CHUNK_SIZE']} for {length_hint(documents)} docs")
+# define parsing options incl. regex, paragraph_separator aso.
+node_parser = SimpleNodeParser.from_defaults(chunk_size=int(os.environ['DOCUMENTS_CHUNK_SIZE']))
+# generate nodes based on the parser
+nodes = node_parser.get_nodes_from_documents(documents)
+# Vectorize nodes
+# very first time OPENAI_API_KEY is needed in this script
+vector_index = VectorStoreIndex(nodes)
+print(f'created {length_hint(nodes)} chunks')
+countQuestions = length_hint(nodes) * int(os.environ['QUESTIONS_PER_CHUNK'])
+# # Build a QueryEngine and start querying.
+query_engine = vector_index.as_query_engine()
+##########
+# CREATE THE QUESTION-CONTEXT PAIRS
+##########
+green(f'Generating {countQuestions} questions. {os.environ['QUESTIONS_PER_CHUNK']} per chunk')
+# create a dataframe of all the document chunks that were indexed
+document_chunks_df = pd.DataFrame({"text": [node.get_text() for node in nodes]})
+# template to generate {QUESTIONS_PER_CHUNK} questions per chunk
+generate_questions_template = f"""\
+Context information is below.
+---------------------
+{{text}}
+---------------------
+Given the context information and not prior knowledge.
+generate only questions based on the below query.
+You are a Teacher/ Professor. Your task is to setup \
+{os.environ['QUESTIONS_PER_CHUNK']} questions for an upcoming \
+quiz/examination. The questions should be diverse in nature \
+across the document. Restrict the questions to the \
+context information provided."
+Output the questions in JSON format with the keys question_1, question_2, question_3.
+"""
+# define parser to fetch questions from the response
+def output_parser(response: str, index: int):
+  try:
+    return json.loads(response)
+  except json.JSONDecodeError as e:
+    return {"__error__": str(e)}
+# prompt template to LLM and store > questions_df 
+questions_df = llm_generate(
+  dataframe=document_chunks_df,
+  template=generate_questions_template,
+  model=OpenAIModel(
+    model="gpt-3.5-turbo",
+  ),
+  output_parser=output_parser,
+  concurrency=20,
+)
+# Construct a dataframe of the questions and the document chunks
+questions_with_document_chunk_df = pd.concat([questions_df, document_chunks_df], axis=1)
+questions_with_document_chunk_df = questions_with_document_chunk_df.melt(
+  id_vars=["text"], value_name="question"
+).drop("variable", axis=1)
+# If the above step was interrupted, there might be questions missing. Clean up the dataframe.
+questions_with_document_chunk_df = questions_with_document_chunk_df[
+  questions_with_document_chunk_df["question"].notnull()
+]
+# debug print 
+# magenta(questions_with_document_chunk_df.head(10))
+##################################################
+### RETRIEVAL EVALUATION 
+##################################################
+green('Starting retrieval evaluation')
+##########
+# GENERATE THE ANSWERS
+######## 
+print(f'Generating the answers for each question')
+# loop over the questions and generate the answers
+for _, row in questions_with_document_chunk_df.iterrows():
+  question = row["question"]
+  response_vector = query_engine.query(question)
+  # debug print question-answer pair
+  # print(f"Question: {Color.MAGENTA}{question}{Color.OFF}\nAnswer: {BrightColor.MAGENTA}{response_vector.response}{Color.OFF}\n")
+# extract all the retrieved documents from the traces logged to phoenix
+print('extracting the retrieved documents from phoenix traces')
+retrieved_documents_df = get_retrieved_documents(px.Client())
+##########
+# CALCULATE RELEVANCE
+######## 
+# use Phoenix's LLM Evals to evaluate the relevance of the retrieved documents with regards to the query. 
+# Note the turned on explanations which prompts the LLM to explain it's reasoning. 
+# This can be useful for debugging and for figuring out potential corrective actions.
+print('calculating relevance of the documents to the query')
+relevance_evaluator = RelevanceEvaluator(OpenAIModel(model="gpt-4-turbo-preview"))
+retrieved_documents_relevance_df = run_evals(
+  evaluators=[relevance_evaluator],
+  dataframe=retrieved_documents_df,
+  provide_explanation=True,
+  concurrency=20,
+)[0]
+# combine the documents with the relevance evaluations to compute retrieval metrics. 
+# These metrics will help to understand how well the RAG system is performing.
+documents_with_relevance_df = pd.concat(
+  [retrieved_documents_df, retrieved_documents_relevance_df.add_prefix("eval_")], axis=1
+)
+##########
+# NCDG@2
+##########
+print(f'Computing NCDG@2')
+# function
+def _compute_ndcg(df: pd.DataFrame, k: int):
+  """Compute NDCG@k in the presence of missing values"""
+  n = max(2, len(df))
+  eval_scores = np.zeros(n)
+  doc_scores = np.zeros(n)
+  eval_scores[: len(df)] = df.eval_score
+  doc_scores[: len(df)] = df.document_score
+  try:
+    return ndcg_score([eval_scores], [doc_scores], k=k)
+  except ValueError:
+    return np.nan
+# run
+ndcg_at_2 = pd.DataFrame(
+  {"score": documents_with_relevance_df.groupby("context.span_id").apply(_compute_ndcg, k=2)}
+)
+##########
+# PRECISION@2
+##########
+print(f'Computing Precision@2')
+precision_at_2 = pd.DataFrame(
+  {
+    "score": documents_with_relevance_df.groupby("context.span_id").apply(
+        lambda x: x.eval_score[:2].sum(skipna=False) / 2
+    )
+  }
+)
+##########
+# HIT
+##########
+print(f'Computing HIT')
+hit = pd.DataFrame(
+  {
+    "hit": documents_with_relevance_df.groupby("context.span_id").apply(
+      lambda x: x.eval_score[:2].sum(skipna=False) > 0
+    )
+  }
+)
+##########
+# COMBINE METRICS INTO ONE DATAFRAME
+##########
+print(f'Creating dataframe of all metrics')
+retrievals_df = px.Client().get_spans_dataframe("span_kind == 'RETRIEVER'")
+rag_evaluation_dataframe = pd.concat(
+  [
+    retrievals_df["attributes.input.value"],
+    ndcg_at_2.add_prefix("ncdg@2_"),
+    precision_at_2.add_prefix("precision@2_"),
+    hit,
+  ],
+  axis=1,
+)
+# debug
+# magenta(rag_evaluation_dataframe)
+##########
+# AGGREGATE THE SCORES ACROSS THE RETRIEVALS
+##########
+print(f'Aggregated metrics are:')
+results = rag_evaluation_dataframe.mean(numeric_only=True)
+magenta(results)
+# SEND PERFORMANCE INFO TO PHOENIX
+px.Client().log_evaluations(
+  SpanEvaluations(dataframe=ndcg_at_2, eval_name="ndcg@2"),
+  SpanEvaluations(dataframe=precision_at_2, eval_name="precision@2"),
+  DocumentEvaluations(dataframe=retrieved_documents_relevance_df, eval_name="relevance"),
+)
+##################################################
+### RESPONSE EVALUATION 
+##################################################
+green(f'Starting response evaluation')
+# fetching question, context, and response (input, reference, and output) into one dataframe
+qa_with_reference_df = get_qa_with_reference(px.Client())
+# debug print 
+# magenta(qa_with_reference_df)
+##########
+# CALCULATE CORRECTNESS & HALLUCINATIONS
+##########
+# measure how well the LLM is responding to the queries
+# details: https://docs.arize.com/phoenix/evaluation/how-to-evals/running-pre-tested-evals/q-and-a-on-retrieved-data
+qa_evaluator = QAEvaluator(OpenAIModel(model="gpt-4-turbo-preview"))
+hallucination_evaluator = HallucinationEvaluator(OpenAIModel(model="gpt-4-turbo-preview"))
+print('calculate performance of QA correctness & hallucinations')
+qa_correctness_eval_df, hallucination_eval_df = run_evals(
+    evaluators=[qa_evaluator, hallucination_evaluator],
+    dataframe=qa_with_reference_df,
+    provide_explanation=True,
+    concurrency=20,
+)
+# debug
+# magenta(qa_correctness_eval_df.head())
+# magenta(hallucination_eval_df.head())
+# aggregate results to get a sense of how well the LLM is answering the questions given the context
+qa_correctness_eval_df.mean(numeric_only=True)
+hallucination_eval_df.mean(numeric_only=True)
+# send QA performance and Hallucinations performance to Phoenix for visualization
+print('send aggregated evaluations to phoenix')
+px.Client().log_evaluations(
+    SpanEvaluations(dataframe=qa_correctness_eval_df, eval_name="Q&A Correctness"),
+    SpanEvaluations(dataframe=hallucination_eval_df, eval_name="Hallucination"),
+)
+##################################################
+### CLEAN END 
+##################################################
+green('finished evaluation process')
+print("See result here: ", px.active_session().url)
+bright_yellow("Press Enter to exit...")
+input()
+px.close_app()
+##################################################
+### SHITTY SNIPPETS, NEVER SUPPOSED TO RUN
+##################################################
+# if process.returncode != 0:
+#   print(process.stderr.decode("utf8"))
+#   raise RuntimeError(f"Returncode {process.returncode} Und das war nicht gut und der error ist ")
+# print(process.returncode)
+# print(process.stdout.decode("utf8"))
+# output = process.stdout.decode("utf8")
+# if len(output.strip()) > 1:
+#   print("ham wir")
+# else:
+#   print("ham wir nicht")
+# print(process.stderr.decode("utf8"))
--- a/requirements.txt
+++ b/requirements.txt
+aiohttp==3.9.4
+aiosignal==1.3.1
+annotated-types==0.6.0
+anyio==4.3.0
+arize-phoenix==3.21.0
+arize-phoenix-evals==0.7.0
+attrs==23.2.0
+beautifulsoup4==4.12.3
+cachetools==5.3.3
+certifi==2024.2.2
+charset-normalizer==3.3.2
+click==8.1.7
+colorist==1.7.2
+Cython==0.29.37
+dataclasses-json==0.6.4
+decorator==5.1.1
+Deprecated==1.2.14
+dirtyjson==1.0.8
+distro==1.9.0
+frozenlist==1.4.1
+fsspec==2024.3.1
+gcsfs==2024.3.1
+google-api-core==2.18.0
+google-auth==2.29.0
+google-auth-oauthlib==1.2.0
+google-cloud-core==2.4.1
+google-cloud-storage==2.16.0
+google-crc32c==1.5.0
+google-resumable-media==2.7.0
+googleapis-common-protos==1.63.0
+graphql-core==3.2.3
+greenlet==3.0.3
+grpcio==1.62.1
+h11==0.14.0
+hdbscan==0.8.33
+httpcore==1.0.5
+httpx==0.27.0
+idna==3.7
+importlib-metadata==7.0.0
+Jinja2==3.1.3
+joblib==1.4.0
+llama-index==0.10.28
+llama-index-agent-openai==0.2.2
+llama-index-callbacks-arize-phoenix==0.1.3
+llama-index-cli==0.1.11
+llama-index-core==0.10.28
+llama-index-embeddings-openai==0.1.7
+llama-index-indices-managed-llama-cloud==0.1.5
+llama-index-legacy==0.9.48
+llama-index-llms-openai==0.1.15
+llama-index-multi-modal-llms-openai==0.1.5
+llama-index-program-openai==0.1.5
+llama-index-question-gen-openai==0.1.3
+llama-index-readers-file==0.1.17
+llama-index-readers-llama-parse==0.1.4
+llama-parse==0.4.0
+llamaindex-py-client==0.1.18
+llvmlite==0.42.0
+MarkupSafe==2.1.5
+marshmallow==3.21.1
+multidict==6.0.5
+mypy-extensions==1.0.0
+nest-asyncio==1.6.0
+networkx==3.3
+nltk==3.8.1
+numba==0.59.1
+numpy==1.26.4
+oauthlib==3.2.2
+openai==1.17.1
+openinference-instrumentation==0.1.1
+openinference-instrumentation-langchain==0.1.14
+openinference-instrumentation-llama-index==1.2.1
+openinference-instrumentation-openai==0.1.4
+openinference-semantic-conventions==0.1.5
+opentelemetry-api==1.24.0
+opentelemetry-exporter-otlp==1.24.0
+opentelemetry-exporter-otlp-proto-common==1.24.0
+opentelemetry-exporter-otlp-proto-grpc==1.24.0
+opentelemetry-exporter-otlp-proto-http==1.24.0
+opentelemetry-instrumentation==0.45b0
+opentelemetry-proto==1.24.0
+opentelemetry-sdk==1.24.0
+opentelemetry-semantic-conventions==0.45b0
+packaging==24.0
+pandas==2.2.2
+pillow==10.3.0
+proto-plus==1.23.0
+protobuf==4.25.3
+psutil==5.9.8
+pyarrow==15.0.2
+pyasn1==0.6.0
+pyasn1_modules==0.4.0
+pydantic==2.7.0
+pydantic_core==2.18.1
+pynndescent==0.5.12
+pypdf==4.2.0
+python-dateutil==2.9.0.post0
+python-dotenv==1.0.1
+pytz==2024.1
+PyYAML==6.0.1
+regex==2023.12.25
+requests==2.31.0
+requests-oauthlib==2.0.0
+rsa==4.9
+scikit-learn==1.4.2
+scipy==1.13.0
+setuptools==69.2.0
+six==1.16.0
+sniffio==1.3.1
+sortedcontainers==2.4.0
+soupsieve==2.5
+SQLAlchemy==2.0.29
+starlette==0.37.2
+strawberry-graphql==0.208.2
+striprtf==0.0.26
+tenacity==8.2.3
+threadpoolctl==3.4.0
+tiktoken==0.6.0
+tqdm==4.66.2
+typing-inspect==0.9.0
+typing_extensions==4.11.0
+tzdata==2024.1
+umap-learn==0.5.6
+urllib3==2.2.1
+uvicorn==0.29.0
+wrapt==1.16.0
+yarl==1.9.4
+zipp==3.18.1