{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Semantic Search\n",
        "\n",
        "The Semantic search API helps you build AI systems that use data from documents. With the API, you can search for texts with similar semantic meaning, unlike the search where you search for keywords, and find relevant texts written in other languages.\n",
        "\n",
        "The Semantic Search API calculates Vector Embeddings for all PDF documents uploaded to Cognite Data Fusion (CDF). You can read more about [embeddings](https://platform.openai.com/docs/guides/embeddings) and [AIs multitool vector embeddings](https://cloud.google.com/blog/topics/developers-practitioners/meet-ais-multitool-vector-embeddings).\n",
        "\n",
        "Vector embeddings are created as part of the built-in Retrieval Augmented Generation (RAG) pipeline. All PDF documents you upload to CDF are parsed and OCRed, and the extracted text is divided into suitably sized passages which are indexed into a vector store.\n",
        "\n",
        "This notebook shows you how to upload a PDF to CDF and use the Semantic search API to find relevant passages for a user question."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "trusted": true
      },
      "outputs": [],
      "source": [
        "from cognite.client import CogniteClient\n",
        "\n",
        "# Instantiate Cognite SDK client:\n",
        "client = CogniteClient()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Step 1. Upload PDF\n",
        "\n",
        "You can upload a PDF file to CDF one of the following ways:\n",
        "\n",
        "* Go to **_CDF_** > **_Industrial tools_** > **_Canvas_** and drag your PDF file to the canvas or upload existing files by selecting **_+ Add data_**.  \n",
        "  If you don't have a good file to upload, try this [test file](./well_report.pdf).\n",
        "\n",
        "* Go to **_CDF_** > **_Industrial tools_** > **_Data explorer_** > **_Files_** and select **_Upload_**.\n",
        "\n",
        "* Use the Python code."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "trusted": true
      },
      "outputs": [],
      "source": [
        "response1 = client.files.upload(path=\"./well_report.pdf\")\n",
        "document_id = response1.id\n",
        "print(document_id)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Step 2. Processing\n",
        "\n",
        "Once you've uploaded the file, wait for it to pass through the RAG pipeline. You can use the Document status API to poll the status."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "trusted": true
      },
      "outputs": [],
      "source": [
        "import time\n",
        "\n",
        "status_path = f\"/api/v1/projects/{client.config.project}/documents/status\"\n",
        "\n",
        "body = {\n",
        "    \"items\": [\n",
        "        {\n",
        "            \"id\": document_id\n",
        "        }\n",
        "    ]\n",
        "}\n",
        "\n",
        "while True:\n",
        "    response2 = client.post(status_path, json=body, headers={\"cdf-version\": \"alpha\"}).json()\n",
        "\n",
        "    status = response2[\"items\"][0][\"semanticsearch\"][\"status\"]\n",
        "    print(f\"status: {status}\")\n",
        "\n",
        "    if status in {\"waiting\", \"progress\"}:\n",
        "        time.sleep(5)\n",
        "        continue\n",
        "\n",
        "    break"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Step 3. Search\n",
        "\n",
        "Once the document is fully indexed, search for the the relevant pasages with the Python code."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "trusted": true
      },
      "outputs": [],
      "source": [
        "import json\n",
        "\n",
        "search_path = f\"/api/v1/projects/{client.config.project}/documents/semantic/search\"\n",
        "\n",
        "body = {\n",
        "    \"limit\": 3,\n",
        "    \"filter\": {\n",
        "        \"and\": [\n",
        "            {\n",
        "                \"equals\": {\n",
        "                    \"property\": [\"id\"],\n",
        "                    \"value\": document_id\n",
        "                }\n",
        "            },\n",
        "            {\n",
        "                \"semanticSearch\": {\n",
        "                    \"property\": [\"content\"],\n",
        "                    \"value\": \"Where is the Volve field located?\"\n",
        "                }\n",
        "            }\n",
        "        ]\n",
        "    }\n",
        "}\n",
        "\n",
        "response3 = client.post(search_path, json=body, headers={\"cdf-version\": \"beta\"}).json()\n",
        "\n",
        "print(json.dumps(response3, indent=2))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "trusted": true
      },
      "outputs": [],
      "source": []
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python (Pyodide)",
      "language": "python",
      "name": "python"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "python",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.8"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 4
}