Spaces:

argilla
/

synthetic-data-generator

Running

File size: 32,396 Bytes

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Fine-tune ModernBERT with Synthetic Data for RAG\n",
    "\n",
    "This notebook demonstrates the fine-tuning process of `modernbert-embed-base` using synthetic data tailored for the Retrieval-Augmented Generation (RAG) model.\n",
    "\n",
    "It provides a complete walkthrough of the fine-tuning process after generating synthetic data using the Synthetic Data Generator. For a comprehensive explanation of the methodology and additional details, refer to the blog post: [Fine-tune ModernBERT for RAG with Synthetic Data](https://huggingface.co/blog/fine-tune-modernbert-for-rag-with-synthetic-data)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Getting Started"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Install the Dependencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install torch\n",
    "!pip install datasets\n",
    "!pip install sentence-transformers\n",
    "!pip install haystack-ai\n",
    "!pip install git+https://github.com/huggingface/transformers.git  # for the latest version of transformers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Import the Required Libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "from torch.utils.data import DataLoader\n",
    "\n",
    "from datasets import load_dataset, concatenate_datasets, Dataset, DatasetDict\n",
    "\n",
    "\n",
    "from sentence_transformers import (\n",
    "    SentenceTransformer,\n",
    "    SentenceTransformerModelCardData,\n",
    "    CrossEncoder,\n",
    "    InputExample,\n",
    "    SentenceTransformerTrainer,\n",
    ")\n",
    "from sentence_transformers.losses import TripletLoss\n",
    "from sentence_transformers.training_args import (\n",
    "    SentenceTransformerTrainingArguments,\n",
    "    BatchSamplers,\n",
    ")\n",
    "from sentence_transformers.evaluation import TripletEvaluator\n",
    "from sentence_transformers.cross_encoder.evaluation import CECorrelationEvaluator\n",
    "\n",
    "\n",
    "from haystack import Document, Pipeline\n",
    "from haystack.document_stores.in_memory import InMemoryDocumentStore\n",
    "from haystack.components.embedders import (\n",
    "    SentenceTransformersDocumentEmbedder,\n",
    "    SentenceTransformersTextEmbedder,\n",
    ")\n",
    "from haystack.components.rankers import SentenceTransformersDiversityRanker\n",
    "from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever\n",
    "from haystack.components.builders import ChatPromptBuilder\n",
    "from haystack.components.generators.chat import HuggingFaceAPIChatGenerator\n",
    "from haystack.dataclasses import ChatMessage\n",
    "from haystack.utils import Secret\n",
    "from haystack.utils.hf import HFGenerationAPIType"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Configure the Environment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "MODEL = \"nomic-ai/modernbert-embed-base\"\n",
    "REPO_NAME = \"sdiazlor\" # your HF username here\n",
    "MODEL_NAME_BIENCODER = \"modernbert-embed-base-biencoder-human-rights\"\n",
    "MODEL_NAME_CROSSENCODER = \"modernbert-embed-base-crossencoder-human-rights\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Using device: mps\n"
     ]
    }
   ],
   "source": [
    "if torch.backends.mps.is_available():\n",
    "    device = \"mps\"\n",
    "elif torch.cuda.is_available():\n",
    "    device = \"cuda\"\n",
    "else:\n",
    "    device = \"cpu\"\n",
    "\n",
    "print(f\"Using device: {device}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pre-process the Synthetic Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dataset({\n",
       "    features: ['context', 'question', 'response', 'positive_retrieval', 'negative_retrieval', 'positive_reranking', 'negative_reranking'],\n",
       "    num_rows: 1000\n",
       "})"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Combine the generated datasets from files and prompts\n",
    "\n",
    "dataset_rag_from_file = load_dataset(f\"{REPO_NAME}/rag-human-rights-from-files\", split=\"train\")\n",
    "dataset_rag_from_prompt = load_dataset(f\"{REPO_NAME}/rag-human-rights-from-prompt\", split=\"train\")\n",
    "\n",
    "combined_rag_dataset = concatenate_datasets(\n",
    "    [dataset_rag_from_file, dataset_rag_from_prompt]\n",
    ")\n",
    "\n",
    "combined_rag_dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dataset({\n",
       "    features: ['context', 'question', 'response', 'positive_retrieval', 'negative_retrieval', 'positive_reranking', 'negative_reranking'],\n",
       "    num_rows: 828\n",
       "})"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Filter out examples with empty or NaN values\n",
    "\n",
    "def filter_empty_or_nan(example):\n",
    "    return all(\n",
    "        value is not None and str(value).strip() != \"\" for value in example.values()\n",
    "    )\n",
    "\n",
    "filtered_rag_dataset = combined_rag_dataset.filter(filter_empty_or_nan).shuffle(seed=42)\n",
    "filtered_rag_dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dataset({\n",
      "    features: ['anchor', 'positive', 'negative'],\n",
      "    num_rows: 828\n",
      "})\n",
      "Dataset({\n",
      "    features: ['anchor', 'positive'],\n",
      "    num_rows: 828\n",
      "})\n"
     ]
    }
   ],
   "source": [
    "# Rename, select and reorder columns according to the expected format for the SentenceTransformer and CrossEncoder models\n",
    "\n",
    "def rename_and_reorder_columns(dataset, rename_map, selected_columns):\n",
    "    for old_name, new_name in rename_map.items():\n",
    "        if old_name in dataset.column_names:\n",
    "            dataset = dataset.rename_column(old_name, new_name)\n",
    "    dataset = dataset.select_columns(selected_columns)\n",
    "    return dataset\n",
    "\n",
    "clean_rag_dataset_biencoder = rename_and_reorder_columns(\n",
    "    filtered_rag_dataset,\n",
    "    rename_map={\"context\": \"anchor\", \"positive_retrieval\": \"positive\", \"negative_retrieval\": \"negative\"},\n",
    "    selected_columns=[\"anchor\", \"positive\", \"negative\"],\n",
    ")\n",
    "\n",
    "clean_rag_dataset_crossencoder = rename_and_reorder_columns(\n",
    "    filtered_rag_dataset,\n",
    "    rename_map={\"context\": \"anchor\", \"positive_retrieval\": \"positive\"}, #TODO\n",
    "    selected_columns=[\"anchor\", \"positive\"],\n",
    ")\n",
    "\n",
    "print(clean_rag_dataset_biencoder)\n",
    "print(clean_rag_dataset_crossencoder)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Some weights of BertForSequenceClassification were not initialized from the model checkpoint at Snowflake/snowflake-arctic-embed-m-v1.5 and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']\n",
      "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "406c4d22f43f41d592d3b94da2955444",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Map:   0%|          | 0/828 [00:00<?, ? examples/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "Dataset({\n",
       "    features: ['anchor', 'positive', 'score'],\n",
       "    num_rows: 828\n",
       "})"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Add scores to train the CrossEncoder model, which requires sentence pairs with a score indicating how related they are.\n",
    "# Check the available models: https://huggingface.co/spaces/mteb/leaderboard\n",
    "\n",
    "model_reranking = CrossEncoder(\n",
    "    model_name=\"Snowflake/snowflake-arctic-embed-m-v1.5\", device=device\n",
    ")\n",
    "\n",
    "def add_reranking_scores(batch):\n",
    "    pairs = list(zip(batch[\"anchor\"], batch[\"positive\"]))\n",
    "    batch[\"score\"] = model_reranking.predict(pairs)\n",
    "    return batch\n",
    "\n",
    "clean_rag_dataset_crossencoder = clean_rag_dataset_crossencoder.map(\n",
    "    add_reranking_scores, batched=True, batch_size=250\n",
    ")\n",
    "clean_rag_dataset_crossencoder"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "DatasetDict({\n",
      "    train: Dataset({\n",
      "        features: ['anchor', 'positive', 'negative'],\n",
      "        num_rows: 662\n",
      "    })\n",
      "    eval: Dataset({\n",
      "        features: ['anchor', 'positive', 'negative'],\n",
      "        num_rows: 166\n",
      "    })\n",
      "})\n",
      "DatasetDict({\n",
      "    train: Dataset({\n",
      "        features: ['anchor', 'positive', 'score'],\n",
      "        num_rows: 662\n",
      "    })\n",
      "    eval: Dataset({\n",
      "        features: ['anchor', 'positive', 'score'],\n",
      "        num_rows: 166\n",
      "    })\n",
      "})\n"
     ]
    }
   ],
   "source": [
    "# Split the datasets into training and evaluation sets\n",
    "def split_dataset(dataset, train_size=0.8, seed=42):\n",
    "    train_eval_split = dataset.train_test_split(test_size=1 - train_size, seed=seed)\n",
    "\n",
    "    dataset_dict = DatasetDict(\n",
    "        {\"train\": train_eval_split[\"train\"], \"eval\": train_eval_split[\"test\"]}\n",
    "    )\n",
    "\n",
    "    return dataset_dict\n",
    "\n",
    "dataset_rag_biencoder = split_dataset(clean_rag_dataset_biencoder)\n",
    "dataset_rag_crossencoder = split_dataset(clean_rag_dataset_crossencoder)\n",
    "\n",
    "print(dataset_rag_biencoder)\n",
    "print(dataset_rag_crossencoder)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train the Bi-Encoder model for Retrieval"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the base model and create the SentenceTransformer model\n",
    "model_biencoder = SentenceTransformer(\n",
    "    MODEL,\n",
    "    model_card_data=SentenceTransformerModelCardData(\n",
    "        language=\"en\",\n",
    "        license=\"apache-2.0\",\n",
    "        model_name=MODEL_NAME_BIENCODER,\n",
    "    ),\n",
    ")\n",
    "model_biencoder.gradient_checkpointing_enable()  # Enable gradient checkpointing to save memory"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Select the TripleLoss loss function which requires sentence triplets (anchor, positive, negative)\n",
    "# Check the available losses: https://sbert.net/docs/sentence_transformer/loss_overview.html\n",
    "\n",
    "loss_biencoder = TripletLoss"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/sdiazlor/.pyenv/versions/3.11.4/envs/distilabel-tutorials/lib/python3.11/site-packages/transformers/training_args.py:2243: UserWarning: `use_mps_device` is deprecated and will be removed in version 5.0 of 🤗 Transformers. `mps` device will be used by default if available similar to the way `cuda` device is used.Therefore, no action from user is required. \n",
      "  warnings.warn(\n"
     ]
    }
   ],
   "source": [
    "# Define the training arguments for the SentenceTransformer model\n",
    "# Customize them as needed for your requirements\n",
    "\n",
    "training_args = SentenceTransformerTrainingArguments(\n",
    "    output_dir=f\"models/{MODEL_NAME_BIENCODER}\",\n",
    "    num_train_epochs=3,\n",
    "    per_device_train_batch_size=4,\n",
    "    gradient_accumulation_steps=4,\n",
    "    per_device_eval_batch_size=4,\n",
    "    warmup_ratio=0.1,\n",
    "    learning_rate=2e-5,\n",
    "    lr_scheduler_type=\"cosine\",\n",
    "    fp16=False,  # or True if stable on your MPS device\n",
    "    bf16=False,\n",
    "    batch_sampler=BatchSamplers.NO_DUPLICATES,\n",
    "    eval_strategy=\"epoch\",\n",
    "    save_strategy=\"epoch\",\n",
    "    save_total_limit=2,\n",
    "    logging_steps=100,\n",
    "    load_best_model_at_end=True,\n",
    "    use_mps_device=(device == \"mps\"),\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define the evaluator to assess the performance of the model\n",
    "triplet_evaluator = TripletEvaluator(\n",
    "    anchors=dataset_rag_biencoder[\"eval\"][\"anchor\"],\n",
    "    positives=dataset_rag_biencoder[\"eval\"][\"positive\"],\n",
    "    negatives=dataset_rag_biencoder[\"eval\"][\"negative\"],\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/sdiazlor/.pyenv/versions/3.11.4/envs/distilabel-tutorials/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.\n",
      "  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "\n",
       "    <div>\n",
       "      \n",
       "      <progress value='123' max='123' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
       "      [123/123 25:34, Epoch 2/3]\n",
       "    </div>\n",
       "    <table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       " <tr style=\"text-align: left;\">\n",
       "      <th>Epoch</th>\n",
       "      <th>Training Loss</th>\n",
       "      <th>Validation Loss</th>\n",
       "      <th>Cosine Accuracy</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>No log</td>\n",
       "      <td>3.655929</td>\n",
       "      <td>0.969880</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>14.374000</td>\n",
       "      <td>3.498395</td>\n",
       "      <td>0.981928</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table><p>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "faad6e9752f34babadff7a966ae55d87",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/sdiazlor/.pyenv/versions/3.11.4/envs/distilabel-tutorials/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.\n",
      "  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]\n",
      "/Users/sdiazlor/.pyenv/versions/3.11.4/envs/distilabel-tutorials/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.\n",
      "  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]\n"
     ]
    }
   ],
   "source": [
    "# Train the model. This will take some time depending on the size of the dataset and the model\n",
    "# Remember to adjust the training arguments according to your requirements\n",
    "\n",
    "trainer = SentenceTransformerTrainer(\n",
    "    model=model_biencoder,\n",
    "    args=training_args,\n",
    "    train_dataset=dataset_rag_biencoder[\"train\"],\n",
    "    eval_dataset=dataset_rag_biencoder[\"eval\"],\n",
    "    loss=loss_biencoder,\n",
    "    evaluator=triplet_evaluator,\n",
    ")\n",
    "trainer.train()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save the model to the local directory and push it to the Hub\n",
    "model_biencoder.save_pretrained(f\"models/{MODEL_NAME_BIENCODER}\")\n",
    "model_biencoder.push_to_hub(f\"{REPO_NAME}/{MODEL_NAME_BIENCODER}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train the Cross-Encoder model for Ranking"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Prepare the training and evaluation samples for the CrossEncoder model\n",
    "\n",
    "train_samples = []\n",
    "for row in dataset_rag_crossencoder[\"train\"]:\n",
    "    # Suppose 'score' is a float or an integer that you want to predict\n",
    "    train_samples.append(\n",
    "        InputExample(texts=[row[\"anchor\"], row[\"positive\"]], label=float(row[\"score\"]))\n",
    "    )\n",
    "\n",
    "eval_samples = []\n",
    "for row in dataset_rag_crossencoder[\"eval\"]:\n",
    "    eval_samples.append(\n",
    "        InputExample(texts=[row[\"anchor\"], row[\"positive\"]], label=float(row[\"score\"]))\n",
    "    )\n",
    "\n",
    "# Initialize the DataLoader for the training samples\n",
    "train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize the CrossEncoder model. Set the number of labels to 1 for regression tasks\n",
    "model_crossencoder = CrossEncoder(model_name=MODEL, num_labels=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define the evaluator\n",
    "evaluator = CECorrelationEvaluator.from_input_examples(eval_samples)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "9517a852f3d34cff86808c4b10cf8973",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Epoch:   0%|          | 0/3 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "6e942043c5a24e77bd6172cb5492d2a7",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Iteration:   0%|          | 0/166 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "d039d5acf3ed424e9ff6d0b30b51aceb",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Iteration:   0%|          | 0/166 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "5fd5d0442b76448e8cab18b652e29ad8",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Iteration:   0%|          | 0/166 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Train the CrossEncoder model\n",
    "\n",
    "model_crossencoder.fit(\n",
    "    train_dataloader=train_dataloader,\n",
    "    evaluator=evaluator,\n",
    "    epochs=3,\n",
    "    warmup_steps=500,\n",
    "    output_path=f\"models/{MODEL_NAME_CROSSENCODER}\",\n",
    "    save_best_model=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save the model to the local directory and push it to the Hub\n",
    "model_crossencoder.save_pretrained(f\"models/{MODEL_NAME_CROSSENCODER}\")\n",
    "model_crossencoder.push_to_hub(f\"{REPO_NAME}/{MODEL_NAME_CROSSENCODER}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Build the RAG Pipeline\n",
    "\n",
    "The following section is inspired by the Haystack tutorial, check it for further details: [Creating Your First QA Pipeline with Retrieval-Augmentation](https://haystack.deepset.ai/tutorials/27_first_rag_pipeline)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add the documents to the DocumentStore\n",
    "# Use the already chunked documents from original datasets\n",
    "\n",
    "df = combined_rag_dataset.to_pandas()\n",
    "df = df.drop_duplicates(subset=[\"context\"]) # drop duplicates based on \"context\" column\n",
    "df = df.sample(n=10, random_state=42) # optional: sample a subset of the dataset\n",
    "dataset = Dataset.from_pandas(df)\n",
    "\n",
    "docs = [Document(content=doc[\"context\"]) for doc in dataset]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize the document store and store the documents with the embeddings using our bi-encoder model\n",
    "\n",
    "document_store = InMemoryDocumentStore()\n",
    "doc_embedder = SentenceTransformersDocumentEmbedder(\n",
    "    model=f\"{REPO_NAME}/{MODEL_NAME_BIENCODER}\",\n",
    ")\n",
    "doc_embedder.warm_up()\n",
    "\n",
    "docs_with_embeddings = doc_embedder.run(docs)\n",
    "document_store.write_documents(docs_with_embeddings[\"documents\"])\n",
    "\n",
    "text_embedder = SentenceTransformersTextEmbedder(\n",
    "    model=f\"{REPO_NAME}/{MODEL_NAME_BIENCODER}\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize the retriever (our bi-encoder model) and the ranker (our cross-encoder model)\n",
    "\n",
    "retriever = InMemoryEmbeddingRetriever(document_store)\n",
    "ranker = SentenceTransformersDiversityRanker(\n",
    "    model=f\"{REPO_NAME}/{MODEL_NAME_CROSSENCODER}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define the prompt builder and the chat generator to interact with the models using the HF Serverless Inference API\n",
    "\n",
    "template = [\n",
    "    ChatMessage.from_user(\n",
    "        \"\"\"\n",
    "Given the following information, answer the question.\n",
    "\n",
    "Context:\n",
    "{% for document in documents %}\n",
    "    {{ document.content }}\n",
    "{% endfor %}\n",
    "\n",
    "Question: {{question}}\n",
    "Answer:\n",
    "\"\"\"\n",
    "    )\n",
    "]\n",
    "\n",
    "prompt_builder = ChatPromptBuilder(template=template)\n",
    "\n",
    "chat_generator = HuggingFaceAPIChatGenerator(\n",
    "    api_type=HFGenerationAPIType.SERVERLESS_INFERENCE_API,\n",
    "    api_params={\"model\": \"meta-llama/Llama-3.1-8B-Instruct\"},\n",
    "    token=Secret.from_env_var(\"HF_TOKEN\"),\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize the pipeline with the components\n",
    "\n",
    "rag_pipeline = Pipeline()\n",
    "rag_pipeline.add_component(\"text_embedder\", text_embedder)\n",
    "rag_pipeline.add_component(\"retriever\", retriever)\n",
    "rag_pipeline.add_component(\"ranker\", ranker)\n",
    "rag_pipeline.add_component(\"prompt_builder\", prompt_builder)\n",
    "rag_pipeline.add_component(\"llm\", chat_generator)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<haystack.core.pipeline.pipeline.Pipeline object at 0x32e75b4d0>\n",
       "🚅 Components\n",
       "  - text_embedder: SentenceTransformersTextEmbedder\n",
       "  - retriever: InMemoryEmbeddingRetriever\n",
       "  - ranker: SentenceTransformersDiversityRanker\n",
       "  - prompt_builder: ChatPromptBuilder\n",
       "  - llm: HuggingFaceAPIChatGenerator\n",
       "🛤️ Connections\n",
       "  - text_embedder.embedding -> retriever.query_embedding (List[float])\n",
       "  - retriever.documents -> ranker.documents (List[Document])\n",
       "  - ranker.documents -> prompt_builder.documents (List[Document])\n",
       "  - prompt_builder.prompt -> llm.messages (List[ChatMessage])"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Connect the components to each other\n",
    "\n",
    "rag_pipeline.connect(\"text_embedder.embedding\", \"retriever.query_embedding\")\n",
    "rag_pipeline.connect(\"retriever.documents\", \"ranker.documents\")\n",
    "rag_pipeline.connect(\"ranker\", \"prompt_builder\")\n",
    "rag_pipeline.connect(\"prompt_builder.prompt\", \"llm.messages\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "80c813c847524f1493067f6dbe65c725",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Batches:   0%|          | 0/1 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "It seems that there is not enough information given in the human rights protocols provided to accurately answer the question. However, we can inform you that there are several types of human rights documents that this could be referring too. Event the most widely respected declared world document on human rights for Example - Exernal and some Individual (Part 1 Art.) and some other attempted Separation apart include: The convention lists several key rights such as \n",
      "\n",
      "1. Right to Life \n",
      "2. Right to Liberty and Security \n",
      "3. Freedom from Torture \n",
      "4. Freedom from Slavery \n",
      "5. Right to a Fair Trial \n",
      "6. No Punishment without Law \n",
      "7. Respect for Family Life \n",
      "... (and throughout given information 44 protocals  - are actually chapter and not... How is the answer \n",
      " \n",
      "\n",
      "Not possible to answer your question due to lack of information, however we can tell you Event the most widely respected declared world document on human rights.\n"
     ]
    }
   ],
   "source": [
    "# Make a query to the pipeline without references included in your documentation\n",
    "question = \"How many human rights there are?\"\n",
    "\n",
    "response = rag_pipeline.run(\n",
    "    {\n",
    "        \"text_embedder\": {\"text\": question},\n",
    "        \"prompt_builder\": {\"question\": question},\n",
    "        \"ranker\": {\"query\": question},\n",
    "    }\n",
    ")\n",
    "\n",
    "print(response[\"llm\"][\"replies\"][0].text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "2995f14154d148589129a3f449adc5d5",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Batches:   0%|          | 0/1 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The information you provided does not directly list the \"Right of Fair Trial\" but looking under articles of the Convention for the Protection of Human Rights and Fundamental Freedoms, Article 6, also known as the Right to a Fair Trial, gives a clear idea.\n",
      "\n",
      " Article 6. Right to a fair Trial\n",
      " \n",
      "\n",
      "1. Everyone is entitled to a fair and public hearing within a reasonable time by an independent and impartial tribunal established by law.\n",
      " \n",
      "2, everybody shall be presumed innocent until proven guilty by a final decision of a competent court.\n",
      " \n",
      "3. Everyone charged with a criminal offence has the following minimum rights:\n",
      "\n",
      "      a to be informed promptly, in a language which he understands and in detail, of the charges, if any, against him.\n",
      "      b to have adequate time and facilities for the preparation of his defence.\n",
      "      c to defend himself in person or through legal assistance of his own choosing or, if he has not sufficient means to pay for legal assistance, to be given it free when the interests of justice so require.\n",
      "      d to be tried in his presence, and to defend himself in person or through legal assistance of his own choosing; to be informed, if he does not have legal assistance chosen or appointed under Article 5 Part 3 of this Convention, to communicate with the defence he has chosen\n",
      "      e to have the free assistance of an interpreter if he cannot understand or speak the language used in court.\n",
      " \n",
      " \n",
      "4. Everyone sentenced has the right to, review by a higher tribunal according to law\n",
      "\n",
      "5. Everyone sentenced has the right to, take up or pursue his occupation.\n",
      "\n",
      "6. Sentences may, also include restoration of rights or removal of disabilities\n"
     ]
    }
   ],
   "source": [
    "# Make a query to the pipeline with references included in your documentation\n",
    "question = \"What's the Right of Fair Trial?\"\n",
    "\n",
    "response = rag_pipeline.run(\n",
    "    {\n",
    "        \"text_embedder\": {\"text\": question},\n",
    "        \"prompt_builder\": {\"question\": question},\n",
    "        \"ranker\": {\"query\": question},\n",
    "    }\n",
    ")\n",
    "\n",
    "print(response[\"llm\"][\"replies\"][0].text)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "distilabel-tutorials",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}