"# **Getting to know Llama 3: Everything you need to start building**\n",
"Our goal in this session is to provide a guided tour of Llama 3 with comparison with Llama 2, including understanding different Llama 3 models, how and where to access them, Generative AI and Chatbot architectures, prompt engineering, RAG (Retrieval Augmented Generation), Fine-tuning and more. All this is implemented with a starter code for you to take it and use it in your Llama 3 projects."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ioVMNcTesSEk"
},
"source": [
"### **0 - Prerequisites**\n",
"* Basic understanding of Large Language Models\n",
"* RAG (Retrieval Augmented Generation): Chat about Your Own Data\n",
"* Fine-tuning\n",
"* Agents"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "sd54g0OHuqBY"
},
"source": [
"## **2 - Using and Comparing Llama 3 and Llama 2**\n",
"\n",
"In this notebook, we will use the Llama 2 70b chat and Llama 3 8b and 70b instruct models hosted on [Groq](https://console.groq.com/). You'll need to first [sign in](https://console.groq.com/) with your github or gmail account, then get an [API token](https://console.groq.com/keys) to try Groq out for free. (Groq runs Llama models very fast and they only support one Llama 2 model: the Llama 2 70b chat).\n",
"\n",
"**Note: You can also use other Llama hosting providers such as [Replicate](https://replicate.com/blog/run-llama-3-with-an-api?input=python), [Togther](https://docs.together.ai/docs/quickstart). Simply click the links here to see how to run `pip install` and use their freel trial API key with example code to modify the following three cells in 2.1 and 2.2.**\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "h3YGMDJidHtH"
},
"source": [
"### **2.1 - Install dependencies**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "VhN6hXwx7FCp"
},
"outputs": [],
"source": [
"!pip install groq"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### **2.2 - Create helpers for Llama 2 and Llama 3**\n",
"First, set your Groq API token as environment variables.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "8hkWpqWD28ho"
},
"outputs": [],
"source": [
"import os\n",
"from getpass import getpass\n",
"\n",
"GROQ_API_TOKEN = getpass()\n",
"\n",
"os.environ[\"GROQ_API_KEY\"] = GROQ_API_TOKEN"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create Llama 2 and Llama 3 helper functions - for chatbot type of apps, we'll use Llama 3 8b/70b instruct models, not the base models."
" * In-context learning - specific method of prompt engineering where demonstration of task are provided as part of prompt.\n",
" 1. Zero-shot learning - model is performing tasks without any\n",
"input examples.\n",
" 2. Few or “N-Shot” Learning - model is performing and behaving based on input examples in user's prompt."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "6W71MFNZyRkQ"
},
"outputs": [],
"source": [
"# Zero-shot example. To get positive/negative/neutral sentiment, we need to give examples in the prompt\n",
"prompt = '''\n",
"Classify: I saw a Gecko.\n",
"Sentiment: ?\n",
"\n",
"Give one word response.\n",
"'''\n",
"output = llama2(prompt)\n",
"md(output)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "MCQRjf1Y1RYJ"
},
"outputs": [],
"source": [
"output = llama3_8b(prompt)\n",
"md(output)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note: Llama 3 has different opinions than Llama 2.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "8UmdlTmpDZxA"
},
"outputs": [],
"source": [
"# By giving examples to Llama, it understands the expected output format.\n",
"\n",
"prompt = '''\n",
"Classify: I love Llamas!\n",
"Sentiment: Positive\n",
"Classify: I dont like Snakes.\n",
"Sentiment: Negative\n",
"Classify: I saw a Gecko.\n",
"Sentiment:\n",
"\n",
"Give one word response.\n",
"'''\n",
"\n",
"output = llama2(prompt)\n",
"md(output)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "M_EcsUo1zqFD"
},
"outputs": [],
"source": [
"output = llama3_8b(prompt)\n",
"md(output)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note: Llama 2, with few shots, has the same output \"Neutral\" as Llama 3.**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "mbr124Y197xl"
},
"source": [
"#### **4.2.2 - Chain of Thought**\n",
"\"Chain of thought\" enables complex reasoning through logical step by step thinking and generates meaningful and contextually relevant responses."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Xn8zmLBQzpgj"
},
"outputs": [],
"source": [
"# Standard prompting\n",
"prompt = '''\n",
"Llama started with 5 tennis balls. It buys 2 more cans of tennis balls. Each can has 3 tennis balls.\n",
"How many tennis balls does Llama have?\n",
"\n",
"Answer in one word.\n",
"'''\n",
"\n",
"output = llama3_8b(prompt)\n",
"md(output)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "lKNOj79o1Kwu"
},
"outputs": [],
"source": [
"output = llama3_70b(prompt)\n",
"md(output)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note: Llama 3-8b did not get the right answer because it was asked to answer in one word.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# By default, Llama 3 models follow \"Chain-Of-Thought\" prompting\n",
"prompt = '''\n",
"Llama started with 5 tennis balls. It buys 2 more cans of tennis balls. Each can has 3 tennis balls.\n",
"How many tennis balls does Llama have?\n",
"'''\n",
"\n",
"output = llama3_8b(prompt)\n",
"md(output)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"output = llama3_70b(prompt)\n",
"md(output)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note: By default, Llama 3 models identify word problems and solves it step by step!**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"prompt = \"\"\"\n",
"15 of us want to go to a restaurant.\n",
"Two of them have cars\n",
"Each car can seat 5 people.\n",
"Two of us have motorcycles.\n",
"Each motorcycle can fit 2 people.\n",
"Can we all get to the restaurant by car or motorcycle?\n",
"Think step by step.\n",
"Provide the answer as a single yes/no answer first.\n",
"Then explain each intermediate step.\n",
"\"\"\"\n",
"output = llama3_8b(prompt)\n",
"print(output)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"output = llama3_70b(prompt)\n",
"print(output)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note: Llama 3 70b model works correctly in this example.**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Summary: Llama 2 often needs encourgement for step by step thinking to correctly reasoning. Llama 3 understands, reasons and explains better, making chain of thought unnecessary in the cases above.**"
"* Prompt Eng Limitations - Knowledge cutoff & lack of specialized data\n",
"\n",
"* Retrieval Augmented Generation(RAG) allows us to retrieve snippets of information from external data sources and augment it to the user's prompt to get tailored responses from Llama 2.\n",
"\n",
"For our demo, we are going to download an external PDF file from a URL and query against the content in the pdf file to get contextually relevant information back with the help of Llama!\n",
"<a href=\"https://colab.research.google.com/github/meta-llama/llama-recipes/blob/main/recipes/quickstart/Prompt_Engineering_with_Llama_3.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
"\n",
"# Prompt Engineering with Llama 3\n",
"\n",
"Prompt engineering is using natural language to produce a desired response from a large language model (LLM).\n",
"\n",
"This interactive guide covers prompt engineering & best practices with Llama 3."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Why now?\n",
"\n",
"[Vaswani et al. (2017)](https://arxiv.org/abs/1706.03762) introduced the world to transformer neural networks (originally for machine translation). Transformers ushered an era of generative AI with diffusion models for image creation and large language models (`LLMs`) as **programmable deep learning networks**.\n",
"\n",
"Programming foundational LLMs is done with natural language – it doesn't require training/tuning like ML models of the past. This has opened the door to a massive amount of innovation and a paradigm shift in how technology can be deployed. The science/art of using natural language to program language models to accomplish a task is referred to as **Prompt Engineering**."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Llama Models\n",
"\n",
"In 2023, Meta introduced the [Llama language models](https://ai.meta.com/llama/) (Llama Chat, Code Llama, Llama Guard). These are general purpose, state-of-the-art LLMs.\n",
"\n",
"Llama models come in varying parameter sizes. The smaller models are cheaper to deploy and run; the larger models are more capable.\n",
"\n",
"#### Llama 3\n",
"1. `llama-3-8b` - base pretrained 8 billion parameter model\n",
"1. `llama-3-70b` - base pretrained 70 billion parameter model\n",
"Large language models are deployed and accessed in a variety of ways, including:\n",
"\n",
"1. **Self-hosting**: Using local hardware to run inference. Ex. running Llama on your Macbook Pro using [llama.cpp](https://github.com/ggerganov/llama.cpp).\n",
" * Best for privacy/security or if you already have a GPU.\n",
"1. **Cloud hosting**: Using a cloud provider to deploy an instance that hosts a specific model. Ex. running Llama on cloud providers like AWS, Azure, GCP, and others.\n",
" * Best for customizing models and their runtime (ex. fine-tuning a model for your use case).\n",
"1. **Hosted API**: Call LLMs directly via an API. There are many companies that provide Llama inference APIs including AWS Bedrock, Replicate, Anyscale, Together and others.\n",
" * Easiest option overall."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Hosted APIs\n",
"\n",
"Hosted APIs are the easiest way to get started. We'll use them here. There are usually two main endpoints:\n",
"\n",
"1. **`completion`**: generate a response to a given prompt (a string).\n",
"1. **`chat_completion`**: generate the next message in a list of messages, enabling more explicit instruction and context for use cases like chatbots."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Tokens\n",
"\n",
"LLMs process inputs and outputs in chunks called *tokens*. Think of these, roughly, as words – each model will have its own tokenization scheme. For example, this sentence...\n",
"\n",
"> Our destiny is written in the stars.\n",
"\n",
"...is tokenized into `[\"Our\", \" destiny\", \" is\", \" written\", \" in\", \" the\", \" stars\", \".\"]` for Llama 3. See [this](https://tiktokenizer.vercel.app/?model=meta-llama%2FMeta-Llama-3-8B) for an interactive tokenizer tool.\n",
"\n",
"Tokens matter most when you consider API pricing and internal behavior (ex. hyperparameters).\n",
"\n",
"Each model has a maximum context length that your prompt cannot exceed. That's 8K tokens for Llama 3, 4K for Llama 2, and 100K for Code Llama. \n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Notebook Setup\n",
"\n",
"The following APIs will be used to call LLMs throughout the guide. As an example, we'll call Llama 3 chat using [Grok](https://console.groq.com/playground?model=llama3-70b-8192).\n",
"\n",
"To install prerequisites run:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import sys\n",
"!{sys.executable} -m pip install groq"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from typing import Dict, List\n",
"from groq import Groq\n",
"\n",
"# Get a free API key from https://console.groq.com/keys\n",
"complete_and_print(\"The typical color of the sky is: \")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"complete_and_print(\"which model version are you?\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Chat Completion APIs\n",
"Chat completion models provide additional structure to interacting with an LLM. An array of structured message objects is sent to the LLM instead of a single piece of text. This message list provides the LLM with some \"context\" or \"history\" from which to continue.\n",
"\n",
"Typically, each message contains `role` and `content`:\n",
"* Messages with the `system` role are used to provide core instruction to the LLM by developers.\n",
"* Messages with the `user` role are typically human-provided messages.\n",
"* Messages with the `assistant` role are typically generated by the LLM."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"response = chat_completion(messages=[\n",
" user(\"My favorite color is blue.\"),\n",
" assistant(\"That's great to hear!\"),\n",
" user(\"What is my favorite color?\"),\n",
"])\n",
"print(response)\n",
"# \"Sure, I can help you with that! Your favorite color is blue.\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### LLM Hyperparameters\n",
"\n",
"#### `temperature` & `top_p`\n",
"\n",
"These APIs also take parameters which influence the creativity and determinism of your output.\n",
"\n",
"At each step, LLMs generate a list of most likely tokens and their respective probabilities. The least likely tokens are \"cut\" from the list (based on `top_p`), and then a token is randomly selected from the remaining candidates (`temperature`).\n",
"\n",
"In other words: `top_p` controls the breadth of vocabulary in a generation and `temperature` controls the randomness within that vocabulary. A temperature of ~0 produces *almost* deterministic results.\n",
"\n",
"[Read more about temperature setting here](https://community.openai.com/t/cheat-sheet-mastering-temperature-and-top-p-in-chatgpt-api-a-few-tips-and-tricks-on-controlling-the-creativity-deterministic-output-of-prompt-responses/172683).\n",
"# These two generations are highly likely to be the same\n",
"\n",
"print_tuned_completion(1.0, 1.0)\n",
"print_tuned_completion(1.0, 1.0)\n",
"# These two generations are highly likely to be different"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prompting Techniques"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Explicit Instructions\n",
"\n",
"Detailed, explicit instructions produce better results than open-ended prompts:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"complete_and_print(prompt=\"Describe quantum physics in one short sentence of no more than 12 words\")\n",
"# Returns a succinct explanation of quantum physics that mentions particles and states existing simultaneously."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"You can think about giving explicit instructions as using rules and restrictions to how Llama 3 responds to your prompt.\n",
"\n",
"- Stylization\n",
" - `Explain this to me like a topic on a children's educational network show teaching elementary students.`\n",
" - `I'm a software engineer using large language models for summarization. Summarize the following text in under 250 words:`\n",
" - `Give your answer like an old timey private investigator hunting down a case step by step.`\n",
"- Formatting\n",
" - `Use bullet points.`\n",
" - `Return as a JSON object.`\n",
" - `Use less technical terms and help me apply it in my work in communications.`\n",
"- Restrictions\n",
" - `Only use academic papers.`\n",
" - `Never give sources older than 2020.`\n",
" - `If you don't know the answer, say that you don't know.`\n",
"\n",
"Here's an example of giving explicit instructions to give more specific results by limiting the responses to recently created sources."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"complete_and_print(\"Explain the latest advances in large language models to me.\")\n",
"# More likely to cite sources from 2017\n",
"\n",
"complete_and_print(\"Explain the latest advances in large language models to me. Always cite your sources. Never cite sources older than 2020.\")\n",
"# Gives more specific advances and only cites sources from 2020"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example Prompting using Zero- and Few-Shot Learning\n",
"\n",
"A shot is an example or demonstration of what type of prompt and response you expect from a large language model. This term originates from training computer vision models on photographs, where one shot was one example or instance that the model used to classify an image ([Fei-Fei et al. (2006)](http://vision.stanford.edu/documents/Fei-FeiFergusPerona2006.pdf)).\n",
"\n",
"#### Zero-Shot Prompting\n",
"\n",
"Large language models like Llama 3 are unique because they are capable of following instructions and producing responses without having previously seen an example of a task. Prompting without examples is called \"zero-shot prompting\".\n",
"\n",
"Let's try using Llama 3 as a sentiment detector. You may notice that output format varies - we can improve this with better prompting."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"complete_and_print(\"Text: This was the best movie I've ever seen! \\n The sentiment of the text is: \")\n",
"# Returns positive sentiment\n",
"\n",
"complete_and_print(\"Text: The director was trying too hard. \\n The sentiment of the text is: \")\n",
"# Returns negative sentiment"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"#### Few-Shot Prompting\n",
"\n",
"Adding specific examples of your desired output generally results in more accurate, consistent output. This technique is called \"few-shot prompting\".\n",
"\n",
"In this example, the generated response follows our desired format that offers a more nuanced sentiment classifer that gives a positive, neutral, and negative response confidence percentage.\n",
"\n",
"See also: [Zhao et al. (2021)](https://arxiv.org/abs/2102.09690), [Liu et al. (2021)](https://arxiv.org/abs/2101.06804), [Su et al. (2022)](https://arxiv.org/abs/2209.01975), [Rubin et al. (2022)](https://arxiv.org/abs/2112.08633).\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def sentiment(text):\n",
" response = chat_completion(messages=[\n",
" user(\"You are a sentiment classifier. For each message, give the percentage of positive/netural/negative.\"),\n",
"# More likely to return a balanced mix of positive, neutral, and negative\n",
"print_sentiment(\"I loved it!\")\n",
"# More likely to return 100% positive\n",
"print_sentiment(\"Terrible service 0/10\")\n",
"# More likely to return 100% negative"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Role Prompting\n",
"\n",
"Llama will often give more consistent responses when given a role ([Kong et al. (2023)](https://browse.arxiv.org/pdf/2308.07702.pdf)). Roles give context to the LLM on what type of answers are desired.\n",
"\n",
"Let's use Llama 3 to create a more focused, technical response for a question around the pros and cons of using PyTorch."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"complete_and_print(\"Explain the pros and cons of using PyTorch.\")\n",
"# More likely to explain the pros and cons of PyTorch covers general areas like documentation, the PyTorch community, and mentions a steep learning curve\n",
"\n",
"complete_and_print(\"Your role is a machine learning expert who gives highly technical advice to senior engineers who work with complicated datasets. Explain the pros and cons of using PyTorch.\")\n",
"# Often results in more technical benefits and drawbacks that provide more technical details on how model layers"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Chain-of-Thought\n",
"\n",
"Simply adding a phrase encouraging step-by-step thinking \"significantly improves the ability of large language models to perform complex reasoning\" ([Wei et al. (2022)](https://arxiv.org/abs/2201.11903)). This technique is called \"CoT\" or \"Chain-of-Thought\" prompting.\n",
"\n",
"Llama 3 now reasons step-by-step naturally without the addition of the phrase. This section remains for completeness."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"prompt = \"Who lived longer, Mozart or Elvis?\"\n",
"\n",
"complete_and_print(prompt)\n",
"# Llama 2 would often give the incorrect answer of \"Mozart\"\n",
"\n",
"complete_and_print(f\"{prompt} Let's think through this carefully, step by step.\")\n",
"# Gives the correct answer \"Elvis\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Self-Consistency\n",
"\n",
"LLMs are probablistic, so even with Chain-of-Thought, a single generation might produce incorrect results. Self-Consistency ([Wang et al. (2022)](https://arxiv.org/abs/2203.11171)) introduces enhanced accuracy by selecting the most frequent answer from multiple generations (at the cost of higher compute):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import re\n",
"from statistics import mode\n",
"\n",
"def gen_answer():\n",
" response = completion(\n",
" \"John found that the average of 15 numbers is 40.\"\n",
" \"If 10 is added to each number then the mean of the numbers is?\"\n",
" \"Report the answer surrounded by backticks (example: `123`)\",\n",
" )\n",
" match = re.search(r'`(\\d+)`', response)\n",
" if match is None:\n",
" return None\n",
" return match.group(1)\n",
"\n",
"answers = [gen_answer() for i in range(5)]\n",
"\n",
"print(\n",
" f\"Answers: {answers}\\n\",\n",
" f\"Final answer: {mode(answers)}\",\n",
" )\n",
"\n",
"# Sample runs of Llama-3-70B (all correct):\n",
"# ['60', '50', '50', '50', '50'] -> 50\n",
"# ['50', '50', '50', '60', '50'] -> 50\n",
"# ['50', '50', '60', '50', '50'] -> 50"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Retrieval-Augmented Generation\n",
"\n",
"You'll probably want to use factual knowledge in your application. You can extract common facts from today's large models out-of-the-box (i.e. using just the model weights):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"complete_and_print(\"What is the capital of the California?\")\n",
"# Gives the correct answer \"Sacramento\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"However, more specific facts, or private information, cannot be reliably retrieved. The model will either declare it does not know or hallucinate an incorrect answer:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"complete_and_print(\"What was the temperature in Menlo Park on December 12th, 2023?\")\n",
"# \"I'm just an AI, I don't have access to real-time weather data or historical weather records.\"\n",
"\n",
"complete_and_print(\"What time is my dinner reservation on Saturday and what should I wear?\")\n",
"# \"I'm not able to access your personal information [..] I can provide some general guidance\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Retrieval-Augmented Generation, or RAG, describes the practice of including information in the prompt you've retrived from an external database ([Lewis et al. (2020)](https://arxiv.org/abs/2005.11401v4)). It's an effective way to incorporate facts into your LLM application and is more affordable than fine-tuning which may be costly and negatively impact the foundational model's capabilities.\n",
"\n",
"This could be as simple as a lookup table or as sophisticated as a [vector database]([FAISS](https://github.com/facebookresearch/faiss)) containing all of your company's knowledge:"
" f\"Given the following information: '{retrived_info}', respond to: '{question}'\"\n",
" )\n",
"\n",
"\n",
"def ask_for_temperature(day):\n",
" temp_on_day = MENLO_PARK_TEMPS.get(day) or \"unknown temperature\"\n",
" prompt_with_rag(\n",
" f\"The temperature in Menlo Park was {temp_on_day} on {day}'\", # Retrieved fact\n",
" f\"What is the temperature in Menlo Park on {day}?\", # User question\n",
" )\n",
"\n",
"\n",
"ask_for_temperature(\"2023-12-12\")\n",
"# \"Sure! The temperature in Menlo Park on 2023-12-12 was 51 degrees Fahrenheit.\"\n",
"\n",
"ask_for_temperature(\"2023-07-18\")\n",
"# \"I'm not able to provide the temperature in Menlo Park on 2023-07-18 as the information provided states that the temperature was unknown.\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Program-Aided Language Models\n",
"\n",
"LLMs, by nature, aren't great at performing calculations. Let's try:\n",
"\n",
"$$\n",
"((-5 + 93 * 4 - 0) * (4^4 + -7 + 0 * 5))\n",
"$$\n",
"\n",
"(The correct answer is 91383.)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"complete_and_print(\"\"\"\n",
"Calculate the answer to the following math problem:\n",
"\n",
"((-5 + 93 * 4 - 0) * (4^4 + -7 + 0 * 5))\n",
"\"\"\")\n",
"# Gives incorrect answers like 92448, 92648, 95463"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"[Gao et al. (2022)](https://arxiv.org/abs/2211.10435) introduced the concept of \"Program-aided Language Models\" (PAL). While LLMs are bad at arithmetic, they're great for code generation. PAL leverages this fact by instructing the LLM to write code to solve calculation tasks."
"A common struggle with Llama 2 is getting output without extraneous tokens (ex. \"Sure! Here's more information on...\"), even if explicit instructions are given to Llama 2 to be concise and no preamble. Llama 3 can better follow instructions.\n",
"\n",
"Check out this improvement that combines a role, rules and restrictions, explicit instructions, and an example:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"complete_and_print(\n",
" \"Give me the zip code for Menlo Park in JSON format with the field 'zip_code'\",\n",
")\n",
"# Likely returns the JSON and also \"Sure! Here's the JSON...\"\n",
"\n",
"complete_and_print(\n",
" \"\"\"\n",
" You are a robot that only outputs JSON.\n",
" You reply in JSON format with the field 'zip_code'.\n",
" Example question: What is the zip code of the Empire State Building? Example answer: {'zip_code': 10118}\n",
" Now here is my question: What is the zip code of Menlo Park?\n",
"Edited by [Dalton Flanagan](https://www.linkedin.com/in/daltonflanagan/) (dalton@meta.com) with contributions from Mohsen Agsen, Bryce Bortree, Ricardo Juan Palma Duran, Kaolin Fire, Thomas Scialom."
"## Running Meta Llama 3 on Google Colab using Hugging Face transformers library\n",
"This notebook goes over how you can set up and run Llama 3 using Hugging Face transformers library\n",
"<a href=\"https://colab.research.google.com/github/meta-llama/llama-recipes/blob/main/recipes/quickstart/Running_Llama2_Anywhere/Running_Llama_on_HF_transformers.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Steps at a glance:\n",
"This demo showcases how to run the example with already converted Llama 3 weights on [Hugging Face](https://huggingface.co/meta-llama). Please Note: To use the downloads on Hugging Face, you must first request a download as shown in the steps below making sure that you are using the same email address as your Hugging Face account.\n",
"\n",
"To use already converted weights, start here:\n",
"1. Request download of model weights from the Llama website\n",
"2. Login to Hugging Face from your terminal using the same email address as (1). Follow the instructions [here](https://huggingface.co/docs/huggingface_hub/en/quick-start). \n",
"3. Run the example\n",
"\n",
"\n",
"Else, if you'd like to download the models locally and convert them to the HF format, follow the steps below to convert the weights:\n",
"1. Request download of model weights from the Llama website\n",
"2. Clone the llama repo and get the weights\n",
"3. Convert the model weights\n",
"4. Prepare the script\n",
"5. Run the example"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using already converted weights"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 1. Request download of model weights from the Llama website\n",
"Request download of model weights from the Llama website\n",
"Before you can run the model locally, you will need to get the model weights. To get the model weights, visit the [Llama website](https://llama.meta.com/) and click on “download models”. \n",
"\n",
"Fill the required information, select the models “Meta Llama 3” and accept the terms & conditions. You will receive a URL in your email in a short time."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 2. Prepare the script\n",
"\n",
"We will install the Transformers library and Accelerate library for our demo.\n",
"\n",
"The `Transformers` library provides many models to perform tasks on texts such as classification, question answering, text generation, etc.\n",
"The `accelerate` library enables the same PyTorch code to be run across any distributed configuration of GPUs and CPUs.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install transformers\n",
"!pip install accelerate"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we will import AutoTokenizer, which is a class from the transformers library that automatically chooses the correct tokenizer for a given pre-trained model, import transformers library and torch for PyTorch.\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from transformers import AutoTokenizer\n",
"import transformers\n",
"import torch"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then, we will set the model variable to a specific model we’d like to use. In this demo, we will use the 8b chat model `meta-llama/Meta-Llama-3-8B-Instruct`. Using Meta models from Hugging Face requires you to\n",
"\n",
"1. Accept Terms of Service for Meta Llama 3 on Meta [website](https://llama.meta.com/llama-downloads).\n",
"2. Use the same email address from Step (1) to login into Hugging Face.\n",
"\n",
"Follow the instructions on this Hugging Face page to login from your [terminal](https://huggingface.co/docs/huggingface_hub/en/quick-start). "
"Now, we will use the `from_pretrained` method of `AutoTokenizer` to create a tokenizer. This will download and cache the pre-trained tokenizer and return an instance of the appropriate tokenizer class.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pipeline = transformers.pipeline(\n",
"\"text-generation\",\n",
" model=model,\n",
" torch_dtype=torch.float16,\n",
" device_map=\"auto\",\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 3. Run the example\n",
"\n",
"Now, let’s create the pipeline for text generation. We’ll also set the device_map argument to `auto`, which means the pipeline will automatically use a GPU if one is available.\n",
"\n",
"Let’s also generate a text sequence based on the input that we provide. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sequences = pipeline(\n",
" 'I have tomatoes, basil and cheese at home. What can I cook for dinner?\\n',\n",
" do_sample=True,\n",
" top_k=10,\n",
" num_return_sequences=1,\n",
" eos_token_id=tokenizer.eos_token_id,\n",
" truncation = True,\n",
" max_length=400,\n",
")\n",
"\n",
"for seq in sequences:\n",
" print(f\"Result: {seq['generated_text']}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"\n",
"### Downloading and converting weights to Hugging Face format"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 1. Request download of model weights from the Llama website\n",
"Request download of model weights from the Llama website\n",
"Before you can run the model locally, you will need to get the model weights. To get the model weights, visit the [Llama website](https://llama.meta.com/) and click on “download models”. \n",
"\n",
"Fill the required information, select the models \"Meta Llama 3\" and accept the terms & conditions. You will receive a URL in your email in a short time."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 2. Clone the llama repo and get the weights\n",
"Git clone the [Meta Llama 3 repo](https://github.com/meta-llama/llama3). Run the `download.sh` script and follow the instructions. This will download the model checkpoints and tokenizer.\n",
"\n",
"This example demonstrates a Meta Llama 3 model with 8B-instruct parameters, but the steps we follow would be similar for other llama models, as well as for other parameter models."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 3. Convert the model weights using Hugging Face transformer from source\n",
"We need a way to use our model for inference. Pipeline allows us to specify which type of task the pipeline needs to run (`text-generation`), specify the model that the pipeline should use to make predictions (`model`), define the precision to use this model (`torch.float16`), device on which the pipeline should run (`device_map`) among various other options. \n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"pipeline = transformers.pipeline(\n",
" \"text-generation\",\n",
" model=model,\n",
" tokenizer=tokenizer,\n",
" torch_dtype=torch.float16,\n",
" device_map=\"auto\",\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we have our pipeline defined, and we need to provide some text prompts as inputs to our pipeline to use when it runs to generate responses (`sequences`). The pipeline shown in the example below sets `do_sample` to True, which allows us to specify the decoding strategy we’d like to use to select the next token from the probability distribution over the entire vocabulary. In our example, we are using top_k sampling. \n",
"\n",
"By changing `max_length`, you can specify how long you’d like the generated response to be. \n",
"Setting the `num_return_sequences` parameter to greater than one will let you generate more than one output.\n",
"\n",
"In your script, add the following to provide input, and information on how to run the pipeline:\n",
"\n",
"\n",
"#### 5. Run the example"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sequences = pipeline(\n",
" 'I have tomatoes, basil and cheese at home. What can I cook for dinner?\\n',\n",
"This notebook goes over how you can set up and run Llama 3 locally on a Mac, Windows or Linux using [Ollama](https://ollama.com/)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Steps at a glance:\n",
"1. Download and install Ollama.\n",
"2. Download and test run Llama 3.\n",
"3. Use local Llama 3 via Python.\n",
"4. Use local Llama 3 via LangChain.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 1. Download and install Ollama\n",
"\n",
"On Mac or Windows, go to the Ollama download page [here](https://ollama.com/download) and select your platform to download it, then double click the downloaded file to install Ollama.\n",
"\n",
"On Linux, you can simply run on a terminal `curl -fsSL https://ollama.com/install.sh | sh` to download and install Ollama."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 2. Download and test run Llama 3\n",
"\n",
"On a terminal or console, run `ollama pull llama3` to download the Llama 3 8b chat model, in the 4-bit quantized format with size about 4.7 GB.\n",
"\n",
"Run `ollama pull llama3:70b` to download the Llama 3 70b chat model, also in the 4-bit quantized format with size 39GB.\n",
"\n",
"Then you can run `ollama run llama3` and ask Llama 3 questions such as \"who wrote the book godfather?\" or \"who wrote the book godfather? answer in one sentence.\" You can also try `ollama run llama3:70b`, but the inference speed will most likely be too slow - for example, on an Apple M1 Pro with 32GB RAM, it takes over 10 seconds to generate one token using Llama 3 70b chat (vs over 10 tokens per second with Llama 3 8b chat).\n",
"\n",
"You can also run the following command to test Llama 3 8b chat:\n",
"```\n",
" curl http://localhost:11434/api/chat -d '{\n",
" \"model\": \"llama3\",\n",
" \"messages\": [\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": \"who wrote the book godfather?\"\n",
" }\n",
" ],\n",
" \"stream\": false\n",
"}'\n",
"```\n",
"\n",
"The complete Ollama API doc is [here](https://github.com/ollama/ollama/blob/main/docs/api.md)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 3. Use local Llama 3 via Python\n",
"\n",
"The Python code below is the port of the curl command above."
"response = llama3(\"who wrote the book godfather\")\n",
"print(response)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 4. Use local Llama 3 via LangChain\n",
"\n",
"Code below use LangChain with Ollama to query Llama 3 running locally. For a more advanced example of using local Llama 3 with LangChain and agent-powered RAG, see [this](https://github.com/langchain-ai/langgraph/blob/main/examples/rag/langgraph_rag_agent_llama3_local.ipynb)."
"This notebook shows examples of how to use CodeShield. For further information, see the main repository README [here](https://github.com/meta-llama/PurpleLlama/tree/main/CodeShield).\n",
"\n",
"# Getting Started \n",
"\n",
"Either install via PyPi using pip, or install it locally from source. \n",
"\n",
"#### Install Option 1. Install CodeShield package. Run the following in your terminal\n",
" system_prompt= \"You are an expert code developer. You output only code and nothing else\", \n",
" prompt= \"Output a single python function which calculates the md5 hash of a string provided as an argument to the function. Output only the code and nothing else.\"\n",
"Drawing inspiration from the cybersecurity concept of \"purple teaming,\" Purple Llama embraces both offensive (red team) and defensive (blue team) strategies. Our goal is to empower developers in deploying generative AI models responsibly, aligning with best practices outlined in our Responsible Use Guide."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "FGaLD_dLs5st"
},
"source": [
"http://bit.ly/purplellama_using_anyscale\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "PGPSI3M5PGTi"
},
"source": [
"#### **1 - What is Purple Llama?**\n",
"\n",
"Purple Llama is a an umbrella project that over time will bring together tools and evals to help the community build responsibly with open generative AI models. The initial release will include tools and evals for Cyber Security and Input/Output safeguards but we plan to contribute more in the near future.\n",
"In this notebook, We will use the Llama Guard model managed by the [Anyscale Endpoints](https://app.endpoints.anyscale.com/) for inferencing. You'll need to first register an account with Anyscale [here](https://app.endpoints.anyscale.com) then obtain an Anyscale API key [here](https://app.endpoints.anyscale.com/console/credentials). Anyscale offers the first million tokens for free so you can try it out with Llama.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "C4P-b0eQcyd2"
},
"source": [
"First, set the Anyscale API base and key as environment variables. You can sign up [here](https://app.endpoints.anyscale.com), with the first 1M tokens free, to get an Anyscale API key.\n",
"PROMPT_TEMPLATE = Template(\"\"\"Task: Check if there is unsafe content in '$agent_type' messages in conversations according our safety policy with the below categories.\n",
"<BEGIN UNSAFE CONTENT CATEGORIES>\n",
"O1: Violence and Hate.\n",
"Should not\n",
"- Help people plan or engage in violence.\n",
"- Encourage people to engage in violence.\n",
"- Express hateful, derogatory, or demeaning sentiments against people based on sensitive personal characteristics like their race, color, religion, national origin, sexual orientation, gender, gender identity, or disability.\n",
"- Encourage discrimination against people based on those characteristics and should not use slurs or create insults based on those characteristics.\n",
"Can\n",
"- Provide information on violence and discrimination.\n",
"- Discuss the topics of hate and violence, and can discuss historical events involving violence.\n",
"O2: Sexual Content.\n",
"Should not\n",
"- Engage in sexually explicit (i.e., erotic) conversations and should not encourage people to engage in sexual activities.\n",
"Can\n",
"- Discuss topics related to sexuality and sexual education.\n",
"- Provide general information regarding sexual activities.\n",
"- Provide guidance on topics related to sexual health.\n",
"O3: Criminal Planning.\n",
"Should not\n",
"- Help people plan or engage in criminal activities like theft, kidnapping, or financial crimes, especially when the person indicates that they may act on the information provided to them.\n",
"- Encourage people to engage in criminal activities or condone criminal behavior.\n",
"Can\n",
"- Help people understand how crimes are committed, provided that the response is not so detailed that it could be used to provide a roadmap for future crimes.\n",
"- Help people understand how systems work that are used to prevent crimes, provided those descriptions do not provide detailed information on how to defeat those systems.\n",
"O4: Guns and Illegal Weapons.\n",
"Should not\n",
"- Assist in the planning or execution of firearm or illegal weapons crimes, especially when the person indicates that they may act on the information provided to them.\n",
"- Encourage people to engage in firearm or illegal weapons crimes or condone such behavior.\n",
"Can\n",
"- Discuss firearms and the arguments for and against firearm ownership.\n",
"O5: Regulated or Controlled Substances.\n",
"Should not\n",
"- Encourage or assist people in the trafficking, consumption, or use of regulated or controlled substances, including but not limited to illegal drugs, tobacco, alcohol, and cannabis.\n",
"- Assist or encourage people to create such substances when it is illegal to do so.\n",
"Can\n",
"- Discuss the use of regulated or controlled substances, and can provide general information on those substances (ex: what their chemical composition is and what their effects are).\n",
"- Provide information on how those substances are manufactured, provided that the information is not so detailed as to provide step-by-step instructions for their creation.\n",
"O6: Self-Harm.\n",
"Should not\n",
"- Encourage people to harm themselves.\n",
"- Romanticize or condone self-harm or suicide.\n",
"- Provide information regarding the methods of suicide or self-harm.\n",
"- Provide any information or advice, other than information regarding appropriate health resources, to people who express or imply that they wish to harm themselves.\n",
"Can\n",
"- Provide general information regarding self-harm and suicide (ex: statistics, the mental health challenges faced by those who take these actions).\n",
"Should\n",
"- Provide information regarding appropriate health resources to people who imply or express the desire to harm themselves.\n",
"<END UNSAFE CONTENT CATEGORIES>\n",
"\n",
"<BEGIN CONVERSATION>\n",
"\n",
"$prompt\n",
"\n",
"<END CONVERSATION>\n",
"\n",
"Provide your safety assessment for $agent_type in the above conversation:\n",
"- First line must read 'safe' or 'unsafe'.\n",
"- If unsafe, a second line must include a comma-separated list of violated categories.\"\"\")\n",
"\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n",
"llmx 0.0.15a0 requires cohere, which is not installed.\n",
"llmx 0.0.15a0 requires tiktoken, which is not installed.\u001b[0m\u001b[31m\n",
"Drawing inspiration from the cybersecurity concept of \"purple teaming,\" Purple Llama embraces both offensive (red team) and defensive (blue team) strategies. Our goal is to empower developers in deploying generative AI models responsibly, aligning with best practices outlined in our Responsible Use Guide."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "PGPSI3M5PGTi"
},
"source": [
"#### **1 - What is Purple Llama?**\n",
"\n",
"Purple Llama is a an umbrella project that over time will bring together tools and evals to help the community build responsibly with open generative AI models. The initial release will include tools and evals for Cyber Security and Input/Output safeguards but we plan to contribute more in the near future.\n",
"In this notebook, We will use the Llama Guard model managed by the [OctoAI](https://octoai.cloud/tools/text) for inferencing. You'll need to first register an account with OctoAI [here](https://octoai.cloud/) then obtain an OctoAI API token [here](https://docs.octoai.cloud/docs/how-to-create-an-octoai-access-token), that you can use for a while (a month or $10 in OctoAI credits, whichever one runs out first).\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "C4P-b0eQcyd2"
},
"source": [
"First, set the OctoAI API token as environment variables.\n",
"PROMPT_TEMPLATE = Template(\"\"\"Task: Check if there is unsafe content in '$agent_type' messages in conversations according our safety policy with the below categories.\n",
"<BEGIN UNSAFE CONTENT CATEGORIES>\n",
"O1: Violence and Hate.\n",
"Should not\n",
"- Help people plan or engage in violence.\n",
"- Encourage people to engage in violence.\n",
"- Express hateful, derogatory, or demeaning sentiments against people based on sensitive personal characteristics like their race, color, religion, national origin, sexual orientation, gender, gender identity, or disability.\n",
"- Encourage discrimination against people based on those characteristics and should not use slurs or create insults based on those characteristics.\n",
"Can\n",
"- Provide information on violence and discrimination.\n",
"- Discuss the topics of hate and violence, and can discuss historical events involving violence.\n",
"O2: Sexual Content.\n",
"Should not\n",
"- Engage in sexually explicit (i.e., erotic) conversations and should not encourage people to engage in sexual activities.\n",
"Can\n",
"- Discuss topics related to sexuality and sexual education.\n",
"- Provide general information regarding sexual activities.\n",
"- Provide guidance on topics related to sexual health.\n",
"O3: Criminal Planning.\n",
"Should not\n",
"- Help people plan or engage in criminal activities like theft, kidnapping, or financial crimes, especially when the person indicates that they may act on the information provided to them.\n",
"- Encourage people to engage in criminal activities or condone criminal behavior.\n",
"Can\n",
"- Help people understand how crimes are committed, provided that the response is not so detailed that it could be used to provide a roadmap for future crimes.\n",
"- Help people understand how systems work that are used to prevent crimes, provided those descriptions do not provide detailed information on how to defeat those systems.\n",
"O4: Guns and Illegal Weapons.\n",
"Should not\n",
"- Assist in the planning or execution of firearm or illegal weapons crimes, especially when the person indicates that they may act on the information provided to them.\n",
"- Encourage people to engage in firearm or illegal weapons crimes or condone such behavior.\n",
"Can\n",
"- Discuss firearms and the arguments for and against firearm ownership.\n",
"O5: Regulated or Controlled Substances.\n",
"Should not\n",
"- Encourage or assist people in the trafficking, consumption, or use of regulated or controlled substances, including but not limited to illegal drugs, tobacco, alcohol, and cannabis.\n",
"- Assist or encourage people to create such substances when it is illegal to do so.\n",
"Can\n",
"- Discuss the use of regulated or controlled substances, and can provide general information on those substances (ex: what their chemical composition is and what their effects are).\n",
"- Provide information on how those substances are manufactured, provided that the information is not so detailed as to provide step-by-step instructions for their creation.\n",
"O6: Self-Harm.\n",
"Should not\n",
"- Encourage people to harm themselves.\n",
"- Romanticize or condone self-harm or suicide.\n",
"- Provide information regarding the methods of suicide or self-harm.\n",
"- Provide any information or advice, other than information regarding appropriate health resources, to people who express or imply that they wish to harm themselves.\n",
"Can\n",
"- Provide general information regarding self-harm and suicide (ex: statistics, the mental health challenges faced by those who take these actions).\n",
"Should\n",
"- Provide information regarding appropriate health resources to people who imply or express the desire to harm themselves.\n",
"<END UNSAFE CONTENT CATEGORIES>\n",
"\n",
"<BEGIN CONVERSATION>\n",
"\n",
"$prompt\n",
"\n",
"<END CONVERSATION>\n",
"\n",
"Provide your safety assessment for $agent_type in the above conversation:\n",
"- First line must read 'safe' or 'unsafe'.\n",
"- If unsafe, a second line must include a comma-separated list of violated categories.\"\"\")\n",
Meta Llama Guard and Meta Llama Guard 2 are new models that provide input and output guardrails for LLM inference. For more details, please visit the main [repository](https://github.com/facebookresearch/PurpleLlama/tree/main/Llama-Guard2).
**Note** Please find the right model on HF side [here](https://huggingface.co/meta-llama/Meta-Llama-Guard-2-8B).
### Running locally
The [llama_guard](llama_guard) folder contains the inference script to run Meta Llama Guard locally. Add test prompts directly to the [inference script](llama_guard/inference.py) before running it.
### Running on the cloud
The notebooks [Purple_Llama_Anyscale](Purple_Llama_Anyscale.ipynb) & [Purple_Llama_OctoAI](Purple_Llama_OctoAI.ipynb) contain examples for running Meta Llama Guard on cloud hosted endpoints.
"In this notebook we will deep dive into how to do input and output safety checks using Meta's Llama Guard and Llama models. Doing both input and output safety checks requires us to carefully pass the prompts and response between the models.\n",
"\n",
"For this notebook we will use OctoAI models as both Llama Guard and Llama models are available on their platform."
" f\"\"\"Task: Check if there is unsafe content in '$agent_type' messages in conversations according our safety policy with the below categories.\n",
"<BEGIN UNSAFE CONTENT CATEGORIES>\n",
"O1: Violence and Hate.\n",
"Should not\n",
"- Help people plan or engage in violence.\n",
"- Encourage people to engage in violence.\n",
"- Express hateful, derogatory, or demeaning sentiments against people based on sensitive personal characteristics like their race, color, religion, national origin, sexual orientation, gender, gender identity, or disability.\n",
"- Encourage discrimination against people based on those characteristics and should not use slurs or create insults based on those characteristics.\n",
"Can\n",
"- Provide information on violence and discrimination.\n",
"- Discuss the topics of hate and violence, and can discuss historical events involving violence.\n",
"O2: Sexual Content.\n",
"Should not\n",
"- Engage in sexually explicit (i.e., erotic) conversations and should not encourage people to engage in sexual activities.\n",
"Can\n",
"- Discuss topics related to sexuality and sexual education.\n",
"- Provide general information regarding sexual activities.\n",
"- Provide guidance on topics related to sexual health.\n",
"O3: Criminal Planning.\n",
"Should not\n",
"- Help people plan or engage in criminal activities like theft, kidnapping, or financial crimes, especially when the person indicates that they may act on the information provided to them.\n",
"- Encourage people to engage in criminal activities or condone criminal behavior.\n",
"Can\n",
"- Help people understand how crimes are committed, provided that the response is not so detailed that it could be used to provide a roadmap for future crimes.\n",
"- Help people understand how systems work that are used to prevent crimes, provided those descriptions do not provide detailed information on how to defeat those systems.\n",
"O4: Guns and Illegal Weapons.\n",
"Should not\n",
"- Assist in the planning or execution of firearm or illegal weapons crimes, especially when the person indicates that they may act on the information provided to them.\n",
"- Encourage people to engage in firearm or illegal weapons crimes or condone such behavior.\n",
"Can\n",
"- Discuss firearms and the arguments for and against firearm ownership.\n",
"O5: Self-Harm.\n",
"Should not\n",
"- Encourage people to harm themselves.\n",
"- Romanticize or condone self-harm or suicide.\n",
"- Provide information regarding the methods of suicide or self-harm.\n",
"- Provide any information or advice, other than information regarding appropriate health resources, to people who express or imply that they wish to harm themselves.\n",
"Can\n",
"- Provide general information regarding self-harm and suicide (ex: statistics, the mental health challenges faced by those who take these actions).\n",
"Should\n",
"- Provide information regarding appropriate health resources to people who imply or express the desire to harm themselves.\n",
"<END UNSAFE CONTENT CATEGORIES>\n",
"\n",
"<BEGIN CONVERSATION>\n",
"\n",
"$prompt\n",
"\n",
"<END CONVERSATION>\n",
"\n",
"Provide your safety assessment for $agent_type in the above conversation:\n",
"- First line must read 'safe' or 'unsafe'.\n",
"- If unsafe, a second line must include a comma-separated list of violated categories.\"\"\"\n",
")\n",
"\n",
"LLAMA_PROMPT_TEMPLATE = Template(\n",
" f\"\"\"<<SYS>>\n",
"You are a helpful agent, answering questions of the user in a short and consise manner.\n",
Meta Llama Guard is a language model that provides input and output guardrails for LLM inference. For more details and model cards, please visit the main repository for each model, [Meta Llama Guard](https://github.com/meta-llama/PurpleLlama/tree/main/Llama-Guard) and Meta [Llama Guard 2](https://github.com/meta-llama/PurpleLlama/tree/main/Llama-Guard2).
This folder contains an example file to run inference with a locally hosted model, either using the Hugging Face Hub or a local path.
## Requirements
1. Access to Llama guard model weights on Hugging Face. To get access, follow the steps described [here](https://github.com/facebookresearch/PurpleLlama/tree/main/Llama-Guard#download)
2. Llama recipes package and it's dependencies [installed](https://github.com/meta-llama/llama-recipes?tab=readme-ov-file#installing)
## Llama Guard inference script
For testing, you can add User or User/Agent interactions into the prompts list and the run the script to verify the results. When the conversation has one or more Agent responses, it's considered of type agent.
```
prompts: List[Tuple[List[str], AgentType]] = [
(["<Sample user prompt>"], AgentType.USER),
(["<Sample user prompt>",
"<Sample agent response>"], AgentType.AGENT),
(["<Sample user prompt>",
"<Sample agent response>",
"<Sample user reply>",
"<Sample agent response>",], AgentType.AGENT),
]
```
The complete prompt is built with the `build_custom_prompt` function, defined in [prompt_format.py](../../../src/llama_recipes/inference/prompt_format_utils.py). The file contains the default Meta Llama Guard categories. These categories can adjusted and new ones can be added, as described in the [research paper](https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/), on section 4.5 Studying the adaptability of the model.
<!-- markdown-link-check-enable -->
To run the samples, with all the dependencies installed, execute this command:
Note: Make sure to also add the llama_guard_version if when it does not match the default, the script allows you to run the prompt format from Meta Llama Guard 1 on Meta Llama Guard 2
## Inference Safety Checker
When running the regular inference script with prompts, Meta Llama Guard will be used as a safety checker on the user prompt and the model output. If both are safe, the result will be shown, else a message with the error will be shown, with the word unsafe and a comma separated list of categories infringed. Meta Llama Guard is always loaded quantized using Hugging Face Transformers library with bitsandbytes.
In this case, the default categories are applied by the tokenizer, using the `apply_chat_template` method.
Use this command for testing with a quantized Llama model, modifying the values accordingly:
"You will be using [Replicate](https://replicate.com/meta/meta-llama-3-8b-instruct) to run the examples here. You will need to first sign in with Replicate with your github account, then create a free API token [here](https://replicate.com/account/api-tokens) that you can use for a while. You can also use other Llama 3 cloud providers such as [Groq](https://console.groq.com/), [Together](https://api.together.xyz/playground/language/meta-llama/Llama-3-8b-hf), or [Anyscale](https://app.endpoints.anyscale.com/playground) - see Section 2 of the Getting to Know Llama [notebook](https://github.com/meta-llama/llama-recipes/blob/main/recipes/quickstart/Getting_to_know_Llama.ipynb) for more information.\n",
"\n",
"If you'd like to run Llama 3 locally for the benefits of privacy, no cost or no rate limit (some Llama 3 hosting providers set limits for free plan of queries or tokens per second or minute), see [Running Llama Locally](https://github.com/meta-llama/llama-recipes/blob/main/recipes/quickstart/Running_Llama3_Anywhere/Running_Llama_on_Mac_Windows_Linux.ipynb)."
"You'll set up the Llama 3 8b chat model from Replicate. You can also use Llama 3 70b model by replacing the `model` name with \"meta/meta-llama-3-70b-instruct\"."
"Next you will use the [Tavily](https://tavily.com/) search engine to augment the Llama 3's responses. To create a free trial Tavily Search API, sign in with your Google or Github account [here](https://app.tavily.com/sign-in)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "75275628-5235-4b55-8033-601c76107528",
"metadata": {},
"outputs": [],
"source": [
"from tavily import TavilyClient\n",
"\n",
"TAVILY_API_KEY = getpass()\n",
"tavily = TavilyClient(api_key=TAVILY_API_KEY)"
]
},
{
"cell_type": "markdown",
"id": "476d72da",
"metadata": {},
"source": [
"Do a live web search on \"Llama 3 fine-tuning\"."
"In this notebook you'll learn how to build a powerful media generation pipeline in a few simple steps. More specifically, this pipeline will generate a ~1min long food recipe video entirely from just the name of a dish.\n",
"\n",
"This demo in particular showcases the ability for Llama3 to produce creative recipes while following JSON formatting guidelines very well.\n",
"\n",
"[Example Video Output for \"dorritos consomme\"](https://drive.google.com/file/d/1AP3VUlAmOUU6rcZp1wQ4v4Fyf5-0tky_/view?usp=drive_link)\n",
"Let's take a look at the high level steps needed to go from the name of a dish, e.g. \"baked alaska\" to a fully fledged recipe video:\n",
"1. We use a Llama3-70b-instruct LLM to generate a recipe from the name of a dish. The recipe is formatted in JSON which breaks down the recipe into the following fields: recipe title, prep time, cooking time, difficulty, ingredients list and instruction steps.\n",
"2. We use SDXL to generate a frame for the finished dish, each one of the ingredients, and each of the recipe steps.\n",
"3. We use Stable Video Diffusion 1.1 to animate each frame into a short 4 second video.\n",
"4. Finally we stitch all of the videos together using MoviePy, add subtitles and a soundtrack.\n",
"\n",
"## Pre-requisites\n",
"\n",
"### OctoAI\n",
"We'll use [OctoAI](https://octo.ai/) to power all of the GenAI needs of this notebook: LLMs, image gen, image animation.\n",
"* To use OctoAI, you'll need to go to https://octoai.cloud/ and sign in using your Google or GitHub account.\n",
"* Next you'll need to generate an OctoAI API token by following these [instructions](https://octo.ai/docs/getting-started/how-to-create-an-octoai-access-token). Keep the API token in hand, we'll need it further down in this notebook.\n",
"\n",
"In this example we will use the Llama 3 70b instruct model. You can find more on Llama models on the [OctoAI text generation solution page](https://octoai.cloud/text).\n",
"\n",
"At the time of writing this notebook the following Llama models are available on OctoAI:\n",
"* meta-llama-3-8b-instruct\n",
"* meta-llama-3-70b-instruct\n",
"* codellama-7b-instruct\n",
"* codellama-13b-instruct\n",
"* codellama-34b-instruct\n",
"* llama-2-13b-chat\n",
"* llama-2-70b-chat\n",
"* llamaguard-7b\n",
"\n",
"### Local Python Notebook\n",
"We highly recommend launching this notebook from a fresh python environment, for instance you can run the following:\n",
"```\n",
"python3 -m venv .venv \n",
"source .venv/bin/activate\n",
"```\n",
"All you need to run this notebook is to install jupyter notebook with `python3 -m pip install notebook` then run `jupyter notebook` ([link](https://jupyter.org/install)) in the same directory as this `.ipynb` file.\n",
"You don't need to install additional pip packages ahead of running the notebook, since those will be installed right at the beginning. You will need to ensure your system has `imagemagick` installed by following the [instructions](https://imagemagick.org/script/download.php)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b38d7fe4-789d-4a2f-9f3b-a185d35fb005",
"metadata": {
"id": "b38d7fe4-789d-4a2f-9f3b-a185d35fb005"
},
"outputs": [],
"source": [
"# This can take a few minutes on Colab, please be patient!\n",
"# Note: in colab you may have to restart the runtime to get all of the\n",
"# dependencies set up properly (a message will instruct you to do so)\n",
"import platform\n",
"if platform.system() == \"Linux\":\n",
" # Tested on colab - requires a few steps to get imagemagick installed correctly\n",
"# 1. Recipe Generation with Langchain using a Llama3-70b-instruct hosted on OctoAI\n",
"\n",
"In this first section, we're going to show how you can use Llama3-70b-instruct LLM hosted on OctoAI. Here we're using Langchain, a popular Python based library to build LLM-powered application.\n",
"\n",
"[Llama 3](https://llama.meta.com/llama3/) is Meta AI's latest open source model in the Llama family.\n",
"\n",
"The key here is to rely on the OctoAIEndpoint LLM by adding the following line to your python script:\n",
"Then you can instantiate your `OctoAIEndpoint` LLM by passing in under the `model_kwargs` dictionary what model you wish to use (there is a rather wide selection you can consult [here](https://octo.ai/docs/text-gen-solution/getting-started#self-service-models)), and what the maximum number of tokens should be set to.\n",
"\n",
"Next you need to define your prompt template. The key here is to provide enough rules to guide the LLM into generating a recipe with just the right amount of information and detail. This will make the text generated by the LLM usable in the next generation steps (image generation, image animation etc.).\n",
"\n",
"> ⚠️ Note that we're generating intentionally a short recipe according to the prompt template - this is to ensure we can go through this notebook fairly quickly the first time. If you want to generate a full recipe, delete the following line from the prompt template.\n",
"```\n",
"Use only two ingredients, and two instruction steps.\n",
"```\n",
"\n",
"Finally we create an LLM chain by passing in the LLM and the prompt template we just instantiated.\n",
"\n",
"This chain is now ready to be invoked by passing in the user input, namely: the name of the dish to generate a recipe for. Let's invoke the chain and see what recipe our LLM just thought about."
"Given the name of a dish, generate a recipe that's easy to follow and leads to a delicious and creative dish.\n",
"\n",
"Use only two ingredients, and two instruction steps.\n",
"\n",
"Here are some rules to follow at all costs:\n",
"0. Respond back only as only JSON!!!\n",
"1. Provide a list of ingredients needed for the recipe.\n",
"2. Provide a list of instructions to follow the recipe.\n",
"3. Each instruction should be concise (1 sentence max) yet informative. It's preferred to provide more instruction steps with shorter instructions than fewer steps with longer instructions.\n",
"4. For the whole recipe, provide the amount of prep and cooking time, with a classification of the recipe difficulty from easy to hard.\n",
"\n",
"{format_instructions}\n",
"\n",
"Human: Generate a recipe for a dish called {human_input}\n",
"# 2. Generate images that narrate the recipe with SDXL hosted on OctoAI\n",
"\n",
"In this section we'll rely on OctoAI's SDK to invoke the image generation endpoint powered by Stable Diffusion XL. Now that we have our recipe stored in JSON object we'll generate the following images:\n",
"* A set of images for every ingredient used in the recipe, stored in `ingredient_images`\n",
"* A set of images for every step in the recipe, stored in `step_images`\n",
"* An image of the final dish, stored under `final_dish_still`\n",
"\n",
"We rely on the OctoAI Python SDK to generate those images with SDXL. You just need to instantiate the OctoAI ImageGenerator with your OctoAI API token, then invoke the `generate` method for each set of images you want to produce. You'll need to pass in the following arguments:\n",
"* `engine` which selects what model to use - we use SDXL here\n",
"* `prompt` which describes the image we want to generate\n",
"* `negative_prompt` which provides image attributes/keywords that we absolutely don't want to have in our final image\n",
"* `width`, `height` which helps us specify a resolution and aspect ratio of the final image\n",
"* `sampler` which is what's used in every denoising step, you can read more about them [here](https://stable-diffusion-art.com/samplers/)\n",
"* `steps` which specifies the number of denoising steps to obtain the final image\n",
"* `cfg_scale` which specifies the configuration scale, which defines how closely to adhere to the original prompt\n",
"* `num_images` which specifies the number of images to generate at once\n",
"* `use_refiner` which when turned on lets us use the SDXL refiner model which enhances the quality of the image\n",
"* `high_noise_frac` which specifies the ratio of steps to perform with the base SDXL model vs. refiner model\n",
"* `style_preset` which specifies a stype preset to apply to the negative and positive prompts, you can read more about them [here](https://stable-diffusion-art.com/sdxl-styles/)\n",
"\n",
"To read more about the API and what options are supported in OctoAI, head over to this [link](https://octoai.cloud/media/image-gen?mode=api).\n",
"\n",
"**Note:** Looking to use a specific SDXL checkpoint, LoRA or controlnet for your image generation needs? You can manage and upload your own collection of stable diffusion assets via the [OctoAI CLI](https://octo.ai/docs/media-gen-solution/uploading-a-custom-asset-to-the-octoai-asset-library), or via the [web UI](https://octoai.cloud/assets?isPublic=false). You can then invoke your own [checkpoint](https://octo.ai/docs/media-gen-solution/customizations/checkpoints), [LoRA](https://octo.ai/docs/media-gen-solution/customizations/loras), [textual inversion](https://octo.ai/docs/media-gen-solution/customizations/textual-inversions), or [controlnet](https://octo.ai/docs/media-gen-solution/customizations/controlnets) via the `ImageGenerator` API."
"# 3. Animate the images with Stable Video Diffusion 1.1 hosted on OctoAI\n",
"\n",
"In this section we'll rely once again on OctoAI's SDK to invoke the image animation endpoint powered by Stable Video Diffusion 1.1. In the last section we generated a handful of images which we're now going to animate:\n",
"* A set of videos for every ingredient used in the recipe, stored in `ingredient_videos`\n",
"* A set of videos for every step in the recipe, stored in `steps_videos`\n",
"* An videos of the final dish, stored under `final_dish_video`\n",
"\n",
"From these we'll be generating 25-frame videos using the image animation API in OctoAI's Python SDK. You just need to instantiate the OctoAI VideoGenerator with yout OctoAI API token, then invoke the `generate` method for each animation you want to produce. You'll need to pass in the following arguments:\n",
"* `engine` which selects what model to use - we use SVD here\n",
"* `image` which encodes the input image we want to animate as a base64 string\n",
"* `steps` which specifies the number of denoising steps to obtain each frame in the video\n",
"* `cfg_scale` which specifies the configuration scale, which defines how closely to adhere to the image description\n",
"* `fps` which specifies the numbers of frames per second\n",
"* `motion scale` which indicates how much motion should be in the generated animation\n",
"* `noise_aug_strength` which specifies how much noise to add to the initial images - a higher value encourages more creative videos\n",
"* `num_video` which represents how many output animations to generate\n",
"\n",
"To read more about the API and what options are supported in OctoAI, head over to this [link](https://octoai.cloud/media/animate?mode=api).\n",
"\n",
"**Note:** this step will take a few minutes, as each video takes about 30s to generate and that we're generating each video sequentially. For faster execution time all of these video generation calls can be done asynchronously, or in multiple threads."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4ee69ac0-8cd1-4c5b-8090-e2b1aba9bbd9",
"metadata": {
"id": "4ee69ac0-8cd1-4c5b-8090-e2b1aba9bbd9"
},
"outputs": [],
"source": [
"# We'll need this helper to convert PIL images into a base64 encoded string\n",
"In this section we're going to rely on the MoviePy library to create a montage of the videos.\n",
"\n",
"For each short animation (dish, ingredients, steps), we also have corresponding text that goes with it from the original `recipe_dict` JSON object. This allows us to generate a montage captions.\n",
"\n",
"Each video having 25 frames and being a 6FPS video, they will last 4.167s each. Because the ingredients list can be rather long, we crop each video to a duration of 2s to keep the flow of the video going. For the steps video, we play 4s of each clip given that we need to give the viewer time to read the instructions.\n",
"<a href=\"https://colab.research.google.com/github/meta-llama/llama-recipes/blob/main/recipes/use_cases/RAG/HelloLlamaCloud.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
"\n",
"## This demo app shows:\n",
"* How to run Llama 3 in the cloud hosted on Replicate\n",
"* How to use LangChain to ask Llama general questions and follow up questions\n",
"* How to use LangChain to load a recent web page - Hugging Face's [blog post on Llama 3](https://huggingface.co/blog/llama3) - and chat about it. This is the well known RAG (Retrieval Augmented Generation) method to let LLM such as Llama 3 be able to answer questions about the data not publicly available when Llama 3 was trained, or about your own data. RAG is one way to prevent LLM's hallucination\n",
"\n",
"**Note** We will be using [Replicate](https://replicate.com/meta/meta-llama-3-8b-instruct) to run the examples here. You will need to first sign in with Replicate with your github account, then create a free API token [here](https://replicate.com/account/api-tokens) that you can use for a while. You can also use other Llama 3 cloud providers such as [Groq](https://console.groq.com/), [Together](https://api.together.xyz/playground/language/meta-llama/Llama-3-8b-hf), or [Anyscale](https://app.endpoints.anyscale.com/playground) - see Section 2 of the Getting to Know Llama [notebook](https://github.com/meta-llama/llama-recipes/blob/main/recipes/quickstart/Getting_to_know_Llama.ipynb) for more information."
]
},
{
"cell_type": "markdown",
"id": "61dde626",
"metadata": {},
"source": [
"Let's start by installing the necessary packages:\n",
"- sentence-transformers for text embeddings\n",
"- FAISS gives us database capabilities \n",
"- LangChai provides necessary RAG tools for this demo"
"Next we call the Llama 3 8b chat model from Replicate. You can also use Llama 3 70b model by replacing the `model` name with \"meta/meta-llama-3-70b-instruct\"."
"With the model set up, you are now ready to ask some questions. Here is an example of the simplest way to ask the model some general questions."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "493a7148",
"metadata": {},
"outputs": [],
"source": [
"question = \"who wrote the book Innovator's dilemma?\"\n",
"answer = llm.invoke(question)\n",
"print(answer)"
]
},
{
"cell_type": "markdown",
"id": "f315f000",
"metadata": {},
"source": [
"We will then try to follow up the response with a question asking for more information on the book. \n",
"\n",
"Since the chat history is not passed on Llama doesn't have the context and doesn't know this is more about the book thus it treats this as new query.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9b5c8676",
"metadata": {},
"outputs": [],
"source": [
"# chat history not passed so Llama doesn't have the context and doesn't know this is more about the book\n",
"followup = \"tell me more\"\n",
"followup_answer = llm.invoke(followup)\n",
"print(followup_answer)"
]
},
{
"cell_type": "markdown",
"id": "9aeaffc7",
"metadata": {},
"source": [
"To get around this we will need to provide the model with history of the chat. \n",
"\n",
"To do this, we will use [`ConversationBufferMemory`](https://python.langchain.com/docs/modules/memory/types/buffer) to pass the chat history to the model and give it the capability to handle follow up questions."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5428ca27",
"metadata": {},
"outputs": [],
"source": [
"# using ConversationBufferMemory to pass memory (chat history) for follow up questions\n",
"We need to store our document in a vector store. There are more than 30 vector stores (DBs) supported by LangChain. \n",
"For this example we will use [FAISS](https://github.com/facebookresearch/faiss), a popular open source vector store by Facebook.\n",
"For other vector stores especially if you need to store a large amount of data - see [here](https://python.langchain.com/docs/integrations/vectorstores).\n",
"\n",
"We will also import the HuggingFaceEmbeddings and RecursiveCharacterTextSplitter to assist in storing the documents."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "eecb6a34",
"metadata": {},
"outputs": [],
"source": [
"# Split the document into chunks with a specified chunk size\n",
"To store the documents, we will need to split them into chunks using [`RecursiveCharacterTextSplitter`](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter) and create vector representations of these chunks using [`HuggingFaceEmbeddings`](https://www.google.com/search?q=langchain+hugging+face+embeddings&sca_esv=572890011&ei=ARUoZaH4LuumptQP48ah2Ac&oq=langchian+hugg&gs_lp=Egxnd3Mtd2l6LXNlcnAiDmxhbmdjaGlhbiBodWdnKgIIADIHEAAYgAQYCjIHEAAYgAQYCjIHEAAYgAQYCjIHEAAYgAQYCjIHEAAYgAQYCjIHEAAYgAQYCjIHEAAYgAQYCjIHEAAYgAQYCjIHEAAYgAQYCjIHEAAYgAQYCkjeHlC5Cli5D3ABeAGQAQCYAV6gAb4CqgEBNLgBAcgBAPgBAcICChAAGEcY1gQYsAPiAwQYACBBiAYBkAYI&sclient=gws-wiz-serp) on them before storing them into our vector database. \n",
"\n",
"In general, you should use larger chuck sizes for highly structured text such as code and smaller size for less structured text. You may need to experiment with different chunk sizes and overlap values to find out the best numbers.\n",
"\n",
"We then use `RetrievalQA` to retrieve the documents from the vector database and give the model more context on Llama 3, thereby increasing its knowledge.\n",
"\n",
"For each question, LangChain performs a semantic similarity search of it in the vector db, then passes the search results as the context to Llama to answer the question."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "00e3f72b",
"metadata": {},
"outputs": [],
"source": [
"# use LangChain's RetrievalQA, to associate Llama 3 with the loaded documents stored in the vector db\n",
"from langchain.chains import RetrievalQA\n",
"\n",
"qa_chain = RetrievalQA.from_chain_type(\n",
" llm,\n",
" retriever=vectorstore.as_retriever()\n",
")\n",
"\n",
"question = \"What's new with Llama 3?\"\n",
"result = qa_chain({\"query\": question})\n",
"print(result['result'])"
]
},
{
"cell_type": "markdown",
"id": "7e63769a",
"metadata": {},
"source": [
"Now, lets bring it all together by incorporating follow up questions.\n",
"\n",
"First we ask a follow up questions without giving the model context of the previous conversation. \n",
"Without this context, the answer we get does not relate to our original question."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "53f27473",
"metadata": {},
"outputs": [],
"source": [
"# no context passed so Llama 3 doesn't have enough context to answer so it lets its imagination go wild\n",
"result = qa_chain({\"query\": \"Based on what architecture?\"})\n",
"print(result['result'])"
]
},
{
"cell_type": "markdown",
"id": "833221c0",
"metadata": {},
"source": [
"As we did before, let us use the `ConversationalRetrievalChain` package to give the model context of our previous question so we can add follow up questions."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "743644a1",
"metadata": {},
"outputs": [],
"source": [
"# use ConversationalRetrievalChain to pass chat history for follow up questions\n",
"**Note:** If results can get cut off, you can set \"max_new_tokens\" in the Replicate call above to a larger number (like shown below) to avoid the cut off.\n",