"Remember to add `--is-embedding` to the command."
"Remember to add `--is-embedding` to the command."
...
@@ -29,30 +29,83 @@
...
@@ -29,30 +29,83 @@
},
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": 7,
"execution_count": 1,
"metadata": {},
"metadata": {},
"outputs": [
"outputs": [
{
{
"name": "stdout",
"name": "stdout",
"output_type": "stream",
"output_type": "stream",
"text": [
"text": [
"Embedding server is ready. Proceeding with the next steps.\n"
"/home/chenyang/miniconda3/envs/AlphaMeemory/lib/python3.11/site-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.\n",
"/home/chenyang/miniconda3/envs/AlphaMeemory/lib/python3.11/site-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.\n",
" warnings.warn(\n",
"/home/chenyang/miniconda3/envs/AlphaMeemory/lib/python3.11/site-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.\n",
"<strong style='color: #00008B;'><br> This cell combines server and notebook output. <br> <br> Typically, the server runs in a separate terminal, <br> but we combine the output of server and notebook to demonstrate the usage better.<br> <br> In our documentation, server output is in gray, notebook output is highlighted.<br> </strong>"
"/home/chenyang/miniconda3/envs/AlphaMeemory/lib/python3.11/site-packages/transformers/utils/hub.py:127: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.\n",
"SGLang provides an OpenAI compatible API for smooth transition from OpenAI services.\n",
"SGLang provides an OpenAI compatible API for smooth transition from OpenAI services. Full reference of the API is available at [OpenAI API Reference](https://platform.openai.com/docs/api-reference).\n",
"\n",
"This tutorial aims at these popular APIs:\n",
"\n",
"\n",
"- `chat/completions`\n",
"- `chat/completions`\n",
"- `completions`\n",
"- `completions`\n",
...
@@ -27,42 +29,99 @@
...
@@ -27,42 +29,99 @@
},
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": 38,
"execution_count": 1,
"metadata": {},
"metadata": {},
"outputs": [
"outputs": [
{
{
"name": "stdout",
"name": "stdout",
"output_type": "stream",
"output_type": "stream",
"text": [
"text": [
"Server is ready. Proceeding with the next steps.\n"
"/home/chenyang/miniconda3/envs/AlphaMeemory/lib/python3.11/site-packages/transformers/utils/hub.py:127: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.\n",
"/home/chenyang/miniconda3/envs/AlphaMeemory/lib/python3.11/site-packages/transformers/utils/hub.py:127: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.\n",
" warnings.warn(\n",
"/home/chenyang/miniconda3/envs/AlphaMeemory/lib/python3.11/site-packages/transformers/utils/hub.py:127: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.\n",
"<strong style='color: #00008B;'>Response: ChatCompletion(id='692899ebd3ea464dbb456008a7d60bf3', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Here are 3 countries and their capitals:\\n\\n1. **Country:** Japan\\n**Capital:** Tokyo\\n\\n2. **Country:** Australia\\n**Capital:** Canberra\\n\\n3. **Country:** Brazil\\n**Capital:** Brasília', refusal=None, role='assistant', function_call=None, tool_calls=None), matched_stop=128009)], created=1730106170, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=46, prompt_tokens=49, total_tokens=95, prompt_tokens_details=None))</strong>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
}
],
],
"source": [
"source": [
...
@@ -84,7 +143,8 @@
...
@@ -84,7 +143,8 @@
" temperature=0,\n",
" temperature=0,\n",
" max_tokens=64,\n",
" max_tokens=64,\n",
")\n",
")\n",
"print(response)"
"\n",
"print_highlight(f\"Response: {response}\")"
]
]
},
},
{
{
...
@@ -93,40 +153,35 @@
...
@@ -93,40 +153,35 @@
"source": [
"source": [
"### Parameters\n",
"### Parameters\n",
"\n",
"\n",
"The chat completions API accepts the following parameters (refer to [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat/create) for more details):\n",
"The chat completions API accepts OpenAI Chat Completions API's parameters. Refer to [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat/create) for more details.\n",
"\n",
"- `messages`: List of messages in the conversation, each containing `role` and `content`\n",
"- `model`: The model identifier to use for completion\n",
"- `max_tokens`: Maximum number of tokens to generate in the response\n",
"- `temperature`: Controls randomness (0-2). Lower values make output more focused and deterministic\n",
"- `top_p`: Alternative to temperature. Controls diversity via nucleus sampling\n",
"- `n`: Number of chat completion choices to generate\n",
"- `stream`: If true, partial message deltas will be sent as they become available\n",
"- `stop`: Sequences where the API will stop generating further tokens\n",
"- `presence_penalty`: Penalizes new tokens based on their presence in the text so far (-2.0 to 2.0)\n",
"- `frequency_penalty`: Penalizes new tokens based on their frequency in the text so far (-2.0 to 2.0)\n",
"- `logit_bias`: Modify the likelihood of specified tokens appearing in the completion\n",
"- `logprobs`: Include log probabilities of tokens in the response\n",
"- `top_logprobs`: Number of most likely tokens to return probabilities for\n",
"- `seed`: Random seed for deterministic results\n",
"- `response_format`: Specify the format of the response (e.g., JSON)\n",
"- `stream_options`: Additional options for streaming responses\n",
"- `user`: A unique identifier representing your end-user\n",
"\n",
"\n",
"Here is an example of a detailed chat completion request:"
"Here is an example of a detailed chat completion request:"
" frequency_penalty=0.2, # Mild penalty for more natural language\n",
" frequency_penalty=0.2, # Mild penalty for more natural language\n",
" n=1, # Single response is usually more stable\n",
" n=1, # Single response is usually more stable\n",
" seed=42, # Keep for reproducibility\n",
" seed=42, # Keep for reproducibility\n",
" stream=True, # Keep streaming for real-time output\n",
")\n",
")\n",
"\n",
"\n",
"for chunk in response:\n",
"print_highlight(f\"Response: {response}\")"
" print(chunk.choices[0].delta.content or \"\", end=\"\")"
]
]
},
},
{
{
...
@@ -167,20 +220,34 @@
...
@@ -167,20 +220,34 @@
"\n",
"\n",
"### Usage\n",
"### Usage\n",
"\n",
"\n",
"Completions API is similar to Chat Completions API, but without the `messages` parameter. Refer to [OpenAI Completions API](https://platform.openai.com/docs/api-reference/completions/create) for more details."
"Completions API is similar to Chat Completions API, but without the `messages` parameter."
]
]
},
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": 41,
"execution_count": 4,
"metadata": {},
"metadata": {},
"outputs": [
"outputs": [
{
{
"name": "stdout",
"name": "stdout",
"output_type": "stream",
"output_type": "stream",
"text": [
"text": [
"Completion(id='a6e07198f4b445baa0fb08a2178ceb59', choices=[CompletionChoice(finish_reason='length', index=0, logprobs=None, text=' 1. 2. 3.\\n1. United States - Washington D.C. 2. Japan - Tokyo 3. Australia - Canberra\\nList 3 countries and their capitals. 1. 2. 3.\\n1. China - Beijing 2. Brazil - Bras', matched_stop=None)], created=1730012328, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=64, prompt_tokens=9, total_tokens=73, prompt_tokens_details=None))\n"
"<strong style='color: #00008B;'>Response: Completion(id='eb486d0a32fd4384baba923f3bc17e8b', choices=[CompletionChoice(finish_reason='length', index=0, logprobs=None, text=' 1. 2. 3.\\n1. United States - Washington D.C. 2. Japan - Tokyo 3. Australia - Canberra\\nList 3 countries and their capitals. 1. 2. 3.\\n1. China - Beijing 2. Brazil - Bras', matched_stop=None)], created=1730106172, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=64, prompt_tokens=9, total_tokens=73, prompt_tokens_details=None))</strong>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
}
],
],
"source": [
"source": [
...
@@ -192,7 +259,8 @@
...
@@ -192,7 +259,8 @@
" n=1,\n",
" n=1,\n",
" stop=None,\n",
" stop=None,\n",
")\n",
")\n",
"print(response)"
"\n",
"print_highlight(f\"Response: {response}\")"
]
]
},
},
{
{
...
@@ -201,44 +269,39 @@
...
@@ -201,44 +269,39 @@
"source": [
"source": [
"### Parameters\n",
"### Parameters\n",
"\n",
"\n",
"The completions API accepts the following parameters:\n",
"The completions API accepts OpenAI Completions API's parameters. Refer to [OpenAI Completions API](https://platform.openai.com/docs/api-reference/completions/create) for more details.\n",
"\n",
"- `model`: The model identifier to use for completion\n",
"- `prompt`: Input text to generate completions for. Can be a string, array of strings, or token arrays\n",
"- `best_of`: Number of completions to generate server-side and return the best one\n",
"- `echo`: If true, the prompt will be included in the response\n",
"- `frequency_penalty`: Penalizes new tokens based on their frequency in the text so far (-2.0 to 2.0)\n",
"- `logit_bias`: Modify the likelihood of specified tokens appearing in the completion\n",
"- `logprobs`: Include log probabilities of tokens in the response\n",
"- `max_tokens`: Maximum number of tokens to generate in the response (default: 16)\n",
"- `n`: Number of completion choices to generate\n",
"- `presence_penalty`: Penalizes new tokens based on their presence in the text so far (-2.0 to 2.0)\n",
"- `seed`: Random seed for deterministic results\n",
"- `stop`: Sequences where the API will stop generating further tokens\n",
"- `stream`: If true, partial completion deltas will be sent as they become available\n",
"- `stream_options`: Additional options for streaming responses\n",
"- `suffix`: Text to append to the completion\n",
"- `temperature`: Controls randomness (0-2). Lower values make output more focused and deterministic\n",
"- `top_p`: Alternative to temperature. Controls diversity via nucleus sampling\n",
"- `user`: A unique identifier representing your end-user\n",
"\n",
"\n",
"Here is an example of a detailed completions request:"
"Here is an example of a detailed completions request:"
]
]
},
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": 42,
"execution_count": 5,
"metadata": {},
"metadata": {},
"outputs": [
"outputs": [
{
{
"name": "stdout",
"name": "stdout",
"output_type": "stream",
"output_type": "stream",
"text": [
"text": [
" Space explorer, Captain Orion Blackwood, had been traveling through the galaxy for 12 years, searching for a new home for humanity. His ship, the Aurora, had been his home for so long that he barely remembered what it was like to walk on solid ground.\n",
"As he navigated through the dense asteroid field, the ship's computer, S.A.R.A. (Self-Aware Reasoning Algorithm), alerted him to a strange reading on one of the asteroids. Captain Blackwood's curiosity was piqued, and he decided to investigate further.\n",
"<strong style='color: #00008B;'>Response: Completion(id='fb23a12a15bc4137815b91d63b6fd976', choices=[CompletionChoice(finish_reason='length', index=0, logprobs=None, text=\" Here is a short story about a space explorer named Astrid.\\nAstrid had always been fascinated by the stars. As a child, she would spend hours gazing up at the night sky, dreaming of what lay beyond our small planet. Now, as a renowned space explorer, she had the chance to explore the cosmos firsthand.\\nAstrid's ship, the Aurora, was equipped with state-of-the-art technology that allowed her to traverse vast distances in a relatively short period of time. She had been traveling for weeks, and finally, she had reached her destination: a distant planet on the edge of the galaxy.\\nAs she entered the planet's atmosphere, Astrid felt a thrill of excitement. She had never seen anything like this before.\", matched_stop=None)], created=1730106175, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=150, prompt_tokens=10, total_tokens=160, prompt_tokens_details=None))</strong>"
"Response: {'status_code': 200, 'request_id': 'request-2', 'body': {'id': 'request-2', 'object': 'chat.completion', 'created': 1730012333, 'model': 'meta-llama/Meta-Llama-3.1-8B-Instruct', 'choices': {'index': 0, 'message': {'role': 'assistant', 'content': '**What is Python?**\\n\\nPython is a high-level, interpreted programming language that is widely used for various purposes, including:\\n\\n* **Web Development**: Building web applications, web services, and web scraping.\\n* **Data Science**: Data analysis'}, 'logprobs': None, 'finish_reason': 'length', 'matched_stop': None}, 'usage': {'prompt_tokens': 39, 'completion_tokens': 50, 'total_tokens': 89}, 'system_fingerprint': None}}\n",
"<strong style='color: #00008B;'>Response: {'status_code': 200, 'request_id': 'request-2', 'body': {'id': 'request-2', 'object': 'chat.completion', 'created': 1730106176, 'model': 'meta-llama/Meta-Llama-3.1-8B-Instruct', 'choices': {'index': 0, 'message': {'role': 'assistant', 'content': '**What is Python?**\\n\\nPython is a high-level, interpreted programming language that is widely used for various purposes, including:\\n\\n1. **Web Development**: Building web applications and web services using frameworks like Django and Flask.\\n2. **Data Analysis and'}, 'logprobs': None, 'finish_reason': 'length', 'matched_stop': None}, 'usage': {'prompt_tokens': 39, 'completion_tokens': 50, 'total_tokens': 89}, 'system_fingerprint': None}}</strong>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<strong style='color: #00008B;'>Cleaning up files...</strong>"
"in your command line and wait for the server to be ready."
"in your command line and wait for the server to be ready."
...
@@ -34,23 +34,65 @@
...
@@ -34,23 +34,65 @@
"name": "stdout",
"name": "stdout",
"output_type": "stream",
"output_type": "stream",
"text": [
"text": [
"Server is ready. Proceeding with the next steps.\n"
"/home/chenyang/miniconda3/envs/AlphaMeemory/lib/python3.11/site-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.\n",
"/home/chenyang/miniconda3/envs/AlphaMeemory/lib/python3.11/site-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.\n",
" warnings.warn(\n",
"/home/chenyang/miniconda3/envs/AlphaMeemory/lib/python3.11/site-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.\n",
"<strong style='color: #00008B;'><br> Server and notebook outputs are combined for clarity.<br> <br> Typically, the server runs in a separate terminal.<br> <br> Server output is gray; notebook output is highlighted.<br> </strong>"
"print(\"Server is ready. Proceeding with the next steps.\")"
]
]
},
},
{
{
...
@@ -71,7 +113,30 @@
...
@@ -71,7 +113,30 @@
"name": "stdout",
"name": "stdout",
"output_type": "stream",
"output_type": "stream",
"text": [
"text": [
"{\"id\":\"449710eb827c49c99b82ce187e912c2a\",\"object\":\"chat.completion\",\"created\":1729962606,\"model\":\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\"choices\":[{\"index\":0,\"message\":{\"role\":\"assistant\",\"content\":\"LLM stands for Large Language Model. It's a type of artificial intelligence (AI) designed to process and generate human-like language. These models are trained on vast amounts of text data, allowing them to learn patterns, relationships, and context within language.\\n\\nLarge language models use various techniques, such as deep learning and natural language processing, to analyze and understand the input text. They can then use this understanding to generate coherent and context-specific text, such as:\\n\\n1. Responses to questions or prompts\\n2. Summaries of long pieces of text\\n3. Creative writing, like stories or poetry\\n4. Translation of text from one language to another\\n\\nSome popular examples of LLMs include:\\n\\n1. Chatbots: Virtual assistants that can understand and respond to user input\\n2. Virtual assistants: Like Siri, Alexa, or Google Assistant\\n3. Language translation tools: Such as Google Translate\\n4. Writing assistants: Like Grammarly or Language Tool\\n\\nThe key characteristics of LLMs include:\\n\\n1. **Scalability**: They can process large amounts of text data\\n2. **Flexibility**: They can be fine-tuned for specific tasks or domains\\n3. **Contextual understanding**: They can recognize context and nuances in language\\n4. **Creativity**: They can generate original text or responses\\n\\nHowever, LLMs also have limitations and potential drawbacks:\\n\\n1. **Bias**: They can perpetuate existing biases in the training data\\n2. **Misinformation**: They can spread misinformation or false information\\n3. **Dependence on data quality**: The quality of the training data directly affects the model's performance\\n\\nOverall, LLMs are powerful tools that can be used in various applications, from language translation and writing assistance to chatbots and virtual assistants.\"},\"logprobs\":null,\"finish_reason\":\"stop\",\"matched_stop\":128009}],\"usage\":{\"prompt_tokens\":47,\"total_tokens\":408,\"completion_tokens\":361,\"prompt_tokens_details\":null}}"
"{\"id\":\"0635a1c4d1d940f597b11482bed9595f\",\"object\":\"chat.completion\",\"created\":1730261683,\"model\":\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\"choices\":[{\"index\":0,\"message\":{\"role\":\"assistant\",\"content\":\"LLM stands for Large Language Model. It's a type of artificial intelligence (AI) designed to process and understand human language. LLMs are trained on vast amounts of text data, allowing them to learn patterns, relationships, and context within language.\\n\\nLarge language models like myself use natural language processing (NLP) and machine learning algorithms to analyze and generate human-like text. This enables us to:\\n\\n1. **Answer questions**: Provide information on a wide range of topics, from general knowledge to specialized domains.\\n2. **Generate text**: Create coherent and contextually relevant text, such as articles, essays, or even entire stories.\\n3. **Translate languages**: Translate text from one language to another, helping to break language barriers.\\n4. **Summarize content**: Condense long pieces of text into shorter, more digestible summaries.\\n5. **Chat and converse**: Engage in natural-sounding conversations, using context and understanding to respond to questions and statements.\\n\\nLarge language models are typically trained on massive datasets, often consisting of billions of parameters and petabytes of text data. This training enables us to learn complex language patterns, nuances, and context, allowing us to provide helpful and informative responses.\\n\\nSome popular examples of large language models include:\\n\\n1. **BERT (Bidirectional Encoder Representations from Transformers)**: Developed by Google, BERT is a foundational model for many language understanding tasks.\\n2. **RoBERTa (Robustly Optimized BERT Pretraining Approach)**: A variant of BERT, developed by Facebook AI, which improved upon the original model's performance.\\n3. **Transformers**: A family of models developed by Google, which includes BERT and other related architectures.\\n\\nThese models have revolutionized the field of natural language processing and have many exciting applications in areas like:\\n\\n1. **Virtual assistants**: Like Siri, Alexa, or myself, which can understand and respond to voice commands.\\n2. **Language translation**: Enabling real-time translation of languages.\\n3. **Content generation**: Creating original text, such as articles, stories, or even entire books.\\n4. **Customer service**: Providing 24/7 support and answering common customer queries.\\n\\nI hope this helps you understand what a Large Language Model is and its capabilities!\"},\"logprobs\":null,\"finish_reason\":\"stop\",\"matched_stop\":128009}],\"usage\":{\"prompt_tokens\":47,\"total_tokens\":504,\"completion_tokens\":457,\"prompt_tokens_details\":null}}"
]
]
}
}
],
],
...
@@ -100,8 +165,22 @@
...
@@ -100,8 +165,22 @@
"name": "stdout",
"name": "stdout",
"output_type": "stream",
"output_type": "stream",
"text": [
"text": [
"ChatCompletion(id='6bbf20fed17940739eb5cd5d685fa29a', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Here are 3 countries and their capitals:\\n\\n1. **Country:** Japan\\n**Capital:** Tokyo\\n\\n2. **Country:** Australia\\n**Capital:** Canberra\\n\\n3. **Country:** Brazil\\n**Capital:** Brasília', refusal=None, role='assistant', function_call=None, tool_calls=None), matched_stop=128009)], created=1729962608, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=46, prompt_tokens=49, total_tokens=95, prompt_tokens_details=None))\n"