Scarica Large Language Models e più Dispense in PDF di Tecniche Di Intelligenza Artificiale solo su Docsity!
Large
Language
Models
Index
- Large Language Models
- Ollama
- OpenAI
- Claude
- Gemini
- User Interface
- AI Agents
- Tools
- Multi-modality
- HuggingFace
- Tokenizers
- Quantization
- Comparison
- Retrival Augmented Generation
- Vector embedding
- LangChain
- Create embeddings with Chroma
- Build a RAG pipeline
- Training
- Fine-tuning an OpenAI LLM
- Fine-tuning an HuggingFace LLM
- LoRA
- QLoRA
- Agentic AI
- Thanks
Panorama of LLM models
LLMs are advanced AI systems trained on massive datasets to understand and generate human-like text. They excel in tasks such as:
- Synthesizing Information: LLMs can combine insights from multiple sources to provide concise summaries, explanations, or comparisons. Example: Summarizing complex topics or extracting key points from large documents.
- Fleshing Out Skeletons: They help expand outlines or ideas into detailed content. Example: Developing blog posts, reports, or structured essays from an initial framework.
- Coding: LLMs assist in writing and debugging code, generating boilerplates, or even explaining algorithms. Example: Writing Python scripts, SQL queries, or REST API integration snippets. Limitations:
- Specialized Domains: LLMs may struggle with niche topics where training data is sparse or highly technical (e.g., quantum physics, niche medical diagnoses).
- Recent Events: Models trained on static datasets may lack knowledge of very recent developments unless explicitly updated.
- Mistakes: LLMs can confidently provide incorrect or misleading information (e.g. fabricating facts or errors in reasoning). Always verify outputs. Main Models on the Market:
- GPT (OpenAI): Renowned for its advanced capabilities in natural language understanding and generation (e.g. GPT-4, GPT-3.5).
- Claude (Anthropic): Focused on safety and alignment, designed for human-centric applications.
- Gemini (Google DeepMind): Combines text and multimodal capabilities with Google's vast infrastructure and search knowledge.
- Llama (Meta): Open-access LLMs tailored for researchers and developers (e.g., Llama 2).
- Perplexity : Specialized in search and conversational tasks, offering succinct and fact-based outputs.
LLMs under the hood
LLMs are powerful tools built on advancements in natural language processing (NLP) and deep learning, particularly the Transformer architecture introduced in the groundbreaking paper "Attention Is All You Need" (Vaswani et al. 2017). The Evolution of LLMs: LLMs have evolved significantly since the introduction of the Transformer architecture. Attention Is All You Need (2017) introduced the Transformer architecture, which revolutionized NLP by replacing recurrent neural networks (RNNs) and long short-term memory (LSTM) models with a structure based on self-attention mechanisms.
- Self-Attention: Allows the model to focus on relevant parts of an input sequence regardless of its length.
- Parallelization: Enabled faster training compared to sequential RNNs.
- Became the foundation for subsequent LLMs. LLMs workflow:
- Tokenization: Text is converted into smaller units called tokens (e.g. words, subwords, or characters). Example: Input: "The quick brown fox" Tokens: ["The", "quick", "brown", "fox"] OpenAI provide a good tool to see how tokenization is done: https://platform.openai.com/tokenizer
LLM evolution
Prompt Engineering: Guide LLM behavior via input instructions. Effective task- specific LLM outputs. Custom GPTs: Adapt GPTs for domain-specific applications. Models tailored to fields like law, medicine. Copilots Assist: users in workflows with intelligent tools. Workflow-specific assistance, e.g. coding or writing. Agentization: Create autonomous systems for complex tasks. Agents capable of planning, reasoning, and executing actions. GPT Models: OpenAI built upon the Transformer to create the Generative Pre- trained Transformer (GPT) series:
- GPT-1 (2018): Introduced pretraining on large text corpora followed by fine- tuning for specific tasks. Showcased the power of unsupervised pretraining for transfer learning.
- GPT-2 (2019): Scaled up the model size (1.5 billion parameters). Demonstrated impressive zero-shot learning abilities: generating coherent text from a prompt without task-specific fine-tuning. Initially not released fully due to concerns about misuse.
- GPT-3 (2020): Drastically increased parameters (175 billion). Exhibited few- shot and zero-shot learning capabilities, enabling task performance with minimal examples. Highlighted the potential for prompt engineering to guide the model's output without retraining.
- RLHF in ChatGPT (2022): Reinforcement Learning with Human Feedback (RLHF): Refined GPT-3.5 and GPT-4 to align model behavior with user expectations. Used human evaluators to fine-tune the model for generating safer, more aligned, and user-friendly responses.
Before going on
When working with Data Science models, you could be carrying out 2 very different activities:
- Training: when you provide a model with data for it to adapt to get better at a task in the future. It does this by updating its internal settings - the parameters or weights of the model. If you're Training a model that's already had some training, the activity is called "fine-tuning".
- Inference: when you are working with a model that has already been trained. You are using that model to produce new outputs on new inputs, taking advantage of everything it learned while it was being trained. Inference is also sometimes referred to as "Execution" or "Running a model". Let's see some models and how to call them in inference!
Ollama
Ollama is a framework that offers a lot of open source Large Language Models in an easy way. Download the model from: https://ollama.com/download and unzip the extracted file. Then, install the Ollama command line tools as suggested and run the LLM with: ollama run <MODEL_NAME> For example: Note that the first time a new model is runned, it takes time to download all the necessary parameters: For example, here I test the Mistral model for the first time:
- Generate Your API Input: Ollama expects structured messages in a JSON format: messages = [ {"role": "system", "content": "system message goes here"}, {"role": "user", "content": "user message goes here"} ]
- Authenticate and Call the API: Use the Python ollama library to call the local API. Authentication isn’t necessary since it runs locally. Here’s how to structure the call. For example: import ollama response = ollama.create( model="llama3.2", messages=messages ) The response content can be seen: print(response["response"])
OpenAI
The OpenAI models are easily accessible via the official site: https://chatgpt.com This guide outlines the core steps required to call OpenAI's API to build intelligent conversational systems or other generative AI applications.
- Generate an API key: Create an OpenAI account if you don't have one by visiting: https://platform.openai.com/ and follow the instructions to create an API key. Once its showed, save it immediately because you will no longer see it. No one, except you have to use this API key. OpenAI asks for a minimum credit to use the API. The API calls will spend against this $5. You can add your credit balance to OpenAI at Settings > Billing: https://platform.openai.com/settings/organization/billing/overview Note: disable the automatic recharge!
- Authentication: The API key authenticates your application with OpenAI's servers. Without it, you cannot access the service. Generate an API key from the OpenAI dashboard and store the key securely inside an .env file:
Claude
The Anthropic Claude models are easily accessible via their official site: https://claude.ai This guide outlines the core steps required to call Anthropic's API to build intelligent conversational systems or other generative AI applications.
- Generate an API Key: To begin, create an account on the Anthropic platform if you don’t already have one: https://console.anthropic.com/ Once registered, follow the instructions to create an API key. Save the key immediately after it's displayed, as you won't be able to view it again. Keep it private and secure—only you should use this API key. Anthropic requires an active billing account to use the API. You can manage your billing information under Account Settings > Billing. Be aware of any usage limits or costs associated with the API.
- Authentication: Your API key serves as a credential to authenticate requests to Anthropic's servers. Without it, you cannot access the service. Store your API key securely in an .env file for easy and safe access in your application: ANTHROPIC_API_KEY=your-api-key-here Use the dotenv library in Python to load the key programmatically: from dotenv import load_dotenv import os load_dotenv() api_key = os.getenv("ANTHROPIC_API_KEY")
- Build the API Input: Anthropic’s API expects input messages to be structured in a conversational format, similar to other APIs. For example: system_message = "This is a system-level instruction."}, user_prompt = [ {"role": "user", "content": "user message goes here."} ]
- Call the API: To interact with Claude, use the anthropic library (or HTTP requests if no SDK is available). Below is an example of querying the model: import anthropic client = anthropic.Client(api_key) result = claude.messages.create( model="claude-3-5-sonnet-20240620", system=system_message, messages=user_prompt ) The response content can be seen: print(message.content[0].text) Alternatively if we want to reproduce the typewriter animation of the response generation in our code, we can call: result = claude.messages.stream( ... ) and see the response as: with result as stream: for text in stream.text_stream: print(text, end="", flush=True)
- Build the API Input: Gemini’s API expects input messages to be structured in a conversational format. For example: system_message = "This is a system-level instruction." user_prompt = [ {"role": "user", "content": "user message goes here."} ]
- Call the API: To interact with Gemini, use the google.generativeai library. Below is an example of querying the model: import google.generativeai as genai gemini = genai.GenerativeModel( model_name='gemini-1.5-flash', system_instruction=system_message ) response = gemini.generate_content(user_prompt) The response can be seen: print(response.text) The typewriter effect can be obtained as: response = gemini.generate_stream(user_prompt) for chunk in response.text_stream: print(chunk, end="", flush=True)
User Interface
User interfaces (UIs) are essential for making AI models accessible to users, and frameworks like Gradio simplify the process of building interactive applications. Gradio is a Python library that allows developers to quickly create and deploy web- based UIs. import gradio as gr
- Basic Chat Interface Let's consider a function that handle the call to an LLM, for example to GPT: def stream_gpt(prompt): messages = [ {"role": "system", "content": system_message}, {"role": "user", "content": prompt} ] stream = openai.ChatCompletion.create( model='gpt-4o-mini', messages=messages, stream=True ) result = "" for chunk in stream: result += chunk.choices[0].delta.content or "" yield result This function build the message in the OpenAI API format. The user prompt is passed as function argument, while a system prompt is given. The GPT model is called and the result is given with a typewriter effect. We can build an interface as follow: view = gr.Interface( fn=stream_gpt, inputs=[gr.Textbox(label="Your message:")], outputs=[gr.Markdown(label="Response:")], allow_flagging="never" ) view.launch() Components:
- Function (fn): Specifies the backend function (stream_gpt) that processes user input and generates a response.
- Inputs: A Textbox is used to accept the user's message. The label "Your message:" describes the purpose of the input field. The content of the text box will be the input of the function component (in this case the user prompt).
- Outputs: A Markdown box displays the model’s response. Markdown allows formatted text, such as bold, italics, and links.
- allow_flagging: This disables the built-in Gradio flagging keeping the UI clean. Launch:
- view.launch() starts a local server where the interface can be accessed through a web browser.
Let's build an interface with a drop-down menu as additional input: view = gr.Interface( fn=stream_model, inputs=[ gr.Textbox(label="Your message:"), gr.Dropdown(["GPT", "Claude"], label="Select model", value="GPT") ], outputs=[gr.Markdown(label="Response:")], flagging_mode="never" ) view.launch() A Dropdown component is added, allowing the user to select between different models (e.g., GPT or Claude). The default value is set to "GPT."
AI Agents
LLM chatbots are remarkably efficient in conversations. For this reason they are perfectly suitable to build an AI assistant. They have to be:
- Friendly
- Mantain context during the conversation
- Subject expertise It's fundamental to well define:
- System prompt
- Context inside our user prompt
- Multi-shot prompting (if the past tokens are obtained from a conversation related to the topic, it's more probable that the future tokens will get the answer to our question)
Keep contest
One of the key feature to build an AI Assistant is to keep the context of the chat. The LLM has to know which exchange of questions and answers has been done during the chat in order to "remember" what has happened. With Gradio's ChatInterface, this process becomes streamlined, especially since recent updates allow Gradio to pass the conversation history in the OpenAI format directly, eliminating additional processing. In conversational AI, context is stored as a series of messages that represent the interaction between the user and the assistant. OpenAI expects this context in the following format: [ {"role": "system", "content": "system message here"}, {"role": "user", "content": "first user prompt here"}, {"role": "assistant", "content": "the assistant's response"}, {"role": "user", "content": "the new user prompt"}, ] The roles define the participant in the conversation:
- System: Provides instructions or sets the model’s behavior.
- User: Represents the user’s input.
- Assistant: Contains the AI’s responses. To ensure that responses are context-aware, we:
- Combine the system message with the conversation history.
- Add the latest user message before sending the request to OpenAI. The history process in Gradio is taken very easily: def chat(message, history): # Combine system message, history, and latest user message messages = [ {"role": "system", "content": system_message} ] + history + [ {"role": "user", "content": message} ]