# Recommended Models

ServerAssistantAI supports a wide range of language models from different providers, including premium and free options, allowing server owners to choose the model that best fits their needs. Here's what we recommend, listed from **most recommended to least**.

## Embedding Models

Embedding models are used to convert text data into numerical representations called embeddings. These embeddings capture the semantic meaning and relationships between different pieces of text. When the `documents/` directory is updated, the content is sent to the embedding API. The resulting embeddings are saved to the `cache/` directory, allowing the AI to find relevant context efficiently without reprocessing or making new API requests for each query.

<table data-full-width="false"><thead><tr><th width="141">Provider</th><th width="223">Embedding Model</th><th width="100">Pricing</th><th>Description</th></tr></thead><tbody><tr><td><a href="https://aistudio.google.com/app/apikey">Google-AIStudio</a></td><td><a data-footnote-ref href="#user-content-fn-1">text-embedding-004</a></td><td>Free &#x26; Paid</td><td>Google's state-of-the-art embeddings for semantic search and classification tasks.</td></tr><tr><td><a href="https://platform.openai.com/api-keys">OpenAI</a></td><td><a data-footnote-ref href="#user-content-fn-2">text-embedding-3-large</a></td><td>Paid</td><td>OpenAI's large embedding model for text similarity and retrieval tasks.</td></tr><tr><td><a href="https://github.com/settings/personal-access-tokens">Github-Models</a></td><td><a data-footnote-ref href="#user-content-fn-3">openai/text-embedding-3-large</a></td><td>Free &#x26; Paid</td><td>OpenAI's large embedding model for text similarity and retrieval tasks.</td></tr><tr><td><a href="https://dashboard.cohere.ai/api-keys">Cohere</a></td><td><a data-footnote-ref href="#user-content-fn-4">embed-english-v3.0</a></td><td>Free &#x26; Paid</td><td>English embedding model great for noisy data, enabling better retrievals for RAG.</td></tr><tr><td><a href="https://dashboard.cohere.ai/api-keys">Cohere</a></td><td><a data-footnote-ref href="#user-content-fn-5">embed-multilingual-v3.0</a></td><td>Free &#x26; Paid</td><td>Multilingual embedding model great for noisy data, enabling better retrievals for RAG.</td></tr><tr><td><a href="https://platform.openai.com/api-keys">OpenAI</a></td><td><a data-footnote-ref href="#user-content-fn-2">text-embedding-3-small</a></td><td>Paid</td><td>OpenAI's small embedding model for text similarity and retrieval tasks.</td></tr></tbody></table>

{% hint style="info" %}
ServerAssistantAI only sends information to the embedding API when changes are made to the `documents/` directory. If no changes are detected, the plugin will use the previously cached embeddings to reduce API calls.
{% endhint %}

## Large Language Models

Large Language Models (LLMs) are powerful AI models that can understand and generate human-like text based on the input they receive. In ServerAssistantAI, when a user asks a question, the system retrieves relevant cached context from the embedding API results. This context, along with the user's question, is sent to the LLM to generate accurate and context-aware responses.

<table><thead><tr><th width="142">Provider</th><th width="232">Large Language Model</th><th width="98">Pricing</th><th>Description</th></tr></thead><tbody><tr><td><a href="https://platform.openai.com/api-keys">OpenAI</a></td><td><a data-footnote-ref href="#user-content-fn-6">gpt-4o</a></td><td>Paid</td><td>OpenAI's most advanced model. Same intelligence as GPT-4 Turbo but 2x faster and 50% cheaper.</td></tr><tr><td><a href="https://github.com/settings/personal-access-tokens">Github-Models</a></td><td><a data-footnote-ref href="#user-content-fn-7">openai/gpt-4o</a></td><td>Free &#x26; Paid</td><td>OpenAI's most advanced model. Same intelligence as GPT-4 Turbo but 2x faster and 50% cheaper.</td></tr><tr><td><a href="https://console.anthropic.com/settings/keys">Anthropic</a></td><td><a data-footnote-ref href="#user-content-fn-8">claude-opus-4.1-20250805</a></td><td>Paid</td><td>Anthropic's upgraded model with enhanced agentic tasks, real-world coding, and reasoning capabilities.</td></tr><tr><td><a href="https://cloud.cerebras.ai/">CerebrasAI</a></td><td><a data-footnote-ref href="#user-content-fn-9">qwen-3-235b-a22b-instruct-2507</a></td><td>Free &#x26; Paid</td><td>Alibaba’s frontier-scale model served on Cerebras. Delivers state-of-the-art non-reasoning performance with increadible speed.</td></tr><tr><td><a href="https://console.anthropic.com/settings/keys">Anthropic</a></td><td><a data-footnote-ref href="#user-content-fn-8">claude-opus-4-20250514</a></td><td>Paid</td><td>Anthropic's most intelligent model with hybrid reasoning and 200K context window, excelling at coding and working autonomously for hours.</td></tr><tr><td><a href="https://aistudio.google.com/app/apikey">Google-AIStudio</a></td><td><a data-footnote-ref href="#user-content-fn-10">gemini-2.5-flash</a></td><td>Free &#x26; Paid</td><td>Google's fast thinking model optimized for speed and cost-efficiency with controllable thinking budgets for high-volume tasks.</td></tr><tr><td><a href="https://platform.openai.com/api-keys">OpenAI</a></td><td><a data-footnote-ref href="#user-content-fn-6">gpt-4.1</a></td><td>Paid</td><td>OpenAI's enhanced model with major coding improvements and 1 million token context window.</td></tr><tr><td><a href="https://github.com/settings/personal-access-tokens">Github-Models</a></td><td><a data-footnote-ref href="#user-content-fn-7">openai/gpt-4.1</a></td><td>Free &#x26; Paid</td><td>OpenAI's enhanced model with major coding improvements and 1 million token context window.</td></tr><tr><td><a href="https://console.groq.com/keys">Groq</a></td><td><a data-footnote-ref href="#user-content-fn-11">deepseek-r1-distill-llama-70b</a></td><td>Free &#x26; Paid</td><td>DeepSeek's efficient 70B <strong>reasoning</strong> model distilled from larger models.</td></tr><tr><td><a href="https://console.anthropic.com/settings/keys">Anthropic</a></td><td><a data-footnote-ref href="#user-content-fn-8">claude-sonnet-4-20250514</a></td><td>Paid</td><td>Anthropic's high-performance model with 200K context window, achieving 72.7% on SWE-bench with superior instruction following.</td></tr><tr><td><a href="https://platform.openai.com/api-keys">OpenAI</a></td><td><a data-footnote-ref href="#user-content-fn-6">gpt-4.1-mini</a></td><td>Paid</td><td>OpenAI's efficient model competitive with GPT-4o but 50% faster and 83% cheaper with 1 million token context window.</td></tr><tr><td><a href="https://github.com/settings/personal-access-tokens">Github-Models</a></td><td><a data-footnote-ref href="#user-content-fn-12">openai/gpt-4.1-mini</a></td><td>Free &#x26; Paid</td><td>OpenAI's efficient model competitive with GPT-4o but 50% faster and 83% cheaper with 1 million token context window.</td></tr><tr><td><a href="https://aistudio.google.com/app/apikey">Google-AIStudio</a></td><td><a data-footnote-ref href="#user-content-fn-13">gemini-2.0-flash-001</a></td><td>Free &#x26; Paid</td><td>Google's next-generation flash model with native tool use, superior speed, and 1M token context window for multimodal applications.</td></tr><tr><td><a href="https://aistudio.google.com/app/apikey">Google-AIStudio</a></td><td><a data-footnote-ref href="#user-content-fn-14">gemma-3-27b-it</a></td><td>Free</td><td>Google's open-source 27B parameter model optimized for instruction following and conversational AI with efficient performance.</td></tr><tr><td><a href="https://aistudio.google.com/app/apikey">Google-AIStudio</a></td><td><a data-footnote-ref href="#user-content-fn-15">gemini-1.5-pro</a></td><td>Free &#x26; Paid</td><td>Gemini 1.5 Pro offers long-context understanding, with a context window of up to 1 million tokens.</td></tr><tr><td><a href="https://console.groq.com/keys">Groq</a></td><td><a data-footnote-ref href="#user-content-fn-16">qwen/qwen3-32b</a></td><td>Free &#x26; Paid</td><td>Alibaba's hybrid model switching between thinking and fast modes, supporting 119 languages and competing with top proprietary models.</td></tr><tr><td><a href="https://dashboard.cohere.ai/api-keys">Cohere</a></td><td><a data-footnote-ref href="#user-content-fn-17">command-a-03-2025</a></td><td>Free &#x26; Paid</td><td>Cohere's most performant 111B model with 256K context, delivering 150% higher throughput.</td></tr><tr><td><a href="https://console.anthropic.com/settings/keys">Anthropic</a></td><td><a data-footnote-ref href="#user-content-fn-18">claude-3-sonnet-20240307</a></td><td>Paid</td><td>Claude 3.5 Sonnet delivers enhanced intelligence and speed, ideal for advanced tasks.</td></tr><tr><td><a href="https://platform.openai.com/api-keys">OpenAI</a></td><td><a data-footnote-ref href="#user-content-fn-19">gpt-4o-mini</a></td><td>Paid</td><td>OpenAI's most cost-efficient small model that’s smarter and cheaper than mostly all other paid models.</td></tr><tr><td><a href="https://github.com/settings/personal-access-tokens">Github-Models</a></td><td><a data-footnote-ref href="#user-content-fn-12">openai/gpt-4o-mini</a></td><td>Free &#x26; Paid</td><td>OpenAI's most cost-efficient small model that’s smarter and cheaper than mostly all other paid models.</td></tr><tr><td><a href="https://console.groq.com/keys">Groq</a></td><td><a data-footnote-ref href="#user-content-fn-11">llama-3.1-70b-versatile</a></td><td>Free &#x26; Paid</td><td>Meta's Llama 3.1 70B instruction-tuned model outperforms all open-source chat models and even closed-source models and using Groq, the inference speed is 250+ Tokens per second!</td></tr><tr><td><a href="https://platform.openai.com/api-keys">OpenAI</a></td><td><a data-footnote-ref href="#user-content-fn-19">gpt-4-turbo</a></td><td>Paid</td><td>OpenAI's GPT-4-Turbo model with 128K context, newer knowledge and more powerful than GPT-4.</td></tr><tr><td><a href="https://aistudio.google.com/app/apikey">Google-AIStudio</a></td><td><a data-footnote-ref href="#user-content-fn-20">gemini-1.5-flash</a></td><td>Free &#x26; Paid</td><td>Gemini 1.5 Flash has higher rate limits than Gemini 1.5 Pro and is Google's fastest, most cost-efficient model.</td></tr><tr><td><a href="https://console.anthropic.com/settings/keys">Anthropic</a></td><td><a data-footnote-ref href="#user-content-fn-21">claude-3-opus-20240307</a></td><td>Paid</td><td>Anthropic's most intelligent model with great performance on highly complex tasks.</td></tr><tr><td><a href="https://console.groq.com/keys">Groq</a></td><td><a data-footnote-ref href="#user-content-fn-11">llama3-70b-8192</a></td><td>Free &#x26; Paid</td><td>Meta's Llama 3 70B instruction-tuned model which outperforms almost all open-source chat models and using Groq, the inference speed is 300+ Tokens per second!</td></tr><tr><td><a href="https://console.anthropic.com/settings/keys">Anthropic</a></td><td><a data-footnote-ref href="#user-content-fn-22">claude-3-haiku-20240307</a></td><td>Paid</td><td>Claude 3 Haiku is Anthropic's fastest model for near-instant responsiveness.</td></tr><tr><td><a href="https://huggingface.co/settings/tokens">HuggingFace</a></td><td>meta-llama/Meta-Llama-3-8B-Instruct</td><td>Free</td><td>Meta's Llama 3 8B instruction-tuned model and outperforming many open-source chat models.</td></tr><tr><td><a href="https://huggingface.co/settings/tokens">HuggingFace</a></td><td>01-ai/Yi-1.5-34B-Chat</td><td>Free</td><td>01.AI's Yi-1.5 model, offering strong performance in coding, math, reasoning, and instruction-following.</td></tr><tr><td><a href="https://huggingface.co/settings/tokens">HuggingFace</a></td><td>mistralai/Mixtral-8x7B-Instruct-v0.1</td><td>Free</td><td>Mistral AI's 8x7B instruction-following model, version 0.1.</td></tr><tr><td><a href="https://platform.openai.com/api-keys">OpenAI</a></td><td><a data-footnote-ref href="#user-content-fn-23">gpt-3.5-turbo</a></td><td>Paid</td><td>OpenAI's GPT-3.5 model with a larger context size of 16K tokens.</td></tr></tbody></table>

{% hint style="warning" %}
Please note that each model may have specific requirements or considerations for optimal performance. When selecting a model, consider factors such as:

* System prompt configuration: Different models may require adjustments to the system prompt to achieve the best results. Check the documentation provided by the model's creators for tips on prompt engineering and configuration.
* Open-source free LLMs have a higher chance of hallucinating compared to paid models.
  {% endhint %}

[^1]: **Free** Rate Limits:

    * 2 Request Per Minute
    * 32,000 Tokens Per Minute
    * 50 Requests Per Day

    \
    **Paid** Rate Limits:

    * 360 Request Per Minute
    * 2 Million Tokens Per Minute
    * 10,000 Requests Per Day

[^2]: **Tier 1** Rate Limits:

    * 3,000 Request Per Minute
    * 1,000,000 Tokens Per Minute

[^3]: **Free Rate Limits (Low Tier):**

    * 15 Requests Per Minute
    * 150 Requests Per Day
    * 64,000 Input Tokens Per Request
    * 5 Concurrent Requests

    **Paid Rate Limits:**

    * Production grade limits (see Azure documentation)

[^4]: **Free** Rate Limits:

    * 5 Request Per Minute
    * 1,000 Total Requests Per Month
    * 100,000 Total Tokens Per Minute<br>

    **Paid** Rate Limits:

    * 10,000 Request Per Minute

[^5]: **Free** Rate Limits:

    * 5 Request Per Minute
    * 1,000 Total Requests Per Month
    * 100,000 Total Tokens Per Minute

    \
    **Paid** Rate Limits:

    * 10,000 Request Per Minute

[^6]: **Tier 1** Rate Limits:

    * 500 Requests Per Minute
    * 30,000 Tokens Per Minute
    * No Daily Limits

    **Tier 2** Rate Limits:

    * 5,000 Requests Per Minute
    * 450,000 Tokens Per Minute
    * No Daily Limits

[^7]: **Free Rate Limits (High Tier):**

    * 10 Requests Per Minute
    * 50 Requests Per Day
    * 8,000 Input Tokens Per Request
    * 4,000 Output Tokens Per Request

    **Paid Rate Limits:**

    * Production grade limits (see Azure documentation)

[^8]: **Tier 1** Rate Limits:

    * 50 Requests Per Minute
    * 30,000 Input Tokens Per Minute
    * 8,000 Output Tokens Per Minute

    **Tier 2** Rate Limits:

    * 1,000 Requests Per Minute
    * 450,000 Input Tokens Per Minute
    * 90,000 Output Tokens Per Minute

[^9]: **Free** Rate Limits:

    * 30 Requests Per Minute
    * 60,000 Tokens Per Minute
    * 14,400 Requests Per Day

    **Paid** Rate Limits:

    * Exploration Tier (see CerebrasAI documentation)

[^10]: **Free** Rate Limits:

    * 10 Requests Per Minute
    * 250,000 Tokens Per Minute
    * 250 Requests Per Day

    **Tier 1** Rate Limits:

    * 1,000 Requests Per Minute
    * 1,000,000 Tokens Per Minute
    * 10,000 Requests Per Day

    **Tier 2** Rate Limits:

    * 2,000 Requests Per Minute
    * 3,000,000 Tokens Per Minute
    * 100,000 Requests Per Day

[^11]: **Free** Rate Limits:

    * 30 Requests Per Minute
    * 6,000 Tokens Per Minute
    * 1,000 Requests Per Day

    **Paid** Rate Limits:

    * 1,000 Requests Per Minute
    * 300,000 Tokens Per Minute
    * 500,000 Requests Per Day

[^12]: **Free Rate Limits (Low Tier):**

    * 15 Requests Per Minute
    * 150 Requests Per Day
    * 8,000 Input Tokens Per Request
    * 4,000 Output Tokens Per Request

    **Paid Rate Limits:**

    * Production grade limits (see Azure documentation)

[^13]: **Free** Rate Limits:

    * 15 Requests Per Minute
    * 1,000,000 Tokens Per Minute
    * 200 Requests Per Day

    **Tier 1** Rate Limits:

    * 2,000 Requests Per Minute
    * 4,000,000 Tokens Per Minute
    * No daily limit

    **Tier 2** Rate Limits:

    * 10,000 Requests Per Minute
    * 10,000,000 Tokens Per Minute
    * No daily limit

[^14]: **Free** Rate Limits:

    * 30 Requests Per Minute
    * 15,000 Tokens Per Minute
    * 14,400 Requests Per Day

[^15]: **Free** Rate Limits:

    * 2 Request Per Minute
    * 32,000 Tokens Per Minute
    * 50 Requests Per Day<br>

    **Paid** Rate Limits:

    * 360 Request Per Minute
    * 2 Million Tokens Per Minute
    * 10,000 Requests Per Day

[^16]: **Free** Rate Limits:

    * 60 Requests Per Minute
    * 6,000 Tokens Per Minute
    * 1,000 Requests Per Day

    **Paid** Rate Limits:

    * 1,000 Requests Per Minute
    * 300,000 Tokens Per Minute
    * 500,000 Requests Per Day

[^17]: **Free** Rate Limits:

    * 20 Requests Per Minute
    * 1,000 Total Requests Per Month

    \
    **Paid** Rate Limits:

    * 500 Requests Per Minute

[^18]: **Tier 1** Rate Limits:

    * 50 Request Per Minute
    * 40,000 Tokens Per Minute
    * 1,000,000 Tokens Per Day

[^19]: **Tier 1** Rate Limits:

    * 500 Request Per Minute
    * 30,000 Tokens Per Minute

[^20]: **Free** Rate Limits:

    * 15 Request Per Minute
    * 1 Million Tokens Per Minute
    * 1,500 Requests Per Day<br>

    **Paid** Rate Limits:

    * 1,000 Request Per Minute
    * 2 Million Tokens Per Minute

[^21]: **Tier 1** Rate Limits:

    * 50 Request Per Minute
    * 20,000 Tokens Per Minute
    * 1,000,000 Tokens Per Day

[^22]: **Tier 1** Rate Limits:

    * 50 Request Per Minute
    * 50,000 Tokens Per Minute
    * 5,000,000 Tokens Per Day

[^23]: **Tier 1** Rate Limits:

    * 3,500 Request Per Minute
    * 60,000 Tokens Per Minute
    * 10,000 Requests Per Day
