Our Large Language Model as a Service (LLMaaS) offering gives you access to cutting-edge language models, inferred using SecNumCloud-qualified infrastructure, HDS-certified for healthcare data hosting, and therefore sovereign, calculated in France. Benefit from high performance and optimal security for your AI applications. Your data remains strictly confidential, and is neither exploited nor stored after processing.

Simple, transparent pricing
1.8 €
per million input tokens
8 €
per million tokens issued
8 €
per million reasoning tokens
0,01 €
per minute of transcribed audio *
Calculated on an infrastructure based in France, SecNumcloud qualified and HDS certified.
Note on the "Reasoning" price: This price applies specifically to models classified as "reasoners" or "hybrids" (models with the "Reasoning" capability activated) when reasoning is active and only on tokens linked to this activity.
* any minute started is counted

Large models

Our large models offer state-of-the-art performance for the most demanding tasks. They are particularly well-suited to applications requiring a deep understanding of language, complex reasoning or the processing of long documents.

18 tokens/second

glm-4.7:358b

Versatile high-performance model designed by Zhipu AI, excellent for logical reasoning, multilingual comprehension and complex tasks.
Deployed with a context of 120,000 tokens. Ideal for in-depth analysis of long documents and intelligent assistants.
86 tokens/second

qwen3-omni:30b

Qwen3-Omni 30B is a native omnimodal model, capable of understanding text, image, video and audio in a single stream.
It supports multimodal inputs (Audio/Video) and offers advanced reasoning capabilities. Note: Audio output via API is not yet enabled.
104 tokens/second

gpt-oss:120b

OpenAI's state-of-the-art open-weight language model, offering solid performance with a flexible Apache 2.0 licence.
A Mixture-of-Experts (MoE) model with 120 billion parameters and around 5.1 billion active parameters. It offers a configurable reasoning effort and full access to the chain of thought.
29 tokens/second

llama3.3:70b

State-of-the-art multilingual model developed by Meta, designed to excel at natural dialogue, complex reasoning and nuanced understanding of instructions.
Combining remarkable efficiency with reduced computational resources, this model offers extensive multilingual capabilities covering 8 major languages (English, French, German, Spanish, Italian, Portuguese, Hindi and Thai). Its contextual window of 132,000 tokens enables in-depth analysis of complex documents and long conversations, while maintaining exceptional overall consistency. Optimised to minimise bias and problematic responses.
21 tokens/second

gemma3:27b

Google's revolutionary model offers an optimum balance between power and efficiency, with an exceptional performance/cost ratio for demanding professional applications.
With unrivalled hardware efficiency, this model incorporates native multimodal capabilities and excels in multilingual performance in over 140 languages. Its impressive contextual window of 120,000 tokens makes it the ideal choice for analysing very large documents, document research and any application requiring understanding of extended contexts. Its optimised architecture allows flexible deployment without compromising the quality of results.
104 tokens/second

qwen3-coder:30b

MoE model optimised for software engineering tasks with a very long context.
Advanced agentic capabilities for software engineering tasks, native support for a 250K token context, pre-trained on 7.5T tokens with a high code ratio, and optimised by reinforcement learning to improve code execution rates.
104 tokens/second

qwen3-2507:30b-a3b

Enhanced version of Qwen3-30B's non-thinking mode, with improved general capabilities, knowledge coverage and user alignment.
Significant improvements in following instructions, reasoning, reading comprehension, mathematics, coding and tool use. Native context of 250k tokens.
148 tokens/second

qwen3-next:80b

Qwen's Next 80B model, optimised for large contexts and reasoning, served via vLLM (A100).
A3B-Instruct variant configured with a context of up to 262k tokens, support for function calling, guided decoding (xgrammar) and speculative (qwen3_next_mtp).
43 tokens/second

qwen3-vl:30b

State-of-the-art multimodal model (Qwen3-VL) offering exceptional visual understanding and accurate temporal reasoning.
This Vision-Language model incorporates major innovations (DeepStack, MRoPE) for detailed analysis of images and videos. It excels at complex OCR, object detection, graph analysis, and spatio-temporal reasoning. Its architecture enables native understanding of video content and accurate structured extraction (JSON).
17 tokens/second

qwen3-vl:32b

High-performance variant of Qwen3-VL, optimised for the most demanding vision tasks.
Offers the same advanced capabilities as the 30B (DeepStack, MRoPE) with increased modelling capacity. Particularly effective for tasks requiring high visual analysis accuracy and deep contextual understanding. Supports text-timestamp alignment for video.
37 tokens/second

Elm 3:7b

Fully Open model of reference, offering total transparency (data, code, weight) and remarkable efficiency.
OLMo 3-7B is a dense model optimised for efficiency (requiring 2.5 times fewer resources than Llama 3.1 8B for comparable performance). It excels particularly in mathematics and programming. With its 65k token window, it is ideal for tasks requiring full auditability.
19 tokens/second

elm tree 3:32b

The first fully open reasoning model at this scale, rivalling the best proprietary models.
OLMo 3-32B uses advanced architecture (GQA) to offer exceptional reasoning capabilities. It excels on complex benchmarks (MATH, HumanEvalPlus) and is capable of exposing its thought process (Think variant). It is the preferred choice for critical tasks requiring high performance and total transparency.
58 tokens/second

qwen3-2507:235b

Massive MoE model with 235 billion parameters, with only 22 billion active, offering cutting-edge performance.
Ultra-sparse Mixture-of-Experts architecture with 512 experts. Combines the power of a very large model with the efficiency of a smaller model. Excels at mathematics, coding, and logical reasoning.
31 tokens/second

qwen3-vl:235b

The most powerful multimodal model in the catalogue, combining cutting-edge visual understanding with exceptional reasoning capabilities.
This Vision-Language model excels at in-depth analysis of complex documents, multilingual OCR and reasoning about dense visual and textual content. It is designed for critical tasks requiring maximum accuracy and extensive contextual understanding.
31 tokens/second

ministral-3:14b

The most powerful member of the Ministral family, designed for complex tasks on local infrastructure.
Deployed with an extended context of 250k tokens. Excels at complex reasoning and coding while remaining efficient.
68.2 tokens/second

qwen3:14b

Balanced Qwen3 14B model, offering solid overall performance with good inference speed.
Excellent size/performance ratio. Capable of good-level reasoning and coding.
20 tokens/second

cogito:32b

Advanced version of the Cogito model, offering considerably enhanced reasoning and analysis capabilities, designed for the most demanding applications in terms of analytical artificial intelligence.
Designed to excel at complex tasks requiring superior depth of analysis, this model stands out for its ability to break down multidimensional problems and provide structured, well-argued answers. It incorporates advanced logic checking mechanisms to minimise hallucinations.
89 tokens/second

nemotron-3-nano:30b

NVIDIA model optimised for complex reasoning and the use of tools, deployed with an extended context.
Uses Nano V3 architecture. Excels at function calling, structured reasoning and analysis of long contexts.

Specialised models

Our specialised models are optimised for specific tasks such as code generation, image analysis or structured data processing. They offer an excellent performance/cost ratio for targeted use cases.

50 tokens/second

ministral-3:3b

Mistral AI's cutting-edge compact model, designed for efficiency in local and edge deployments.
Despite its small size, this model offers surprising performance for conversational tasks and simple reasoning. Ideal for mobile devices.
55 tokens/second

ministral-3:8b

Mid-sized model in the Ministral family, offering an optimal balance between performance and resources.
Version 8B is more robust, capable of handling longer contexts and more complex reasoning, while remaining very fast.
53 tokens/second

gemma3:1b

Gemma 3 micro-model, ultra-fast and efficient.
Perfect for simple tasks, rapid classification or execution on highly constrained devices.
48.0 tokens/second

gemma3:4b

Compact Gemma 3 4B model, offering an excellent performance/size ratio.
Capable of decent reasoning and good language comprehension. A good candidate for more advanced local assistants.

qwen3-embedding:0.6b

Ultra-light Qwen3 embedding model, optimised for speed and efficiency on resource-limited infrastructures.
Offers an excellent compromise between semantic performance and speed of execution.
196.3 tokens/second

granite-embedding:278m

Ultra-compact IBM Granite embedding model, designed for maximum efficiency.
Ideal for semantic search tasks requiring minimal latency.

qwen3-embedding:4b

High-performance Qwen3-4B embedding model, offering deep semantic understanding and an extended context window.
Deployed with a context of 40,000 tokens for processing large documents.
171 tokens/second

bge-m3:567m

State-of-the-art multilingual embedding model (BGE-M3), offering exceptional semantic search capabilities in over 100 languages.
Deployed with a context of 8192 tokens. Supports dense, sparse and multi-vector search methods.
175 tokens/second

embeddinggemma:300m

Google's state-of-the-art embedding model, optimised for its size, ideal for search and semantic retrieval tasks.
Built on Gemma 3, this model produces vector representations of text for classification, clustering and similarity search. Trained on over 100 languages, its small size makes it perfect for resource-constrained environments.
9 tokens/second

gpt-oss:20b

OpenAI's open-weight language model, optimised for efficiency and deployment on consumer hardware.
A Mixture-of-Experts (MoE) model with 21 billion parameters and 3.6 billion active parameters. It offers configurable reasoning effort and agent capabilities.
52 tokens/second

qwen3-2507-think:4b

Qwen3-4B model optimised for reasoning, with improved performance on logic, maths, science and code tasks, and extended context to 250K tokens.
This 'Thinking' version has an increased thought length, making it ideal for highly complex reasoning tasks. It also offers general improvements in following instructions, using tools and generating text.
30 tokens/second

qwen3-2507:4b

Updated version of Qwen3-4B's non-thinking mode, with significant improvements in overall capabilities, extended knowledge coverage and better alignment with user preferences.
Significant improvements in following instructions, logical reasoning, reading comprehension, mathematics, coding and tool use. Native context of 250k tokens.
31 tokens/second

rnj-1:8b

Model 8B "Open Weight" specialising in coding, mathematics and science (STEM).
RNJ-1 is a dense model with 8.3B parameters trained on 8.4T tokens. It uses global attention and YaRN to provide a context of 32k tokens. It excels at code generation (83.5% HumanEval+) and mathematical reasoning, often outperforming much larger models.
64 tokens/second

qwen3-vl:2b

Ultra-compact multimodal Qwen3-VL model, bringing advanced vision capabilities to edge devices.
Despite its small size, this model incorporates Qwen3-VL technologies (MRoPE, DeepStack) to deliver impressive image and video analysis. Ideal for mobile or embedded applications requiring OCR, object detection or rapid visual understanding.
57 tokens/second

qwen3-vl:4b

Balanced Qwen3-VL multimodal model, offering robust vision performance with a small footprint.
Excellent compromise between performance and resources. Capable of analysing complex documents, graphics and videos with high accuracy. Supports structured extraction and visual reasoning.
46 tokens/second

qwen3:0.6b

Ultra-light Qwen3 model with 0.6 billion parameters, offering exceptional inference speed for fast, simple tasks.
Ideal for deployment on lightweight servers or as the first level of processing for complex workflows. Configured with a context of 40,000 tokens.
44 tokens/second

qwen3-vl:8b

Qwen3-VL multimodal model (8B), offering advanced vision performance with a reasonable footprint.
Version 8B of the Qwen3-VL model. Excellent compromise between performance and resources. Capable of analysing complex documents, graphics and video with high accuracy.
44 tokens/second

devstral:24b

Devstral 24b is an agentic LLM specialising in software engineering, co-developed by Mistral AI and All Hands AI.
Devstral excels at using tools to explore code bases, modify multiple files and drive engineering agents. Based on Mistral Small 3, it offers advanced reasoning and coding capabilities. Configured with Mistral-specific optimisers (tokenizer, parser).
23 tokens/second

devstral-small-2:24b

Second iteration of Devstral (Small 2), a cutting-edge agentic model for software engineering, deployed on Mac Studio with a massive context.
Optimised for exploring codebases, multi-file editing, and tool usage. Offers performance close to >100B models for code (SWE-bench Verified 68%). Natively supports vision. Deployed with an extended context of 380k tokens to handle entire projects.
33 tokens/second

granite4-small-h:32b

IBM's MoE (Mixture-of-Experts) model, designed as a "workhorse" for everyday business tasks, with excellent efficiency for long contexts.
This hybrid model (Transformer + Mamba-2) with 32 billion parameters (9B active) is optimised for enterprise workflows such as multi-tool agents and customer support automation. Its innovative architecture reduces RAM usage by more than 70% for long contexts and multiple batches.
58 tokens/second

granite4-tiny-h:7b

IBM's ultra-efficient hybrid MoE model, designed for low latency, edge and local applications, and as a building block for agentic workflows.
This 7 billion parameter (1B active) model combines Transformer and Mamba-2 layers for maximum efficiency. It reduces RAM usage by over 70% for long contexts, making it ideal for resource-constrained devices and fast tasks such as function calling.
79 tokens/second

deepseek-ocr

DeepSeek's specialist OCR model, designed for high-precision text extraction with formatting preservation.
Two-stage OCR system (visual encoder + MoE 3B decoder) optimised for converting documents into structured Markdown (tables, formulas). Requires specific pre-processing (Logits Processor) for optimum performance.
22 tokens/second

medgemma:27b

MedGemma is one of Google's most powerful open models for understanding medical text and images, based on Gemma 3.
MedGemma is suitable for tasks such as generating medical imaging reports or answering natural language questions about medical images. MedGemma can be adapted for use cases requiring medical knowledge, such as patient interviewing, triage, clinical decision support and summarisation. Although its basic performance is solid, MedGemma is not yet clinical-grade and will probably require further refinement. Based on the Gemma 3 architecture (native multimodal), this 27B model incorporates a SigLIP image encoder pre-trained on medical data. It supports a context of 128k tokens and is in FP16 for maximum precision.
27 tokens/second

mistral-small3.2:24b

Minor update to Mistral Small 3.1, improving instruction tracking, function calling robustness and reducing repetition errors.
This version 3.2 retains the strengths of its predecessor while making targeted improvements. It is better able to follow precise instructions, produces fewer infinite generations or repetitive responses, and its function calling template is more robust. In other respects, its performance is equivalent to or slightly better than version 3.1.

Model comparison

This comparison table will help you choose the model best suited to your needs, based on various criteria such as context size, performance and specific use cases.

Comparative table of the characteristics and performance of the various AI models available, grouped by category (large-scale models and specialist models).
Model Publisher Parameters Context (k tokens) Vision Agent Reasoning Security Quick * Energy efficiency *
Large models
glm-4.7:358b Zhipu AI 358B 120000
qwen3-omni:30b Qwen Team 30B 32768
gpt-oss:120b OpenAI 120B 120000
llama3.3:70b Meta 70B 132000
gemma3:27b Google 27B 120000
qwen3-coder:30b Qwen Team 30B 250000
qwen3-2507:30b-a3b Qwen Team 30B 250000
qwen3-next:80b Qwen Team 80B 262144
qwen3-vl:30b Qwen Team 30B 250000
qwen3-vl:32b Qwen Team 32B 250000
Elm 3:7b AllenAI 7B 65536
elm tree 3:32b AllenAI 32B 65536
qwen3-2507:235b Qwen Team 235B 130000
qwen3-vl:235b Qwen Team 235B 200000
ministral-3:14b Mistral AI 14B 250000
qwen3:14b Qwen Team 14B 131072
cogito:32b Deep Cogito 32B 32000
nemotron-3-nano:30b NVIDIA 30B 250000
Specialised models
ministral-3:3b Mistral AI 3B 250000
ministral-3:8b Mistral AI 8B 250000
gemma3:1b Google 1B 120000
gemma3:4b Google 4B 120000
qwen3-embedding:0.6b Qwen Team 0.6B 32768
granite-embedding:278m IBM 278M 8192
qwen3-embedding:4b Qwen Team 4B 40000
bge-m3:567m BAAI 567M 8192
embeddinggemma:300m Google 300M 2048
gpt-oss:20b OpenAI 20B 120000
qwen3-2507-think:4b Qwen Team 4B 250000
qwen3-2507:4b Qwen Team 4B 250000
rnj-1:8b Essential AI 8B 32000
qwen3-vl:2b Qwen Team 2B 250000
qwen3-vl:4b Qwen Team 4B 250000
qwen3:0.6b Qwen Team 0.6B 40000
qwen3-vl:8b Qwen Team 8B 250000
devstral:24b Mistral AI & All Hands AI 24B 120000
devstral-small-2:24b Mistral AI & All Hands AI 24B 380000
granite4-small-h:32b IBM 32B (9B active) 128000
granite4-tiny-h:7b IBM 7B (1B active) 128000
deepseek-ocr DeepSeek AI 3B 8192
medgemma:27b Google 27B 128000
mistral-small3.2:24b Mistral AI 24B 128000
Legend and explanation
Functionality or capacity supported by the model
Functionality or capability not supported by the model
* Energy efficiency Indicates particularly low energy consumption (< 2.0 kWh/Mtoken)
* Quick Model capable of generating more than 50 tokens per second
Note on performance measures
The speed values (tokens/s) represent performance targets in real-life conditions. Energy consumption (kWh/Mtoken) is calculated by dividing the estimated power of the inference server (in Watts) by the measured speed of the model (in tokens/second), then converted into kilowatt-hours per million tokens (division by 3.6). This method offers a practical comparison of the energy efficiency of different models, to be used as a relative indicator rather than an absolute measure of power consumption.

Recommended use cases

Here are some common use cases and the most suitable models for each. These recommendations are based on the specific performance and capabilities of each model.

Multilingual dialogue

Chatbots and assistants capable of communicating in several languages, with automatic detection, context maintenance throughout the conversation and understanding of linguistic specificities.
Recommended models
  • Llama 3.3
  • Mistral Small 3.2
  • Qwen 3
  • Openai OSS
  • Granite 4

Analysis of long documents

Processing of large documents (>100 pages), maintaining context throughout the text, extracting key information, generating relevant summaries and answering specific content questions
Recommended models
  • Gemma 3
  • Qwen next
  • Qwen 3
  • Granite 4

Programming and development

Generating and optimising code in multiple languages, debugging, refactoring, developing complete functionalities, understanding complex algorithmic implementations and creating unit tests
Recommended models
  • DeepCoder
  • Qwen3 coding
  • Granite 4
  • Devstral

Visual analysis

Direct processing of images and visual documents without OCR pre-processing, interpretation of technical diagrams, graphs, tables, drawings and photos with generation of detailed textual explanations of the visual content
Recommended models
  • deepseek-OCR
  • Mistral Small 3.2
  • Gemma 3
  • Qwen 3 VL

Safety and compliance

Applications requiring specific security capabilities; filtering of sensitive content, traceability of reasoning, RGPD/HDS verification, risk minimisation, vulnerability analysis and compliance with sectoral regulations
Recommended models
  • Granite Guardian
  • Granite 4
  • Devstral
  • Mistral Small 3.2
  • Magistral small

Light and on-board deployments

Applications requiring a minimal resource footprint, deployment on capacity-constrained devices, real-time inference on standard CPUs and integration into embedded or IoT systems
Recommended models
  • Gemma 3n
  • Granite 4 tiny
  • Qwen 3 VL (2B)
Contact us
Cookie policy

We use cookies to give you the best possible experience on our site, but we do not collect any personal data.

Audience measurement services, which are necessary for the operation and improvement of our site, do not allow you to be identified personally. However, you have the option of objecting to their use.

For more information, see our privacy policy.