Self-hosted AI inference API — 128K context, function calling, reasoning
SWE-Bench 80.2% Multi-SWE-Bench 51.3%| Endpoint | https://gpu-workspace.taile8dc37.ts.net/minimax/v1 |
|---|---|
| Auth | Authorization: Bearer YOUR_API_KEY |
| Model | minimax-m2.5 |
curl https://gpu-workspace.taile8dc37.ts.net/minimax/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "minimax-m2.5",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Standard OpenAI chat completions endpoint. Supports:
"stream": true)List available models.
Health check — returns 200 when the server is ready.
| Model ID | Context | Description |
|---|---|---|
minimax-m2.5 | 128K | Recommended |
MiniMaxAI/MiniMax-M2.5 | 128K | Full name alias |
| Input | $0.30 / 1M tokens |
|---|---|
| Output | $1.20 / 1M tokens |
| Max concurrent | 16 requests |
|---|---|
| Max context | 131,072 tokens (128K) |
| Timeout | 600 seconds |
MiniMax-M2.5 includes chain-of-thought reasoning in <think> blocks. The API separates reasoning into the reasoning_content field when available.
export ANTHROPIC_BASE_URL="https://gpu-workspace.taile8dc37.ts.net/minimax/v1"
export ANTHROPIC_API_KEY="YOUR_API_KEY"
claude --model minimax-m2.5
export OPENAI_BASE_URL="https://gpu-workspace.taile8dc37.ts.net/minimax/v1"
export OPENAI_API_KEY="YOUR_API_KEY"
codex --model minimax-m2.5 "your prompt"
aider --openai-api-base https://gpu-workspace.taile8dc37.ts.net/minimax/v1 \
--openai-api-key YOUR_API_KEY \
--model openai/minimax-m2.5
Add to ~/.continue/config.json:
{
"models": [{
"title": "MiniMax-M2.5",
"provider": "openai",
"model": "minimax-m2.5",
"apiBase": "https://gpu-workspace.taile8dc37.ts.net/minimax/v1",
"apiKey": "YOUR_API_KEY"
}]
}
In Cline settings: API Provider → OpenAI Compatible, Base URL → https://gpu-workspace.taile8dc37.ts.net/minimax/v1, Model ID → minimax-m2.5
| Base URL | https://gpu-workspace.taile8dc37.ts.net/minimax/v1 |
|---|---|
| API Key | Your API key |
| Model | minimax-m2.5 |
from openai import OpenAI
client = OpenAI(
base_url="https://gpu-workspace.taile8dc37.ts.net/minimax/v1",
api_key="YOUR_API_KEY",
)
response = client.chat.completions.create(
model="minimax-m2.5",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=1024,
)
print(response.choices[0].message.content)
stream = client.chat.completions.create(
model="minimax-m2.5",
messages=[{"role": "user", "content": "Write a Redis cache decorator."}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://gpu-workspace.taile8dc37.ts.net/minimax/v1",
apiKey: "YOUR_API_KEY",
});
const response = await client.chat.completions.create({
model: "minimax-m2.5",
messages: [{ role: "user", content: "Hello!" }],
});
console.log(response.choices[0].message.content);
Ollama-style CLI for managing the server:
pip install -e . # from the MiniMax-M2.5 repo
minimax run # interactive chat with streaming
minimax ps # server status + GPU usage
minimax serve # start vLLM + LiteLLM
minimax stop # stop all servers
minimax tui # admin TUI (key management)
minimax auth login # store API key
minimax setup claude # configure Claude Code
Running on 8x NVIDIA H100 80GB with vLLM — tensor parallel + expert parallel