deepseek-v4
DeepSeek V4 API — Flash and Pro open-weight models on one OpenAI-compatible /v1/chat/completions endpoint on api.reapi.ai. 1M context, 384K max output, thinking mode by default, vision input, and tool use.
The DeepSeek V4 API ships two open-weight models — deepseek-v4-flash
(fast, low-cost) and deepseek-v4-pro (frontier reasoning and agentic
coding) — exposed through api.reapi.ai as a drop-in OpenAI-compatible Chat
Completions endpoint. Both bring a 1M-token context window, 384K max output,
thinking mode on by default, vision input, tool use, and context caching.
Current rates live on the
model page and on
api.reapi.ai/pricing.
Quick example
curl https://api.reapi.ai/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-v4-flash",
"group": "default",
"messages": [
{ "role": "user", "content": "Hello" }
],
"stream": true,
"max_tokens": 4096,
"temperature": 0.7
}'from openai import OpenAI
client = OpenAI(
api_key="YOUR_API_KEY",
base_url="https://api.reapi.ai/v1",
)
stream = client.chat.completions.create(
model="deepseek-v4-flash", # or "deepseek-v4-pro"
messages=[{"role": "user", "content": "Hello"}],
stream=True,
max_tokens=4096,
temperature=0.7,
extra_body={"group": "default"},
)
for chunk in stream:
delta = chunk.choices[0].delta.content or ""
print(delta, end="", flush=True)import OpenAI from "openai";
const client = new OpenAI({
apiKey: "YOUR_API_KEY",
baseURL: "https://api.reapi.ai/v1",
});
const stream = await client.chat.completions.create({
model: "deepseek-v4-flash", // or "deepseek-v4-pro"
messages: [{ role: "user", content: "Hello" }],
stream: true,
max_tokens: 4096,
temperature: 0.7,
// `group` is an api.reapi.ai-specific extension; pass via extra body.
// @ts-expect-error — not part of the OpenAI types
group: "default",
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}package main
import (
"bytes"
"encoding/json"
"fmt"
"io"
"net/http"
)
func main() {
body, _ := json.Marshal(map[string]any{
"model": "deepseek-v4-flash", // or "deepseek-v4-pro"
"group": "default",
"messages": []map[string]string{
{"role": "user", "content": "Hello"},
},
"stream": true,
"max_tokens": 4096,
"temperature": 0.7,
})
req, _ := http.NewRequest("POST",
"https://api.reapi.ai/v1/chat/completions", bytes.NewReader(body))
req.Header.Set("Authorization", "Bearer YOUR_API_KEY")
req.Header.Set("Content-Type", "application/json")
resp, _ := http.DefaultClient.Do(req)
defer resp.Body.Close()
out, _ := io.ReadAll(resp.Body)
fmt.Println(string(out))
}Authentication
Every request needs a Bearer token. The DeepSeek V4 chat workspace lives on
the api.reapi.ai platform — sign in there to create a key and top up tokens.
- Open api.reapi.ai and sign in (or create an account).
- Generate an API key under API Keys.
- Top up tokens under Top Up (pay-as-you-go, billed in USD per 1M tokens — see api.reapi.ai/pricing).
Authorization: Bearer YOUR_API_KEYThe chat surface (api.reapi.ai) is a separate workspace from the
image/video/audio task gateway at reapi.ai/api/v1/*. Keys and balances
do not cross over — a key issued on reapi.ai/settings/apikeys will not
authenticate against api.reapi.ai/v1/chat/completions, and vice versa.
Models
The DeepSeek V4 family ships two variants. Both share the same endpoint,
request shape, 1M context window, and 384K max output — pick the variant
with the model field.
model | Best for | Architecture |
|---|---|---|
deepseek-v4-flash | Fast, low-cost everyday work — autocomplete, batch analysis, chat backends. Reasoning closely approaches Pro. | 284B total / 13B active (MoE) |
deepseek-v4-pro | Frontier reasoning, complex debugging, and agentic coding. Rivals top closed-source models. | 1.6T total / 49B active (MoE) |
The legacy ids deepseek-chat and deepseek-reasoner map to
deepseek-v4-flash in non-thinking and thinking mode respectively. New
integrations should use the explicit deepseek-v4-flash /
deepseek-v4-pro ids.
Endpoint
POST https://api.reapi.ai/v1/chat/completionsDrop-in for the OpenAI SDKs — same request shape, same SSE wire format. Set
base_url to https://api.reapi.ai/v1. DeepSeek V4 also supports the
Anthropic API format natively; this guide documents the OpenAI-compatible
Chat Completions surface.
Request body
model — string, required
"deepseek-v4-flash" or "deepseek-v4-pro". Echoed back in the response
envelope.
messages — array, required
Conversation history as an array of message objects. Same shape as the OpenAI Chat Completions spec, plus content-parts for vision:
{
"role": "system" | "user" | "assistant" | "tool",
"content": "string OR content-parts array (text + image_url parts)"
}Multi-turn history is sent in chronological order — the last message is the
one the model responds to. Do not echo a prior turn's reasoning_content
back into messages; strip it before the next request.
max_tokens — integer, default 4096
Upper bound on output tokens for this response, including the chain-of-thought when thinking mode is on. The synchronous API supports up to 384K output tokens — set it generously for long-form or reasoning-heavy outputs.
stream — boolean, default false
When true, the response is streamed as server-sent events (SSE) with
Content-Type: text/event-stream. Each event is a JSON delta in the OpenAI
format, terminated by a data: [DONE] line.
temperature — number, default 1
Sampling temperature. Lower values produce more deterministic output. Ignored while the model is in thinking mode.
top_p — number, default 1
Nucleus sampling cutoff. Ignored in thinking mode.
frequency_penalty / presence_penalty — number, default 0
Standard OpenAI repetition controls. Ignored in thinking mode.
tools / tool_choice — optional
Standard OpenAI tool-calling parameters. DeepSeek V4 ships dedicated agentic optimizations with reliable function calling and JSON output.
group — string, default "default"
api.reapi.ai-specific extension. Selects a token group on the gateway, which routes the request to a specific upstream channel pool. Omit if default routing is fine.
Thinking mode
DeepSeek V4 runs in thinking mode by default: before the final answer it
produces a chain of thought, returned in a reasoning_content field at the
same level as content.
{
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"reasoning_content": "Let me work through this step by step...",
"content": "The final answer."
},
"finish_reason": "stop"
}
]
}For latency-sensitive or simple calls, switch to non-thinking mode for
faster, cheaper responses. When thinking is on, the sampling parameters
(temperature, top_p, frequency_penalty, presence_penalty) have no
effect.
Strip reasoning_content from assistant messages before sending them back
in a follow-up request — the chain-of-thought from a previous turn is not
meant to be re-fed as input.
Vision input (beta)
Send images alongside text via OpenAI content-parts:
{
"model": "deepseek-v4-pro",
"max_tokens": 4096,
"messages": [
{
"role": "user",
"content": [
{ "type": "text", "text": "What does this chart show?" },
{
"type": "image_url",
"image_url": { "url": "https://example.com/chart.png" }
}
]
}
]
}Each image counts toward the input token budget based on its resolution.
Context caching
DeepSeek V4 caches stable prompt prefixes automatically. When a request hits
the cache, the cached input tokens bill at a small fraction of the standard
input rate — a big saving for agent loops and chatbots that replay long
system prompts and tool schemas. No configuration is required; reuse the same
prefix across calls and the discount applies. The
usage.prompt_tokens_details.cached_tokens field reports how many input
tokens were served from cache.
Response shape
Non-streaming (stream: false)
{
"id": "chatcmpl-018f5a3a1b6e7d9f8c2b4d6e8f0a2c4e",
"object": "chat.completion",
"created": 1735000000,
"model": "deepseek-v4-flash",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I help you today?"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 12,
"completion_tokens": 9,
"total_tokens": 21,
"prompt_tokens_details": {
"cached_tokens": 0
}
}
}When thinking mode is on, message.reasoning_content carries the
chain-of-thought alongside content.
Streaming (stream: true)
Content-Type: text/event-stream. Each data: line is a JSON delta in the
OpenAI chunk format; the final event before [DONE] carries the
finish_reason (stop / length / tool_calls / content_filter).
Pricing
DeepSeek V4 is billed pay-as-you-go in USD against your api.reapi.ai
token balance. It bills along three dimensions — input tokens (cache miss),
input tokens (cache hit), and output tokens — and deepseek-v4-pro costs
more per token than deepseek-v4-flash. Current rates live on
api.reapi.ai/pricing and in the pricing card
at the top of the
model page.
Per-call bill:
billable_input = (prompt_tokens - cached_tokens) × input_rate / 1,000,000
cache_read_bill = cached_tokens × cache_hit_rate / 1,000,000
output_bill = completion_tokens × output_rate / 1,000,000Output tokens include the chain-of-thought when thinking mode is on. Failed requests are not charged.
Limits
| Limit | Value |
|---|---|
| Context window | 1M tokens |
| Max output per call | 384K tokens |
Streams that hit the output cap finish with finish_reason: "length";
call again with a continuation message if you need more text.
Errors
The error envelope follows the OpenAI shape — HTTP status, plus a JSON body:
{
"error": {
"message": "...",
"type": "invalid_request_error",
"code": "..."
}
}Common cases:
| Status | When | Notes |
|---|---|---|
400 | Bad request shape, unsupported param combo | Check the messages array and model id |
401 | Missing / invalid API key | Re-issue a key at api.reapi.ai |
402 | Insufficient balance | Top up at api.reapi.ai |
429 | Per-group rate limit hit | Back off, or move to a different group |
500 | Upstream / gateway error | Safe to retry — failed calls are not charged |
api.reapi.ai does not internally retry chat requests. Every customer call maps to exactly one upstream POST. If a network error reaches you, that is a one-for-one wire failure and a retry from your side is safe; the gateway will not double-bill.
Recipes
Minimum request
{
"model": "deepseek-v4-flash",
"max_tokens": 4096,
"messages": [
{ "role": "user", "content": "Summarise this in three sentences." }
]
}Tool use (function calling)
{
"model": "deepseek-v4-pro",
"max_tokens": 4096,
"messages": [
{ "role": "user", "content": "What's the weather in Tokyo today?" }
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Look up the current weather for a city.",
"parameters": {
"type": "object",
"properties": { "city": { "type": "string" } },
"required": ["city"]
}
}
}
],
"tool_choice": "auto"
}Long-context analysis
{
"model": "deepseek-v4-pro",
"max_tokens": 8192,
"messages": [
{ "role": "system", "content": "<a long, stable reference document>" },
{ "role": "user", "content": "List every mention of constraint X with line numbers." }
]
}Keep the long reference block stable across calls so the cache-hit rate applies on subsequent requests.
When to pick Flash vs Pro
deepseek-v4-flash— latency-sensitive, high-throughput, cost-sensitive work: in-IDE autocomplete, inline suggestions, CI code review, bulk summarization, chat backends. Reasoning closely approaches Pro at a fraction of the price.deepseek-v4-pro— work where reasoning depth dominates: complex debugging, architecture planning, math/STEM, and long-horizon agentic coding. Both share one key — route per request.
Tips
- Set
max_tokensgenerously when thinking is on. The chain-of-thought counts toward the output budget; a low cap can truncate before the final answer. - Strip
reasoning_contentbefore the next turn. Re-feeding a prior turn's chain-of-thought as input is not supported. - Stream by default for chat UX. Streaming cuts perceived latency.
- Cache stable prefixes. Reuse the same system prompt and tool schemas across calls to bill repeated input at the low cache-hit rate.
- Route by difficulty. Send simple, high-volume calls to Flash and reserve Pro for the hardest reasoning, all on one key.