Skip to content
← All news
6 min read

Same models, but DeepSeek made them 85% faster!

DeepSeek's new DSpark speeds up DeepSeek-V4 by 60 to 85% per user with no drop in answer quality. No new model, no new GPUs, just a smarter way to write the words.

DeepSeek shipped DSpark on June 27 and open-sourced it under MIT. No new model, no new GPUs required.

If you have ever watched an AI agent grind through a long task, you know the wait. The model writes one word, then the next, then the next, and you sit there staring at a spinner. On June 27, 2026, DeepSeek shipped something that cuts that wait hard. It is called DSpark, and it makes DeepSeek-V4 generate 60 to 85% faster for each user with no drop in quality. Same model, same answers, less waiting. They released the whole thing under an MIT license.

Why the wait happens at all

Modern models write text one word at a time. To pick each new word, the model looks back at every word already written to see how they relate. For one sentence that is nothing. For a report that runs to tens of thousands of words, the amount of looking back is enormous.

Here is the surprising part: the slow bit is not the math. A modern GPU chews through the model's billions of parameters in a quick burst. The slow bit is memory. For every single word, the chip has to fetch the saved relationships for all the earlier words, and while it fetches, it sits idle. The chip is fast. It spends most of its time waiting on memory.

The old fix: let a fast intern guess ahead

The standard trick for this is speculative decoding. Instead of the big model slowly writing every word, a small fast model drafts several words ahead, and the big model checks the whole draft in a single parallel pass.

Picture a slow, expensive senior who signs off on everything one word at a time, and a fast intern who drafts a chunk for the senior to review. The senior accepts the words that match what they would have written, then pulls out a red pen at the first wrong word and rejects it and everything after it. The intern redrafts from there.

Because the big model always has the final say, the output is identical to what it would have written on its own. The speed is free. The quality is untouched. That is why speculative decoding is called lossless.

Why that fix kept breaking

The catch is the intern. You get two kinds, and they fail in opposite ways.

Type of drafterHow it writesThe problem
AutoregressiveOne word at a time, so each word sees the lastAccurate, but slow
ParallelA whole block of words at onceFast, but the tail of the draft turns to mush
Careful and slow, or fast and wrong. Nobody had a clean way out.

That second failure has a name: suffix decay. Because a parallel drafter guesses the last word without having seen the words before it, the first few words come out fine and the later ones fall apart. Trying to agree with you, it might write 'of course' correctly, or it might scramble it into 'of problem'.

What DSpark actually changed

DSpark starts with the fast parallel drafter and bolts a tiny corrector onto it, called a Markov head. A Markov process only looks at the current state to predict the next one and ignores everything before it. The Markov head does exactly that: it looks at the word the drafter just produced and nudges the next word toward what usually follows. Once it writes 'of', it biases the next word toward 'course'. That is what fixes suffix decay.

The obvious worry is that adding a sequential step drags the speed back down. It does not, because they shrink the head with a technique called low-rank factorization at rank 256. Growing the draft from 4 words to 16 adds only 0.2 to 1.3% latency per round. In their tests a two-layer DSpark drafter beats a five-layer pure parallel one across every benchmark.

Stopping the server from wasting work

A fast drafter is only half the job. In a real data center, thousands of users share one big model, and it can only verify so many words at once. A long draft full of bad guesses burns that shared capacity and makes everyone else wait in the queue.

So DSpark adds a confidence head. For every word it drafts, it also outputs a score from 0 to 1 for how likely that word is to survive the big model's check. The moment the score drops below a threshold, the drafter stops and sends only the part it trusts. On chat prompts that lifted the acceptance rate from 45.7% to 95.7%. On math reasoning it went from 76.9% to 92.5%.

The draft length also flexes with the work. A math or coding answer is predictable, so it drafts long. An open-ended story could go a hundred ways, so it cuts early. It even watches GPU load in real time: during quiet hours it drafts longer for faster replies, and during busy hours it drafts shorter to keep the whole system from buckling.

Guess fast, keep it honest, verify in one pass.

The numbers, and the honest catch

Per user, DSpark runs 60 to 85% faster on the V4-Flash model and 57 to 78% faster on V4-Pro, measured against DeepSeek's own previous system. Against Eagle3, the best drafter before it, DSpark accepts drafts about 30% longer on the open Qwen3 models.

The eye-catching number is total throughput under a hard rule. If you insist that every user gets at least 120 tokens per second, the old system fell over after a handful of users. DSpark, by knowing when to draft long and when to cut short, served up to 661% more total output at that target, and 406% more on V4-Pro at 50 tokens per second.

Read those big numbers with care. The 661% is a best case, measured under strict speed targets DeepSeek chose, and no independent lab has reproduced it yet. The paper also ships as a PDF inside the code repo rather than on arXiv. Treat 661% as the ceiling and the 60 to 85% per-user figure as the number you are more likely to actually feel. Even discounted it is a real gain, and unusually, the code and the weights are both public, so you can time it yourself.

Why a build studio cares

We build agents and apps that make a lot of model calls, and latency is the tax on every one of them. A long research run, a code agent, a chat that streams its answer: each is bottlenecked by the exact memory-bound, one-word-at-a-time problem DSpark goes after. Faster words at the same quality means cheaper agents and interfaces that feel instant instead of laggy.

The part we keep coming back to is that DeepSeek pulled this off while short on resources, with far fewer people than the big labs and without the top Nvidia chips. The constraints forced the clever engineering, and then they gave it away. The DeepSpec toolkit that ships with it trains these drafters on open models like Qwen3 and Gemma, so the idea is not locked to DeepSeek's own model.

DSpark already ships inside DeepSeek-V4. Next step: read the writeup on MarkTechPost, skim the paper and code in the DeepSpec repo under its MIT license, or pull the DeepSeek-V4-Pro-DSpark model from Hugging Face and time it against your current setup. If token latency is hurting something you are building and you want a hand, write to us at hello@gattyworks.com.

AIInfrastructureOpen SourceDeepSeekDSparkSpeculativeDecodingLLMAIInferenceOpenSourceDeepSeekV4MachineLearningGPUAInews
04 · BRIEF

Ready to ship?

Send a brief. You will get a written reply, a fixed quote, and a delivery date within 24 hours. If we miss that window, the website fee on your first project is refunded in full.

24h reply, or 100% website fee refunded
Average reply under 4 hours. Always inside 24.