Why Most Single-3090 WhisperX + 24B Setups Fail
Hitesh Sondhi · June 3, 2026 · 12 min read
We’ve seen this movie before: someone says they’re fitting WhisperX large-v3 + a 24B model on a single RTX 3090, posts a triumphant screenshot, and leaves out the part where the system falls over the second you add real audio, real context, or real users.
The 3090 is a beast for the money. It’s also a liar if you don’t respect VRAM math.
A lot of “it works on my machine” setups only work because they quietly cap context to the floor, skip diarization, serialize everything so latency gets ugly, or run quantization so aggressive the LLM starts answering like it got hit in the head with a frying pan. We’ve built enough voice AI systems to know the difference between a demo and something you can actually ship.
Here’s the good news: you can run WhisperX large-v3 and a 24B LLM on one 3090.
Here’s the bad news: you don’t get everything.
Key Takeaways
- A single 24GB RTX 3090 can run WhisperX large-v3 and a 24B LLM, but only if you make hard tradeoffs on context length, batching, and concurrency.
- The cleanest path is usually 4-bit quantization for the LLM, strict context capping, and staged execution instead of pretending both models can hog VRAM at once.
- WhisperX is not just Whisper. Word timestamps, alignment, and diarization can turn a “working” setup into an out-of-memory festival.
- If you need low latency plus long context plus diarization plus concurrent sessions, one 3090 is the wrong hill to die on.
- For production, memory stability matters more than bragging rights. A slower system that doesn’t crash beats a “fast” one that explodes at 2 a.m.
The setup that looks fine right up until it catches fire
The most useful write-up we’ve seen on this topic is the reproducible recipe from Sika Mikaniko BG, which shows a practical approach to fitting WhisperX large-v3 and a 24B LLM on a single 3090 by capping context and being disciplined about memory Source.
That’s the key idea.
Not magic. Not “CUDA tweak number 47.” Just accepting reality.
A 3090 gives you 24GB of VRAM. That sounds roomy until you load:
- WhisperX with
large-v3 - alignment model overhead
- optional diarization pieces
- a 24B LLM
- KV cache for context
- framework overhead
- whatever fragmentation nonsense PyTorch decides to gift you that day
And then you wonder why the box starts wheezing.
Why “24B on 24GB” is already a compromise
Let’s say the quiet part out loud: a 24B model on a 3090 is almost never running in a luxurious full-precision setup. You’re in quantization country.
Usually 4-bit. Sometimes 5-bit if you’re lucky and the rest of the stack is lean. But if you’re trying to co-locate ASR and an LLM, 4-bit is the practical default. Otherwise you’re basically trying to move a couch through a dog door.
That’s not automatically bad. Quantized 24B models can still be very good. We’ve had solid results with carefully chosen quantized models in AI agent and custom model workflows when the task is narrow and the prompt discipline is tight.
But there’s a cost.
The cost is usually one or more of these:
- shorter usable context
- lower throughput
- more brittle output on complex reasoning
- slower prefill than people expect
- weird quality cliffs when prompt length grows
This is where most blog posts get suspiciously cheerful.
WhisperX is where your VRAM budget goes to die
People say “Whisper” when they mean “speech-to-text.” Fine. But WhisperX is a different operational animal because it’s not just transcription.
You’re often paying for:
- transcription
- forced alignment for word timestamps
- speaker diarization, if enabled
That stack is incredibly useful. It’s also heavier than the average “I ran faster-whisper in a notebook” benchmark implies.
If you only need rough transcript text, you can get away with less. If you need word-level timestamps and speaker-attributed transcripts, now you’re carrying extra luggage. On a single 3090, that luggage matters.
Here’s where it gets weird.
A lot of single-GPU builds “work” because they quietly disable the expensive parts users actually want in production.
What actually works on one 3090
The recipe we agree with is boring in the best way: cap context, quantize the LLM, and avoid overlapping peak memory phases Source.
That means:
1. Treat context length as a budget, not a right
The biggest mistake is assuming that because a model advertises a large context window, you should use it. On a 3090, long context is like ordering dessert after you already maxed out the company card.
KV cache will eat you alive.
If you’re trying to run a 24B LLM alongside WhisperX large-v3, context capping is not a nice optimization. It’s the difference between “stable enough to test” and “CUDA out of memory after the third request.”
Our hot take: most teams should start with a brutally capped context and earn their way upward.
Not the other way around.
2. Serialize the heavy stages
If your pipeline tries to transcribe audio while the LLM is already sitting on a giant active cache, you’re asking one 3090 to do two deadlifts at once.
Bad plan.
A more reliable pattern is:
- run ASR
- release as much memory as possible
- run LLM inference on the transcript
- only keep what absolutely must stay resident
That may increase end-to-end latency a bit, but it massively improves survivability. And survivability is underrated.
Here’s how the pipeline usually needs to look if you want it to stay upright:
flowchart TD A[Audio Input] --> B[WhisperX Transcription] B --> C[Alignment / Optional Diarization] C --> D[Transcript Cleanup] D --> E[Context Capping] E --> F[24B LLM Inference] F --> G[Response / Summary / Agent Action]
If you insist on overlapping stages for “real-time feel,” you’ll need tighter memory controls and usually a smaller LLM or a smaller ASR model. Physics remains annoyingly undefeated.
3. Be honest about diarization
Diarization is one of those features that product people love in demos and infra people fear in production.
Fair enough.
If your use case genuinely needs “Speaker 1 / Speaker 2” with decent segmentation, WhisperX’s diarization path can be worth it. But on a single 3090, it’s often the first thing we’d make optional. For many support, hospitality, and meeting-summary flows, a clean transcript with timestamps is more valuable than fancy speaker labels that tank throughput.
We learned this the hard way on voice-heavy systems: users forgive missing speaker labels faster than they forgive lag.
A practical memory strategy that doesn’t pretend VRAM is infinite
When we’re sizing systems like this, we think in terms of peak overlap, not just model size on paper.
That’s the trap.
A model might “fit” in isolation. Your pipeline doesn’t run in isolation.
Here’s the mental model:
- Static footprint: model weights, tokenizer/runtime overhead
- Dynamic footprint: activations, temporary buffers, batch effects
- Context footprint: KV cache for the LLM
- Pipeline extras: alignment, diarization, preprocessing, postprocessing
If those peaks overlap, your 24GB card stops being a 24GB card and starts being a pumpkin.
Here’s a simple way to picture the tradeoff space:

The useful question isn’t “Can I load both?”
It’s “Can I load both, process realistic inputs, and survive repeated requests without fragmentation or OOM?”
Very different question.
Latency: where the dream usually dies
A lot of people are willing to accept some quality loss from quantization. Fewer people are emotionally prepared for the latency consequences.
Speech pipelines already have multiple steps:
- audio ingest
- chunking / preprocessing
- transcription
- alignment
- optional diarization
- transcript cleanup
- LLM prompt assembly
- generation
That’s a lot of moving parts on one GPU.
If you cap context aggressively, you can keep the LLM responsive. But then your downstream reasoning may lose long-call nuance. If you keep more transcript history, the LLM gets slower and more memory-hungry. This is the central tradeoff, and no amount of motivational posting on X changes it.
It’s like trying to pack for a two-week trip with a carry-on. You can bring shoes, or jackets, or camera gear. You can’t bring all three and still pretend the zipper will close.
The tradeoff table nobody wants to print
Here’s the version we’d actually use to make a decision.
| Goal | What to do | What you lose |
|---|---|---|
| Fit WhisperX large-v3 + 24B on one 3090 | 4-bit LLM, strict context cap, serialized stages | Long-context reasoning, concurrency |
| Better transcript quality | Keep WhisperX large-v3 and alignment | More VRAM pressure, slower pipeline |
| Speaker labels | Enable diarization selectively | Throughput, memory headroom |
| Faster LLM responses | Shorten prompt and transcript history | Less conversational memory |
| More stable production behavior | Lower batch sizes, free memory aggressively, avoid overlap | Peak speed, “benchmark glory” |
| Real-time-ish UX | Stream partial transcript, defer heavy LLM tasks | More pipeline complexity |
Our opinion: if your product depends on both long-context reasoning and rich speech metadata, stop forcing it onto one 3090.
That setup is a lab experiment, not a platform.
When a single 3090 is actually a good idea
We don’t want to overcorrect here. One 3090 can be a smart choice if:
- you’re building a prototype that needs local inference
- you’re cost-sensitive and can tolerate serialized workloads
- your transcript-to-LLM handoff is short and structured
- diarization is optional
- concurrency is low
- you value owning the stack over raw throughput
This is especially true for internal tools, single-user desktop workflows, or edge-ish deployments where cloud round trips are worse than local latency. That’s part of why we care so much about on-device AI and products like RunHotel: local systems can be excellent when the workload is designed honestly.
Designed honestly is doing a lot of work there.
When it’s a terrible idea
Here’s the blunt version.
A single 3090 is the wrong answer if you need:
- multiple concurrent users
- long meeting transcripts fed wholesale into the LLM
- diarization on every session
- low tail latency
- agentic workflows with tool calls and retries
- production reliability without babysitting
We’ve seen teams spend weeks shaving 400MB off a memory footprint when the real answer was “buy another GPU” or “use a smaller model.” That’s engineer catnip. It’s also sometimes terrible business.
Hot take: GPU Tetris is fun right up until it delays shipping by a month.
If you’re making a business decision, run the numbers first. Our AI cost estimator is a better starting point than vibes and Reddit optimism.
If you insist on doing it anyway, do this
If you’re determined to make fitting whisperx large-v3 + a 24B LLM on one 3090 work, this is the order we’d recommend:
Start with the transcript path
Get WhisperX stable first with your actual audio lengths, language mix, and timestamp requirements.
Not benchmark clips. Real clips.
Add the LLM second
Load the 24B model in 4-bit and test with a hard context cap. Don’t start by chasing maximum context. Start by finding the point where the system remains stable over repeated runs.
Measure peak memory, not average memory
Average memory is comforting and mostly useless. Peaks are what kill you.
Make diarization opt-in
If a workflow truly needs it, enable it there. Don’t make every request pay the tax.
Build transcript compaction early
Summaries, rolling windows, chunk selection, and prompt compression matter more than people think. A sloppy transcript handoff will punish both memory and latency.
Test repeated requests
The first request is the demo. The tenth request is the truth.
This is where a lot of “reproducible” setups stop being reproducible.
What we’d recommend to most teams
If your goal is to ship, not to win a forum argument, we’d usually recommend one of these paths:
Use WhisperX large-v3 + a smaller LLM
Better speech quality, fewer memory gymnastics.Use a 24B LLM + lighter ASR path
Good if transcript perfection matters less than downstream reasoning.Split workloads across devices or services
Boring architecture. Great results.Keep one 3090 for development, not production
This is often the sane middle ground.
And if you’re building customer-facing speech systems, especially with action-taking agents, get help before you architect yourself into a corner. That’s exactly the kind of mess we handle in AI consulting, voice AI, and custom model work.
Here’s the architecture decision most teams eventually end up making:

One box is elegant. Two boxes sleep better at night.
The real answer
Yes, you can run WhisperX large-v3 and a 24B LLM on a single 3090. The cited reproducible recipe proves that with the right context capping and memory discipline, it’s possible Source.
But “possible” is not the same as “good.”
That distinction matters.
If you’re building a prototype, a local tool, or a low-concurrency system, this setup can be smart and cost-effective. If you’re building a production voice product with real users and real latency expectations, you need to be ruthless about tradeoffs — or stop romanticizing the single-GPU setup.
Want help sizing the stack before you waste two weeks in VRAM hell? Talk to us at /contact. We like ambitious builds. We just prefer them without the 3 a.m. CUDA panic.
Sources
- Sika Mikaniko BG, “Fitting WhisperX large-v3 & a 24B LLM on one 3090: a reproducible context-capping recipe” — https://dev.to/sikamikanikobg/fitting-whisperx-large-v3-a-24b-llm-on-one-3090-a-reproducible-context-capping-recipe-22g0





