How to Vet a “Homegrown” LLM Before You Buy It
Mitesh Sondhi · June 15, 2026 · 12 min read
A procurement team asks for a sovereign LLM. A vendor sends back a model card, a benchmark table, and a line that says "trained locally for public sector use."
Your job isn't to admire the claim. Turn it into an API contract.
What are the inputs? Who produced the weights? Which upstream models are inside the artifact? What license terms survive the merge? Can the vendor reproduce the benchmark from a clean checkout? Without files, hashes, and scripts, you're not buying a model. You're buying a story.
Recent discussion around rio de janeiro's "homegrown" LLM is a useful warning. A GitHub issue on Nex-N2 alleges that Rio-3.5-Open-397B, presented as a homegrown model by Rio de Janeiro's municipal IT company, appears to be a merge involving an existing model rather than a clean local fine-tune as described publicly Nex-N2 GitHub issue. We don't need to litigate the whole case here. One engineering lesson is enough: "homegrown" is not a technical property unless someone can prove provenance.
Buyers, especially governments and regulated companies, need that distinction. A merged model can be legitimate. A fine-tune can be legitimate. A locally trained model can be legitimate. Trouble starts when the commercial claim, the license obligations, and the actual weight lineage don't match.
Treat "homegrown" as a claim that needs an interface
In practice, the phrase "homegrown LLM" sounds like it describes where the intelligence came from. Usually, it doesn't.
Sometimes the training data was local. Sometimes the fine-tuning team was local. Deployment might be local. Or the base model is foreign, the adapter is local, and the final weights were merged on a government GPU cluster for three hours.
Those are different products.
As an API engineer, I like to make ambiguous product claims boring. Define the fields, then reject invalid payloads. For an LLM procurement, "homegrown" should be decomposed into fields like:
{
"base_model": "name, commit, hash, license",
"training_method": "pretrain | continued_pretrain | sft | dpo | merge | adapter",
"training_data_origin": "datasets, collection process, exclusions",
"weight_lineage": "parent models and merge recipe",
"tokenizer_lineage": "source tokenizer and modifications",
"evaluation_harness": "repo, commit, prompts, parameters",
"deployment_boundary": "cloud region, on-prem, edge device",
"license_obligations": "redistribution, attribution, commercial use"
}
When the seller can't populate that structure, don't move to security review yet. Identity verification is still open.
We use the same instinct on custom model projects: before arguing about accuracy, we ask what artifact we're actually responsible for. A model without lineage is like a container image with no Dockerfile, no base image digest, and no SBOM. It might run. Still, you shouldn't ship it into production.

The difference between a fine-tune, a merge, and a native model
A real audit starts by removing marketing nouns.
A base model is the upstream set of weights the vendor started from. Fine-tuning changes those weights by training on additional examples. Adapter fine-tuning, such as LoRA, trains a smaller set of parameters that can later be applied to the base model. A merge combines multiple models or adapters into a new checkpoint, often using tools that interpolate or combine weight tensors.
None of these is automatically bad.
Merges are popular because they can improve behavior quickly without paying the full cost of pretraining. Lineage can also disappear if the vendor doesn't disclose the recipe. A "national" model that is mostly an undisclosed upstream model plus a local instruction adapter may still be useful, but it shouldn't be sold as sovereign infrastructure without caveats.
Native pretraining is a different beast. It requires corpus design, tokenizer decisions, distributed training infrastructure, checkpoint management, and serious evaluation. When someone claims they trained a frontier-sized model from scratch, ask for operational evidence. Training logs, data mixture documents, optimizer settings, checkpoint schedules, failure reports, and compute invoices tell a different story than a launch deck.
Most buyers don't need to ask "is this pure?" Ask "do the claims match the artifact, and do the licenses permit our use?"
How to inspect the model artifact before anyone demos it
Start with the files.
Model cards are useful, but they're not evidence by themselves. Evidence lives in repository history, configuration files, tokenizer assets, safetensors metadata, adapter files, evaluation scripts, and dependency pins.
Ask for a frozen artifact package. Not a mutable link. Exact commit IDs, SHA-256 hashes for weights, tokenizer files, config files, and any merge or conversion scripts are the target.
A basic inspection should include:
config.json, especially architecture, hidden size, layer count, attention heads, rope settings, and vocabulary size- Tokenizer files, including
tokenizer.json, merges, special tokens, and added vocabulary - Weight shard names and sizes
- Safetensors metadata, if present
- Adapter configs, such as LoRA rank, alpha, target modules, and base model reference
- Merge scripts, mergekit YAML, or equivalent recipes
- Training and evaluation repo commits
- License files for every upstream component
Tokenizer assets are often where the story leaks. Teams may claim a local language model, but the tokenizer is unchanged from an upstream family. That isn't inherently a problem, yet it tells you the model likely inherits behavior, limits, and licensing constraints from that upstream family.
Architecture fingerprints help too. Matching layer count, hidden size, intermediate size, attention head layout, rope scaling, and tokenizer vocabulary doesn't prove copying. It does mean that "trained from scratch" needs stronger evidence.
API teams should put this into a repeatable intake script. Parse configs. Hash files. Store lineage metadata in your model registry. Later, if the model appears in an agent, a voice stack, or a public service, you should be able to answer which upstream artifacts are in the call path.
Merge detection is not magic, but it's useful
Filenames rarely prove a merge. Numerical comparison does more work.
A common approach is to compare tensors between the candidate model and suspected parents. Load matching tensors, normalize them, and compute similarity metrics layer by layer. When large portions of the candidate are extremely close to an upstream model, that tells you something. A candidate that looks like a weighted combination of two known models across many layers tells you more.
A simple internal tool can start with:
import torch
from safetensors.torch import load_file
from torch.nn.functional import cosine_similarity
a = load_file("candidate/model-00001-of-000xx.safetensors")
b = load_file("suspected_parent/model-00001-of-000xx.safetensors")
for name in sorted(set(a.keys()) & set(b.keys())):
va = a[name].flatten().float()
vb = b[name].flatten().float()
if va.numel() != vb.numel():
continue
score = cosine_similarity(va, vb, dim=0).item()
if score > 0.999:
print(name, score)
This toy snippet won't settle a procurement dispute. It will tell you whether to ask harder questions.
Real audits compare multiple candidate parents, account for quantization, inspect deltas, and test whether the candidate can be reconstructed from a stated merge recipe. A vendor claiming "we fine-tuned Qwen-family weights on local public sector data" should show tensor similarity to the base model. A vendor claiming "we trained an independent model" has a harder explanation for near-identical tensors across broad layers.
Merges leave other footprints. Some layers may resemble one parent while others resemble another. Embedding and output layers may stay untouched. Adapter merges may preserve the base model almost exactly except in targeted modules. Don't punish reuse. Force reuse into the open.
License review has to follow the weights, not the slide deck
Licenses attach to upstream models, datasets, code, and sometimes generated artifacts. A merge may stack obligations.
Many "sovereign AI" pitches get sloppy here. They talk about hosting jurisdiction and local branding, while the actual artifact may inherit restrictions from a base model, an instruction dataset, an evaluation dataset, or a model merge. Lawyers can't review what engineers don't surface.
For each upstream component, require:
- Model name and version
- Source URL and commit or release
- License text
- Redistribution terms
- Commercial use terms
- Attribution requirements
- Acceptable use restrictions
- Any dataset-specific constraints
Don't accept "open source" as a license description. Open weights, source-available code, research-only terms, and permissive commercial licenses are not the same thing.
Product context matters too. If you're embedding a model in an on-device AI workflow, redistributing weights to customer hardware may trigger different obligations than calling a hosted endpoint. If you're building a voice AI system, recordings, transcripts, and fine-tuning data bring their own consent and retention constraints. For RunHotel, we care a lot about what can run locally and what data leaves the property because deployment boundaries are part of the product contract, not an afterthought.
A license audit should produce a yes, no, or conditional answer for each intended use. Any audit that produces a shrug should keep the model out of production.
Benchmark claims need reproducible endpoints
Benchmark tables are the easiest part of an LLM launch to fake by accident.
Not always maliciously. A team runs evaluation with one prompt template, samples with forgiving decoding settings, filters failed runs, compares against stale baselines, and publishes a clean table. By the time a buyer sees it, the table looks objective. It isn't.
Ask for the harness.
You want the exact evaluation code, prompt templates, dataset versions, model revision, tokenizer revision, inference settings, hardware, quantization mode, batch size, and random seeds where relevant. API models need one more check: did the benchmark use the same endpoint customers will call? A public API with safety filters, routing, caching, or a smaller distilled model behind it must be benchmarked through that path.
Web and API discipline helps here. Treat evals like integration tests. The request body should be visible. The response parser should be visible. Timeouts and retries should be visible. A benchmark that requires manual cleanup isn't one you can rely on for procurement.
For an agentic use case, raw benchmark scores matter less than tool-call reliability, refusal behavior, latency, and cost per completed task. We see this often in AI agent builds: a model with a better public reasoning score can be worse once it has to call APIs with strict schemas and recover from partial failures.
Compare vendors with a private eval set from your own workflows. Keep it small enough to inspect manually. A hundred high-signal tasks from your domain beat a leaderboard screenshot.
Sovereignty is an operational property
A model isn't sovereign because a press release says it is. Sovereignty shows up in operations.
Could you run it without a foreign API dependency? Are patches possible without asking the original vendor? Is training and safety data lineage inspectable? Are retention, deletion, and access controls enforceable? Could you reproduce the artifact from source materials if the vendor disappears?
For some teams, full sovereignty is overkill. A hosted frontier API with a strong data processing agreement may be the right decision. Public infrastructure, defense-adjacent systems, healthcare, finance, and citizen services often sit in a different risk category, where provenance isn't optional.
Buying a local label is the mistake when what you need is control.
That is why we often start AI consulting engagements with a deployment and data-flow map before model selection. Where does the prompt originate? Where is it logged? What processor touches it? What model version answered it? Who can override it? Those questions sound bureaucratic until something fails and you need a clean incident timeline.
A sovereign LLM procurement should include a kill switch, version pinning, audit logs, exportable weights where legally allowed, and a tested fallback. Vendors that can't support those basics won't be saved by the model's origin story.
Red flags that should slow the deal down
Some signals don't prove misconduct, but they justify a deeper audit.
Slow down when a vendor refuses to name upstream models while claiming strong benchmark parity with known open models. Treat a famous-base tokenizer and architecture as a prompt for evidence when the model card implies clean-room training. A license file shorter than the dependency chain is another warning. Benchmark claims without scripts deserve no credit.
Also watch for sudden naming jumps. When a model appears with a giant parameter count, little training detail, no public checkpoints during development, and no reproducibility package, ask how it was built. Serious training leaves exhaust.
The Rio discussion is valuable because it turns a vague branding issue into a concrete engineering question: does the artifact match the claim? The GitHub issue alleges a mismatch between public positioning and model lineage Nex-N2 GitHub issue. A buyer doesn't need to wait for internet consensus before improving their own process.
Make your procurement checklist boring. No lineage, no production pilot. No license map, no redistribution. No reproducible eval, no benchmark credit.
What we ask vendors before a pilot
For a serious LLM pilot, we ask for evidence before architecture diagrams.
The minimum package is a frozen model artifact, a lineage document, a license matrix, an evaluation harness, and an inference contract. The inference contract matters more than people think. It defines context length, supported parameters, streaming behavior, rate limits, error shapes, safety filter behavior, versioning, and deprecation policy.
A model that can't give deterministic error responses will cause pain in production. Application code needs to know whether a refusal, timeout, content filter, overload, or schema failure happened. If all failures return 500 model_error, you're signing up for guesswork.
Use something like this as a starting contract:
{
"model_id": "vendor-model-name",
"model_revision": "sha256-or-commit",
"input_schema": "chat-completions-compatible-v1",
"max_context_tokens": 0,
"supports_json_schema": true,
"streaming": true,
"error_codes": [
"rate_limited",
"context_overflow",
"safety_refusal",
"invalid_schema",
"inference_timeout"
],
"data_retention": "none | configurable | vendor-defined",
"upstream_lineage": "attached"
}
Then test it with your own workload. If you're unsure how model choice affects infrastructure cost, run rough scenarios through an AI cost estimator before committing to a deployment shape. Large local models can make sense, but idle GPUs are not sovereignty. They're just expensive furniture.
Sources
Create a one-page model provenance checklist this week, then require every vendor or internal team to fill it out with hashes, license links, upstream model names, and benchmark scripts before any pilot goes live.





