Solo tre parole: benchmarking local LLMs – Ice and Fire

“Solo tre parole: non sei solo.”

Six Italian words. The first three announce “only three words”. The last three deliver “you are not alone”. Read it again, slowly. The same word “solo” opens and closes the sentence, but it does different work each time: first as the adverb “only”, then as the adjective “alone”. And the supposedly three-word reassurance, non sei solo, contains exactly three words. The sentence is about itself. It is also, quietly, a little gem¹.

I gave it to a small zoo of local language models running on a laptop, asking each to translate it into English while preserving every nuance it could. What followed was more interesting than the question deserved. One model invented wordplay that did not exist. Another saw the cleverness with full clarity, articulated it precisely, and then ignored it. A third produced a candidate translation so contorted it was almost charming. And the model that most clearly understood the puzzle took sixteen minutes to say so.

This is the story of that afternoon.

Why the phrase is sneakier than it looks

Translation, at its dullest, is word substitution. At its more interesting, it is constraint satisfaction under aesthetic pressure: keep the meaning, keep the tone, keep the rhythm, and if there is a clever trick in the source, ideally keep that too. Translating is a highly skilled job.

Solo tre parole: non sei solo hides three small tricks in plain sight.

The first is lexical polysemy. Italian solo is doing two jobs in the same sentence: a quantifying adverb at the start, an existential adjective at the end. Same form, different role, different meaning. English has no single word that pulls double duty in quite the same way; we are forced to split the echo into two distinct lexical items, and the gentle internal rhyme of the original collapses.

The second is self-reference. The opening clause announces the length of the second clause, and the second clause delivers exactly that length. Non sei solo is genuinely three words. The sentence describes itself accurately. Most English candidate translations, like “you are not alone”, break this property: four words, not three. To preserve the self-reference, you need a contraction (“you’re not alone”) or some less natural construction.

The third is register. The sentence is intimate, minimalist, the kind of thing one writes on a postcard or sends as a message at a hard moment. It is not florid, it is not formal. Anything that translates the meaning but reaches for “you are not in solitude” misses the point entirely.

So that is the brief: hold polysemy, self-reference, and register together in six English words, or any possible alternative. Possible, but only just.

The setup

I tested everything through Ollama on an M4 MacBook Air with 24GB of RAM, using the same prompt across all models. Although, I have a PC with a better GPU and cooling, but the GPU memory would not be able to handle most of the tested models. That’s why my passively-cooled MacBook was transformed into a LLM powerhouse.

The prompt asks the model to do five things, in order:

Provide a literal gloss.
Identify any wordplay, double meanings, self-referential structure, register choices, or cultural framing.
Explain the genuinely hard parts.
Offer two or three candidate translations, each with what it preserves and what it sacrifices.
Pick a recommended translation and justify the choice.

The structure is deliberately fussy. Small local models tend to leap straight from “I see Italian” to “here is a translation”, missing every interesting layer along the way. Forcing a literal gloss first slows them down. Forcing multiple candidates with explicit trade-offs makes them name what they are giving up. Asking for justifications stops them waving their hands.

The lineup, more or less in order of running:

llama3.1:8b
granite4.1:8b
mistral-small3.2:24b
gemma4:e4b (small) and gemma4:26b (large), both with native thinking mode
deepseek-r1:8b and deepseek-r1:14b, both with native thinking
qwen3.6:27b, with thinking
gpt-oss:20b, with thinking

All Q4_K_M quantisation, except gpt-oss which uses MXFP4. So roughly comparable on the quantisation front, with one small caveat for the gpt-oss numbers.

How small models fail when asked to be clever

The most striking failure was llama3.1:8b. Asked to find subtlety in the source, it confidently told me that tre is phonetically similar to t’re, “which sounds like ‘there'”. This is invented. There is no such pun. The model, faced with a request to find wordplay, hallucinated wordplay rather than admit it could not find any.

This is the worst sort of small-model failure. A miss is recoverable; a fabrication looks like analysis and is not. If you do not speak the source language, you have no way to check. The model produced clean prose, confident structure, and made-up linguistics underneath.

granite4.1:8b did better as it identified the solo/solo polysemy but its account of what the polysemy actually did in the sentence collapsed into incoherence. It missed the self-referential count entirely.

These are the small-model results in a nutshell: 8B parameters at Q4 quantisation does not appear to be enough capacity to hold polysemy, structural self-reference, and register all at once. Something has to give, and it does.

The analysis–translation gap

A more interesting failure showed up in the larger models. gpt-oss:20b is the cleanest example.

It saw the polysemy: “solo occurs twice, first as the adverb ‘only’, then as the adjective ‘alone'”. It saw the self-reference: “the phrase claims that the whole sentence consists of just three words”. Then in step 3, it noted, in plain English: “the English equivalent — ‘Only three words: you are not alone’ — has four words, so the exact numeric precision is lost.”

It saw the problem with full clarity. Then it proposed three candidates, none of which solved the problem, and recommended one that did not either.

deepseek-r1:14b showed the same shape. Sharp analysis, all candidates fail the count, recommendation flat.

This is more interesting than “didn’t see the problem”. These models did see it. They simply could not turn the seeing into a generation constraint. Identifying a problem and constructively satisfying it are, apparently, separate skills. Constraint identification looks like memorisation and pattern-matching; constraint satisfaction in English requires the model to feel its way to “you’re not alone” (counting the contraction as one word, which is the cleanest fix available) rather than describe its way there.

What thinking mode actually buys you

gemma4:26b with thinking mode enabled was the only model in the batch that caught everything and knew it had caught everything. Its analysis used phrases like “lexical echo” and “semantic mirror”. Its recommended translation, Just three words: you’re not alone, came with an explicit note: treating the contraction “you’re” as a single word preserves the 3:3 word count of the original. It did not stumble into the answer; it reasoned to it.

I then ran the same model with /set nothink. Same weights. Same prompt. Different answer.

The non-thinking version flatly stated, “there is no linguistic wordplay in the sense of puns.” This is wrong. With thinking off, Gemma’s failure mode collapsed neatly onto Mistral’s, missing the polysemy entirely.

That single comparison — same model, same query, thinking on versus off, opposite verdicts on whether wordplay even exists — is the cleanest demonstration I have seen of what reasoning-at-inference-time actually contributes. It is not just a quality boost. It is the ability to revise a first impression. Without thinking, the first pass is the answer, and a confident first pass can be confidently wrong.

A small experiment with Mistral

mistral-small3.2:24b does not have native thinking. So I tried to fake it.

Upon a suggestion from Claude (see credits at the bottom of this post), I added a “step 0” to the prompt:

Before writing the visible sections, work through the source carefully: list the words individually, check whether any word appears more than once, decide for each repeated word whether the meanings are the same or different, and check whether the announced word count matches any clause in the sentence.

With this addition, Mistral suddenly caught the solo/solo polysemy it had missed completely on its first pass. The capability was in the base model; what had been missing was the procedure for using it. The hypothesis was cleanly confirmed as Mistral could see polysemy, which it simply did not bother in one pass.

There was a twist. With the new instruction, Mistral lost track of the self-referential count, which it had caught in the original run. As if attention is a budget: spend it forcing one feature, lose it on another. Whether that is a real effect or a coincidence on this single sentence, I genuinely cannot tell.

The same run also gave me a textbook example of confabulation under structural pressure. One of Mistral’s candidate translations claimed to “preserve the shift from ‘only’ to ‘alone'” while sacrificing “the explicit count of three words”, but the candidate phrase was Three words only: you’re not alone. The words “three” and “words” are right there. The count is preserved. The model invented a sacrifice that did not exist, because the prompt asked for three differentiated candidates and only two of them were genuinely different.

The cost of cleverness

Capability is one axis. Speed is another, and there were surprises here too.

Model	Eval rate (tok/s)	Wall clock
`gpt-oss:20b`	25,6	40 s
`gemma4:26b`	25,6	53 s
`llama3.1:8b`	20,8	21 s
`deepseek-r1:8b`	19,0	1m 14s
`granite4.1:8b`	18,6	27 s
`deepseek-r1:14b`	10,8	1m 41s
`mistral-small3.2:24b`	7,2	50 s
`qwen3.6:27b`	2,8	15m 57s

A 24B model running at 7,2 tok/s on the same hardware as a 26B model running at 25,6 tok/s is not what parameter-count instinct would predict. The biggest single factor turns out to be embedding dimension. Mistral’s 5.120-wide embeddings cost roughly 3,3× more compute per token than Gemma’s 2.816-wide ones, and that ratio matches the speed gap almost exactly. On Apple Silicon, where memory bandwidth is the binding constraint for inference, narrow-and-deep beats wide-and-shallow.

qwen3.6:27b is more puzzling. It has the same 5.120 embedding as Mistral and DeepSeek 14B, yet ran at 2,8 tok/s, far slower than width alone explains. With 17GB, the model is comparable in size with gemma4:26b, and it is at the limit of what a MacBook Air with 24GB can run without memory pressure being too high. But the Gemma model was able to answer in less than a minute, so likely it is deeper, or has unoptimised inference paths in Ollama for that architecture, or pays an overhead for its 262.144-token context length. Whatever the cause, sixteen minutes for a single-sentence translation is not interactive. Quality-wise it was strong. Usability-wise it is unusable.

And the commercial chatbots?

A fair sanity check: how do the cloud-hosted models do on the same prompt?

ChatGPT (free version, GPT-5.5 as of writing) and Perplexity in default mode both performed at roughly the level of mistral-small3.2 or granite4.1: identifying one of the two layers, missing the other, recommending a flat translation. Defaulting to a consumer-friendly model presumably trades depth for cost, and the trade-off shows.

Perplexity with the Sonar model reached the gemma4:26b level: both layers caught, count preserved. Gemini in Fast+Thinking mode matched it too. So far, no surprises.

Claude with Opus 4.7 in Adaptive mode (which appears to engage thinking) also matched gemma4:26b on the first pass. But when I pushed it to convey everything from its own analysis in the translation rather than declaring trade-offs, it came back with something none of the other models had produced:

Three words alone: you’re not alone.

That is genuinely clever. The word “alone” appears twice, doing different work each time — first as a postpositive adverb meaning “merely” or “by themselves”, then as the predicate adjective meaning “solitary” — directly mirroring the solo/solo echo of the original. The post-colon clause is exactly three words. Register holds. It is the only translation across the entire test, local or commercial, that preserves all three constraints simultaneously.

The wider observation, perhaps: the gap between the best commercial cloud chatbot and the best local model on a MacBook Air is now smaller than the gap within either category. A well-chosen local model beats a default-mode commercial chatbot. And the difference between a thinking and non-thinking variant of the same model is larger than the difference between one good thinking model and another, regardless of where it runs.

Takeaways, more or less

A few things I will be carrying forward from this afternoon.

Parameter count is a poor proxy for almost everything. A 24B model can be slower than a 26B one and produce weaker analysis. Architecture, training, and inference mode all dominate. “Size class” is a simplification that hides every interesting variable.

Thinking mode does real work. When a task requires anything more than one-pass pattern-matching — counting, cross-referencing, constraint satisfaction — disabling thinking will silently cripple the model. The same gemma4:26b confidently denied wordplay existed without thinking, and confidently dissected it with thinking. If your local model supports /set think, leave it on for anything subtle.

Identification is not satisfaction. Several models cleanly described the problem and then produced answers that ignored their own description. Knowing the constraint and respecting it during generation are separate capabilities, and the second one is rarer.

Confabulation is the worst failure mode. Llama 3.1’s invented phonetic pun, and Mistral’s invented preserves/sacrifices justifications, are more dangerous than missing an answer. Missing leaves you uncertain; confabulating leaves you confidently wrong. Smaller models do it more, but no model is immune.

Local LLMs on a laptop are remarkable but not magical. A MacBook Air can now run models that catch literary wordplay in a foreign language. It can also run models that hallucinate confidently and sound convincing whilst doing it. The gap between those two modes is, increasingly, the more important question.

For what it is worth, my favourite translation remains the one gemma4:26b reasoned its way to:

Just three words: you’re not alone.

Three words. Precisely. With deliberate intent — and in a sentence about being seen, that is the whole point.

1: I got the idea while reading this article from The Guardian: ‘Being human helps’: despite rise of AI is there still hope for Europe’s translators? by P. Oltermann.

The full prompt and raw model outputs are available on request.

Credits: The translation prompt and the draft of this post were developed in conversation with Claude Opus 4.7 (Anthropic). Any errors of judgement remain mine 😉.