Advanced AI Models Produce Up to 50 Times More CO₂ Than Common LLMs — with a Catch
AI Implementation4 min read

Advanced AI Models Produce Up to 50 Times More CO₂ Than Common LLMs — with a Catch

Sam Carter

AI Strategy Consultant

November 7, 2025
Advanced AI Models Produce Up to 50 Times More CO₂ Than Common LLMs — with a Catch

Advanced AI Models Produce Up to 50 Times More CO₂ Than Common LLMs — with a Catch

A new study reveals a troubling trade-off: the more “thinking” we ask of advanced reasoning AI models, the steeper the environmental cost. In many cases, answers that require more logic, deliberation, or internal reasoning generate up to 50 times more carbon dioxide (CO₂) emissions than responses from more concise models. SciTechDaily+3Frontiers+3Live Science+3

What did the researchers do?

  • A Team from Hochschule München University of Applied Sciences in Germany compared 14 large language models (LLMs), with sizes from 7 billion to 72 billion parameters.
  • They asked all these models the same 1,000 benchmark questions, divided roughly between multiple-choice and free-response formats. The questions spanned topics like abstract algebra, philosophy, high school mathematics, world history, and international law.
  • They distinguished between two kinds of models: “concise models” (which aim to give shorter, more direct answers) and “reasoning‐enabled models” (which use internal steps, explanations, or “chain of thought” style reasoning).

Key findings:

Emission difference is huge:

  • Reasoning models produced many more “thinking tokens” — internal pieces of text or logic that the model generates before giving the final answer. On average, reasoning-enabled models generated about 543.5 thinking tokens per question, while concise models only generated about 37.7. That’s more than 10× as many tokens.

Up to 50× emissions, but not always better answers:

  • These extra tokens translate directly into more computation, more energy, more CO₂. The reasoning models in the study emitted up to 50 times the CO₂ of concise models when answering the same questions. But more tokens / more reasoning didn’t always mean dramatically higher accuracy. In many cases there was a diminishing return. Popular Science+2Frontiers+2

Accuracy-sustainability trade-off:

  • The most accurate model in the study was a reasoning-enabled model called Cogito (70B parameters), reaching around 84.9% accuracy across the benchmark. But it also emitted about three times more CO₂ compared to other large models that gave concise answers. Models that kept emissions under ~500 grams CO₂ equivalent generally did not achieve over 80% accuracy. Frontiers+2anthropocenemagazine.org+2

Subject matter matters:

  • Questions in abstract or philosophical areas (such as abstract algebra or philosophy) raised emissions significantly—up to six times higher—than simpler or more straightforward topics like high school history. Because reasoning takes more “thinking tokens” when the topic is harder.

Real-world scale amplifies impact:

  • The researchers point out that while one prompt or question might not make moss-growing difference, multiplied across millions of uses the extra emissions add up. For example: asking DeepSeek’s R1 model 600,000 questions could produce CO₂ emissions roughly equivalent to a round-trip flight from London to New York. In contrast, a model like Qwen-2.5, when answering similarly, could handle over three times more questions with similar emissions. Frontiers+1

Why this matters:

  • A.Environmental cost is rarely visible to users — when you type a question into an AI, you don’t see how much energy is being used behind the scenes. But the study shows it’s substantial and varies a lot depending on the model and the type of question.

  • B.Growing usage = growing footprint — AI isn’t just for research labs anymore. Chatbots, virtual assistants, education platforms, content generation, etc., are all using LLMs at scale. Even small inefficiencies or extra emissions per question can aggregate massively.

  • C.Design decisions have consequences — model builders who emphasize reasoning, detailed chain-of-thought, or explanations are often doing so to improve answer quality. But those decisions come with trade-offs in energy, emissions, and sometimes response time.

What could be done:

  • Here are some possible mitigations or better practices, drawn from the study and expert commentary:
  • Use concise models when possible — if the task doesn’t need deep reasoning or explanations (e.g. asking for simple facts), use a model or mode that gives direct answers.
  • Prompt engineering — asking for shorter answers, stating “in brief,” or “give me a summary” could reduce token generation, and hence emissions.
  • Limit reasoning-mode use to when it adds real value (e.g. for a math proof, philosophical argument, complex problem solving), rather than using it by default.
  • Model efficiency improvements — more efficient architectures, better hardware, optimizing inference time, better energy sources, etc.
  • Transparency and metrics — more info from model developers on energy per query, CO₂ equivalent emissions, etc., so users and organizations can make informed decisions.

A human perspective:

It’s a sobering reminder that “smarter” isn’t always “better” in every dimension. As AI becomes more integrated into daily life, its unseen costs — energy, climate impact, resource usage — matter more. There’s a kind of tension between wanting rich, detailed, human-like reasoning and being energy-efficient. Much like driving a powerful car: it gives more speed, more capability, but costs more fuel. Users, researchers, and companies all will need to balance expectations. Do I want the deepest possible answer, or an answer that’s “good enough” with a smaller environmental footprint? Are we okay with extra cost in emissions for marginal gains in accuracy or detail?

Found this helpful?

Share it with your network