There are over 40 viable LLMs right now. Claude, GPT, Gemini, Llama, DeepSeek, Mistral, Grok — the list keeps growing. Every few weeks there's a new model claiming state-of-the-art on some benchmark, and every few weeks I find myself asking the same question: which one is actually better for my use case?
The standard workflow is painful. Open three browser tabs. Paste the same prompt. Wait. Squint at the outputs. Try to remember what the first model said while reading the third. Repeat for the next prompt.
I built yardstiq to fix this. It's an open-source CLI tool that sends one prompt to multiple models simultaneously, streams the responses side-by-side in your terminal, and gives you performance stats and optional AI judge scoring.
npx yardstiq "your prompt" -m claude-sonnet -m gpt-4o -m gemini-flashAfter using it across hundreds of comparisons, here's what I found.
Why Compare LLMs Side-by-Side?
If you're building an AI-powered product, picking the wrong model costs you money, latency, and output quality. Benchmarks on leaderboards don't tell the full story — they measure standardized tasks, not your tasks.
Side-by-side comparison with your actual prompts gives you:
- Real performance data for your specific use case
- Cost visibility so you can optimize spend
- Latency metrics (time to first token, tokens per second) that affect user experience
- Quality differences that only show up with your domain-specific prompts
Setting Up LLM Comparisons With yardstiq
yardstiq requires zero installation. Run it directly with npx:
npx yardstiq "Explain quicksort in 3 sentences" -m claude-sonnet -m gpt-4oYou'll need API keys for the providers you want to compare. Set them as environment variables:
export ANTHROPIC_API_KEY=sk-...
export OPENAI_API_KEY=sk-...
export GOOGLE_GENERATIVE_AI_API_KEY=...yardstiq supports 40+ models including Claude, GPT, Gemini, DeepSeek, Mistral, Grok, Llama, and any model available through Ollama for local inference.
Benchmarking 10 LLMs: Coding, Creative Writing, and Reasoning
I compared 10 models across three categories. For each, I ran 5 prompts and used yardstiq's AI judge feature (GPT-4.1) to score responses on a 1–10 scale while tracking performance metrics.
Coding: Writing a Production-Ready Rate Limiter
I asked each model to implement a token bucket rate limiter in Python with Redis backing, proper error handling, and tests.
Claude Sonnet consistently produced the most complete implementations — proper async handling, edge cases covered, meaningful tests. GPT-4o was close but leaned toward verbose code with more comments than substance. Gemini Pro produced code that looked correct but missed subtle concurrency issues.
The value story was surprising: DeepSeek V3.2 nearly matched GPT-4o quality at a fraction of the cost. Claude Haiku was the speed champion for quick scaffolding. Codestral punched above its weight on pure coding tasks.
Creative Writing: Sci-Fi Opening Paragraphs
This is where models diverged the most. Claude leaned literary — more metaphor, more interiority. GPT-4o was cinematic — action-oriented with sensory details front and center. Gemini Pro surprised me with less predictable word choices and occasionally genuinely interesting turns of phrase.
The budget models produced noticeably flatter prose. For creative work, premium models earn their cost.
Reasoning: Logic Puzzles and Multi-Step Deduction
The gap between models was smallest at the top and widest at the bottom. Claude Sonnet and GPT-4.1 both handled complex reasoning well. Reasoning-specialized models like DeepSeek R1 took longer but occasionally caught edge cases that general models missed.
Local models via Ollama were competitive on simpler tasks but fell apart on multi-step problems. Still, seeing a local model go head-to-head with cloud APIs in the same terminal — for free — is satisfying.
Advanced Features: AI Judge and Benchmark Suites
AI Judge Scoring
Add --judge to have a separate model evaluate and score each response with reasoning:
npx yardstiq "Implement a LRU cache" -m claude-sonnet -m gpt-4o --judgeThis removes the subjectivity of eyeballing outputs and gives you consistent scoring criteria.
YAML Benchmark Suites
For systematic evaluation, define prompt suites in YAML:
name: coding-eval
prompts:
- category: algorithms
text: "Implement a LRU cache in Python"
- category: debugging
text: "Find the bug in this code: ..."Run the suite:
yardstiq benchmark run ./coding-eval.yaml -m claude-sonnet -m gpt-4o -m deepseekResults export to JSON, Markdown, or self-contained HTML reports.
Local Model Comparisons With Ollama
Compare local models against cloud APIs at zero cost for the local side:
yardstiq "hello" -m local:llama3.2 -m local:mistral -m claude-sonnetKey Takeaways From Hundreds of LLM Comparisons
There is no single best model. There's a best model for a specific task at a specific price point.
Here's my working framework after months of side-by-side testing:
- Coding tasks: Claude Sonnet for quality, DeepSeek for bulk generation at low cost
- Creative writing: Run Claude and GPT-4o side-by-side and pick the voice you prefer
- Quick questions: Claude Haiku or GPT-4o Mini at 10x less cost
- Research and reasoning: GPT-4.1 or Claude Sonnet, with reasoning models for edge cases
Time to first token matters more than you think. When you're in flow, the difference between 300ms and 700ms TTFT feels significant. yardstiq's performance table made me much more aware of this and changed which models I default to for interactive use cases.
Try yardstiq
yardstiq is MIT licensed and works with one command:
npx yardstiq "your prompt" -m claude-sonnet -m gpt-4oIf you're picking between models for a project, or just curious how they stack up, it takes about 10 seconds to get your first comparison.