Skip to main content

How GigRadar Chooses the AI That Writes Your Proposals

Written by Vadym O

GigRadar's AI bidder, Sardor, writes your Upwork cover letters and answers screening questions for you. A fair question to ask is: which AI is doing the writing, and how do you know it's any good?

The honest answer is that we don't take any model's word for it. Every model that's a candidate to write your proposals has to earn the job first — by scoring well on real Upwork jobs, judged the same strict way every time. This article explains exactly how that works.

In short: We give each AI model the same real jobs, have an independent judge score every proposal against a clear definition of "good," and run the whole thing repeatedly so the results hold up. The highest-scoring all-rounder is recommended by default — and you can always pick a different model to match how you bid.

The three pillars behind every model's grade: judged on defined criteria, handles AI traps and puzzles, run repeatedly


How we test the AI

The test is built to mirror what actually happens when you bid. For each model, we do three things:

  1. Start from a real job. We use a broad range of real Upwork job posts across different categories and skill sets — the same kinds of jobs you bid on every day.

  2. Generate a proposal. The model writes a full cover letter for that job, exactly as it would for you.

  3. Score it against a clear bar. We define up front what a strong proposal looks like. An independent AI judge then grades each letter on a 0-to-100 scale for how well it hits that bar — rewarding genuine relevance and personalization, and penalizing the signs of low-effort, machine-written applications.

Crucially, every proposal is graded by the same fixed, independent judge applying the same rubric — so all four models are measured on one consistent yardstick, never on gut feel. Repeat this across more than 1,200 graded proposals, and a clear, fair picture of each model emerges.

What every proposal has to nail

Two things separate a proposal that wins interviews from one that gets ignored. We score both.

The Score: is it actually a good proposal?

The Score (0–100) is the heart of the benchmark. The judge rewards proposals that are genuinely relevant and personalized to the specific job, and penalizes the tell-tale signs of low-effort, machine-written applications:

  • Unfilled placeholders (a literal "[Your Name]")

  • Emojis and generic filler

  • Made-up links that weren't in the job post

  • The wrong client or company name

  • The wrong language for the job

A higher Score means a proposal that reads like a thoughtful human wrote it for that exact client. This is the clearest differentiator between models.

AI traps & puzzles: does it stay sharp?

Some clients hide instructions inside their job posts — "begin your proposal with the word pineapple," or little puzzles designed to expose lazy, automated bidders. A strong model reads the post carefully, follows the genuine instruction, and refuses the bad-faith ones — instead of blindly falling for the trap.

We test how often each model navigates these correctly. The good news: every model we tested handles the large majority of them.

Why you can trust the Score

AI scoring can be slippery. Ask a model to grade the same proposal twice and a careless setup will give you two different answers — and suddenly your "results" are just noise. The judge is the most important part of this benchmark, so we built it to be strict and repeatable.

  • A clear definition of success. What counts as a strong proposal is defined in advance, as a fixed rubric. The judge applies that same rubric to every model, so "good" means the same thing every time — not whatever the judge feels like in the moment.

  • An independent, consistent judge. One dedicated AI judge grades every proposal, locked to its steadiest setting. Re-grade the same proposal and it returns the same Score. The grading is reproducible, not a coin flip.

  • Run repeatedly. Because the writing itself varies a little each time, we don't trust a single attempt. Each test is generated and scored multiple times over, and we only stand behind a result that holds up across the repeats.

  • Only real gaps count. Every Score carries a margin of error. If two models land inside it, we treat them as a tie rather than inventing a ranking — so the differences we report are real, not random.

That last point is why you'll sometimes see two models listed as effectively equal. That's not indecision; it's honesty about what the data can actually prove.

What we found

Here's the high-level picture from the latest benchmark (Score is the 0–100 judge grade; AI-trap handling is how reliably each model navigates hidden client instructions):

How the models compare on Score: GPT-5.1 71, Claude Sonnet 4.6 66, Claude Haiku 4.5 62, GPT-4o 54

Model

Score (0–100)

AI-trap handling

Best for

GPT-5.1

71 — highest

Strong

The default — best all-round results

Claude Sonnet 4.6

66

Strong

A strong Anthropic alternative

Claude Haiku 4.5

62

Strong

Fast, lightweight drafting

GPT-4o

54

Strong

A stable, familiar baseline

A few takeaways:

  • GPT-5.1 is the strongest all-rounder — clearly the highest-scoring model. It's the recommended default.

  • All models handle AI traps well. On this test the differences between them were inside the margin of error, so no single model stands out as dramatically better — they're all dependable here.

  • Newer isn't automatically better, and bigger isn't either. The rankings come from measurement, not reputation, which is the whole point of running the benchmark.

These numbers are a snapshot in time. We re-run the benchmark whenever models or prompts change, so the recommendation stays current.

Memory: two ways Sardor can write

You'll see Sardor offered in two modes, which differ by one thing — memory:

  • Sardor (memory off) — the base writer. It works only from the current job and your profile.

  • Laziza (memory on) — the same writer, plus your saved preferences. It checks each proposal against the style and rules you've taught it (for example, "keep it under 150 words" or "don't open with a generic greeting") and adjusts the letter to match.

Memory makes proposals more personal and on-brand. How much it helps depends on how much you've taught it — which is why both modes are available to you.

A note on honesty

No benchmark is perfect, and we'd rather be upfront about ours:

  • The Score is a relative signal — it's best for comparing models against each other, not as an absolute grade out of 100.

  • AI-trap handling is the hardest thing to measure precisely, which is why we report it as "all strong" rather than ranking models that are statistically tied.

  • Results are a point-in-time snapshot. As models improve, we re-test and update.

We think being clear about the limits is part of what makes the rest of the numbers trustworthy.

How to choose your model

  • Want the best results with no fuss? Stick with the recommended default — the highest-scoring model.

  • Want Sardor to follow your personal style rules? Turn on memory (Laziza).

  • Curious to experiment? Every tested model is available in your scanner settings. They've all cleared the same bar, so you're choosing between genuinely tested options — not marketing claims.

Frequently asked questions

Do you just use ChatGPT?

No. We test several leading models from different providers and pick the best performer, then keep re-testing as new models are released.

Are these Scores made up?

No. Every Score comes from running a real job through the model and having one consistent, independent judge grade the result against the same pre-defined rubric. We run each test repeatedly and only report differences large enough to be real.

Will my recommended model change over time?

It can. As models improve and new ones launch, we re-run the benchmark and update the lineup so you're always pointed at the best current option.

Can I override the recommendation?

Yes. The recommended model is a sensible default, but you're free to choose any available model in your scanner settings.

Did this answer your question?