What is the Skill Score?

The Skill Score is skillsdirectory.co's 0–10 rating of how good an AI-agent skill actually is. Version 2 leads with measured effectiveness — whether the skill produces materially better output than a frontier model alone — combined with safety, maintenance, and documentation. GitHub popularity is shown as context and given almost no weight.

How is the Skill Score calculated?

When a skill has been benchmarked: effectiveness 40%, safety 25%, maintenance 15%, documentation 15%, adoption (popularity) 5%. Effectiveness is the skill's win rate against the same model with a strong baseline prompt, judged blind. Until a skill is benchmarked the score is provisional — safety 40%, maintenance 30%, documentation 30%, popularity 0% — and labelled as such.

How do you measure whether a skill is actually better than ChatGPT or Claude alone?

We run a small suite of verifiable tasks twice: once with the skill, once with the same frontier model and a strong baseline prompt. A separate judge model compares the two outputs blind, with the order swapped to remove position bias. The skill's win rate over the baseline — with a confidence interval — is its effectiveness. A skill that ties the baseline is labelled 'no real edge': you'd get the same result from the model directly.

Can vendors pay for a higher Skill Score?

No. The score is computed from public signals and our own benchmark on a fixed schedule. Vendors cannot influence it. Sponsored placements exist, but they are labelled and sit outside organic results.

Methodology

How the Skill Score is computed

Methodology v2 · trust-score-v2Last reviewed 2026-05-28Authored by skillsdirectory

Every listing carries a Skill Score from 0 to 10 — our rating of how good the skill actually is. Version 2 is built around one question a star count can never answer: does this produce materially better output than the model you already have? We measure that directly by running the skill against a frontier-model baseline. Effectiveness leads the score; safety, maintenance, and documentation round it out; popularity is context, not quality. This page is the full rubric and the benchmark method — published, because a score whose method is hidden is just an opinion with a logo on it.

/01

The four dimensions

When a skill has been benchmarked, the Skill Score is a weighted composite of an effectiveness measurement plus three hygiene sub-scores, each rated 0 to 100. Until then it is provisional (see below). Nothing is secret — we publish the rubric because a score whose method is hidden is just an opinion with a logo on it.

Effectiveness · 40%

The headline signal, and the one a star count can never give you: does the skill produce materially better outputthan the same frontier model with a strong prompt and no skill? We run a suite of verifiable tasks with the skill and against a matched baseline, judge the two outputs blind (order-swapped to kill position bias), and take the skill's win rate over the baseline — with a confidence interval. A skill that merely ties the baseline scores ~50 here and is labelled no real edge. See the benchmark section below.

Safety · 25%

Does the skill execute arbitrary code? Does it ship credentials safely? Does it respect the agent's sandbox, or break out of it? Does the install path require piping a script from a stranger's domain into a shell? Static analysis reads the source code, not just the docs, and combines results with npm audit / package-registry security advisories where applicable.

Maintenance · 15%

Recency of the last commit, release cadence, ratio of closed to open issues, and contributor count over a 90-day window. A repo with one commit two years ago drops here even if everything else looks good — abandoned skills become risks when their dependencies bit-rot.

Documentation · 15%

Does the README explain what the skill does, when to use it, and how to install it cleanly? Are there working examples? Does the install path match the platform's conventions? Documentation is read by a language model and graded against a sample of representative user queries.

Adoption · 5% (context only)

GitHub stars, forks, and contributors, capped so popularity can never dominate. In v1 this was 20% of the score, which made the rating little more than a star count in a nicer font — the reason we pulled the old score. In v2 it is a tie-breaker at most, and in the provisional regime it carries zero weight, so a popular but unverified skill cannot buy a high score.

Benchmarked: 0.40·effectiveness + 0.25·safety + 0.15·maintenance + 0.15·documentation + 0.05·adoption.

Provisional (not yet benchmarked): 0.40·safety + 0.30·maintenance + 0.30·documentation — popularity zeroed, and the listing is clearly labelled provisionalso you know effectiveness hasn't been measured yet. We never invent an effectiveness number. The band — High Quality, Solid, or Caution — comes from the overall: ≥7.5, ≥5.5, otherwise low. Both regimes rescale to the 0–10 display range.

/01b

The benchmark — lift over baseline

Effectiveness is the part GitHub can't replicate, so it is worth explaining in full. For each benchmarked skill we assemble a small suite of verifiable tasks in its capability — contract clauses with a known risk, invoices with known line items, code with a planted bug, datasets with a known answer. Then, for every task:

Arm A runs the skill (executed in a sandbox where we can, otherwise faithfully reproduced from its instructions — the page says which).
Arm B gives the same frontier model a strong, fair baseline prompt and no skill.
A separate judge model compares A and B blind, scoring correctness, completeness, specificity, and format. We run the comparison twice with the order flipped and only count a win if the same arm wins both times.

The skill's win rate over the baseline, with a 95% confidence interval, is its effectiveness. We pre-register the bar for “materially better” before running, so we can't move the goalposts. We also report reliability (does it win consistently across repeated trials, or only sometimes?) and cost — a 1% quality gain at 10× the cost is not a win.

Crucially, this catches the skill that is really just a good prompt. If the skill's edge disappears once we hand its own instructions to the bare model, we say so on the page: most of this skill's value is its prompt. That is the most honest, least gameable thing we can tell you, and it is the whole point of the number.

/02

What the score is measuring

The Skill Score answers "How good is this skill?", not "Can I trust this person?". It is a quality signal — our opinion, derived from public signals — about whether the skill does what it says, whether it does it well, and whether it will keep working as the surrounding ecosystem changes.

We are explicit about that framing because the score is not a security guarantee. The safety sub-score reads code statically and flags the most common patterns of trouble; it cannot catch a skill that does something dangerous only when triggered by a specific input. Treat the score as a strong signal that lets you skip the obviously bad skills and prioritise the obviously good ones — not as a substitute for reading the install command before running it.

/03

The signals we read

For each indexed skill we collect a fixed set of signals from the source repository and supporting registries. Nothing we publish on a skill page is invented; every fact comes from one of these places.

Source URL, license file, and topic tags from the canonical repository.
Stars, forks, contributor count, open and closed issue counts.
Last commit date and release-tag cadence.
The README and any marker file (SKILL.md, mcp.json, .cursorrules, .windsurfrules, etc.).
The skill's own source code, read for safety review.
Package-registry metadata (npm, PyPI) where the skill is also published.
Reviews left by verified installers through this site, once the skill has accumulated any.

/04

The flag system

A skill can carry a flag at one of three severities. Flags are surfaced at the top of the skill page so you see them before you copy an install command.

Info

Worth noting but not blocking. Examples: single-maintainer bus factor, recent ownership transfer, requires a paid third-party API to function.

Warning

The skill works but does something you should review before installing. Examples: executes scraped HTML in a Node VM, expects unrestricted credentials, runs as an admin user.

Critical

The skill is unsafe in its current state and should not be installed without changes. Examples: posts user-supplied data to an unidentified endpoint, ships a credential in source, vulnerable to a public CVE. Critical flags downgrade a skill to Tier C and remove it from the index.

/05

What we deliberately do not score

The effectiveness benchmark measures output quality on representativetasks — not your exact task. The score tells you whether a skill tends to beat the model alone and is safe to run; it can't predict whether it's the right fit for your specific workflow.

We grade output quality by blind comparison against a baseline — not aesthetic or stylistic preference. There is no “we liked the tone” sub-score.
We don't compare paid skills against free ones on cost alone. License and pricing are surfaced as facts, not weighted into the score.
We don't penalise low star counts on their own. A new skill with strong maintenance and clean documentation can sit at High Quality.
We never let a vendor pay for a higher score, and the score is recomputed on a schedule independent of any commercial relationship.

/06

The tier gate

The score feeds into a tier decision. Three tiers exist:

Tier A — listed and indexed by search engines. Requires complete metadata, a valid license, no critical flags, an overall score of 5.5 or higher, and at least one real-world signal (≥25 stars, recent commits within 180 days, presence in a known registry, or a verified submission).
Tier B — listed in our directory but noindex until it earns indexing. A real skill that hasn't yet accumulated enough signal to be ranked.
Tier C — rejected. Broken, empty, joke, malicious, near-duplicate of a canonical entry, or carrying a critical safety flag.

Borderline decisions go to a moderation queue rather than getting auto-published. That is the only manual step in the pipeline — every other decision is reproducible from the published rubric.

/07

Refresh cadence

Every Tier-A skill is re-scored at least weekly. Repository signals (stars, commits, issues) refresh nightly. Skill scans re-run when the source code's content hash changes, so a skill that ships a security fix sees its score move quickly. When a skill's tier or critical-flag status changes, the affected listing pages rebuild automatically.

The "Last verified" date on each skill page is the timestamp of its most recent successful refresh. If you see a date older than two weeks, the refresh job failed for that skill and we surface that as an info flag.

/08

Honest limitations

A few things to keep in mind so the score is useful rather than oversold:

The safety review reads code statically. A skill that does dangerous things only when it sees a specific input may pass review and still be unsafe. Treat the score as a strong signal, not a guarantee.
Documentation grading uses a language model. We bias toward false positives there — a poorly-described skill is easier to live with than a missed-warning one.
For skills that aren't open source — paid Custom GPTs, OpenAI Apps, hosted Zapier templates — we score what we can see. The skill's actual implementation is closed, so the safety sub-score weights the platform's vendor-level controls rather than line-level static analysis.
Brand-new skills with no signal default to Tier B until they accumulate either real-user activity or a verified author submission. A new skill being at Tier B is not a judgment about its quality.