Skip to content
skillsdirectory

Methodology

How the Skill Score is computed

Every listing in this directory carries a numeric Skill Score from 0 to 10 — our proprietary rating of how good the skill actually is at what it claims to do. The score exists for one reason: to help you skip the dead, broken, or low-quality skills and reach a working one faster. This page is the full rubric, the signals it draws from, and the decisions it deliberately does not make.

/01

The four dimensions

The Skill Score is a weighted composite of four sub-scores, each rated 0 to 100. The weights and the signals each sub-score reads from are listed below. Nothing is secret — we publish the rubric because a score whose method is hidden is just an opinion with a logo on it.

Safety · 35%

Does the skill execute arbitrary code? Does it ship credentials safely? Does it respect the agent's sandbox, or break out of it? Does the install path require piping a script from a stranger's domain into a shell? Static analysis reads the source code, not just the docs, and combines results with npm audit / package-registry security advisories where applicable.

Maintenance · 25%

Recency of the last commit, release cadence, ratio of closed to open issues, and contributor count over a 90-day window. A repo with one commit two years ago drops here even if everything else looks good — abandoned skills become risks when their dependencies bit-rot.

Documentation · 20%

Does the README explain what the skill does, when to use it, and how to install it cleanly? Are there working examples? Does the install path match the platform's conventions? Documentation is read by a language model and graded against a sample of representative user queries.

Adoption · 20%

GitHub stars, forks, and contributor count, capped logarithmically so a 50k-star repo doesn't blow out the rest of the score. Star count alone never gets a skill into Tier A — adoption is a smoothing signal, not a decision signal. The number is there to break ties between two otherwise similar skills.

The overall score is 0.35·safety + 0.25·maintenance + 0.20·documentation + 0.20·adoption, rescaled to the 0–10 display range. The band — High Quality, Solid, or Caution — comes from the overall: ≥7.5, ≥5.5, otherwise low.

/02

What the score is measuring

The Skill Score answers "How good is this skill?", not "Can I trust this person?". It is a quality signal — our opinion, derived from public signals — about whether the skill does what it says, whether it does it well, and whether it will keep working as the surrounding ecosystem changes.

We are explicit about that framing because the score is not a security guarantee. The safety sub-score reads code statically and flags the most common patterns of trouble; it cannot catch a skill that does something dangerous only when triggered by a specific input. Treat the score as a strong signal that lets you skip the obviously bad skills and prioritise the obviously good ones — not as a substitute for reading the install command before running it.

/03

The signals we read

For each indexed skill we collect a fixed set of signals from the source repository and supporting registries. Nothing we publish on a skill page is invented; every fact comes from one of these places.

  • Source URL, license file, and topic tags from the canonical repository.
  • Stars, forks, contributor count, open and closed issue counts.
  • Last commit date and release-tag cadence.
  • The README and any marker file (SKILL.md, mcp.json, .cursorrules, .windsurfrules, etc.).
  • The skill's own source code, read for safety review.
  • Package-registry metadata (npm, PyPI) where the skill is also published.
  • Reviews left by verified installers through this site, once the skill has accumulated any.
/04

The flag system

A skill can carry a flag at one of three severities. Flags are surfaced at the top of the skill page so you see them before you copy an install command.

Info

Worth noting but not blocking. Examples: single-maintainer bus factor, recent ownership transfer, requires a paid third-party API to function.

Warning

The skill works but does something you should review before installing. Examples: executes scraped HTML in a Node VM, expects unrestricted credentials, runs as an admin user.

Critical

The skill is unsafe in its current state and should not be installed without changes. Examples: posts user-supplied data to an unidentified endpoint, ships a credential in source, vulnerable to a public CVE. Critical flags downgrade a skill to Tier C and remove it from the index.

/05

What we deliberately do not score

The Skill Score is about whether a skill will work and is safe. It is not a quality-of-output score for the agent that uses it, and it does not try to predict whether the skill is right for your specific use case.

  • We don't grade aesthetic preferences — there is no "good prompt" sub-score.
  • We don't compare paid skills against free ones on cost alone. License and pricing are surfaced as facts, not weighted into the score.
  • We don't penalise low star counts on their own. A new skill with strong maintenance and clean documentation can sit at High Quality.
  • We never let a vendor pay for a higher score, and the score is recomputed on a schedule independent of any commercial relationship.
/06

The tier gate

The score feeds into a tier decision. Three tiers exist:

  • Tier A — listed and indexed by search engines. Requires complete metadata, a valid license, no critical flags, an overall score of 5.5 or higher, and at least one real-world signal (≥25 stars, recent commits within 180 days, presence in a known registry, or a verified submission).
  • Tier B — listed in our directory but noindex until it earns indexing. A real skill that hasn't yet accumulated enough signal to be ranked.
  • Tier C — rejected. Broken, empty, joke, malicious, near-duplicate of a canonical entry, or carrying a critical safety flag.

Borderline decisions go to a moderation queue rather than getting auto-published. That is the only manual step in the pipeline — every other decision is reproducible from the published rubric.

/07

Refresh cadence

Every Tier-A skill is re-scored at least weekly. Repository signals (stars, commits, issues) refresh nightly. Skill scans re-run when the source code's content hash changes, so a skill that ships a security fix sees its score move quickly. When a skill's tier or critical-flag status changes, the affected listing pages rebuild automatically.

The "Last verified" date on each skill page is the timestamp of its most recent successful refresh. If you see a date older than two weeks, the refresh job failed for that skill and we surface that as an info flag.

/08

Honest limitations

A few things to keep in mind so the score is useful rather than oversold:

  • The safety review reads code statically. A skill that does dangerous things only when it sees a specific input may pass review and still be unsafe. Treat the score as a strong signal, not a guarantee.
  • Documentation grading uses a language model. We bias toward false positives there — a poorly-described skill is easier to live with than a missed-warning one.
  • For skills that aren't open source — paid Custom GPTs, OpenAI Apps, hosted Zapier templates — we score what we can see. The skill's actual implementation is closed, so the safety sub-score weights the platform's vendor-level controls rather than line-level static analysis.
  • Brand-new skills with no signal default to Tier B until they accumulate either real-user activity or a verified author submission. A new skill being at Tier B is not a judgment about its quality.

The full source for the score computation lives in packages/pipelines/src/stages/ in this project's repository. If you spot a scoring error on a specific skill, submit a correction — the moderation queue treats corrections at the same priority as new-skill submissions.