How Skillful.sh Aggregates Data from 50+ AI Tool Directories

The Source Landscape

The AI tool ecosystem is spread across dozens of directories, registries, and listing services. npm has thousands of MCP-related packages. GitHub hosts tens of thousands of repositories. Dedicated registries like Smithery, Glama, and mcp.so each maintain their own curated collections. Community-maintained awesome lists on GitHub provide yet another source. And new directories continue to emerge as the ecosystem grows.

Each source has its own strengths. npm provides download statistics and version history. GitHub provides code metrics, issue activity, and contributor data. Curated registries provide editorial assessment and compatibility information. Awesome lists reflect community preferences. No single source gives you the complete picture of any given tool.

The Crawling Process

Aggregating from 50+ sources requires systematic data collection. Each source has its own API (or lack thereof), its own data format, and its own rate limits. Some sources provide clean APIs that return structured data. Others require web scraping. A few provide data dumps. The collection infrastructure needs to handle all of these access patterns reliably.

Crawling frequency varies by source. Package registries that update frequently are crawled more often than community lists that change slowly. The goal is to stay current without overwhelming source servers or consuming excessive resources.

Normalization

Once data is collected from multiple sources, it needs to be normalized into a consistent format. This is where much of the complexity lives. Different sources describe the same tool differently. One might call it an "MCP server," another a "plugin," and a third a "tool." One might categorize it as "database" while another says "data" and a third says "developer tools."

Normalization involves mapping these different labels to a consistent taxonomy, standardizing field formats (dates, version numbers, URLs), and resolving conflicts when sources disagree. This process is partly automated (using pattern matching and heuristics) and partly manual (for ambiguous cases that require human judgment).

Deduplication

The same tool often appears in multiple directories, sometimes with slightly different names or descriptions. Deduplication identifies these duplicates and merges their records into a single, enriched entry. The merged entry combines the best metadata from each source: the most complete description, the most recent version number, and all the quality signals from every directory that lists it.

Deduplication is harder than it sounds. A tool named "pg-mcp" in one directory and "postgres-mcp-server" in another might be the same tool or two different tools. Repository URLs are the most reliable matching criterion, but not all directories include them. Name similarity, author matching, and description comparison all contribute to deduplication decisions, with manual review for uncertain cases.

Enrichment and Scoring

After normalization and deduplication, each tool's record is enriched with computed signals. Security scores are computed based on dependency analysis and maintenance activity. Popularity signals are computed from download counts, GitHub stars, and directory presence. Trending signals identify tools whose metrics are growing faster than average.

The cross-referencing itself becomes a quality signal. A tool that appears in five directories has been independently evaluated five times. A tool that appears in only one automated registry has had less scrutiny. This directory count is a simple but effective indicator of ecosystem confidence.

Serving the Data

The aggregated, normalized, deduplicated, and enriched data is served through Skillful.sh's search interface, where users can search by keyword, filter by type, category, security grade, and other facets, and sort by various quality signals. The goal is to compress what would otherwise be hours of manual research across dozens of websites into seconds of searching on a single platform.

The data is updated regularly to reflect changes in the underlying sources. New tools appear as they're added to directories. Security scores update as dependencies change. Trending signals reflect the most recent growth patterns. The aggregated view stays current because the underlying collection and processing pipeline runs continuously.

How Skillful.sh Aggregates Data from 50+ Directories