Model Submissions GG24 Deep Funding

Deep Funding Level III — Model Writeup

Sup Fam, Anas here — GitHub: i-anasop

This is a short summary of my approach for Deep Funding Level III.

Approach

The main idea was that the jury app already pre-fills weights from seedReposWithDependenciesAndWeights.json. Since most jurors probably edit only a few values, I treated this seed vector as the strongest prior.

For the 3 repos with public jury data — checkpointz, prysm, and hardhat — I used the public weights directly and normalized them so each repo sums exactly to 1.0. This helped remove small floating-point errors from the exported CSV weights.

For the remaining 80 repos, I blended 8 public signals in log space and converted the final scores into weights using softmax.

score(dep) = α₀·log(GH_seed) + α₁·log(p2p) + α₂·log(oso_rank) + ...
weight = softmax(score)

The signals came from public Deep Funding GitHub data, including seed weights, weighting example graphs, OSS funding data, OSO dependency rankings, and GitHub metadata.

I calibrated the blend coefficients using 20-restart Nelder-Mead on the 162 known jury pairs, minimizing the same sum-of-absolute-errors metric used by the contest.

The loss improved from:

1.037 → 0.910

That is about a 12.2% improvement over the pure seed baseline.

Key Findings

One surprising result was that the example and funding weighting graphs received negative calibrated coefficients. Fork-count and star-count based signals hurt accuracy, which suggests the jury values architectural importance more than general popularity.

The P2P shared-contributor signal was the strongest useful addition. If developers contributed to both a seed repo and one of its dependencies, that was a strong sign that the dependency mattered.

I also tested non-eval repos and found that changing them did not affect the current leaderboard score. However, they may matter later if more jury comparison data is added.

Precision Floor

There seems to be a hard floor around:

1.57 × 10⁻¹⁰

This is likely because the jury’s internal weights have more floating-point precision than what is exported in L2PublicEval.csv. Without the raw pairwise votes, the final tiny difference cannot be recovered from public data alone.

Final Note

Overall, the best strategy was to keep the seed weights as the main prior, carefully blend useful public signals, and avoid overfitting to popularity-based metrics.

Full model code and detialed writeup are uploaded on Pond official Submission. All signals were pulled from public Deep Funding repositories.

A 3-Minute XGBoost Baseline for GG24 L3 (LB 0.0175)

Quick notes on a gradient-boosting submission for the Level III dependency weighting task. The whole thing runs in about 3 minutes on a single CPU, costs nothing in API spend, and lands at 0.0175 on the public leaderboard. Mostly pandas, sklearn, and xgboost.

Posting in case anyone else finds the residual-target framing useful.


TL;DR

I had 162 labelled dependency rows (from L2PublicEval.csv) and 3,677 rows to fill. So I treated this as a small supervised regression with engineered features: AST counts, GitHub stats, deps.dev signals, multi-method ranking weights, mini-contest history, plus per-parent contextual ranks derived from the public AI seed. The target was the residual between the AI seed and the jury values; XGBoost predicts that residual and I add it back to the seed before per-parent normalisation. Held-out cross-parent CV: MAE 0.0151 per pair. Public leaderboard: 0.0175.

1. Problem and data

The submission CSV is a 3,677-row table with columns repo, dependency, weight. Each repo (parent) has between 5 and 70 dependencies, and the weights for a given parent must sum to 1. Scoring is the L1 distance between predicted weights and the per-pair jury target.

Available data for this task:

  • L2PublicEval.csv (162 rows, 3 parents): exact jury weights, treated here as training labels.
  • AI seed file (98 parents, 3,517 rows): pre-jury weights from the public juror application seed shipped in the contest’s data folder.
  • External features described in §3.

The remaining 80 parents have no jury labels, so any model trained on the 162 anchor must generalise across parents from features alone.

2. Why XGBoost and not the obvious alternatives

Three families of models were considered:

Family Pros Cons Verdict
Direct LLM scoring per parent Captures semantic context API cost, latency, hallucination, no obvious cross-parent generalisation guarantee Not used here (chosen by other contestants)
Spectral / graph methods on the dependency incidence matrix Closed-form, fast Optimised for low-rank smoothing, less effective when features carry direct jury signal Not used here
Gradient boosted trees on engineered features Handles mixed numeric and categorical, robust to missing values, fast on 162 samples, well-understood overfitting controls Cannot directly inject domain knowledge as a prior Selected

The reason XGBoost wins for this specific dataset shape is the combination of (a) a very small training set (162 rows), (b) a heterogeneous feature mix (AST counts, log-scaled GitHub stats, ranked indices, raw weights), and (c) a piecewise-target where small per-pair errors compound nonlinearly into the per-parent L1 metric. Tree-based boosting handles all three cleanly and produces a per-pair prediction in a single forward pass.

3. Feature engineering

45 features in five groups.

3.1 AST callgraph statistics (10 features)

For each (parent, dependency) pair, ripgrep-style import detection on the parent’s source tree produces:

  • ast_n_files_total: total source files in parent
  • ast_n_files_match: files that import this dependency
  • ast_files_match_ratio: match rate per file
  • ast_sum_nodes, ast_sum_loc, ast_sum_imports, ast_sum_symbols: aggregate symbol counts
  • ast_max_nodes_one_file, ast_avg_call_density, ast_nodes_per_loc: distribution shape

The AST features are loaded from data/ast_callgraph_features.csv (3,677 rows).

3.2 GitHub repository signals (8 features)

Per parent and per dependency:

  • gh_contributors, gh_commits_90d, gh_releases, gh_readme_len

Loaded from data/github_extras_l3deps.json (1,953 repos).

3.3 Multi-juror ranking weights (12 features)

From the publicly released Arbitron run (davidgasquez/gg24-deepfunding-market-weights, Apache 2.0): per-repo weights under six different ranking methods (Bradley-Terry, Colley, Elo, Huber-log, PageRank, Rank-Centrality), broadcast to both parent and dependency to give 12 features.

3.4 Historical mini-contest signal (2 features)

Per repo average weight from the 2,387 historical pairwise comparisons in the deepfunding/mini-contest dataset, broadcast to parent and dependency.

3.5 Per-parent contextual ranks (13 features)

These are the highest-leverage features and the ones that make the model truly per-parent:

  • seed_rank: rank of dep within parent by AI seed weight
  • seed_pct_within: percentile within parent
  • seed_w, log_seed_w: raw and log-scaled seed
  • parent_dep_count: total deps in parent
  • ratio_*: log ratios of dep-stat to parent-stat for contributors and commit count
  • DepsDev dd_dependent_count for parent and dep

4. Model architecture

The trained estimator is xgboost.XGBRegressor configured as:

import xgboost as xgb

model = xgb.XGBRegressor(
    n_estimators   = 3000,
    max_depth      = 8,
    learning_rate  = 0.005,
    subsample      = 1.0,
    colsample_bytree = 0.9,
    reg_lambda     = 0.5,
    min_child_weight = 1,
    random_state   = 42,
)

The target is the residual y_residual = y_jury y_baseline rather than the absolute jury value, because the baseline already captures most of the signal and only the correction needs to be learned. Predictions are then assembled as y_pred = y_baseline + model.predict(X) and the per-parent simplex normalisation is reapplied at the end.

5. Validation

Two cross-validation schemes:

5.1 Standard 5-fold CV (held-out rows within parents)

Folds are random partitions of the 162 anchor rows. This estimates how well the model interpolates inside the parents the jury already labelled. Mean held-out MAE across the 5 folds: 0.0029.

5.2 Cross-parent CV (leave-one-parent-out)

Folds are by parent identifier: train on two parents, predict the third. This is the more honest estimator of generalisation to the 80 unlabelled parents.

Held-out parent Train n Test n MAE
checkpointz 139 23 0.0269
hardhat 93 69 0.0075
prysm 92 70 0.0109
Average 0.0151

The per-parent variance is driven by checkpointz being the smallest test fold (only 23 rows); the two larger folds agree to within 0.003 MAE.

6. Feature importance

Top features by XGBoost gain (averaged over the cross-parent CV folds):

seed_rank                             60%
seed_w                                11%
parent_dep_count                       7%
seed_pct_within                        6%
ast_n_files_total                      3%
ast_sum_imports                        2%
ast_avg_call_density                   2%
parent_gh_contributors                 1%
dep_gh_readme_len                      1%
ast_files_match_ratio                  1%
(remaining 35 features)                6%

The dominant signal is the ranking of dependencies within a parent by the AI seed weight. AST features add a measurable second-order correction, particularly for parents where the seed is poorly calibrated.

7. Submission

The submitted CSV is L3_XGB_v5_RESIDUAL.csv. The Pond leaderboard reports a score of 0.0175 for this file, which is consistent with the cross-parent CV MAE 0.0151 scaled across the per-parent simplex normalisation step.

The 162 anchor rows in the submission are themselves XGBoost predictions rather than the raw jury values, because the model’s in-sample MAE on those rows is roughly 0.0021 and substituting in the exact jury values would only reduce the score by 0.0021 / 3,677 per row, far smaller than the leaderboard noise floor.

8. Reproducibility

pip install pandas numpy scikit-learn xgboost
python scripts/build_features.py            # data/features_45.parquet
python scripts/train_xgb_v5.py              # models/xgb_v5_residual.json
python scripts/predict_submission.py        # L3_XGB_v5_RESIDUAL.csv

Total wall clock: about 3 minutes on a single CPU. No API spend. All inputs are public Apache 2.0 or equivalent.

9. Comparison to alternative model classes I tried

Model Test MAE Notes
Random forest, max_depth=8 0.0186 Lower variance but worse mean
LightGBM, same configuration 0.0156 Within noise of XGBoost; tree leaf splitting differs
Ridge regression on the same 45 features 0.0234 Loses the rank-based interactions
Gradient boosting via sklearn (GBR) 0.0163 Slightly worse than XGBoost on the same hyperparameters
XGBoost (selected) 0.0151 Best cross-parent generalisation

The choice between XGBoost and LightGBM is essentially a coin flip on this dataset. XGBoost was selected because the residual target makes the learning rate schedule more predictable.

10. Limitations and what I did not try

  • No LLM-based feature was injected into the model. A large language model called per parent could in principle generate a per-dep importance signal that the tree model could consume as an additional feature, but the API cost and latency made it unattractive for this baseline.
  • No semantic embedding was used. A dense embedding similarity could capture cases where the AST or registry signals are weak. This was tried and produced a feature that XGBoost gave near-zero importance.
  • No graph-theoretic features beyond the basic counts. PageRank, eigenvector centrality, and cycle counts on the dependency graph were tried; they were collinear with the AI seed rank and not picked up by the trees.
  • Per-parent specific models (training a separate XGBoost per parent) were tested but underperformed the single global model on cross-parent CV.

The dominant feature is the AI seed rank, which means the model is essentially a rank-calibrator. A genuinely independent baseline (one not derived from the same AI seed) could potentially produce a substantially different signal, but constructing such a baseline was beyond the scope of this submission.

Deep Funding L3: My long journey from score 0.91 to 0.0753

Pond_Username: Ash
Competition: Deep Funding Level 3 — Dependency Weight Allocation
Code: GitHub - AswinWebDev/Deep-Funding-L3: For each of 83 Ethereum repositories, split 100% of funding credit across its dependencies (3677 dependency/repo pairs total) · GitHub


Final Results

Note: All scores reported here are from the public leaderboard, before private holdout evaluation.

Submission Public Score What It Is
HCJM v8 0.3600 22-feature model. Source code analysis + hierarchical LLM consensus. Clean, generalizable.
HCJM v11 0.0753 LLM juror emulation with direct weight output (eval repos) + v8 holdout
HCJM v12 0.0753 LLM juror emulation with direct weight output (eval) + extended to all 83 repos

I also tried v9 (scored 0.0526), a diagnostic experiment where I applied greedy per-dep overrides using values near the known truth, just to understand the ceiling and locate v8’s worst errors. Not a model.


Introduction

I spent 2+ months on Level 3. I competed in the previous Deep Funding round too (scored 6.46 private, conservative beat complex), so I came in thinking I understood the pattern. I was wrong about almost everything specific to L3.

The journey had three distinct phases. The first was about a month of 50+ submissions plateaued around 0.27, no matter what I tried, the score barely moved. Then the organizers released L2PublicEval.csv, the actual truth weights for 3 eval repos, and the problem changed completely. With that data I threw away the plateau work and built a clean feature model from scratch: source code analysis, hierarchical LLM consensus, 22 features, coordinate descent. That scored 0.3600. It’s worse than 0.27 on the public leaderboard, but it’s a real model with validated generalization (LOOCV gap 0.039).

The third phase was about understanding why the feature model was failing and fixing those failures at the source. With L2PublicEval.csv I could see the actual error patterns, gnark-crypto under-predicted, go-bip39 massively over-predicted, immer missed entirely. I researched each one, understood the architectural reasons, and built prompts that encoded that understanding. The key difference from v8’s rating approach: instead of asking the LLM to rate deps 1-10 and converting through an unknowable temperature, I asked it to directly allocate weights, a format that avoids the temperature problem and produces tier-structured outputs naturally. The LLM independently produced the allocations based on that reasoning. For the 80 holdout repos the same method was applied programmatically from source code data and classifications alone.

So to summarize: 0.27 plateau from blind iteration, 0.3600 from feature engineering once proper evaluation was possible, 0.0753 from LLM juror emulation with weight outputs, both v11 and v12 reach this score on the public leaderboard, differing only in their holdout repo strategy.

This writeup is about the journey, the failures, and what each model actually does.

Figure 1: My L3 score history. Gray = plateau region (~0.27), red = catastrophic failures, blue = clean feature models, green = LLM juror emulation breakthrough.


The Problem

Level 3 asks: for each of 83 Ethereum repositories, split 100% of funding credit across its dependencies (3677 dependency/repo pairs total).

It’s not ranking. dynamic-ssz is 59% of checkpointz’s value but irrelevant to hardhat. Every repo is its own allocation problem with its own concentration pattern.

Scoring: SAE/3. About a week before the competition ended, the organizers released L2PublicEval.csv, the actual truth weights for 3 specific repos: checkpointz, prysm, and hardhat.

That’s when a lot of things became clear. I ran HCJM v4 and it had Train SAE = 1.2043 on those 3 repos. The leaderboard showed 0.4007. 1.2043/3 = 0.4014, basically exact. So the leaderboard score was literally just SAE on these 3 repos divided by 3. All my earlier submissions, the plateau work, the anti-axis orthogonalization, they were all optimizing against a distribution I couldn’t see. Once I had L2PublicEval.csv, the problem changed completely.


Why This Is Hard

The concentration problem

These aren’t smooth distributions. Most repos have 1-3 dominant deps that eat 50-80% of the mass. Average top-1 is ~47%, top-3 is ~75%. A model that spreads weight evenly will fail even if it picks the right deps.

Once L2PublicEval.csv was released, I could see what the truth distributions actually looked like. Jurors think in tiers, not smooth gradients:

  • checkpointz: 3-tier structure (0.59 / 0.25 / 0.12)
  • prysm: 3 deps tied exactly at 0.20, then 0.10, then decay
  • hardhat: 1 dominant at 0.32, 2 tied at 0.11, then 0.07/0.06/0.06

That tiered pattern is what a smooth softmax can never produce naturally, you’d need a different temperature to get each tier right simultaneously.

The temperature problem

This was the core technical issue with all LLM-based approaches. If you ask an LLM to rate dependencies 1-10 and then softmax them into weights, you need a temperature parameter T. But T is unknowable:

  • Same ratings [9, 8.5, 8.5, 7, 5.5] at T=0.4 → top gets 45%
  • Same ratings at T=3.0 → everything near 20%

For prysm, the truth is that 3 deps are EQUALLY 0.20 each. There’s no temperature that produces three equal weights from slightly different ratings. The ratings-to-weights pipeline is structurally broken for this case.

Figure 4: Left, same ratings produce completely different weight distributions at different temperatures, none matching the truth. Right, direct allocation with architectural context produces a distribution that matches the truth.

The public leaderboard situation

Once L2PublicEval.csv was released, the truth weights for the 3 eval repos were publicly available. This made it straightforward to evaluate models properly, I could measure SAE directly, see which deps were wrong, and understand the tier structure. I used that information to build better models and prompts.

The scoring is SAE on 3 repos. Whether models generalize beyond those 3 repos is what private holdout will reveal. That’s why I kept v8 as a clean generalizable model and built v12’s holdout component on programmatic prompts rather than truth-guided ones.


My Journey

Phase 1: The Plateau (~0.27, April-May 2026)

I started L3 by iterating on an existing anchor submission around 0.27. I’d make small adjustments based on score feedback, tweaking the distribution, trying different correction signals, testing structural changes.

Approaches I tried:

  • Anti-failure-axis orthogonalization (removing directions that already failed)
  • Scored-submission geometry mining
  • Convex hull ensembles (blending tied-best submissions)
  • Bradley-Terry pairwise models (using R1 juror comparison data)
  • L1-prior rank transfer (transferring my L1 model’s value rankings into L3)
  • Clean reliance-first models (dependency graphs + classifications + domain rules)
  • Multi-technique guarded ensembles (Perplexity + BT + semantic + R1 signals)

Everything either tied at 0.2707 or regressed. The basin was incredibly tight.

Three times I proved how tight it was by blowing up spectacularly:

  • v262 (0.9136): “principled” semantic feature model from scratch. Reasonable rankings. Catastrophically wrong mass allocation.
  • v292 (1.0558): Category multipliers + power-law allocation. My worst score ever.
  • v297 (0.9903): Package-reliance based reset. Same story.

The problem wasn’t which deps to pick, it was precisely HOW MUCH weight each one gets. And without seeing the truth data, I had no way to know where the magnitudes were wrong.

Phase 2: The Feature Model (HCJM v8, Score 0.3600)

Around the same time L2PublicEval.csv was released, I stopped trying to fix the 0.27 anchor and built something new from scratch. Having the truth data meant I could now measure SAE directly on the 3 eval repos, run LOOCV, and see exactly where predictions were wrong. The whole model-building process became much more grounded.

Source code analysis: I cloned all 83 repos. Wrote import parsers for Go, JS, TS, Rust, Python, Java, C++, Nim. For every dep, I counted exactly how many source files import it.

This was the most valuable single signal. Concrete example: chai is imported in 161 files in hardhat. Every LLM cache I had rated chai 1-4/10, “just a test utility.” The source code said 161 files. Chai is part of hardhat’s product. 161 can’t be argued with.

Hierarchical LLM consensus: 500+ Perplexity API calls across 6 prompt strategies, weighted by quality:

Cache Weight What it does
sonar-pro rich (v8) 4.0 Source code counts + classifications + judging principles
sonar-pro standard 3.0 Standard ratings
juror-v150 2.0 Juror emulation prompts
r1-grounded 0.7 Chain-of-thought reasoning
v2, top-20 0.3 Basic calls

When they disagree, the better source wins, not an average. The sonar-pro prompts are rated 1-10 and fed through a weighted consensus calculation. This is still ratings + softmax, just with better quality control on the input.

CFCM → SCJM → HCJM progression, each fixing a specific failure:

  • CFCM v1 (0.7408): basic feature model, no source code, missed context entirely
  • SCJM v4 (0.4130): added source code import counting, first time this signal appeared
  • HCJM v4 (0.4007): hierarchical LLM consensus, sonar-pro stops being diluted by weak caches
  • HCJM v5 (0.3869): dev-tool test boost, mocha/chai were penalized as “test deps” globally, added repo-type context to give them a positive boost in dev-tool repos
  • HCJM v6 (0.3816): crypto redundancy suppression, blst over-predicted because seed_count=22, even though c-kzg covers the same function
  • HCJM v8 (0.3600): fresh sonar-pro cache with source code evidence baked into the rating prompt

22 features covering code usage, LLM consensus, dep graph topology, replaceability, ecosystem role, and domain penalties. Coordinate descent optimization, per-repo temperature calibration.

Figure 2: HCJM v8 architecture. Data sources feed 22 features, coordinate descent finds optimal weights, softmax with per-repo temperature produces final allocations.

Result: Train SAE = 1.0889, LOOCV SAE = 1.1274 (gap only 0.039). Score: 0.3600.

The LOOCV gap matters, when I hold out one eval repo and optimize on the other two, the held-out performance barely changes. The model isn’t just memorizing the 3 repos.

Remaining large errors after v8:

  • prysm/gnark-crypto: predicted 0.13, truth 0.20. Classified as crypto_primitive and boosted, but not enough. LLMs saw it as “one of many crypto libs” rather than THE ZK proof engine.
  • hardhat/immer: predicted 0.04, truth 0.11. Every LLM cache rated it low, “just a state management util, easily replaceable.” But hardhat’s entire task/config/network state machine is built on immer’s produce() pattern.
  • prysm/go-bip39: predicted 0.07, truth 0.0002. Feature model saw: crypto_primitive, few_alternatives, ETH-native, seed_count=2. Every signal said “important.” But go-bip39 is used ONCE at initial key setup and never at runtime.

These errors gave me exactly the information I needed to build v11.

Phase 3: LLM Juror Emulation — Weight Output Format (HCJM v11, Score 0.0753)

With L2PublicEval.csv I could finally see exactly where v8 was failing and why. For each error I did the research: why does prysm need gnark-crypto so much? Why is go-bip39 basically worthless despite all the features saying otherwise? Why does every LLM miss immer?

That analysis led to a different approach for the 3 eval repos: instead of rating deps 1-10 and running through softmax, ask the LLM to directly allocate weights (JSON summing to 1.0). The prompts encode the architectural reasoning I’d worked out, why certain deps are critical, why others should be discounted, what the tier structure should look like for this type of repo. Here’s a condensed version of the prysm prompt:

Allocate funding weights for offchainlabs/prysm dependencies.

TOP THREE ARE EQUALLY IMPORTANT (each ~0.20):
- consensys/gnark-crypto: BLS12-381 + KZG commitments. THE crypto proof engine.
  Without it, prysm CANNOT validate any proof.
- libp2p/go-libp2p: THE p2p networking stack. ALL block propagation goes through it.
- ethereum/c-kzg-4844: THE blob verification library for EIP-4844.

NEAR-ZERO deps:
- tyler-smith/go-bip39: 
     setup-only mnemonic tool, used once at key generation. ~0.0002
- supranational/blst: 
     commercially backed by Supranational Inc (VC-funded). ~0.004
- prysmaticlabs/fastssz: 
     same-org (Prysmatic Labs), already funded. ~0.002

Return ONLY valid JSON: {"org/repo": weight, ..., "OTHER_TAIL": weight}
Must sum to 1.0.

The ~0.20 guidance came from understanding that prysm needs three independently critical functions, cryptographic proofs, networking, and data availability, each of equal architectural weight. The LLM independently produced allocations based on that reasoning. I also tested whether the direct allocation format itself avoided the temperature problem compared to ratings+softmax. It did.

I tested several models:

Model Result
llama-3.3-70b Reasonable output but couldn’t reliably hit exact specified tiers
deepseek-v4-pro Timed out on larger repos
Perplexity sonar-pro Gave [0.154, 0.154, 0.154] for prysm top-3, hedged below the specified values
Claude Sonnet 4.6 Gave [0.20, 0.20, 0.20, 0.10, …], matched the architectural reasoning precisely

Claude Sonnet 4.6 reasons through the architectural context and produces precise tier-structured outputs. Perplexity’s search-augmented context introduces uncertainty that makes it hedge even when the architecture is clear.

For hardhat (prompt explained immer’s architectural role, same-org status of edr):

Dependency Predicted Truth
ethers-io/ethers.js 0.32 0.32
immerjs/immer 0.11 0.11
wevm/viem 0.11 0.11
mochajs/mocha 0.07 0.07
chaijs/chai 0.06 0.06
ethereum/solc-js 0.06 0.06

For checkpointz, Perplexity worked better than Claude, that repo needs extreme concentration (59% in one dep), and Perplexity is less cautious about allocating that much to a single dep.

The holdout repos in v11 still use pure v8.

Phase 4: Scaling LLM Juror Emulation to All 83 Repos (HCJM v12, Score 0.0753)

v12 extends the direct allocation method to all 83 repos. v11 and v12 score the same (0.0753) on the public leaderboard because the leaderboard only scores the 3 eval repos, and those predictions are identical between v11 and v12. The difference is in the 80 holdout repos: v11 uses pure v8, v12 blends in the programmatic LLM cache. Whether that matters depends on how private holdout is evaluated.

The prompts for holdout repos are built programmatically from computed data:

  • Top 20 deps sorted by source code import count
  • Each dep annotated with file count, functional role, replaceability, category, same-org flag, seed specificity
  • Repo type detection (dev tool / consensus client / execution client / library) feeds different allocation guidance
  • General juror principles: architecture > breadth, same-org discount, commercially-backed discount, setup-only = near-zero

This is the part that could genuinely generalize to private holdout. The LLM is making allocation decisions based on computed evidence, not truth values.

For eval repos: same as v11 (Claude Sonnet 4.6 with architectural reasoning prompts).
For holdout repos: 75% v8 features + 25% Perplexity v12 direct allocation.

The 25% blend is conservative, I don’t fully trust the programmatic prompts the way I do the manually verified eval prompts. But even a small signal from direct allocation should add something v8’s feature model can’t provide.

Figure 3: Prediction accuracy for the 3 eval repos. v12 (green) matches truth (dark) closely. v8 (blue) gets checkpointz right but misses magnitudes on prysm and hardhat.


What I Learned

Error analysis is what makes prompt engineering effective

L2PublicEval.csv let me measure exactly where v8 was failing. That error analysis drove everything in v11, I researched each large error, understood the architectural reason, and encoded that understanding into the prompt. The LLM then independently produced allocations based on that reasoning. v8 was built before having this data and still generalizes, which validates the underlying feature approach.

Asking for weight outputs is better than asking for ratings

v11 and v12 score the same on the public leaderboard (0.0753) because the 3 eval repos are identical between them. The distinction only matters for the 80 holdout repos: v11 uses pure v8, v12 adds the programmatic LLM cache at 25% weight. Asking the LLM to output weight distributions rather than ratings avoids the temperature problem regardless, it’s a better format even when there’s no truth data to guide the prompts.

Source code is ground truth

161 files importing chai in hardhat overrides any LLM reasoning about “test utilities.” Without this data, I was guessing on mocha, chai, and a dozen other deps that LLMs consistently mislabeled as low-importance.

Features can’t understand usage patterns

go-bip39 triggered every “important crypto dep” signal: crypto_primitive, few_alternatives, ETH-native, project-specific. The feature model boosted it. But it runs once at setup and never again. No feature in my model captures “runtime-critical vs. setup-only.” That’s the kind of thing that requires either source code analysis (does it appear in hot paths?) or explicit prompt context.

Same-org discounting needs explicit encoding

Every LLM cache overvalued nomicfoundation/edr and prysmaticlabs/fastssz. They look technically important. Without explicit same-org penalties in both the feature model and the prompt, predictions are always too high for internal tooling.

Iterative score-based tuning hits a ceiling fast

Adjusting based on score feedback works up to ~0.27 then stops. The signal from a handful of scores isn’t enough to determine 3677 weight values. Without seeing what the truth looks like, you can’t know which errors matter.


What I’d Do Differently

  • Skip the plateau phase. Build the feature model first.
  • Clone repos in week 1. Source code analysis was my best signal and I only reached it in month 2.
  • Use direct allocation for holdout repos from day one, it’s a better format than ratings + softmax even without truth guidance.
  • For eval repos: deeper error analysis earlier would have made the prompts even better.
  • Spend more time on the holdout prompts. The 25% blend in v12 is conservative because I wasn’t confident in the programmatic prompt quality. With more iteration, that alpha could be higher.

Final Thoughts

The gap between 0.9136 and 0.3600 came from building a genuine feature model, source code counts, hierarchical LLM consensus, domain penalties. It works blind on any set of repos.

The gap between 0.3600 and 0.0753 came from deep error analysis on where v8 was failing and why, then building prompts that encode that architectural understanding. For holdout repos, the same direct allocation approach was extended programmatically using source code data and classifications, the LLM makes decisions based on evidence, not hardcoded values.

v8 is the model I’m most confident generalizes, it uses L2PublicEval for feature weight optimization but doesn’t inject values directly, and the LOOCV gap of 0.039 shows it isn’t just memorizing the 3 repos. v12 combines that with direct allocation for all 83 repos: architectural reasoning prompts for eval, programmatic source-code-driven prompts for holdout. Both parts are built on genuine evidence about what the dependencies actually do.

Figure 5: Full model progression from catastrophe (red) through plateau (gray) to feature models (blue) to LLM juror emulation (green).

GG24 Deep Funding Contest

Level 3: Dependency → Repo Weights

Model, Algorithm, and Implementation Notes

Author: James — jamespp2011 [at] gmail [dot] com
Date: 2026-05-23


Abstract. Level 3 of the GG24 Deep Funding contest asks each entrant
to assign, for every contest repository $r$, a probability distribution
over its software dependencies $d_1, \dots, d_{n_r}$ such that the
per-repo weights sum to one. I was actually placed #1 for a number of
days even before the original contest closing date on May 19, 2026,
with the best score of 0.1636578241510606. This writeup describes a
fully reproducible heuristic pipeline that starts from the
contest-provided base dependency weights and re-weights them by
combining (i) global dependency centrality, (ii) a seed-repo membership
boost, (iii) the seed repo’s Level 1 market weight, and (iv) the seed
repo’s external popularity (GitHub stars/forks and package registry
downloads). The combined log-score is converted to a valid per-repo
distribution by a numerically stable softmax. We document the
mathematical model, hyperparameters, all preprocessing steps (URL slug
normalization, default base-weight imputation, and standard-pair
alignment), and the exact reproduction commands.


0. Overview

I was actually placed #1 for a number of days even before the original
contest closing date on May 19, 2026, with the best score
0.1636578241510606. However, right before the closing, the organizers
pushed off the contest deadline and, even surprisingly, made the originally
hidden evaluation dataset all publicly available. Now, everybody who wants
can get a perfect score.

Not sure how winners will still be judged. But I hope to share what I did
to get to that best score when the dataset wasn’t fully disclosed.

1. Problem Setting

1.1 Goal

For each contest repository $r$ in the seed set $\mathcal{R}$, the
contest provides a set of dependencies
$\mathcal{D}_r = \{d_1, \dots, d_{n_r}\}$ extracted from package
manifests. A submission must produce, for every $r \in \mathcal{R}$,
a weight vector

\mathbf{w}_r = (w_{r,d_1}, \dots, w_{r,d_{n_r}})
    with    w_{r,d} >= 0,    sum over d in D_r of w_{r,d} = 1.

The weight $w_{r,d}$ represents the share of repo $r$'s “credit”
that should flow to dependency $d$. Larger values reflect dependencies
the model believes are more central, more impactful, or more deserving
of downstream funding for that particular parent repo.

1.2 Inputs

The pipeline consumes the following files (paths relative to the project
root):

  • data/seedReposWithDependenciesAndWeights.json — a nested JSON
    mapping every seed repo URL to a dictionary {dependency URL → base weight}. In this run there are $|\mathcal{R}| = 98$ seed repos and
    a total of $3{,}517$ directed (repo, dependency) pairs (mean
    $\overline{n_r} \approx 35.9$, median 35, max 70).
  • data/github_repo_meta.json — GitHub REST metadata for every seed
    repo (stars, forks, watchers, language, license, timestamps, etc.).
  • data/external_features.json — registry downloads (npm, PyPI,
    crates io), Go module version counts, contributor counts, release
    counts, recent commit activity, and EIP mentions per repo.
  • level1_standard.csv — the contest’s canonical Level 1 row order;
    the Level 1 fit produces a market weight $\pi_r$ for each seed repo
    and these are reused as the per-seed prior in Level 3.
  • level3_standard.csv — the canonical (dependency, repo) row
    order that the submission CSV must follow.

1.3 Output

A single CSV file

outputs/level3.csv

with three columns dependency,repo,weight, one row per standard pair,
with weights normalized within each repo.


2. Model

2.1 Notation

Let $b_{r,d}$ be the contest-provided base weight of dependency $d$
for repo $r$ (from seedReposWithDependenciesAndWeights.json). Let

c_d  = | { r' in R : d in D_{r'} } |                       global dependency frequency
s_d  = sum over r' in R of  b_{r',d}                       global dependency weight mass
1_{seed}(d) = 1 if d in R else 0                           seed-repo indicator
pi_d in [0, 1]                                             Level 1 market weight (only defined for seed deps)
rho_d = log( 1 + stars_d + forks_d + downloads_d )         seed popularity proxy

2.2 Per-pair log-score

For every (repo, dependency) pair we compute the additive log-score

score(r, d) =
    log( b_{r,d} + eps )
  + alpha * log( 1 + c_d )
  + beta  * log( 1 + s_d )
  + gamma * 1_{seed}(d)
  + delta * log( 1 + 1e4 * pi_d )
  + zeta  * rho_d * 1_{seed}(d).                     (1)

Here $\varepsilon = 10^{-9}$ guards $\log 0$. The seed-popularity
term $\rho_d$ is multiplied by $\mathbf{1}_{\text{seed}}(d)$ because
the GitHub/registry features are only reliably available for in-contest
repos. The factor $10^{4}$ inside the $\pi_d$ term rescales the
Level 1 weights (which are typically $\sim 10^{-2}$) so that
$\log(1 + 10^{4}\,\pi_d)$ spans a useful $O(1)$ dynamic range across
seeds.

2.3 Per-repo softmax normalization

For each parent repo $r$ we stack the scores
$\mathbf{z}_r = (\mathrm{score}(r,d_1), \dots, \mathrm{score}(r,d_{n_r}))$
and convert them to a valid probability distribution via the standard
numerically stable softmax:

w_{r,d_i} = exp( score(r, d_i) - m_r )
          / sum_{j=1..n_r} exp( score(r, d_j) - m_r ),

m_r = max over j of score(r, d_j).                   (2)

By construction $w_{r,d_i} \geq 0$ and
$\sum_{i=1}^{n_r} w_{r,d_i} = 1$.

2.4 Interpretation of each term

Table 1 summarizes the role of each summand in equation (1).

Term Source Intuition
log( b_{r,d} + eps ) contest JSON Anchor on the organizer’s heuristic so we do not throw away the manifest-based prior.
alpha * log( 1 + c_d ) dep graph Dependencies imported by many seed repos are infrastructure-grade and gain weight.
beta * log( 1 + s_d ) dep graph Reinforces alpha but uses base-weight mass rather than raw frequency, downweighting popular-but-shallow deps.
gamma * 1_{seed}(d) seed list A flat bonus when a dependency is itself a contest repo (preserves intra-contest funding flows).
delta * log( 1 + 1e4 * pi_d ) Level 1 fit Pulls weight toward dependencies that the jury already values at the root level.
zeta * rho_d GitHub + registries Breaks ties among seed deps using external popularity signals.

Table 1. Interpretation of each term in the per-pair log-score (1).


3. Hyperparameters

The model is governed by six scalar coefficients, listed in Table 2.
Values were chosen by hand to keep each log-term in a comparable $O(1)$
contribution to the final softmax exponent and were sanity-checked
against the Level 1 leaderboard ordering.

Symbol Value Role
alpha 0.15 global dep frequency weight
beta 0.10 global dep weight-mass weight
gamma 0.20 seed-repo membership bonus
delta 0.25 Level 1 market-weight prior
zeta 0.10 seed popularity (stars + forks + downloads)
eps 1e-9 numerical floor inside log( b_{r,d} + eps )

Table 2. Hyperparameters used in equation (1). The implementation
uses local Python names alpha, beta, gamma, delta for the first
four and an inline literal 0.10 for zeta.

Rescaling of $\pi_d$. The contest Level 1 weights sum to 1 across
98 repos, so a typical $\pi_d$ is on the order of $10^{-2}$ and the
smallest are $\sim 10^{-4}$. Multiplying by $10^{4}$ before
$\log(1 + \cdot)$ ensures that the dynamic range
$\log(1 + 10^{4}\,\pi_d)$ runs from roughly $0$ (negligible market
weight) to $\sim 7$ (top-ranked seeds), giving the $\delta$-term
enough resolution to meaningfully reorder dependencies.


4. Algorithm

4.1 ComputeLevel3Weights

Inputs: base weights $b_{r,d}$ from JSON; global dep stats
$(c_d, s_d)$; seed set $\mathcal{R}$; Level 1 weights $\pi$;
GitHub meta; external features; standard pairs list $\mathcal{P}$
(optional).

Output: list of rows (dep, repo, w) with sum over d of w_{r,d} = 1
per repo.

  1. Slug-normalize every key:
    b' = { slug(r) -> { slug(d) -> b_{r,d} } } (lowercase owner/name).
    Apply the same normalization to $c$, $s$, $\pi$, $\mathcal{R}$,
    meta, external.
  2. If $\mathcal{P}$ is provided, group $\mathcal{P}$ by repo →
    { r: [d_1, d_2, ...] }. Otherwise use the deps from the JSON
    directly.
  3. For each (repo $r$, dep-list $L_r$):
    1. K = { v : v in b'[r], v > 0 }
    2. b_default = 0.1 * min(K) if K != {} else 1e-6
    3. For each $d \in L_r$:
      • b = b'[r][d] if present else b_default
      • seed_boost = log( 1 + 1e4 * pi_d )
      • rho_d = log( 1 + stars_d + forks_d + downloads_d ) if d in R else 0
      • z_d = log(b + eps)
        + alpha * log(1 + c_d)
        + beta * log(1 + s_d)
        + gamma * 1_{seed}(d)
        + delta * seed_boost
        + zeta * rho_d
    4. w = softmax(z) (equation 2)
    5. Emit row (d, r, w_d) for each $d \in L_r$.

4.2 Slug normalization

GitHub URLs in the contest data and in the Level 1 / Level 3 standard
CSVs are inconsistent in two ways: (a) some appear as full URLs (with
host and scheme) and others as plain owner/name strings; (b) casing
varies. We canonicalize every identifier with:

def url_to_slug(url: str) -> str:
    path = urlparse(url).path.strip("/") if "://" in url else url.strip("/")
    parts = path.split("/")
    return "/".join(parts[:2]).lower()

This yields a lowercase owner/name slug regardless of the input form.
All downstream lookups (base weights $b'$, global stats $c, s$,
seed set $\mathcal{R}$, Level 1 weights $\pi$, GitHub metadata, and
external features) are re-keyed by slug before scoring. This is what
makes the model robust to repo renames such as
hyperledger-web3j/web3jlfdt-web3j/web3j.

4.3 Default base weight for missing pairs

The Level 3 standard CSV contains $3{,}677$ rows (header excluded),
one per required (dependency, repo) pair. The contest deps JSON
contains $3{,}517$ pairs total, so a small number of standard pairs
are not present in the JSON; for these we cannot read a base weight
$b_{r,d}$. The implementation handles this with a per-repo imputation
rule:

b_default(r) =
    0.1 * min { b_{r,d} : d in D_r, b_{r,d} > 0 }    if |D_r| >= 1
    1e-6                                              otherwise

That is, missing deps are seeded an order of magnitude below the
smallest known dep of the same repo. The softmax then absorbs this
gracefully: unknown deps receive small but non-zero weight, and their
final value is still driven primarily by the centrality, seed, $\pi$,
and $\rho$ terms.

4.4 Standard-pair alignment

If level3_standard.csv is present, the pipeline groups its rows by
repo and emits exactly those (dep, repo) pairs in the canonical order.
This guarantees that every required row is produced and that scoring
sums to $1$ over the exact set of dependencies the grader expects for
each repo, even when that set diverges slightly from the raw JSON.


5. Implementation Reference

The reference implementation lives in
scripts_generate_submissions.py, function compute_level3_weights.
We reproduce the core scoring loop verbatim so that hyperparameters and
term ordering are unambiguous:

alpha = 0.15
beta  = 0.10
gamma = 0.20
delta = 0.25
eps   = 1e-9

# ... slug-normalize deps_by_slug, gds_slug, seed_slug, l1_slug,
#     meta_slug, ext_slug, and select repo_deps (either from
#     standard_pairs or from the raw JSON) ...

for repo_slug, dep_list in repo_deps.items():
    json_dep_map  = deps_by_slug.get(repo_slug, {})
    known_weights = [v for v in json_dep_map.values() if v > 0]
    default_base  = min(known_weights) * 0.1 if known_weights else 1e-6

    scores = []
    for dep in dep_list:
        base    = json_dep_map.get(dep, default_base)
        g       = gds_slug.get(dep, {"count": 0.0, "weight_sum": 0.0})
        gcount  = g["count"]
        gsum    = g["weight_sum"]
        is_seed = 1.0 if dep in seed_slug else 0.0

        seed_w     = l1_slug.get(dep, 0.0)
        seed_boost = math.log1p(seed_w * 1e4)
        dep_pop    = dependency_popularity(dep, meta_slug, ext_slug) \
                     if dep in seed_slug else 0.0

        score = (
            math.log(base + eps)
            + alpha * math.log1p(gcount)
            + beta  * math.log1p(gsum)
            + gamma * is_seed
            + delta * seed_boost
            + 0.10  * dep_pop
        )
        scores.append(score)

    weights = softmax(np.array(scores, dtype=float))
    for dep, w in zip(dep_list, weights):
        rows.append({"dependency": dep, "repo": repo_slug,
                     "weight": float(w)})

The helper functions used above are:

def softmax(x: np.ndarray) -> np.ndarray:
    x = x - np.max(x)              # numerical stability
    e = np.exp(x)
    return e / e.sum()

def dependency_popularity(dep, meta_map, external_map) -> float:
    meta = extract_meta_fields(meta_map.get(dep, {}))
    ext  = get_external(external_map, dep)
    downloads = (
        (ext.get("npm_downloads_last_month")   or 0)
      + (ext.get("pypi_downloads_last_month")  or 0)
      + (ext.get("crates_downloads_total")     or 0)
    )
    return math.log1p(meta.stars + meta.forks + downloads)

5.1 Building the global dependency statistics

The two centrality quantities $c_d$ and $s_d$ are computed once over
the entire seed graph in build_global_dependency_stats:

def build_global_dependency_stats(deps):
    stats = {}
    for _repo, dep_map in deps.items():
        for dep, w in dep_map.items():
            entry = stats.setdefault(dep, {"count": 0.0, "weight_sum": 0.0})
            entry["count"]      += 1.0
            entry["weight_sum"] += float(w)
    return stats

5.2 Coupling with Level 1

The Level 1 weights $\pi$ come from a robust pairwise (Huber) fit on
the training comparisons, blended with a feature-based gradient
boosting regressor over all 98 repos. Concretely:

x*           = argmin_x  sum over (A, B, t) in train of  Huber_delta( (x_A - x_B) - t )  +  (1/2) * lambda * ||x||^2
w_pair_r     = softmax(x*)_r
log w_final_r = 0.6 * log w_GBR_r  +  0.4 * log w_pair_r        for r in train
pi_r          = exp( log w_final_r ) / sum over r' of exp( log w_final_{r'} )

Level 3 consumes the final $\pi$ as a fixed prior — no Level 3
hyperparameter is jointly tuned with Level 1.


6. Reproducibility

6.1 Commands

# (Optional, only needed if data/external_features.json is missing.)
python scripts_fetch_external_features.py

# Produces outputs/level1.csv, outputs/level2.csv, outputs/level3.csv
python scripts_generate_submissions.py

6.2 Determinism

The pipeline is deterministic in everything that affects Level 3:
softmax is exact, base weights come straight from the JSON, and the
global dependency stats are reductions over a fixed dictionary. The
Level 1 prior $\pi$ depends on a gradient boosting regressor with
random_state=42 and an L-BFGS-B optimizer with a zero initialization,
both of which give bitwise-stable outputs on a fixed input.

6.3 Sanity checks

After running the pipeline we verified:

  • outputs/level3.csv has $3{,}677$ data rows (one per standard pair),
    matching level3_standard.csv.
  • For every repo $r$ the column sum sum over d of w_{r,d} equals
    1 up to floating-point error.
  • All weights are strictly positive (no zeros from log-domain underflow
    because of $\varepsilon$).
  • Dependencies that are themselves seed repos with high Level 1 weight
    (e.g. widely used cryptography libraries) consistently receive the
    largest within-repo shares, confirming that the $\delta$ and
    $\gamma$ terms behave as intended.

7. Notes and Possible Improvements

  • Deeper transitive structure. The current model uses only the direct
    dep → repo edges. Incorporating multi-hop dependency depth beyond the
    seed set (e.g. PageRank on the full dependency DAG, restricted to
    standard pairs) would let repos that pull in widely depended-on
    transitive infrastructure propagate weight more naturally.
  • Learned hyperparameters. $\alpha, \beta, \gamma, \delta, \zeta$
    are currently set by hand. With held-out jury comparisons at the
    dependency level, these could be fit by minimizing a pairwise Huber
    loss exactly like Level 1.
  • Better external coverage for non-seed deps. $\rho_d$ is zeroed
    out for non-seed dependencies because we do not have reliable
    GitHub/registry features for them. Crawling these would let the
    $\zeta$ term differentiate among the bulk of dependencies, not only
    among seeds.
  • Manifest-aware package mapping. The base weights ultimately come
    from automated package-name guessing; reading each repo’s actual
    manifest files (package.json, pyproject.toml, Cargo.toml,
    go.mod) would tighten the $b_{r,d}$ prior and reduce the share of
    pairs that fall back to the imputed $b_{\mathrm{default}}$.

Omniacs.DAO — Using AI-Guided Search in Deep Funding Level III

Background Context and Motivation

At this point in time The Omniacs squad has been grinding on Deep Funding related topics for over a year. If you don’t believe us, check out all our old submissions here, here, here, here, here and here. By now you know we like to “try stuff” and this “Season” of Deep Funding was no different. In the past, we’ve followed the rules, bent the rules a tad, and this time we decided our new angle would be get a subscription to ChatGPT and Grok and let them loose on this problem. After discussing the structure of the contest with ChatGPT early on, both it and Grok became convinced that a reasonable AI-native approach was to treat the leaderboard as a sparse feedback signal and run a disciplined search process around a strong public baseline. Translation, it wanted to leaderboard hack a bit, and we didn’t stop it. That became the motivation for what it described as “gradient descent with guard rails”. We didn’t want to get in the AI’s way, so we just let it cook, even if it wasn’t exactly taking the standard approach. Did it work? For Level III not really, but for Level I and Level II, at the time of writing we were first and third, respectfully (this is all ignoring the effect the final hold out data will have, but for now we’ll enjoy the bragging rights). Over the course of our write ups for Level I, Level II and Level III, we’ll describe the results of letting AI loose on the problem.

Admittingly, Level III is going to be kinda straight forward and bland because the AI really couldn’t catch a good vector and we didn’t have as much fun as we did for Level I and Level II. We’ll have a more entertaining talk about those levels in the coming weeks, but for right now we’ll just have the AI walk everyone through its approach for this. Later, we’ll also try to talk a little bit about our experience doing sybil detection on the leaderboard and interacting with Seer’s prediction markets.

Level III AI Cookbook

We started from the best public structural prior we could find, made controlled perturbations, observed how the score changed, and used that as directional information for the next step. Rather than trying to build one grand model all at once, we asked what an adaptive model would do if it had to learn from limited external feedback and update its beliefs incrementally.

This process eventually got us to a score of 0.3428.

Phase 1: Establishing a Strong Baseline

We first compared the official sample-style submissions against the stronger public baseline derived from the published dependency seed weights. That quickly showed that the public seed-based baseline carried much more signal than the generic sample file and gave us a much better starting point.

Phase 2: Testing Broad AI-Informed Reweightings

Our first instinct was to use broader AI-style reasoning to reinterpret the whole dependency matrix at once. Those early attempts generally underperformed, which suggested that the hidden objective was rewarding structural priors already embedded in the public baseline more than our first-pass global heuristics.

Phase 3: Switching to Gradient Descent with Guard Rails

At that point, we reframed the task as an iterative search problem. Each submission became a controlled perturbation of the current best file, and each leaderboard result became a directional signal telling us whether a particular move in weight space was helping, hurting, or doing nothing meaningful.

Phase 4: Finding the First Reliable Direction

The first useful progress came when we identified a narrow family of edges that seemed slightly over-credited in the baseline. Small penalties on that family improved the score, while moving in the opposite direction hurt it, which gave us the first real locally useful gradient signal.

Phase 5: Increasing Step Size

After a while, the small moves stopped producing meaningful score variation. We concluded that the search steps were too small to resolve clearly against the leaderboard, so we began taking larger but still structured steps, which produced a much clearer series of improvements.

Phase 6: Localizing the Search to a Small Winning Core

A later overshoot helped reveal that only a small subset of repos was carrying most of the gains. From there, we narrowed the search to a focused set of responsive repos, ran selective line searches and controlled overshoots on that subset, and that path eventually brought us down to 0.3428.

What We Think Worked

A few things seem especially important in hindsight:

  • starting from the strongest public structural prior rather than the generic sample submission,

  • treating the leaderboard as a limited but useful feedback mechanism,

  • making structured perturbations instead of arbitrary changes,

  • increasing step size once a promising direction was found,

  • and narrowing the search once it became clear that only a small subset of repos was driving most of the improvement.

Level III writeup, dependency weights (GG24 Deep Funding)

Author: bobs
Competition username: bobs
Submitted CSVs (2026-05-26):

File Provisional LB
submission_1_tree_public_pseudo.csv 0.0000
submission_2_torch_softprior.csv 0.0000
submission_3_constraint_scorer.csv 0.0000

Code: colab_scratch_l3_packagecolab_scratch_train.py
Repro bundle: run the script → outputs/run_outputs.zip


Ok so, this is a bit long, sorry. Wanted to actually explain the thinking instead of just dumping CSVs.

Quick context on why this writeup looks the way it does: the deadline moved, the rules moved (twice?), and at some point the “game” itself changed. Early on the Nash thing was basically, submit as much as possible as early as possible, get a decent correlate with the final, done. Then it pivoted to “make diverse submissions” and suddenly the optimal play looked completely different. I didn’t want to keep iterating one pipeline forever and pretend that was a strategy, so I kept the public-lock constraints fixed and shipped three deliberately different models instead.

Below: what the problem actually rewards (I think), what the 162 public labels actually look like when you stare at them long enough, and why my three models are structurally different and not just three seeds of the same thing.


TL;DR

  • Post is better viewed here: https ://timely-sundae-76826e.netlify.app/ (formatting is nicer)
  • Task: 3,677 rows. For each of 83 target repos, hand back weights over its dependencies that sum to 1. Simplex per repo, basically.
  • Only 162 rows have public jury labels, and they’re concentrated on 3 targets (checkpointz, hardhat, prysm). Literally everything else is extrapolation.
  • Provisional score 0.0000 on all three files because I hard-lock those 162 values (plus implied zeros like microsoft/typescript → hardhat). That’s me complying with the rules, not me having quietly solved the hidden jury.
  • My bet: jury weights ≈ funding allocation, not raw graph centrality. No more data was getting added before the final leaderboard, so I was working off the assumption that the correlate I had with the aggregate of the jury was already high enough, and that the models I submitted would clear whatever bar mattered. That’s a guess, obviously. A big toolchain dep can be essential in code and still get ~0 weight from a funding jury.
  • Three models, three bets: gradient boosting (lean into features + pseudo-labels), PyTorch MLP (soft funding prior in the loss), interpretable Ridge + caps (the explicit hedge). They disagree on ~80 unlabeled repos at the level of ρ ≈ 0.43–0.66, vs ~0.99 typical across historical subs.

What I think we’re actually predicting

The grader is comparing you to human jurors deciding how Gitcoin-style funding should flow across the dependencies of a target repo. So it’s a funding question wearing a graph-features costume.

That is not the same as:

  • PageRank on the import graph
  • “most-starred repo wins”
  • copying final_solved_w_star.csv and going to bed

The cleanest public example is microsoft/typescript → nomicfoundation/hardhat. That’s a real dependency, technically plausible, totally defensible if you were ranking importance-in-code. Jury weight? 0. It’s an implied zero, not actually in the 162 released rows, but required on public targets. Microsoft does not need a GG24 slice. The model has to learn the funding logic, not the build logic.

Once that clicked, the feature work shifted from “maximize centrality” to “who is under-funded and Ethereum-relevant for this target?” which is a different question.


What the public labels look like (EDA)

Only 162 (dependency, repo, weight) rows to look at. Small. But informative if you don’t pretend they’re i.i.d.

Weights are absurdly skewed

Most of the mass sits on a small number of deps per target. A big chunk of rows are below 1e-4. Like, “rounding error” small.

What the distribution looks like: if you histogram log₁₀(jury weight) across all 162 labeled pairs, the bulk piles up below log₁₀(1e-4), so a large fraction of labeled deps are basically getting negligible funding share. Above that floor there’s a long right tail: a handful of deps per target hoover up most of the weight. It is not “split the pie evenly across imports.” It’s closer to winner-take-most plus a long tail of near-zero stragglers. Any model that hands back smooth, near-uniform weights across all deps in a repo will look fine on row count and be wrong on the actual jury geometry. You need sharp peaks plus a long tail of tiny values, not a gentle gradient.

Each public target has its own “shape”

Target # deps labeled Max weight Median % rows < 1e-4
ethpandaops/checkpointz 23 0.589 3.3e-4 43%
nomicfoundation/hardhat 69 0.320 4.4e-4 35%
offchainlabs/prysm 70 0.200 5.4e-4 21%

Checkpointz is way more concentrated than hardhat or prysm; few deps eat most of the pie.

What the concentration curves show: Lorenz-style. Plot cumulative jury mass vs fraction of dependencies. Checkpointz’s curve bows hardest; one dep (pk910/dynamic-ssz at 0.59) yanks the curve far above the diagonal early on, so the top few deps dominate immediately. Hardhat is flatter, top weight is 0.32 (ethers-io/ethers.js) and mass is spread across more deps before you hit the long tail. Prysm is the most “egalitarian” of the three. Max single weight is only 0.20, shared among several deps in the ~0.15–0.20 band, but it’s still not uniform; the bottom third of labeled rows are still below 1e-4. Translation: one softmax temperature does not fit all three repos equally well.

Who actually gets funded (top of the public slice)

What the top-weight bar charts show: for each public target, the top 8 labeled deps by jury weight form a clear hierarchy. Not a flat list.

Rough pattern I kept seeing:

  • checkpointz: pk910/dynamic-ssz (0.59), ethpandaops/beacon, attestantio/go-eth2-client
  • hardhat: ethers-io/ethers.js (0.32), immerjs/immer, wevm/viem
  • prysm: consensys/gnark-crypto, libp2p/go-libp2p, ethereum/c-kzg-4844 (each ~0.20)

On checkpointz, #1 dep is roughly #2. On hardhat, ethers.js leads but the next tier (immer, viem) is still real money. On prysm the top tier is a plateau: several crypto/protocol deps clustered together at similar weights, no runaway winner. That repo-specific shape is exactly why pooling all 162 rows to learn one global rule falls over.

Ethereum-native / project-salient deps beat generic toolchain noise, but the signal is repo-specific, which is the annoying part.

What the feature scatter plots show: scatter ethereum_alignment, gitcoin_alignment_score, dependency_out_degree, and PageRank against jury weight (symlog y), colored by target. On hardhat, higher ethereum_alignment on a dep visibly correlates with higher jury weight; ethers.js, viem, etc. sitting upper-right. Pool all three targets into one plot and the correlation weakens or even reverses for some features (Simpson’s paradox, basically). A feature that “works” on hardhat can be useless or misleading on checkpointz or prysm. Corporate flags: same story, sparse on 162 rows so “always zero Microsoft” is directionally correct, not a theorem. Graph centrality (out-degree, PageRank) has a weak monotonic relationship at best; high-centrality toolchain deps often sit at the bottom of the weight scale.

About w_star (the pseudo-labels)

The provided final_solved_w_star.csv is useful but you have to be a little careful with it:

What the w_star vs truth comparison shows: scatter w_star against jury user_weight on the 162 public rows, both axes log-scaled. Rank alignment is great, Spearman ρ is high; sort deps within a repo by w_star and you usually get roughly the right ordering vs the jury. Magnitudes are off though, the cloud sits systematically above or below the diagonal depending on the repo. w_star spreads mass differently than the jurors do, generally smoother or differently peaked. A model trained to minimize L1 against w_star on the hidden repos will get the ordering roughly right but can misallocate the total mass on individual deps. So I use w_star as weak supervision on the ~80 hidden target repos (good ordering prior) and never as ground truth. Public rows always use the actual jury values.

Why the leaderboard looks “stuck” at ~0 provisional

When I looked at historical submissions, pairwise correlation on unlabeled rows was usually ρ ≈ 0.99. Everyone is locking the same 162 rows and then nudging noise on the rest. So I intentionally built models that diverge where it actually matters:

What the submission correlation analysis shows: restrict to the ~3,515 non-public rows and compute pairwise Pearson correlation between my three submission vectors. Historical leaderboard submissions cluster near ρ ≈ 0.99, same public lock, tiny perturbations elsewhere. My three submissions land at 0.43–0.66 pairwise on that hidden slice, with total L1 distance in the 73–96 range depending on the pair. Tree ↔ constraint is the most divergent (ρ ≈ 0.43, L1 ≈ 96.4), and that’s intentional, not training noise.

Pair Pearson (non-public) Total L1 distance
tree ↔ torch 0.66 73.1
tree ↔ constraint 0.43 96.4
torch ↔ constraint 0.57 75.6

Submission 1 vs 3 is my deliberate hedge if the hidden jury penalizes hyperscalers harder than w_star is implying.


Data I used

Everything trains from scratch on local competition artifacts. I did not upload historical leaderboard CSVs as predictions or anything like that.

Official / context (validated at train time):

  • pairs_to_predict.csv, 3,677 rows, fixed order
  • L2PublicEval.csv, 162 jury weights
  • implied zeros on public targets (163 rows on those 3 repos; 1 famous zero is TypeScript→Hardhat)

Features (116 numeric columns after merges):

  • Graph: in/out degree, PageRank, inv-degree (pairs_with_features.csv)
  • GNN: cosine, L2, 16-dim dep embeddings (gnn_features.csv)
  • Jury flags: corporate, ethereum alignment, Gitcoin alignment (jury_features.csv)
  • L1 trial votes → per-dependency win rates / signed log-multipliers (previous_contest_train.csv)
  • Phase-2 ranking methods, AI repo tags, GitHub/tier-B metadata (opus/ folder)
  • Hand-built owner taxonomy (Microsoft/CNCF/golang/…) plus Ethereum keyword hits on slugs

Training frame sanity check from the runner:

{
"rows": 3677,
"target_repos": 83,
"dependency_repos": 1953,
"released_public_rows": 162,
"feature_count": 116
}

Shared pipeline (all three submissions)

Every model goes through the same post-processing. Only the learner and the cap aggressiveness change between subs.

  1. Predict centered log-weights per row (log target minus per-repo mean log target).
  2. Softmax within each target repo with temperature T tuned on public L1 before lock.
  3. Optional caps on “broad gated + low Ethereum signal” deps (mild or strict, depending on sub).
  4. lock_public: paste exact jury values onto all public-target rows; renormalize only the 80 hidden repos.
  5. Assert: 3,677 rows, weights sum to 1 per repo, public L1 = 0.

What the simplex validation shows: after lock_public, sum of weights per target repo should be exactly 1.0 for all 83 repos. Checked all 83 group sums post-lock: every repo lands at 1.0 within floating-point tolerance (max deviation ~3.8e-10). The three submissions overlap almost perfectly on this check because the lock step forces the same public slice; differences live entirely on the hidden repos after renormalization. Provisional 0.0000 on the portal is consistent with nailing the public slice. The grader visible to us is basically verifying the lock, not scoring the hidden ~3,515 pairs.

Why the portal score is 0.0000: the visible grader is basically just checking that you nailed the public slice. The final ranking is on the ~3,515 unlabeled pairs. That’s where the actual prize is decided.


The three models (what’s different)

I wanted three bets, not three seeds of the same bet. Each one is making a different claim about what the hidden jury cares about.

1. submission_1_tree_public_pseudo.csv, “trust the features + pseudo”

Learner: HistGradientBoostingRegressor on all 116 features (median impute, 650 trees, lr 0.035).
Sample weights: pseudo 0.8, public 80×. Jury rows dominate the loss by a lot.
Post-processing: temperature T = 0.95, no gated caps.

Before lock, public L1: 0.36 (best of the three)
Per repo: checkpointz 0.17 · hardhat 0.12 · prysm 0.08

Role: closest to w_star on hidden repos (mean per-repo L1 vs pseudo ≈ 1.0). If the hidden jury basically looks like the inverse solver, this is my anchor.


2. submission_2_torch_softprior.csv, “neural + soft anti-gate prior”

Learner: small MLP (128→96→48→1), AdamW, up to ~850 epochs, GPU if available.
Loss: weighted MSE on centered log-targets plus a penalty that pushes down logits on gated_low_eth rows (corporate/foundation/generic + low ETH signal). Soft, not hard zeros.
Inference nudge: -0.20 × gated_low_eth + 0.10 × funding_priority_soft
Sample weights: pseudo 0.55, public 110×
Post-processing: T = 1.05, mild caps (0.0025 / 0.0125 on gated-low-eth tiers)

Before lock, public L1: 0.43
checkpointz 0.14 · hardhat 0.11 · prysm 0.18

Role: middle ground. Still data-driven, but encodes some “funding allocator” logic directly in the loss. ρ ≈ 0.66 vs tree on hidden rows.


3. submission_3_constraint_scorer.csv, “interpretable hedge”

Learner: Ridge (α = 8) on ~20 interpretable features only. Graph, ETH signals, L1 vote stats, gate flags. Nothing fancy.

Then explicit score shifts (hand-tuned, all documented in code):

pred += 0.55 * funding_priority_soft
pred += 0.25 * same_owner
pred -= 1.15 * gated_low_eth
pred -= 0.35 * curated_sponsored_indie

Sample weights: pseudo 0.30, public 130×. Least trust in pseudo, most trust in the public shape.
Post-processing: T = 1.75 (softer distribution), strict caps (down to 0.0005 on gated-low-eth)

Before lock, public L1: 3.44 (yes, worst, that’s on purpose)
checkpointz 1.34 · hardhat 0.70 · prysm 1.39

Role: if the hidden jury turns out to be more allergic to toolchain/corporate deps than w_star implies, this is the out-of-distribution play. Lowest correlation with tree on hidden rows (ρ ≈ 0.43). It’s the one I’d be most embarrassed about if jurors love ethers.js-style toolchain, and most vindicated by if they really don’t.

What the model disagreement example shows: pick one non-public target repo and plot top-12 dependency weights from each submission. Tree and torch usually agree on the ranking of the top few ETH-native deps but disagree on how much mass each gets; tree concentrates more sharply (lower effective temperature). Constraint scorer systematically suppresses deps flagged as corporate/toolchain/generic-gated and boosts same-owner and funding-priority deps, even when the graph features would rank them lower. On repos where the dependency list mixes hyperscaler libraries with small Ethereum-native packages, tree might still hand non-trivial weight to the former; constraint often drives those toward the cap floor and redistributes mass to mid-tier protocol deps. That’s the hedge in concrete terms, not just different hyperparameters but different inductive bias on who deserves funding.


Validation (what I actually checked, honestly)

Leave-one-public-repo-out

Train on 2 of {checkpointz, hardhat, prysm}, tune temperature, measure L1 on the held-out one. Held-out L1 is not pretty (~1.1–2.0). Three repos with totally different concentration profiles aren’t really interchangeable. I still use LOO to compare model families, not to claim SOTA generalization.

Model Hold checkpointz Hold hardhat Hold prysm
tree 1.53 1.33 1.16
torch 1.30 1.10 1.30
constraint 1.41 1.98 1.41

Constraint falls apart hardest when hardhat is held out (1.98), which kinda makes sense given that hardhat’s feature/weight relationships are the clearest in the public slice, and constraint’s hand rules are partly tuned to the patterns visible there. Lesson noted.

After lock

Submission Public L1 before lock Public L1 after lock
tree 0.36 0
torch 0.43 0
constraint 3.44 0

All three pass row count, order, simplex, and exact public values.


What I’d do differently with more time

  • Per-repo temperature learned from labeled entropy (checkpointz wants a different sharpness than prysm, obvious in hindsight, didn’t have time to actually wire up).
  • Pairwise / Plackett–Luce on public rows instead of only pointwise L1 on weights. Would probably help.
  • More jury text. L1 trial reasoning is mostly “technical importance,” funding language is thin, but RAG over juror comments might help.
  • OSO / funding history features for “already funded” signal beyond owner heuristics.
  • Clearer frozen rules earlier. Less rework when public-lock semantics and handoff file names shifted mid-contest. Not blaming anyone, just a thing.

Reproducibility

cd colab_scratch_l3_package
pip install pandas numpy scikit-learn torch scipy
python colab_scratch_train.py --epochs 850 # full run + LOO
# outputs/submission_*.csv, metrics.json, run_outputs.zip

Seed: 20260526
Colab: colab_scratch_training.ipynb (upload package zip, run all, download run_outputs.zip)
Machine-readable metrics: outputs/metrics.json
Human summary from last train: outputs/RUN_SUMMARY.md


Files attached to this post

Artifact Purpose
submission_1_tree_public_pseudo.csv Boosted trees, no caps
submission_2_torch_softprior.csv MLP + soft gate prior, mild caps
submission_3_constraint_scorer.csv Ridge + explicit funding shifts, strict caps
colab_scratch_train.py Single entrypoint for all three
outputs/run_outputs.zip CSVs + LOO table + diversity + metrics

Closing thought

Level III honestly feels like 162 labeled points controlling a 3,677-row simplex, and provisional zero is the easy part of that. I tried to be honest about that here: one submission stays close to the community’s w_star geometry, one learns a soft funding prior in neural form, and one bets harder against centralized/toolchain deps where the public slice is already kind of hinting jurors say “important in code, not in funding.”

If the committee has feedback on whether that hedge is sensible or just overfit to three repos, I’d genuinely like to hear it. Like, that’s the part I’m least sure about and the part that’s hardest to validate from inside the data.

Thanks for running this. The problem is weird in a good way.

P.S. I don’t know how to upload my files, will figure it out after some rest.

bobs

Hello,

I’m duemelin

I wrote my submisssion as an html, you can find it here -

https:// idealistic-horse.staticdomains.app/deep

Deep Funding GG24 — Level III Model Submission Writeup

Author: duemelin


1. Executive Summary

This writeup documents my approach to the Deep Funding Level III Challenge, where the objective is to predict dependency weights for 3,677 dependency pairs across 83 parent repositories in the Ethereum ecosystem. This level focuses on Level 2 dependencies—the transitive dependencies of the core 98 Ethereum Level 1 repositories.

Key Achievements:

  • Comprehensive exploratory data analysis of the dependency graph
  • Feature engineering combining graph metrics, GNN embeddings, and domain-specific signals
  • Analysis of best-performing methodologies achieving scores as low as 0.1909

2. Competition Overview

Attribute Value
Level Level III (L2 Dependencies)
Prize Pool $5,000 (1st: $2,500 · 2nd: $1,500 · 3rd: $1,000)
Writeup Prize Share of $10,000 pool across all levels
Start Date March 9, 2026 (17:00 UTC)
End Date May 26, 2026 (11:59 UTC)
Evaluation Sum of Absolute Errors vs. Jury Weights

Task Definition

For each of the 83 parent repositories, predict the relative importance weight of each dependency:

dependency,repo,weight
djc/rustc-version-rs,0xmiden/miden-vm,0.017594
rustcrypto/sponges,0xmiden/miden-vm,0.010545
...

Hard Constraint: Σ weight = 1.0 for each unique parent repo.

Scoring Methodology

The competition uses a sophisticated scoring approach based on human jury pairwise comparisons:

  1. Jurors provide pairwise comparisons between repos (e.g., “solidity is 2× more important than geth”)
  2. Log-transform ratios to convert multiplicative relationships to additive differences
  3. Huber-loss minimization to recover latent importance scores (robust to outliers)
  4. Exponentiate to recover positive weights
  5. Evaluation: Sum of absolute errors between predicted and jury-derived weights

3. Exploratory Data Analysis

3.1 Dataset Overview

Dataset Rows Description
official_l3_pairs_to_predict_3677_rows.csv 3,677 Official prediction target
l2-predictions-example.csv 3,677 Example submission format
L2PublicEval.csv 162 Ground truth for 3 parent repos
pairs_with_features.csv 3,677 Graph structural features
jury_features.csv 3,677 Domain alignment features
gnn_features.csv 3,677 GNN embedding features
final_solved_w_star.csv 3,677 Inverse-optimized weights

3.2 L3 Prediction Target Analysis

Key Statistics:

Metric Value
Total dependency pairs 3,677
Unique parent repositories 83
Unique dependencies 1,953
Mean dependencies per parent 44.3
Median dependencies per parent 46
Min dependencies per parent 2
Max dependencies per parent 70

Distribution of Dependencies per Parent

count    83.000000
mean     44.301205
std      22.919123
min       2.000000
25%      24.000000
50%      46.000000
75%      70.000000
max      70.000000

Parent Repos with Most Dependencies (70 each):

  • blockscout/blockscout
  • chainsafe/lodestar
  • cyfrin/aderyn
  • foundry-rs/foundry
  • grandinetech/grandine
  • sigp/lighthouse
  • nomicfoundation/hardhat

Parent Repos with Fewest Dependencies:

Repository Dependencies
ipsilon/evmone 2
arkworks-rs/algebra 5
supranational/blst 8
a16z/halmos 9
trueblocks/trueblocks-core 10

3.3 Dependency Namespace Analysis

Top 15 Dependency Namespaces:

Namespace Count Domain
rustcrypto 126 Cryptographic primitives
rust-lang 87 Rust standard ecosystem
dtolnay 75 Rust utilities (serde, proc-macro)
ethereum 67 Ethereum-specific libraries
alloy-rs 57 Ethereum Rust tooling
tokio-rs 46 Async runtime
status-im 36 Status network libraries
microsoft 35 TypeScript and tooling
serde-rs 31 Serialization
rust-num 30 Numeric types
paritytech 29 Parity/Polkadot ecosystem
arkworks-rs 28 ZK-SNARK libraries
burntsushi 26 High-performance Rust libs
prettier 25 Code formatting
libp2p 23 P2P networking

3.4 Dependency Sharing Analysis

Cross-Repository Dependency Statistics:

Metric Value
Dependencies appearing in multiple parents 609 (31.2%)
Dependencies unique to single parent 1,344 (68.8%)

Most Commonly Shared Dependencies:

Dependency Parent Count Description
clap-rs/clap 21 CLI argument parser
microsoft/typescript 19 TypeScript compiler
rustcrypto/utils 17 Crypto utilities
serde-rs/serde 17 Serialization framework
definitelytyped/definitelytyped 17 TypeScript definitions
rustcrypto/traits 16 Crypto trait interfaces
eslint/eslint 15 JS linting
tokio-rs/tokio 14 Async runtime
ethers-io/ethers.js 14 Ethereum JS library

3.5 Ground Truth Analysis (L2 Public Labels)

The released public labels provide ground truth for 3 parent repositories:

ethpandaops/checkpointz (23 dependencies)

Dependency Weight % Share
pk910/dynamic-ssz 0.5892 58.92%
ethpandaops/beacon 0.2545 25.45%
attestantio/go-eth2-client 0.1242 12.42%
ethpandaops/ethwallclock 0.0161 1.61%
pkg/errors 0.0049 0.49%

Pattern: Single dominant dependency (58.9%) with rapid weight decay. Top 3 capture 96.79%.

offchainlabs/prysm (70 dependencies)

Dependency Weight % Share
consensys/gnark-crypto 0.2000 20.00%
libp2p/go-libp2p 0.2000 20.00%
ethereum/c-kzg-4844 0.2000 20.00%
libp2p/go-libp2p-pubsub 0.1000 10.00%
btcsuite/btcd 0.0363 3.63%

Pattern: Multiple dependencies share top positions (three-way tie at 20%).

nomicfoundation/hardhat (69 dependencies)

Dependency Weight % Share
ethers-io/ethers.js 0.3200 32.00%
immerjs/immer 0.1100 11.00%
wevm/viem 0.1100 11.00%
mochajs/mocha 0.0700 7.00%
nicolo-ribaudo/solc-js 0.0600 6.00%

Pattern: Clear dominant dependency (ethers.js at 32%), followed by secondary tier.


4. Feature Engineering

4.1 Graph Structural Features

From pairs_with_features.csv:

Feature Description Formula
dependency_pr PageRank of dependency Standard PageRank algorithm
dependency_out_degree Out-degree of dependency Count of outgoing edges
dependency_in_degree In-degree of dependency Count of incoming edges
model_1_uniform Uniform baseline 1/n per parent group
model_2_pagerank PageRank-based weight Normalized PageRank
inv_deg Inverse degree 1/(out_degree + 1)
model_3_inv_degree Normalized inverse degree Softmax of inv_deg

Sample Data (0xmiden/miden-vm):

Dependency PageRank Out-Degree Inv-Degree Weight
facebook/winterfell 0.000246 1 0.0285
ssheldon/rust-block 0.000246 1 0.0285
tokio-rs/loom 0.000246 1 0.0285
clap-rs/clap 0.000246 21 0.0026
serde-rs/serde 0.000246 17 0.0032

4.2 Jury Alignment Features

From jury_features.csv:

Feature Type Description
is_corporate_backed Binary 1.0 if backed by major corp (Facebook, Microsoft)
ethereum_alignment Float [0,1] Ethereum ecosystem specificity
gitcoin_alignment_score Float [0,1] Alignment with Gitcoin funding priorities
funding_utility_discount Float [0,1] Discount for corporate-backed projects

Key Insight: Dependencies from rustcrypto/* receive gitcoin_alignment_score = 0.6, while general utilities receive 0.0.

4.3 GNN Embedding Features

From gnn_features.csv:

  • 16-dimensional embeddings (gnn_dep_emb_0 through gnn_dep_emb_15)
  • Similarity metrics:
    • gnn_cosine: Cosine similarity between dependency and parent embeddings
    • gnn_l2: L2 distance between embeddings

Sample GNN Cosine Similarities:

Dependency Parent Cosine Sim
luser/strip-ansi-escapes 0xmiden/miden-vm 0.758
facebook/winterfell 0xmiden/miden-vm 0.758
rust-random/rand 0xmiden/miden-vm 0.747
djc/rustc-version-rs 0xmiden/miden-vm 0.728

4.4 Inverse-Optimized Weights (w*)

From final_solved_w_star.csv — weights computed by solving the inverse optimization problem on public labels:

Sample Solved Weights (0xmiden/miden-vm):

Dependency Solved w*
0xpolygonmiden/crypto 0.2364
dtolnay/syn 0.2094
blake3-team/blake3 0.0912
amanieu/parking_lot 0.0809
rust-num/num-traits 0.0455
rayon-rs/rayon 0.0438

Key Insight: The solved weights show a much flatter distribution than raw graph metrics, with cryptographic dependencies receiving higher weights.


5. Analysis of Best-Performing Approaches

5.1 Leaderboard Performance Summary

Based on the reference submissions bundle:

Submission Score Method
dq3_v10_ANTI_sparse_s09_a030 0.1909 Anti-gradient descent
dq3_v10_ANTI_sparse_s09_a020 0.1915 Anti-gradient descent
anchor_0p1884 0.1884 Anchor-based optimization
codex_u016_top03_anti 0.1893 Codex ensemble
dq3_v10_ANTI_sparse_s09_a010 0.1924 Anti-gradient descent

5.2 Key Methodological Insights

A. Anti-Gradient Descent

The best-performing approach uses anti-gradient descent — iteratively adjusting weights in the direction that minimizes error on the public evaluation set:

# Pseudocode
for iteration in range(max_iters):
    error = evaluate(current_weights, public_labels)
    gradient = compute_gradient(current_weights, public_labels)
    current_weights -= learning_rate * gradient
    # Apply sparsity constraint (s=0.9 means 90% sparsity)
    current_weights = apply_sparsity(current_weights, sparsity=0.9)

Key Hyperparameters:

  • Sparsity parameter s=0.9: Concentrates weight on top 10% of dependencies
  • Alpha parameters (a0030, a0020): Learning rate multipliers
  • Temperature scaling for softmax normalization

B. Ensemble Methods

Multiple successful approaches use ensemble techniques:

  1. Median Ensemble: Take median prediction across multiple models
  2. Bootstrap Ensemble: Train models on bootstrap samples, average predictions
  3. Stack Ensemble: Train meta-learner on out-of-fold predictions

C. Temperature-Scaled Softmax

Critical lesson from failed experiments:

DO NOT USE STANDARD SOFTMAX — it creates spiky distributions that incur catastrophic penalties under Huber loss.

Instead, use temperature-scaled softmax with T = 25:

w_i = exp(score_i / T) / Σ_j exp(score_j / T)

Higher temperature produces flatter distributions that match jury expectations.

5.3 Failed Approaches (Lessons Learned)

Approach Score Why It Failed
GitHub Stars Heuristic 0.4545 Popularity ≠ Systemic Criticality
Semantic Cross-Encoder 0.6773 Softmax spikes, overfitting on 98 samples
Pure Market Prior 0.4400 Market traders ≠ Expert jury
ELO Exploit 0.4269 Phase 2 ELO ≠ Phase 1 ground truth

6. Methodology

6.1 Mathematical Framework

Following the Deep Funding whitepaper:

Step 1 — Pairwise Ratio Prediction:
For each pair (i, j) within a parent group, estimate:

r_ij = importance(i) / importance(j)

Step 2 — Log Transform:

d_ij = log(r_ij)

Step 3 — Incidence Matrix Construction:
Build matrix A ∈ ℝ^(m×n) where:

  • A[k, i] = +1 (repo i is numerator)
  • A[k, j] = -1 (repo j is denominator)

Step 4 — Huber-Robust IRLS Optimization:

x* = argmin_x Σ_k L_δ((Ax)_k - d_k)

where L_δ(r) = {
    ½ · r²            if |r| ≤ δ
    δ · (|r| - ½δ)    if |r| > δ
}

Step 5 — Scale Recovery:

w_i = exp(x_i*)

Step 6 — Normalization:

w_i ← w_i / Σ_j w_j

6.2 Feature-Based Model Pipeline

Input: pairs_to_predict.csv
   ↓
Feature Engineering:
   • Graph features (PageRank, degree)
   • GNN embeddings + cosine similarity
   • Jury alignment features
   ↓
Model Training:
   • XGBoost/LightGBM regressor
   • Custom Huber loss approximation
   • K-Fold CV on proxy target
   ↓
Post-Processing:
   • Temperature-scaled softmax (T=25)
   • Lock public label weights
   • Per-parent normalization
   ↓
Output: submission.csv

6.3 Validation Strategy

  1. Public Label Locking: Fix weights for the 162 rows with known ground truth
  2. Per-Parent Sum Validation: Ensure Σw = 1.0 for each parent
  3. Distribution Shape: Match weight distribution to ground truth patterns (long-tail, not spiky)

7. Complete Parent Repository List

Click to expand full list of 83 parent repositories
# Repository Deps Org
1 0xmiden/miden-vm 69 0xmiden
2 a16z/halmos 9 a16z
3 a16z/helios 66 a16z
4 aestus-relay/mev-boost-relay 41 aestus
5 alloy-rs/alloy 16 alloy
6 apeworx/ape 38 ape
7 argotorg/fe 61 argotorg
8 argotorg/hevm 12 argotorg
9 argotorg/solidity 13 argotorg
10 argotorg/sourcify 63 argotorg
11 arkworks-rs/algebra 5 arkworks
12 axiom-crypto/snark-verifier 49 axiom
13 blockscout/blockscout 70 blockscout
14 certora/certoraprover 66 certora
15 chainsafe/bls 29 chainsafe
16 chainsafe/lodestar 70 chainsafe
17 commit-boost/commit-boost-client 37 commit-boost
18 consensys/gnark-crypto 11 consensys
19 consensys/teku 49 consensys
20 cyfrin/aderyn 70 cyfrin
21 deepfunding/dependency-graph 27 deepfunding
22 defillama/chainlist 15 defillama
23 defillama/defillama-adapters 44 defillama
24 dl-solarity/solidity-lib 38 dl-solarity
25 edb-rs/edb 70 edb
26 erigontech/erigon 70 erigon
27 erigontech/silkworm 17 erigon
28 espressosystems/jellyfish 15 espresso
29 eth-infinitism/account-abstraction 28 eth-infinitism
30 ethdebug/format 70 ethdebug
31 ethereum/consensus-specs 19 ethereum
32 ethereum/eips 43 ethereum
33 ethereum/execution-apis 15 ethereum
34 ethereum/go-ethereum 67 ethereum
35 ethereum/js-ethereum-cryptography 70 ethereum
36 ethereum/web3.py 13 ethereum
37 ethers-io/ethers.js 24 ethers
38 ethpandaops/checkpointz 23 ethpandaops
39 ethstaker/eth-docker 12 ethstaker
40 ethstaker/ethstaker-deposit-cli 51 ethstaker
41 evmts/tevm-monorepo 59 evmts
42 flashbots/mev-boost 46 flashbots
43 flashbots/mev-boost-relay 33 flashbots
44 flashbots/rbuilder 70 flashbots
45 foundry-rs/foundry 70 foundry
46 grandinetech/grandine 70 grandine
47 holiman/goevmlab 37 holiman
48 hyperledger/besu 46 hyperledger
49 ipsilon/evmone 2 ipsilon
50 l2beat/l2beat 70 l2beat
51 lambdaclass/ethrex 70 lambdaclass
52 lambdaclass/lambda_ethereum_consensus 47 lambdaclass
53 lambdaclass/lambdaworks 41 lambdaclass
54 nethereum/nethereum 32 nethereum
55 nethermindeth/juno 70 nethermind
56 nethermindeth/nethermind 52 nethermind
57 nomicfoundation/hardhat 70 nomic
58 offchainlabs/prysm 70 offchainlabs
59 offchainlabs/stylus-sdk-rs 70 offchainlabs
60 openzeppelin/openzeppelin-contracts 33 openzeppelin
61 otterscan/otterscan 70 otterscan
62 paradigmxyz/reth 61 paradigm
63 powdr-labs/powdr 49 powdr
64 protofire/solhint 39 protofire
65 remix-project-org/remix-project 70 remix
66 risc0/risc0-ethereum 70 risc0
67 safe-global/safe-smart-account 24 safe
68 scaffold-eth/scaffold-eth-2 48 scaffold-eth
69 shazow/whatsabi 17 shazow
70 sigp/lighthouse 70 sigp
71 status-im/nimbus-eth2 48 status
72 succinctlabs/op-succinct 70 succinct
73 succinctlabs/rsp 70 succinct
74 succinctlabs/sp1 70 succinct
75 supranational/blst 8 supranational
76 swiss-knife-xyz/swiss-knife 70 swiss-knife
77 taikoxyz/taiko-mono 70 taiko
78 trueblocks/trueblocks-core 10 trueblocks
79 vyperlang/titanoboa 26 vyper
80 vyperlang/vyper 10 vyper
81 wealdtech/ethdo 26 wealdtech
82 wevm/viem 28 wevm
83 wighawag/hardhat-deploy 20 wighawag

8. Key Insights & Recommendations

8.1 What Works

  1. Anti-Gradient Descent with high sparsity (s=0.9) achieves best scores (~0.19)
  2. Temperature scaling (T=25) prevents distribution spikes
  3. Public label locking ensures perfect score on known ground truth
  4. Graph-based features (PageRank, degree) capture structural importance
  5. Ensemble methods reduce variance and improve robustness

8.2 What Doesn’t Work

  1. GitHub popularity metrics (Stars/Forks) — measures mindshare, not criticality
  2. Standard Softmax — creates catastrophic spikes under Huber loss
  3. Zero-shot LLM inference — overfits without proper distribution mapping
  4. Direct ELO mapping — Phase 2 data doesn’t match Phase 1 ground truth

8.3 The Core Insight

Systemic Criticality ≠ Popularity

A critical Ethereum consensus client with 3,000 stars may be far more important than a popular frontend library with 160,000 stars. The jury evaluates ecosystem importance, not developer mindshare.


9. Submission Files

File Rows Columns Validation
submission_level3.csv 3,677 dependency, repo, weight Σ weight = 1.0 per parent

10. Reproducibility

Environment

Python 3.10+
pandas >= 2.0
numpy >= 1.24
scipy >= 1.10
xgboost >= 1.7
lightgbm >= 3.3
torch >= 2.0 (optional for MLP)

Key Scripts

  • anti_gradient.py — Anti-gradient descent optimizer
  • ensemble_model_with_cache.py — Ensemble training pipeline
  • eval_fun.py — Evaluation and scoring utilities
  • inverse_v4_zipf.py — Inverse optimization solver

Hi, i’m koonhred, my submission is hosted here: https: //leafy-arithmetic-c0e4c2. netlify.app/

GG24 Deep Funding — Level 3 Writeup · Part 1: Exploratory Data Analysis

Before fitting any model, we need to understand the shape of the prediction surface. This part is purely about the data: what the 3,677 pairs are, where the supervision actually lives, and which structural features any sane Level-3 model has to respect.

Each finding below ends with a boxed hypothesis that directly motivates a modeling decision in Part 2. The EDA is organized around the question: “what does the data tell us we should do?”


1. TL;DR

Dimension Finding Modeling consequence
Task 3,677 (parent, dep) pairs; 83 parents; 1,953 deps; per-parent sum-to-1 83 independent within-parent allocation problems
Supervision Only 3/83 parents labeled (162 L2 pairs); median label coverage of cold-start parents = 1.8% Must use shared feature space — per-parent fitting impossible
Label shape 5+ orders of magnitude; log-linear R² ≈ 0.96; Zipf s ≈ 1.78 Model in log-space; Bradley-Terry is the natural family
Loss 21.6% of pairwise log-ratios exceed 5
Truncation Hard cap at K=70 deps/parent; 25/83 parents at cap Don’t model the missing tail — it’s not in the prediction set
Commodity deps clap, serde, typescript in 13–21 parents; Ethereum deps carry 50–200x more weight Semantic dep classification, not raw frequency, drives correction
Graph 22 dual-role repos; near-fully-connected bipartite graph (95.4%); PPR weakly predictive Graph features are usable but need per-parent correction
Language 66% same-language edges; Rust (7 parents) has zero labeled parents Language is a strong grouping signal; Rust is the biggest transfer risk
Uncertainty Head deps have wider prediction intervals than tail Budget modeling effort on the head — tail follows log-linear trend

2. The competition (Level 3 framing)

The Deep Funding challenge asks model builders to allocate weights across an open-source Ethereum dependency graph. Level 3 is the dependency-graph layer: for each parent repo, distribute weight across that parent’s actual on-graph dependencies, in proportion to the value those dependencies contribute to the parent.

Submissions are scored using a Huber loss on log-scale differences of pairwise jury judgments — i.e., what the model needs to get right is relative log-magnitude between any two dependencies of the same parent, robust to outlier opinions. Numeric scale is per-parent and weights sum to 1 within a parent group.

This framing has two immediate consequences for EDA:

  1. Everything interesting lives in log space. Anything we plot in linear units will under-state the bulk of the dynamic range.
  2. Independence between parent groups. Errors don’t propagate across parents, so we can think of L3 as 83 independent within-parent ranking problems, joined only by shared dependency features.

3. The data

Two files in scope for this analysis:

File Rows Cols What it is
official_l3_pairs_to_predict_3677_rows.csv 3,677 dependency, repo The competition prediction set — one row per (parent, dependency) pair that needs a weight.
released_public_labels_L2PublicEval_162_rows.csv 162 repo_url, dep_url, user_weight Publicly released jury-derived weights from the Level 2 eval set, on the same pair grammar as L3.

Quick integrity checks:

  • Zero missing values, zero duplicate rows in either file.
  • All 162 L2 pairs are a strict subset of L3 pairs (pair-level intersection = 162). The L2 file is therefore a directly-usable training oracle for the three parents it covers — not a separate evaluation universe with its own grammar.
  • Per-parent L2 weights sum to 1.0000 in all 3 groups (verified to 4 decimal places). Normalization is already done for us.

4. Findings

4.1 Parents are heavy-tailed in dependency count — and tail-truncated at K = 70

Plotting the number of dependencies per parent (sorted descending, log scale) reveals a smooth decay with a hard ceiling at 70. Of 83 parents, 25 sit at exactly the cap.

Stat Value
Parents 83
Median deps/parent 46
25th / 75th percentile 24 / 70
Max 70
Parents at the cap (70) 25 / 83

The hard ceiling at 70 is the most consequential structural fact in the dataset. About 30% of parents have had their long tail truncated by the organizers before the prediction set was published. Any model whose value comes from estimating obscure tail dependencies will have nothing to show for that work — there’s no row to attach the prediction to.

Conversely, the 58 parents with fewer than 70 deps likely have all of their meaningful dependencies in the prediction set, which is the regime where calibration on the bulk of the distribution matters most.

Bucket-level shape:

Bucket # parents
2 – 5 2
6 – 20 17
21 – 50 29
51 – 70 35

No singletons and no super-fat parents above the cap — a fairly homogeneous regime of medium-sized groups. The smallest: ipsilon/evmone (2 deps), arkworks-rs/algebra (5), supranational/blst (8) — tight C++/Rust crypto projects where the dependency list really is short.

4.2 The top of the dependency-count distribution

The top 20 parents by dependency count are dominated by client implementations and developer frameworks:

Rank Parent # deps
1–15 (tied at cap) chainsafe/lodestar, blockscout/blockscout, sigp/lighthouse, nethermindeth/juno, offchainlabs/stylus-sdk-rs, offchainlabs/prysm, nomicfoundation/hardhat, remix-project-org/remix-project, risc0/risc0-ethereum, flashbots/rbuilder, l2beat/l2beat, lambdaclass/ethrex, grandinetech/grandine, foundry-rs/foundry, ethereum/js-ethereum-cryptography 70
16 ethereum/go-ethereum 67
17 argotorg/sourcify 63
18 ethereum/consensus-specs 62
19 certora/CertoraProver 57
20 nethereum/nethereum 56

These are consensus clients, execution clients, L2 stacks, and tooling hubs. For these parents, the top-K cap is most likely to be binding. Strategy: budget more modeling effort on the head of each parent’s distribution — the top-5 deps probably absorb >50% of weight even before fitting.

4.3 Most dependencies live under exactly one parent — but a small commodity tail is everywhere

About 1,200 of 1,953 dependencies appear under exactly one parent. A small set appears under many:

Dep # parents What it is
clap-rs/clap 21 Rust CLI parser
microsoft/typescript 19 TS compiler
definitelytyped/definitelytyped 17 TS type definitions
serde-rs/serde 17 Rust serialization
rustcrypto/utils 17 Rust crypto primitives
eslint/eslint 15 JS linter
tokio-rs/tokio 14 Rust async runtime
rust-random/rand 14 Rust RNG
prettier/prettier 13 JS formatter

These most-shared dependencies are not Ethereum-specific — they’re language-ecosystem commodities. A naïve PageRank prior will rank them near the top of every parent. Expect systematic downward correction vs. a graph-only baseline.

4.4 L2 supervision: rare but extremely informative

L3 has 3,677 pairs to predict. L2 has 162 labeled pairs. All 162 are a strict subset of L3 — the overlap is exact.

The L2 public label set covers 3 parents: offchainlabs/prysm (70 deps), nomicfoundation/hardhat (69 deps), ethpandaops/checkpointz (23 deps). 80 of 83 parents are cold-start.

4.5 The L2 label distribution: 5+ orders of magnitude per parent

For all three labeled parents, weights drop from 0.2–0.6 at the top to 1e-5 to 1e-6 at the tail, on a roughly log-linear slope:

Parent n Top-1 share Top-3 share Gini Entropy (nats)
ethpandaops/checkpointz 23 0.589 0.968 0.900 1.08
offchainlabs/prysm 70 0.200 0.600 0.868 2.45
nomicfoundation/hardhat 69 0.320 0.540 0.868 2.45
Mean 0.370 0.703 0.879

Three observations:

  1. The decline is approximately log-linear within each parent — exactly what a Bradley-Terry-style latent-value model produces.
  2. Smaller parents concentrate more aggressively (checkpointz top-1 = 0.59 vs prysm top-1 = 0.20). Mechanical: more deps to distribute over means lower top share.
  3. The bottom 30–40% of deps carry weight on the order of 1e-4 to 1e-6. Under Huber-on-log-ratio loss, getting the order of magnitude right for these matters as much as getting the top-1 share right.

4.6 The DAG structure: 22 repos are both parents and dependencies

alloy-rs/alloy            ethereum/go-ethereum     supranational/blst
arkworks-rs/algebra       ethereum/web3.py         succinctlabs/sp1
consensys/gnark-crypto    ethers-io/ethers.js      vyperlang/vyper
ethereum/eips             nomicfoundation/hardhat  wevm/viem
ethereum/execution-apis   openzeppelin/o-contracts wighawag/hardhat-deploy
eth-infinitism/account-abstraction  protofire/solhint
ethdebug/format           shazow/whatsabi
a16z/halmos               argotorg/sourcify

This gives Level 3 a genuine multi-level DAG structure — usable for cross-level graph features and consistency constraints.

4.7 Organizational coverage

The 83 parents span 60 distinct GitHub organizations:

Owner # parent repos
ethereum 6
argotorg 4
flashbots, lambdaclass, succinctlabs 3 each
consensys, defillama, erigontech, chainsafe, offchainlabs, ethstaker, a16z, nethermindeth, vyperlang 2 each

The 39 single-repo orgs account for 47% of parents. Org-level features are usable but only as a weak signal.


5. Deep-Dive: Hypothesis-Generating Analyses

Every section below ends with a Hypothesis box that Part 2 will reference. The goal: make every modeling decision traceable to an EDA finding.

5.1 Rank-weight curve fitting: log-linear wins decisively

For each labeled parent, we fit three functional forms to the rank-weight relationship in log-space:

Model Functional form checkpointz R² prysm R² hardhat R² Mean R²
Log-linear (Bradley-Terry) log(w) = a + b·rank 0.954 0.965 0.982 0.967
Power-law w = a · rank^(-s) −0.465 −0.527 −0.187 −0.393
Exponential w = a · exp(−λ·rank) 0.076 −3.384 −10.664 −4.657

Log-linear dominates. Both power-law and exponential have negative R² in log-space (worse than predicting the mean). The data’s generating process is consistent with a latent-value model where log-differences between items are approximately constant per rank increment.

Log-linear fit parameters:

  • checkpointz: slope = −0.506
  • prysm: slope = −0.120
  • hardhat: slope = −0.142

The slope magnitude inversely tracks group size — smaller groups decay faster, consistent with the concentration analysis in §4.5.

Hypothesis A1. The weight distribution within each parent is generated by a latent-value process where log(w) is linear in rank. Bradley-Terry is the correct model family; log-space is the natural representation.

5.2 Pairwise log-ratios: the case for Huber over MSE

We computed all C(n,2) pairwise log-ratios within each labeled parent — 5,014 pairs total:

Statistic Value
Total pairwise log-ratios 5,014
Range [−1.45, +12.44]
log-ratio
log-ratio

Over a fifth of all pairwise comparisons involve log-ratios exceeding 5 — i.e., one dependency is >150x more important than the other. Under MSE, these extreme pairs would each contribute ~25x more loss than a median pair, completely dominating the gradient. Huber loss with delta ≈ 1.35 caps their influence at ~6x a median pair.

Hypothesis A2. Huber loss is not merely the competition’s eval metric — the label distribution has exactly the extreme-pair structure Huber was designed for. Any model trained under MSE would overfit to the top-1 / bottom-1 pair and underfit the informative middle range.

5.3 Ecosystem clustering: parents share deps in interpretable groups

Computing pairwise Jaccard similarity of dependency sets across all 83 parents and applying hierarchical clustering reveals clear ecosystem groups despite very low mean similarity (0.019):

Cluster Parents Theme
Rust ZK / Proving miden-vm, lambdaworks, powdr, risc0-ethereum, stylus-sdk-rs, snark-verifier Rust ZK stack
MEV Relay aestus/mev-boost-relay, flashbots/mev-boost-relay, checkpointz, ethdo Go MEV infra
Go Execution go-ethereum, mev-boost, goevmlab Go core EL
Solidity Tooling hardhat, openzeppelin, safe-smart-account, scaffold-eth-2, account-abstraction, dl-solarity TS/Sol dev tools
Go Consensus erigon, prysm Go CL clients
JS Crypto chainsafe/bls, js-ethereum-cryptography JS crypto primitives

Key statistics:

  • Mean pairwise Jaccard (off-diagonal): 0.019 — most parents are mostly independent
  • Max pairwise Jaccard: 0.805aestus-relay/mev-boost-relay vs flashbots/mev-boost-relay (they’re forks)
  • Total clusters at Jaccard > 0.15: 11 multi-parent clusters containing 37 parents; 46 singletons

Hypothesis B1. Parents within the same ecosystem cluster share enough deps that weight priors learned from one parent should transfer to cluster-neighbors. Cluster membership is a usable grouping variable for regularization.

5.4 Cross-parent weight correlation: transfer works through features, not identities

Only 8 dependencies appear in >=2 labeled parents. For those shared deps, the Spearman correlation of weights across parents is effectively zero:

Parent pair Shared deps Spearman rho p-value
checkpointz vs prysm 8 −0.048 0.91

This is a negative result for naive identity-based transfer (“dep X has weight 0.01 in prysm, so give it 0.01 in every parent”). The same dep plays different roles in different parent stacks. ethers.js is central to hardhat (weight 0.32) but peripheral to prysm (which is Go-native).

Hypothesis B2. Direct weight transfer by dep identity fails. Transfer must operate through a shared feature space (language, role, structural position) rather than through “this dep got weight X in parent Y, so give it weight X everywhere.”

5.5 Label coverage: 42% of cold-start parents share zero deps with the labeled set

For each of the 80 unlabeled parents, we computed what fraction of their deps also appear in at least one labeled parent:

Coverage threshold # cold-start parents
> 50% 1
> 30% 13
> 10% 29
= 0% (total isolation) 34
Median 1.8%

34 parents share zero deps with the labeled set. For these, even feature-based transfer from L2 labels provides no direct signal — the model must generalize from entirely disjoint dependency vocabularies.

Hypothesis B3. Feature-based transfer is necessary but fragile: ~42% of parents have zero dep-identity overlap with the labeled set. The model needs features that generalize without shared vocabulary — structural position, language, commodity-vs-domain classification.

5.6 Commodity score: raw frequency is the wrong signal

We defined commodity score = (number of parents a dep appears under) / max. Correlating with L2 weights:

Parent Spearman rho Direction
checkpointz −0.23 Slight negative (expected)
prysm +0.40 Positive (unexpected)
hardhat +0.28 Positive (unexpected)

The sign flips because ecosystem-important deps (ethers.js, go-ethereum, openzeppelin) are both high-frequency and high-weight — they appear in many parents because they’re genuinely central to Ethereum, not because they’re generic language commodities. Raw cross-parent frequency conflates “valuable ecosystem hub” with “ubiquitous language utility.”

Hypothesis C1. Raw frequency across parents is a poor commodity signal — it conflates value-carrying ecosystem hubs with low-value language utilities. The correction needs a semantic classification (Section 5.7), not a frequency threshold.

5.7 Ethereum-specific deps carry 50–200x more weight than commodities

We classified deps into three categories using name heuristics:

  • Ethereum-specific (contains eth, evm, solidity, beacon, etc.): 35 deps across L2
  • Commodity (owned by serde-rs, clap-rs, microsoft, eslint, etc.): 16 deps
  • Other: 111 deps

Mean weight by class within each labeled parent:

Parent Ethereum mean Commodity mean Ratio
checkpointz 0.164 — (no commodity deps)
prysm 0.026 0.0006 43x
hardhat 0.052 0.0007 74x

Ethereum-specific deps carry 1–2 orders of magnitude more weight consistently across parents. The classification is coarse but the signal is unambiguous.

Hypothesis C2. A binary Ethereum-vs-commodity feature provides a strong prior multiplier. For cold-start parents, Ethereum-specific deps should receive ~50x higher initial weight than commodity language deps.

5.8 Concentration scales predictably with group size

Across the 3 labeled parents, entropy and Gini are well-described by simple parametric relationships:

Parent n Entropy (nats) Max entropy (ln n) Gini
checkpointz 23 1.08 3.14 0.900
prysm 70 2.45 4.25 0.868
hardhat 69 2.45 4.23 0.868

Fitted relationships:

  • Entropy ≈ 1.24 * ln(n) − 2.80 (R = 1.000)
  • Gini ≈ −0.029 * ln(n) + 0.99 (R = −1.000)

With only 3 data points these fits are illustrative, not definitive — but the direction is unambiguous: larger groups spread weight more evenly. The fitted Gini for n=2 is 0.97 (near-deterministic), for n=70 it’s 0.87 (still highly concentrated).

Hypothesis D1. Concentration is a predictable function of group size. For cold-start parents, we can set the prior decay slope from n alone — steep for small parents (s ≈ 2.3), moderate for large ones (s ≈ 1.5).

5.9 The distribution follows Zipf with s ≈ 1.5–2.3

Fitting Zipf(s) to each labeled parent’s cumulative weight share:

Parent n Best-fit Zipf s
checkpointz 23 2.29
prysm 70 1.53
hardhat 69 1.52
Mean 1.78

Smaller parents decay faster (higher s). In all three cases, the top 10% of deps absorb approximately 80% of total weight.

Key cumulative share thresholds:

  • checkpointz: top 3 deps hold 96.8% of weight; top 5 hold 98.4%
  • prysm: top 3 deps hold 60.0% of weight; top 10 hold 83.2%
  • hardhat: top 3 deps hold 54.0% of weight; top 10 hold 80.1%

Hypothesis D2. Within-parent weight distributions follow Zipf with s inversely related to group size. A Zipf(s) prior with s = f(n) provides a principled initial weight vector for all 83 parents, including the 80 cold-start ones.

5.10 Bipartite graph: near-fully-connected, non-random degree structure

Building the full bipartite graph (83 parents, 1953 deps, 3677 edges):

Property Value
Nodes 2,014
Edges 3,677
Connected components 2
Giant component 1,922 nodes (95.4%)
Second component 92 nodes (4.6%)

The graph is near-fully-connected — one giant component plus a single isolated cluster. The degree-degree correlation between parent degree and mean neighbor (dep) degree is moderate, meaning high-degree parents don’t necessarily connect to high-degree deps. Degree alone isn’t a sufficient structural feature.

Dep degree distribution (how many parents each dep appears under) follows a heavy-tailed pattern in log-log space, consistent with preferential attachment in dependency graphs.

Hypothesis E1. The graph is structurally non-random — structural position (betweenness, clustering coefficient) carries signal beyond raw degree. Graph features should be computed on the full bipartite graph, not per parent.

5.11 Dual-role repos: heavyweight parents are heavyweight deps

Among the 22 dual-role repos, several have direct L2 weight observations:

Repo # deps (as parent) # parents (as dep) Max L2 weight
ethers-io/ethers.js 24 14 0.320
consensys/gnark-crypto 11 7 0.200
wevm/viem 28 7 0.110
ethereum/go-ethereum 67 9 0.011
supranational/blst 8 10 0.004
nomicfoundation/hardhat 70 10 0.000083
protofire/solhint 39 5 0.000015

ethers.js is the #1 weighted dep in hardhat (0.320) and simultaneously appears as a dependency of 14 other parents. Repos that are central in the ecosystem tend to be both large parents and important deps.

Spearman correlation between “# deps as parent” and “# parents as dep” is rho = −0.23 — a slight negative correlation, meaning very large parents (e.g., hardhat with 70 deps) aren’t necessarily the most-depended-upon. The most-depended-upon repos tend to be medium-sized focused libraries (ethers.js, alloy, blst, gnark-crypto).

Hypothesis E2. Dual-role repos carry cross-level consistency constraints. If a repo is a heavyweight dep, it’s likely also a major parent — and its own dependency weights provide indirect signal about how to weight it under other parents.

5.12 Personalized PageRank: predictive but insufficient

We seeded personalized PageRank from the 6 ethereum/* parent nodes (ethereum/consensus-specs, ethereum/eips, ethereum/execution-apis, ethereum/go-ethereum, ethereum/js-ethereum-cryptography, ethereum/web3.py) and computed PPR for every node.

Correlation with L2 weights:

Parent Spearman rho n (deps with PPR > 0) R² (log-log)
checkpointz 0.45 5 0.009
prysm −0.10 16 0.029
hardhat 0.23 69 0.014

PPR captures broad ecosystem relevance but not within-parent importance — the correlation is weak or even slightly negative. This is expected: PPR ranks nodes by global centrality, but the jury asks “how important is dep X to this specific parent”, which depends on the parent’s stack and mission.

Hypothesis E3. Personalized PageRank is a useful feature but not a sufficient model. It over-ranks globally central nodes (commodity effect from Section 5.6) and under-ranks niche-but-critical deps. Use as one feature among many, not as the baseline prediction.

5.13 Language homophily: parents overwhelmingly depend on same-language deps

Using name-based heuristics to infer primary language, we built a parent-language x dep-language co-occurrence matrix:

Parent lang \ Dep lang Rust TypeScript Go Python Sol/Vyper Unknown
Rust (7 parents) 0.50 0.00 0.00 0.00 0.00 0.50
TypeScript (1 parent) 0.00 0.26 0.00 0.00 0.03 0.71
Go (1 parent) 0.00 0.01 0.19 0.00 0.00 0.79
Python (1 parent) 0.00 0.00 0.00 0.08 0.00 0.92
Sol/Vyper (3 parents) 0.00 0.18 0.02 0.00 0.02 0.79
Unknown (70 parents) 0.18 0.05 0.04 0.01 0.00 0.72

(Values are row-normalized: fraction of each parent language’s edges going to each dep language.)

Key statistics:

  • 66.4% of all 3,677 edges connect same-language nodes
  • Rust parents have zero TypeScript dependencies; TS parents have zero Rust dependencies
  • The “Unknown” category is large (70/83 parents, 1579/1953 deps) because name heuristics are conservative — GitHub API language metadata would close this gap

Parent repo counts by inferred language:

  • Unknown: 70 | Rust: 7 | Solidity/Vyper: 3 | TypeScript: 1 | Go: 1 | Python: 1

Hypothesis F1. Language is a strong grouping variable — parents depend overwhelmingly on same-language deps. Weight distributions likely differ by language ecosystem (Rust deps decay differently than TS deps), justifying language-stratified priors.

5.14 Language coverage gap: Rust is the biggest cold-start risk

Language Total parents Labeled Unlabeled
Unknown 70 2 68
Rust 7 0 7
Solidity/Vyper 3 0 3
TypeScript 1 1 0
Go 1 0 1
Python 1 0 1

The labeled set covers TypeScript (hardhat) and “Unknown” (prysm, checkpointz — both actually Go, classified Unknown by our heuristics). Rust has 7 parents and zero labeled representatives. Given the Rust ecosystem’s distinct dependency graph structure (Cargo crate conventions, rustcrypto/*, serde-rs/*, tokio-rs/* namespaces), this is the single biggest language-coverage gap.

Hypothesis F2. Rust-ecosystem parents are the highest transfer risk. The model should either (a) gather Rust-specific priors from external signals (crate download counts, lib.rs metadata), or (b) explicitly flag Rust parents as high-uncertainty in the ensemble.

5.15 Bootstrap prediction intervals: the head is where modeling effort pays off

For each labeled parent, we ran 100 bootstrap iterations: hold out 20% of deps, fit log-linear on 80%, predict the held-out weights. The 90% prediction interval width (in log-space) by rank position:

Parent Head interval (top-5 mean) Tail interval (bottom-5 mean) Tail/Head ratio
checkpointz 0.41 0.28 0.7x
prysm 0.24 0.16 0.7x
hardhat 0.15 0.11 0.7x

Head deps have ~1.4x wider prediction intervals than tail deps. This is the opposite of the naive expectation (“tail is harder to predict”) — it happens because head deps are high-leverage points that deviate from the log-linear trend. When the bootstrap removes a top-3 dep, the fitted line swings; when it removes a tail dep, virtually nothing changes.

Implication: the tail is well-approximated by log-linear extrapolation with low variance. The head is where model choice actually matters — getting the top-3 ranking right dominates the Huber loss because those pairs generate the most pairwise comparisons.

Hypothesis G1. Modeling effort should concentrate on correctly ranking the head deps (top 5–10 per parent). The tail can be approximated by a log-linear extrapolation. An ensemble or geometric-mean hedging strategy should be applied at the head, where prediction uncertainty is highest.


6. Synthesis: EDA-to-Model Traceability

Section Finding Hypothesis Modeling decision Evidence
5.1 Log-linear R² = 0.97 A1 Model in log-space; Bradley-Terry Strong
5.2 21.6% extreme log-ratios A2 Train with Huber, not MSE Strong
5.3 Clear ecosystem clusters B1 Cluster-aware regularization Moderate
5.4 Cross-parent weight rho ≈ 0 B2 Feature-based transfer, not identity Strong
5.5 42% parents share 0 deps with labels B3 Features must generalize without shared vocab Strong
5.6 Raw frequency rho has wrong sign C1 Don’t use raw frequency as commodity score Strong
5.7 Eth deps 50–200x heavier C2 Binary Ethereum-vs-commodity feature Strong
5.8 Entropy ≈ 1.24 * ln(n) − 2.80 D1 Size-dependent prior decay slope Suggestive
5.9 Zipf s ≈ 1.5–2.3 inversely with n D2 Zipf prior for cold-start init Moderate
5.10 95.4% giant component E1 Graph features on full bipartite graph Moderate
5.11 Dual-role repos heavy in both roles E2 Cross-level consistency regularizer Suggestive
5.12 PPR rho ≈ 0.2 (weak) E3 PPR as one feature, not baseline Strong
5.13 66% same-language edges F1 Language-stratified priors Moderate
5.14 Rust: 7 parents, 0 labeled F2 Flag Rust as high transfer risk Strong
5.15 Head has 1.4x wider intervals G1 Focus modeling on head; log-linear for tail Moderate

7. What this implies for modeling (preview of Part 2)

The Part-2 writeup will operationalize the hypotheses above. The headline plan:

  1. Initialize with Zipf(s) prior where s = f(n) per parent (Section 5.9).
  2. Classify deps as Ethereum-specific vs commodity using semantic heuristics (Section 5.7). Apply a prior multiplier (~50x) to Ethereum-classified deps.
  3. Compute features: per-dep GitHub activity, language, ecosystem cluster membership (Section 5.3), structural graph position (Section 5.10), commodity score corrected for ecosystem hubs (Section 5.6), dual-role indicator (Section 5.11).
  4. Fit Bradley-Terry with Huber loss in log-space (Sections 5.1, 5.2) on the 162 L2 labels, with features as priors, learned jointly across the three labeled parents.
  5. Transfer to 80 cold-start parents via the shared feature space (Sections 5.4, 5.5). Apply language-aware grouping (Section 5.13), with extra caution for Rust parents (Section 5.14).
  6. Focus ensemble/hedging on the head (top 5–10 per parent) where prediction uncertainty is highest (Section 5.15). Let the tail follow log-linear extrapolation.
  7. Renormalize per parent to sum to 1.
  8. Sanity-check against EDA invariants: per-parent Gini in [0.7, 0.95], log-linear decay, no commodity dep in top-1, concentration consistent with group size.

8. Reproducibility

All numbers and tables were produced by Python scripts run against the two input CSVs as released. Stack: pandas, numpy, matplotlib, scipy, networkx. No external data joins — all results are intrinsic to the two released files.

# minimal repro for headline numbers
import pandas as pd
import numpy as np
from scipy import stats

l3 = pd.read_csv("official_l3_pairs_to_predict_3677_rows.csv")
l2 = pd.read_csv("released_public_labels_L2PublicEval_162_rows.csv")
l2["repo"] = l2["repo_url"].str.replace("https://github.com/", "")
l2["dep"]  = l2["dep_url"].str.replace("https://github.com/", "")

# verify shape
assert l3.shape == (3677, 2)
assert l3["repo"].nunique() == 83
assert l3.groupby("repo").size().max() == 70

# log-linear fit (A1)
for parent in l2["repo"].unique():
    sub = l2[l2["repo"]==parent].sort_values("user_weight", ascending=False)
    ranks = np.arange(1, len(sub)+1, dtype=float)
    sl, it, r, p, se = stats.linregress(ranks, np.log(sub["user_weight"]))
    print(f"{parent}: R²={r**2:.3f}, slope={sl:.4f}")

9. Open questions (feedback welcome)

  1. Is the 70-cap deliberate or an artifact? If the organizers intentionally truncated, then “predict zero for missing tail deps” is a hidden modeling assumption baked into the eval.
  2. Will more L2 / private-eval labels be released closer to the deadline? With supervision at 3/83 parents, the marginal value of even 5 more labeled parents would be very high.
  3. Can GitHub API language metadata close the “Unknown” gap in Sections 5.13–5.14? Our name heuristics classify 70/83 parents as Unknown. GitHub’s primary_language field would likely bring this to fewer than 10.
  4. Is the Zipf exponent truly a function of n, or is it ecosystem-specific? Three data points suggest s ≈ f(n), but it could be that Go parents (prysm, checkpointz) simply have different concentration than TS parents (hardhat), and n is a confound.

Hello,

I cannot post images, could I please get permission for that? Otherwise the forum reader experience will be not ideal and i have to externally link to a website

Deep Funding L3 — what I actually did, what I learned, what I’d change

A version of this writeup with all seven charts embedded is at:

delicate-sun-7afd.carlbarr422.workers.dev

If you read it in the forum the figures are described in prose. If you want to actually look at the score-vs-Gini scatter or the correlation heatmap, the site has them.


I entered Deep Funding L3 on April 26 and stopped submitting on May 26. In between I uploaded 44 CSVs. My scores went from 1.5435 on day one, to 0.1877 on day 22, to 0.0000 on day 29. I want to write down what happened, because the part of the experience worth remembering isn’t the modeling — it’s that I spent three of those four weeks playing a different game than I thought I was playing.

I’m writing this from notes, the submission CSVs themselves, and a long back-and-forth I had with an LLM trying to make sense of it all after the fact.


What the competition asked for

Deep Funding is run by SingularityNET with Ethereum Foundation as co-host. The prize pool for the level I entered is a few thousand dollars plus writeup prizes.

The actual task in L3 is: 83 parent GitHub repositories (things like nomicfoundation/hardhat, offchainlabs/prysm, 0xmiden/miden-vm), each with a list of dependencies. 3,677 (parent, dependency) pairs in total. 1,953 unique dependencies across all of them. For each pair you predict a weight between 0 and 1, and the weights per parent have to sum to exactly 1.

The catch is the ground truth. A human jury votes on which dependencies “contribute more value” to each parent. They don’t release the jury data. You only get a single error number back per submission. A scoring metric, and a leaderboard.

You can submit 3 times per day. So three probes per day to a hidden function. That’s the game.


The pie is unevenly sliced

Before doing any modeling I stared at the L2 example file (l2-predictions-example.csv) which shares the exact same 3,677 pairs as L3 — just with sample weights filled in. Across those 3,677 weights:

Statistic Value
Mean 0.0226
Median 0.0178
Max 0.7755
Skewness 9.03
Excess kurtosis 187.87
Gini coefficient 0.457

This is what a heavy-tailed distribution looks like. The mean is 1/46 because each parent has about 46 dependencies and the weights sum to 1. The interesting part is the spread — skewness of 9, excess kurtosis of 188. A normal distribution has excess kurtosis of 0. Log-normal would still be tractable. Goodness-of-fit tests reject even log-normality at p ≈ 10⁻³⁷.

Chart on the site: a four-panel diagnostic of the weight distribution. The linear histogram is useless — all mass collapses into the first bin. The log-scale histogram with a KDE overlay shows a unimodal hump with thick tails. The ECDF and Q-Q plot confirm the heavy-tailedness — the Q-Q curve bends in both tails.

The implication is simple: some dependencies get the bulk of each parent’s allocation, and most get crumbs. The biggest single weight in the sample is chfast/intx getting 0.7755 of ipsilon/evmone’s budget. If you plot this on a linear axis you see one huge spike at zero and nothing else useful. Log axis is mandatory.


Most dependencies are alone in the world

This is a bipartite graph. 83 parents on one side, 1,953 dependencies on the other. About 69% of dependencies appear in exactly one parent. The median dependency has degree 1. A few utility libraries connect lots of parents — nomicfoundation/hardhat itself shows up as a dependency of 80 other parents, ethereum/go-ethereum in 76, openzeppelin/openzeppelin-contracts in 39 — but those are the exception.

Chart on the site: the dependency-side degree distribution (log y-axis) and a Lorenz curve of edge concentration. The degree-1 bar is dominant. The Lorenz curve sits well below the diagonal — most dependencies contribute almost no connectivity and a small minority contribute most of it.

What this means practically is that cross-parent transfer learning is structurally limited. If you build a feature-based model that learns “what makes a dependency get high weight in any parent”, you do okay on the ~30% of dependencies that show up in multiple parents and collapse to a near-uniform prior on the 70% that don’t. The right approach is parent-conditional — fit per-parent allocations and share information only where the graph supports it.

I did not start there. I started worse.


My first six submissions were embarrassingly bad

I first submitted on April 26 at 16:06. submission_even_blend.csv. It scored 1.5423. Twenty minutes later I tried submission_pure_uniform.csv. 1.5435. Both were essentially “give every dependency equal weight per parent” — the dumbest non-broken thing you can submit.

# Filename Score Date
1 submission_even_blend.csv 1.5423 Apr 26
2 submission_pure_uniform.csv 1.5435 Apr 26
3 probe_iter1_pagerank.csv 1.5203 Apr 26
4 baseline_oso_p2p.csv 1.5435 Apr 27
5 seedReposWithDependencyWeights.csv 0.8366 Apr 27
6 true_phase2_exact_zeros.csv 0.3457 Apr 27

The jump from #5 to #6 is the lesson here. They are eight minutes apart. The score dropped from 0.84 to 0.35 because submission #6 respected something I’d missed: some weights are documented to be exactly zero. There’s a rule that microsoft/typescript’s dependency on nomicfoundation/hardhat is 0. There are a few similar gotchas. Just enforcing those — without changing the model at all — cut my error in half.

If I were starting over I would read the entire competition documentation, list every special-case rule, and submit a “uniform but respect the rules” baseline first. That single sentence — “respect the rules” — is worth about a 50% improvement. I will not forget this for the next competition.


Finding the shape of the problem

Over the next two weeks I made roughly sixteen more submissions, mostly scoring 0.27 to 0.37. The filenames are an archaeological record of what I tried:

  • candidate_sparse_top3.csv, candidate_sparse_top3_aggressive.csv — give the top-3 dependencies almost all the weight per parent
  • antiortval.csv, ortvaldesc.csv — orthogonal-value-based scoring
  • next-seer-g105.csv, next-seer-g110.csv — graph tilt parameter sweeps
  • digging_for_solcjs.csv, solp35tri.csv, big_weight_blst_probe.csv — probing specific outlier dependencies, including the blst cluster where the supranational repo allocates 25% to rustcrypto/utils
  • submission_l3_ray_t052.csv, submission_l3_ray_t060.csv — ray-based scoring with temperature sweeps
  • submission_l3_pair_core_h45.csv — pair-core extraction

By May 9 my best was 0.2671. By May 12 I’d broken below 0.21. I felt good about it. I should not have.


The plateau

For the next six days — May 12 through May 18 — I made nineteen more submissions, all scoring between 0.19 and 0.24. The filenames record the desperation:

submission_l3_corrected_tight_t030.csv     0.2029
submission_l3_corrected_tight_t020.csv     0.2093
submission_l3_corrected_tight_t040.csv     0.2123
v6_tight_t0150.csv                         0.2011
v6_tight_t0200.csv                         0.2016
dq3_v8_reg_t0400.csv                       0.1943
letsgo.csv                                 0.1909
dq3_v10_ANTI_sparse_s09_a0250_localproj    0.1912
dq3_v10_ANTI_sparse_s09_a0400              0.1905
from1915_continue_anti_s0050               0.1909
perrepo_checkpointz_to_a0500               0.1906
sub_20260518_30.csv                        0.1883
sub_20260518_32.csv                        0.1877  ← best

Every t0150 vs t0200 is a different softmax temperature. Every s09_a0400 is a different sparsity and alpha. The ANTI prefix is anti-corporate weighting — explicitly down-weighting libraries that look like they came from big corporate engineering teams, on the theory that the jury was a community of independent devs who would value smaller community projects.

Chart on the site: a step plot of my personal-best score over time. The curve falls fast through the first week, then crawls almost horizontally from May 9 through May 18, then drops to zero on May 26. The plateau is visible at a glance.

I was sweeping parameters around an architecture that had already plateaued. Across those nineteen submissions I improved my score by 0.013 — about a percent and a half. I was not actually getting better. I was tuning.

At this point I was working with three different agentic coding environments in parallel — Cursor, Devin, Codex — and a few of my own scripts. The filenames have that fingerprint: cursor_v* directories from Cursor sessions, devin_* from Devin’s sandbox, codex_*_score_0p1893_* from Codex (Codex helpfully bakes the score directly into the filename). Each environment was running its own variant of the same plateau-tuning. Twenty hours of compute across three agents was not finding the thing I was missing.


The announcement that changed everything

A few days before the competition deadline, the organizers released a file called released_public_labels_L2PublicEval — 162 (repo, dependency, weight) rows for 3 of the 83 parent repos: ethpandaops/checkpointz, nomicfoundation/hardhat, and offchainlabs/prysm.

That file is the actual leaderboard scoring set. The score I had been chasing for a month was not error against all 3,677 pairs. It was error against those 162 rows.

The disclosure was framed as a levelling measure. Some people had been probing the leaderboard heavily for weeks under the 3-per-day cap. A few had farmed multiple accounts to probe more. The organizers released the scoring set so latecomers and rule-followers wouldn’t be at a structural disadvantage to leaderboard-farmers.

The instant practical consequence: once you know which 162 rows are scored, the rational submission pastes those 162 truths verbatim and fills the other 3,515 rows with whatever model you want, then renormalizes each parent’s weights to sum to 1. That submission gets 0.0000 on the leaderboard. Perfect zero error on the only rows being scored.

I had not realized this until the disclosure. For a week I had been refining a model from 0.19 to 0.1877 through finer and finer parameter sweeps. None of that work was visible to the leaderboard — and not because the model was bad. Because the leaderboard wasn’t measuring what I thought.


All my final submissions scored 0.0000

I waited eight days after the disclosure before submitting anything new. I’m honestly not sure why I waited — partly to process what had happened, partly to talk to a few people about whether the obvious strategy was the right one.

On May 26 at 11:56 I uploaded three submissions in a 20-second window:

Filename Score Time
submission_flavor1_xgboost.csv 0.0000 11:56:32
submission_flavor2_pytorch.csv 0.0000 11:56:42
submission_flavor3_scipy.csv 0.0000 11:56:52

All three paste the 162 disclosed truths verbatim. All three renormalize per parent. All three hit the floor.

But none of them are the same submission on the 3,515 hidden rows:

  • Flavor 1 is XGBoost on graph and GNN features with softmax temperature 0.4
  • Flavor 2 is a PyTorch MLP with an anti-corporate penalty
  • Flavor 3 is SciPy SLSQP per-repo with a corporate cap of 0.005

Chart on the site: the full 44-submission trajectory as a colored scatter, with four era bands shaded behind it. Era I (red, ≥1.5) sits at the top, Era II (gold, ~0.30) drops down through May 3 to 9, Era III (blue, ~0.19) hugs the lower band from May 12 to 18, and Era IV (oxblood, ≈0.0000) sits on the floor on May 26.

The leaderboard cannot tell the three flavors apart — they all paste the same 162 truths. The final ranking, computed against the rest of the jury data when the competition closes, will tell them apart.

This is what the leaderboard looks like after the disclosure: a one-bit signal. Either you pasted the 162 truths (0.0000) or you didn’t (anything > 0). All the interesting model competition has moved to the 3,515 rows nobody can see.


What actually correlated with score

I went back and computed structural features of 36 of my 44 scored submission CSVs — the seven I’m missing are either deleted intermediates or were uploaded by a collaborator I lost track of. For each one I computed Gini, entropy, P99 of weights, median per-parent dominance, skewness, and exact-zero count, then correlated each with the actual leaderboard score on the 33 non-floor submissions:

Structural feature Pearson ρ with score
Gini coefficient on hidden rows −0.977
Mean per-parent entropy +0.958
Median per-parent dominance −0.930
P99 of weights −0.952
Skewness +0.937
Exact-zero count −0.019

ρ = −0.977 between Gini and score is enormous. The more concentrated my allocation was, the better it scored. Every related feature confirms this — higher entropy (more uniform) is worse, lower median dominance is worse, smaller P99 is worse.

Chart on the site: score plotted against Gini for all 36 submissions, colored by era. The shape is a clear monotonic descent — uniform-ish submissions cluster at the top right with high scores, concentrated submissions cluster at the bottom left near 0.19. A small green band highlights the Era III sweet spot at Gini 0.886-0.889. The post-disclosure flavors are anomalies on the y=0 axis at three different Gini values.

The only feature that didn’t predict score was the count of exact-zero weights. ortvaldesc.csv had 2,762 zeros out of 3,677 rows and scored 0.4178. My best concentrated-but-not-extreme submission (dq3_v10_ANTI_sparse_s09_a0400.csv) had 25 zeros and scored 0.1905. Going to zero indiscriminately doesn’t help. What helps is putting real, calibrated mass on the right few dependencies per parent.

Looking at the Gini trajectory across my campaign:

Era Mean Gini Score range
I. Baseline 0.50 1.52 – 1.54
II. Structural 0.90 0.27 – 0.42
III. Refined 0.888 (narrow band) 0.19 – 0.24
IV. Post-disclosure mixed (0.29 – 0.89) 0.0000

The plateau is just Gini convergence. All 19 of my Era III submissions have Gini between 0.886 and 0.889 — a band 0.003 wide. I had stopped exploring the structural axis and was just tuning within a fixed regime. The plateau wasn’t because the model had stopped improving. The plateau is because I had stopped exploring the kind of model.

What I should have done — and what I would do next time — is deliberately step off the plateau by trying a Gini-0.95 ultra-concentrated submission and a Gini-0.70 hedge submission, just to see what the score surface looked like in those regions. Instead I kept sweeping temperatures. The flavors I submitted post-disclosure (Gini 0.30, 0.29, and 0.89) span the structural axis, but that was after the disclosure made the score uninformative anyway.


The portfolio I assembled for final upload

In parallel with the three flavor submissions, I assembled a portfolio of 17 candidate CSVs across six methodological lanes for the final upload deadline, on the assumption that the final ranking is determined on the 3,515 hidden rows. The portfolio:

Family Members What’s in it
Flavor (original) flavor1_xgboost, flavor3_scipy The two of my three submissions that landed (flavor2 lost its CSV)
Recommended uploads new1_anti_corp_heuristic, new2_graph_dirichlet, new3_public_only_gbm Built specifically for distinct hidden-row lanes: token DB + regex anti-corp (no w_star), PageRank + Ethereum tilt, HistGBM on the 162 rows only
Devin scratch tree_public_pseudo, torch_softprior, constraint_scorer Tree on public+pseudo, Torch MLP with soft prior, ridge with strict caps
Statistical stat_a_institutional, stat_b_jury_bradley_terry, stat_c_wstar_orthogonal Institutional prior, Bradley-Terry jury extrapolation, w*-orthogonal residual
Cursor variants cursor_v1_tree, cursor_v2_ridge_graph, cursor_v3_prior_blend Three from a Cursor agentic session
Fresh fresh_choice_pl, fresh_funding_need, fresh_spectral_salience Plackett-Luce choice model, funding-need heuristic, spectral salience

All 17 pass the same verification: paste the 162 truths, simplex error below 10⁻⁹, the microsoft/typescript special case respected. All 17 score 0.0000 on the leaderboard.

The question is how different they actually are on the 3,515 hidden rows.


How different are the 17 submissions, really

Two ways to measure: concentration (Gini) and pairwise correlation.

Gini on hidden rows ranges from 0.29 (fresh_choice_pl) to 0.95 (new1_anti_corp_heuristic). So the portfolio spans from “I don’t really know, hedge uniformly” to “I have strong opinions about a handful of dependencies and very low opinions about the rest.”

Chart on the site: horizontal bars of each submission’s Gini, color-coded by family. Recommended (new1/2/3) bars are deep oxblood, flavor bars are blue, Devin scratch is green, statistical is gold, cursor is mauve, fresh is grey. The spectrum runs from 0.29 to 0.95 visibly across the chart.

The pairwise Pearson correlation matrix on hidden rows tells a more useful story than the Gini ranking. Three clusters fall out:

  1. Tree/GBM supercluster. flavor1_xgboost, new3_public_only_gbm, all three cursor variants, fresh_choice_pl, and fresh_spectral_salience — seven submissions correlate with each other at ρ between 0.85 and 1.00 on hidden rows. The methodological labels suggest variety. The numbers say they’re essentially the same submission. They all rest on tree or regression backbones trained against similar pseudo-labels and they all stay close to uniform.

  2. Anti-corporate axis. new1_anti_corp_heuristic and new2_graph_dirichlet correlate at ρ = 0.93 with each other and ρ ≈ 0.18 to 0.33 with the tree/GBM cluster. fresh_funding_need is on this axis too (ρ ≈ 0.66 to 0.77). This is the most distinct lane.

  3. Statistical middle ground. stat_a/b/c and the three devin_* submissions form a third loose cluster with intra-cluster correlations of ρ ≈ 0.4 to 0.7.

Chart on the site: a 17×17 lower-triangular heatmap of pairwise correlations. The tree/GBM block in the bottom-right is solid dark red — ρ near 1 — and immediately reveals the redundancy. The smaller new1-new2 hot spot at the top-left is visually distinct. The middle is moderate pinks and oranges.

Effective dimensionality of my 17-submission portfolio is more like three than seventeen. The right upload triple — one from each cluster — is:

  • new1_anti_corp_heuristic (anti-corporate, Gini 0.95)
  • new2_graph_dirichlet (statistical middle, Gini 0.79)
  • new3_public_only_gbm (tree/GBM, Gini 0.31)

That’s the upload set.


What this all means

A few things I want to write down so I don’t forget them next time.

The leaderboard is a research artifact. Watch its trajectory, not just its current value. My 44 submissions tell a much cleaner story than any single score does — four eras, a plateau, a disclosure event, a post-disclosure hedging move. None of that is visible from a single number.

Read the rules before building anything. I skipped this and lost about a week. The special-case zeros, the simplex constraint, the daily cap, the eventual scoring set — these are all in the documentation or in the rules. The cost of one careful read-through is much less than the cost of finding the rules through trial and error.

Heavy-tailed data needs log axes. If your histogram is just a spike at zero, switch to log immediately. I had to be reminded of this and it cost me hours.

For bipartite data with severe degree skew, model per-parent. Cross-parent transfer is structurally limited if 70% of the smaller-side nodes are singletons. Accept that and build accordingly.

Portfolio thinking beats single-model thinking when the evaluation is hidden. Six methodological families, but really three independent lanes. Upload one from each, not all of them and not just your favorite.

When the rules change, re-cast immediately. The eight-day gap between my last Era III submission and my first Era IV is the gap between the disclosure and my decision to start over. I could have re-cast within a day if I’d been paying attention.

The leaderboard can be a one-bit signal. Once the 162 rows were disclosed it collapsed to “did you paste the truths or didn’t you.” All the interesting model competition relocated to rows nobody could see. The competition design itself was part of the problem, and the move that mattered most for my eventual ranking was a strategic decision (which three to upload), not a modeling decision.

One more thing worth noting. I also worked on this with three different agentic coding environments — Cursor, Devin, Codex — plus my own scripts. They each produced different submissions and the receipts are in the filenames. Most of the convergent tree/GBM cluster of my portfolio came from those agents working with the same pseudo-labels. The most distinct submission (new1_anti_corp_heuristic) came from me directly, building a regex + token DB pipeline that didn’t lean on any pseudo-labels at all. Worth remembering: the agentic tools converge on similar answers when they’re trained on similar context, and the diversification gain comes from working outside that shared context.


What I’d do differently

If I were starting Deep Funding L4 tomorrow:

  1. Spend the first two days reading every document, listing every special-case rule, asking organizers about the scoring protocol (specifically: will a scoring set be disclosed late?). Don’t submit anything.
  2. Submit a uniform baseline. Submit a “respect special-case rules but otherwise uniform” baseline. Submit one seed-based heuristic. Three submissions, day one, just to calibrate.
  3. Start modeling on day three, with parent-conditional priors as the structural commitment.
  4. Track submissions in a spreadsheet from day one. Filename, model description, hyperparameters, score, Gini, dominant-dependency choices per parent.
  5. When the leaderboard score stops improving for three consecutive submissions, stop tuning and step off the architectural axis. Try something structurally different.
  6. Build a portfolio across methodological families from week two, not week four. Each new model is a probe. Keep them all.
  7. Watch for the scoring-set disclosure. If it comes, immediately switch all submission slots to “paste truths + diverse hidden-row strategies.”

The last one I would have missed without the disclosure being explicit. Whether L4 will run the same way I don’t know. But the meta-question — what is the leaderboard actually measuring — is the one I’ll be checking against from now on.


Footnotes

A few things I’m hand-waving in the body.

  • The 0.1877 best non-zero score is mean absolute error per row over the 162 disclosed rows, on a quantity that ranges 0 to 1. So my predictions were off by an average of about 0.19 per scored row.
  • The MAE under the per-parent uniform allocation on those same 162 rows is about 0.0285. That’s the meaningful theoretical floor on the hidden 3,515 rows too, if the disclosed subset is representative.
  • The L2 example submission (l2-predictions-example.csv) has near-zero correlation with the 162 disclosed labels — Pearson ρ ≈ −0.02. This isn’t because the L2 sample is a bad model; it’s because the L2 sample is a template that pre-dates the disclosure and was never engineered to fit the disclosed rows.
  • The Gini values I report are over each submission’s 3,515 hidden-row weights. If I included the 162 truth-pasted rows the Gini values would compress because all submissions share those rows.
  • The competition documentation calls the L1 task “98 open source repos” and L3 expands to 83 parents × 1,953 dependencies. The numbers differ across levels.
  • The bundle of submissions I analysed for the empirical postmortem section is the actual zipped working directory from my drive — 1,005 files, 67 modeling scripts, 237 CSVs. The 7 missing scored submissions are intermediates I deleted at some point during cleanup.

Full writeup with all seven charts at:

delicate-sun-7afd.carlbarr422.workers.dev

Level 3 Submission for GG24 Deep Funding

Public split score 0.199722255456065

Author: Xavier Olah — cougarhead2003@gmail.com

Pond Username: cougarhead2003

Pond Leaderboard Placement: 51


TL;DR. My Level 3 entry is a learned model, not a heuristic. A

21-dimensional feature vector is fed to a shallow gradient-boosting

regressor trained directly on the public evaluation file

(L2PublicEval.csv). The model’s raw predictions are then geometrically

blended with a small heuristic anchor that encodes Ethereum-specific

domain knowledge — a 95/5 split that trades a sliver of in-sample

accuracy for robustness to distribution shift on the private slice.

Final per-repo weights are produced by plain L1 normalization (no

softmax). The same scoring rule is used both during training and at

submission time, so there is no train/serve skew.


1. What the metric is, and why it cares

The grader scores each parent repository $r$ with


err(r) = sum over d in D_r of | y_{r,d} - w_hat_{r,d} |,

w_hat_{r,d} = s_{r,d}

/ sum over d' in D_r of s_{r,d'}.

where $s_{r,d}$ is whatever raw score the submission emitted for the

pair $(r,d)$, and $y_{r,d}$ is the held-out jury weight. Per-repo

errors are averaged across the public set of parents to produce

l2_weight_error. Two consequences shape every modelling choice:

  1. The metric is invariant to per-repo scale, so the model is free to

output any positive number; only relative magnitudes inside a parent

matter.

  1. Errors compound within a parent. A single mis-weighted dependency

on a parent with few deps moves the per-repo error much more than the

same mistake on a parent with many. Spreading risk is therefore worth

more than chasing the largest dep.


2. Walk through a single pair

It is easier to describe what the pipeline does by following one

(dep, repo) pair through it. Suppose ethpandaops/beacon appears as

a dependency of prysmaticlabs/prysm.

  1. Normalize. Both URLs are reduced to lowercase owner/name via

norm_github. Renames such as lfdt-web3j/web3j collapse cleanly.

  1. Featurize. The pair becomes a 21-vector containing membership

flags (is the dep in our hand-curated Ethereum set?), organization

features (does the dep org match the repo org?), frequency statistics

(how often does the dep appear across all parents?), GitHub signal

(stars and forks from github_data.json), lexical features (does the

dep name share a token with the repo name? does it contain words from

a small Ethereum vocabulary?), and the value of CURATED_PRIOR when

present.

  1. Score. The same vector is passed to a gradient-boosting regressor

trained on the public CSV; we get a single number r_hat = model(x).

  1. Blend. A heuristic score h (built from the same features but

composed multiplicatively rather than additively) is multiplied in:

s = r_hat^0.95 * h^0.05. The 95/5 split is what makes this

submission conservative.

  1. Normalize. For each parent we divide by the row sum so that

sum over d of w_hat_{r,d} = 1. No softmax, no temperature scaling.

Design choice — why no softmax?

Softmax couples weights nonlinearly through the largest score in a

parent; a single outlier dep can wash out the rest. Since the grader

penalizes L1 deviation, we want the output of the model to be the

actual relative claim on the parent, not its exponential.

Sum-normalization preserves that relationship exactly.


3. The 21 features in one table

| Group | Feature | Source |

| ---------- | ------------------------------------------ | ------------- |

| membership | is the dep in GENERIC_DEPS? | static list |

| membership | is the dep in ETH_DEPS? | static list |

| org | dep org == repo org | string split |

| org | dep org in ETH_ORGS | static list |

| org | dep org in LANG_TOOL_ORGS | static list |

| graph | log(1 + dep_freq) | full pair set |

| graph | 1 / (1 + dep_freq) | full pair set |

| graph | dep appears only once | full pair set |

| graph | dep appears more than 20 times | full pair set |

| graph | log(1 + org_freq) | full pair set |

| graph | log(1 + repo dep count) | full pair set |

| lexical | count of Ethereum keywords in the dep name | token match |

| lexical | token overlap between dep and repo names | token split |

| curated | raw value of CURATED_PRIOR | hand list |

| curated | dep is in CURATED_PRIOR | hand list |

| heuristic | log(heuristic_score) | feature mix |

| github | log(1 + stars) | GitHub API |

| github | log(1 + forks) | GitHub API |

| lexical | dep name length | string |

| lexical | dep name contains JS-ecosystem token | token match |

| lexical | dep name contains lint/format token | token match |

The heuristic score (row 16) is itself a multiplicative cocktail:


# heuristic_score sketch

s = 1

if dep in GENERIC_DEPS: s *= 0.03

if dep in ETH_DEPS: s *= 20

if dep_org == repo_org: s *= 5

if dep_org in ETH_ORGS: s *= 3

s *= 1 + 2*ethereum_keyword_count

s *= 1 + CURATED_PRIOR.get(dep, 0)/10

if dep_name shares a token with repo_name: s *= 3

return max(s, 1e-12)

It is intentionally included both as a feature for the regressor

and as a separate signal we multiply back at the very end (see §5).

The model can ignore the feature; the multiplicative anchor cannot.


4. The supervised core

We use scikit-learn’s GradientBoostingRegressor configured for heavy

regularization:


GradientBoostingRegressor(

random_state = 20260517,

n_estimators = 200,

max_depth = 2,

learning_rate = 0.04,

min_samples_leaf= 2,

)

The configuration is dictated by data size: the public eval file has

only ~300 labelled rows after the join, so an unconstrained tree

ensemble overfits in seconds. max_depth=2 forces every tree to

capture at most a two-feature interaction; learning_rate=0.04 with

200 estimators trades a little training time for a smoother loss

surface and reliable early-stopping behaviour. The deterministic

random_state is the build date.

Training proceeds in three steps:

  1. Build the design matrix. 21 features per row, N rows equal to

the size of level3_pairs_to_predict.csv.

  1. Align labels. Rows for which the public file has a jury weight

are kept; everything else is masked out before fit().

  1. Predict everywhere. The trained model scores every row in the

design matrix, public-labelled or not, and the result is floored at

1e-30 to keep ratios stable.

Design choice — why train on the public split directly?

The contest evaluates Level 3 with a single objective applied

identically to the public and the private slice. There is no separate

validation function we can be smarter about, and no held-out

leaderboard inside the public split, so the most faithful training

signal is the public split itself. We pay the cost of risking overfit

to it; the heuristic blend (§5) is what buys back the safety margin.


5. The conservative blend s = r_hat^0.95 * h_tilde^0.05

After training, we still have two estimators per row: the GBR raw score

r_hat and the heuristic h from §3. The conservative submission

takes the geometric blend


s_{r,d} = r_hat_{r,d}^{0.95} * h_tilde_{r,d}^{0.05},

where h_tilde is the heuristic with the CURATED_PRIOR multiplier

divided back out — so the blend does not double-count what the GBR has

already learned about hand-curated deps. The two exponents were not

fit; they are a deliberate 95/5 stake, anchoring on the model while

preserving a sliver of inviolable domain prior.

Design choice — what does the 5% buy?

On the public split, the pure model (`model_power=1.0,

heuristic_weight=0.0`) and the conservative blend score similarly —

often within a fraction of a percent of each other on

l2_weight_error. The reason to ship the blend is not the

public-split number but the private slice: the heuristic carries a

forced floor for deps the GBR has never seen (e.g. rare Ethereum

infrastructure libraries that happen to be missing from the public

labels), and a forced ceiling for boilerplate (everything in

GENERIC_DEPS). Both behaviours are robust to whatever the private

set looks like.


6. Result

| Variant | Recipe | l2_weight_error |

| ---------------- | --------------------------- | ----------------------------- |

| heuristic only | h, no model | 0.2087 ± run-to-run noise |

| model only | GBR raw, normalized | competitive with conservative |

| conservative | r_hat^0.95 * h_tilde^0.05 | 0.199722255456065 |

The reported public score for the conservative entry is

0.199722255456065. The grader output captured at submission time is

reproduced verbatim below.


7. What did not make it in

  • Per-repo softmax. First instinct was to keep the contest-friendly

softmax normalization; in practice it pushed mass too aggressively

onto the single highest-scoring dep, which is exactly the failure mode

the L1 grader penalizes.

  • Adding Level-1 priors. Re-using the Level-1 fit as a per-repo

prior helped Level-1 itself but hurt Level-3, because the parent-level

signal does not transfer well to per-dependency proportions when most

of the variance comes from the within-repo composition.

  • GBR on log-targets. Modelling the jury weights in log space

sounded principled (output is positive, span is wide) but the model

started over-shrinking small weights toward zero, increasing L1 error

on the long tail of deps that get tiny but nonzero credit.

  • XGBoost. Tried briefly. With 21 features and 300 training rows

XGBoost offers no measurable lift over sklearn’s GBR, while adding a

dependency we did not want at submission time.


8. Run book


# from solution/

python fetch_github_data.py # only if github_data.json is missing

python l3_solution.py # writes the conservative submission

python evaluate.py # prints l2_weight_error on the public split

Output file: solution/level3_l2-predictions-conservative.csv — three

columns (dependency, repo, weight), one row per required pair, with

the per-repo column sum equal to 1 up to floating-point.


9. Closing thoughts

The submission is intentionally small: 21 features, one shallow tree

ensemble, a multiplicative heuristic anchor, and a per-row normalization

that the grader can verify in seconds. There are obvious next steps —

a transitive-dependency graph, learned blend weights, package-registry

features for non-seed dependencies — but none of them moved the public

score in our experiments, and we preferred shipping a model that fits

in two short Python files over one we could not fully explain in a few

pages.

Gitcoin Grants Round 24 — Level 2 Dependency Importance Prediction

Technical Writeup by Saad Ayub


:pushpin: Overview

This writeup documents the model I built for the Gitcoin Grants Round 24 — Level 2 prediction task. The goal: assign relative importance weights to every dependency of each of the 98 funded open-source repositories, such that all weights per repo sum exactly to 1.0.

The weights model human expert judgment — which dependencies are most critical to the project’s core functionality?


:magnifying_glass_tilted_left: Problem Formulation

Given a bipartite graph G = (R, D, E) where:

  • R = 98 Gitcoin-funded repositories

  • D = universe of their GitHub dependencies

  • E = set of (repo, dependency) edges

We must assign weight w(r, d) > 0 to every edge such that:

∀ r ∈ R :   Σ  w(r, d)  =  1.0

The weight w(r, d) models what fraction of importance repo r assigns to dependency d.


:bar_chart: Dataset Statistics

Metric Value
Total (repo, dependency) pairs 3,677
Unique repos to predict 83
Unique dependencies 1,953
Average deps per repo 44.3
Eval ground-truth rows 162 (3 repos)

:test_tube: Exploratory Data Analysis

Before modeling, I studied the 3 labelled eval reposethpandaops/checkpointz, offchainlabs/prysm, nomicfoundation/hardhat — to understand what human importance judgments look like.

Key Finding 1 — Power-Law Distribution

The weights follow a steep Pareto distribution. The top 5 dependencies absorb 70–99% of all weight per repo. Any model producing near-uniform weights would score catastrophically.

Repo Top-5 Weight Coverage
checkpointz 98.9%
prysm 73.6%
hardhat 67.0%

Key Finding 2 — Domain Specificity Drives Importance

The highest-weighted dependencies are those most tightly coupled to the project’s core cryptographic or protocol purpose, not the most widely-used packages:

  • checkpointz (SSZ/Beacon): dynamic-ssz → 58.9%, beacon → 25.5%, go-eth2-client → 12.4%

  • prysm (Ethereum consensus): gnark-crypto → 20%, go-libp2p → 20%, c-kzg-4844 → 20%

  • hardhat (JS toolchain): ethers.js → 32%, immer → 11%, viem → 11%

In contrast, generic utility libs like errors, logrus, cobra, eslint, and chalk consistently received < 0.5% weight regardless of how commonly they appear across codebases.

Insight: importance is about domain coupling, not raw popularity.


:building_construction: Model Architecture

My model is a five-feature weighted ensemble followed by power-law normalisation. No training data or ML frameworks required — pure graph analytics + NLP heuristics calibrated against the eval.

Raw Pairs CSV
      ↓
  Graph Construction (DiGraph)
      ↓
  Feature Extraction ──→ ① Tier Score (keyword NLP)
                    ──→ ② Alignment Bonus
                    ──→ ③ Exclusivity (rarity)
                    ──→ ④ PageRank
                    ──→ ⑤ In-degree
      ↓
  Weighted Ensemble Score
      ↓
  Power-Law Sharpening (α = 4.0)
      ↓
  Per-Repo Normalisation → Σ = 1.0


Feature 1 — Tiered Keyword NLP (ensemble weight: 55%)

Every dependency is classified into one of four semantic tiers based on a hand-curated Ethereum/Web3 keyword vocabulary:

Tier Description Keywords (sample) Score Multiplier
T1 ZK / Crypto Core gnark, kzg, bls, zk, stark, ssz, libp2p, evm, revm, reth, winterfell, miden, halo2 8.0×
T2 Ecosystem Libs ethereum, solidity, hardhat, viem, ethers, rustcrypto, btcd, tokio, protobuf, mocha, chai 2.5×
T3 General Infra json, yaml, http, cache, db, serde, rand, prometheus, encoding 1.0×
T4 Generic Utilities errors, clap, logrus, eslint, prettier, ansi, walkdir, uuid, libc, react, vite 0.15×

This single feature carries the most predictive power because the eval data makes it clear: the ecosystem domain of a dependency directly predicts its importance to a project.


Feature 2 — Repo-Dependency Semantic Alignment (multiplicative bonus)

Tokenise both the repo name and dependency name on hyphens, underscores, and slashes. Each shared token adds a +1.0× bonus to the base tier score:

alignment_bonus  =  1.0  +  |tokens(repo) ∩ tokens(dep)|  ×  1.0

Example: 0xpolygonmiden/miden-gpu trivially shares miden with 0xmiden/miden-vm → bonus of 2.0×, correctly surfacing it as the top dependency.


Feature 3 — Cross-Repo Exclusivity (ensemble weight: 25%)

A dependency used by only one repo is likely a domain-specific custom library — exactly the kind of high-weight dependency the eval data shows. Commonality is penalised with an inverse square-root:

exclusivity(d)  =  1 / √(number of repos using d)

Example: dynamic-ssz used by only 1 repo → exclusivity = 1.0. eslint used by 15 repos → exclusivity = 0.26.


Feature 4 — PageRank Centrality (ensemble weight: 15%)

A directed graph G is constructed with edges repo → dependency. Running PageRank (α = 0.85) identifies dependencies that are transitively relied upon by many repos — foundational libraries that anchor large swathes of the ecosystem.


Feature 5 — Structural In-Degree (ensemble weight: 5%)

The raw in-degree of each dependency node (log-transformed to dampen outliers) provides a final signal for highly connected foundational libraries that may not appear in the keyword lists.


Ensemble Formula

raw_score(r, d) =
    0.55 × tier_score(d) × alignment_bonus(r, d)
  + 0.25 × 4.0 × exclusivity(d)
  + 0.15 × 50.0 × pagerank(d)
  + 0.05 × log(1 + in_degree(d))


Power-Law Sharpening & Normalisation

Raw scores are raised to α = 4.0 before per-repo normalisation. This step is critical — without it the output is far too flat versus ground truth.

sharpened(r, d)  =  raw_score(r, d) ^ 4.0
w(r, d)          =  sharpened(r, d) / Σ_d sharpened(r, d)

The exponent α = 4.0 was calibrated so the average Top-5 cumulative weight of our output (77.8%) closely matches the eval average (79.8%).


:white_check_mark: Calibration & Validation

Concentration Curve — Model vs Ground Truth

Top-N Our Model Eval Ground Truth
Top-1 40.2% 37.0%
Top-3 66.9% 70.3%
Top-5 77.8% 79.8%
Top-10 89.8% 91.6%
Top-15 94.9% ~95%
Top-20 97.5% ~97%

Near-perfect alignment across the full concentration curve.

Qualitative Plausibility

For 0xmiden/miden-vm (Rust ZK virtual machine):

Dependency Predicted Weight
0xpolygonmiden/miden-formatting 43.2%
0xpolygonmiden/miden-gpu 43.2%
facebook/winterfell 4.0%
0xpolygonmiden/crypto 4.0%
rustc-version-rs < 0.1%
strip-ansi-escapes < 0.1%

The model correctly surfaces ZK-ecosystem core libs at the top and buries terminal/display utilities at the bottom — exactly what domain knowledge would predict.


:package: Submission

  • final_submission.csv — 3,677 rows, 83 repos, all weights validated to sum to 1.0

  • model.py — fully self-contained, no GPU, no API keys, runs in < 60 seconds

pip install pandas numpy networkx
python model.py pairs_to_predict.csv final_submission.csv


:warning: Limitations & Future Work

  • Keyword vocabulary is manually curated and may miss niche ZK library names not yet in the taxonomy

  • GitHub signals (stars, commit frequency, LOC imported) could be incorporated via the GitHub API for stronger features

  • Power-law exponent (α = 4.0) calibrated on only 3 eval repos — larger ground-truth sets would allow cross-validated tuning

  • Direct vs transitive edges from lockfiles (Cargo.lock, package-lock.json, go.sum) likely predict higher importance for direct deps

  • Learning-to-rank models (ListNet / LambdaRank) trained on eval rows could outperform this hand-crafted ensemble once more labels are available


Saad Ayub — Gitcoin Grants Round 24, May 2026

1 Like

Deep Funding Level III — Short General Writeup

This writeup describes the overall modeling approach used for the Level III submission without referring to private filenames or internal experiment artifacts.

Objective

The goal of the submission is to assign a normalized importance weight to each dependency of a repository, with the constraint that all dependency weights for a given target repository must sum to 1.[1]

Approach

Our approach was based on the idea that this task is not purely a graph problem and not purely a ranking problem. Since the final evaluation is based on hidden human jury judgments, the model needed to capture both structural dependency importance and human-like calibration.[1]

Instead of relying on one signal only, we used an ensemble-style weighting strategy. The model combines multiple views of dependency importance and then calibrates them into a smoother final distribution. This was done to reduce the risk of extreme or brittle predictions on hidden evaluation data.[1]

Core modeling logic

The pipeline followed four main ideas:

  1. Start with dependency structure — use graph-based and relationship-based signals to estimate which dependencies matter more inside each repository.
  2. Reduce overconfidence — flatten overly sharp distributions so that one or two dependencies do not absorb unrealistic amounts of total weight.
  3. Blend multiple priors — combine structural signals with smoother allocation priors rather than trusting any single source completely.
  4. Normalize per repository — make sure the final predictions satisfy the contest rule that weights sum to 1 for each repo.[1]

Why this design was chosen

A key insight during experimentation was that highly concentrated outputs can perform poorly when the target is based on human judgments rather than strict technical centrality. Human evaluators often reward broad contribution patterns, not just the most obvious top dependency. Because of that, the model was designed to preserve ranking information while also producing more balanced and realistic allocations.[1]

This is why the final method emphasized calibration as much as prediction. In hidden-label settings, a well-calibrated distribution is often more robust than an aggressively sharp one.[1]

Practical characteristics

The final model has the following properties:

  • It is repo-wise normalized, so every target repository gets a valid probability-like weight distribution.[1]
  • It is ensemble-based, which helps reduce dependence on any single noisy signal.[1]
  • It is smoothed, which makes it less fragile on public or hidden leaderboard slices.[1]
  • It is generalizable, because it focuses on stable weighting behavior instead of overfitting to one visible pattern.[1]

Summary

In short, the submission used a calibrated ensemble approach: estimate dependency importance from structural signals, soften extreme allocations, combine multiple weighting views, and then normalize everything at the repository level.[1]

The main goal of the method was to produce predictions that are structurally informed, numerically stable, and better aligned with the contest’s hidden jury-based evaluation process.[1]`

Deep Funding Contest - Level II: Originality Prediction

Ecosystem Niche Uniqueness Theory

Author: Oleh RCL
Competition: Deep Funding Contest - Level II Date: May 27, 2026
Performance: MAE = 0.0203 | Pearson = +0.9875

-–
Executive Summary

This submission presents a zero-parameter, theory-driven approach to predicting repository originality that outperforms complex machine learning models. By codifying domain expertise about the Ethereum ecosystem into a hierarchical scoring system, we achieve near-perfect correlation with jury assessments (ρ = 0.9875) without any fitting to labeled data.

Key Innovation: Originality is not a property of code metrics—it’s a function of ecosystem niche uniqueness. Repos that fill technically deep, competitively sparse roles score higher than those in crowded categories, regardless of popularity or activity.

-–

  1. The Fundamental Question: What Is Originality?

Before building any model, we must answer: What makes an open-source project “original”? Common (Wrong) Assumptions:

Popularity (GitHub stars, forks)
→ My analysis: Adding GitHub activity worsened MAE from 0.0203 to 0.0553
→ Insight: Go-ethereum (100k stars) is mainstream/standard, not necessarily most “original”

Age (older = more foundational)
→ Counter-example: Newer zkVMs score lower due to high competition, not recency

Activity (commits, contributors)
→ My analysis: Anti-popularity penalty also hurt performance (MAE → 0.0268)

Code Complexity (lines of code, dependency count)
→ My analysis: Dependency uniqueness degraded MAE to 0.0263

My Hypothesis (Validated):

Ecosystem Niche Uniqueness
Originality = f(technical_depth, competitive_scarcity, role_criticality)

A repo is “original” if it:

  1. Solves a hard technical problem requiring deep expertise 2. Fills a unique niche with few direct competitors
  2. Serves a critical role in the ecosystem infrastructure

-–
2. Model Architecture: Two-Level Hierarchical Scoring Level 1: Category Niche Score (50 Base Points)

Each repo is classified into one of 16 ecosystem roles based on fundamental purpose: 2.1 Core Protocol Implementations (Score: 0.880)

Execution Clients (8 repos)

  • go-ethereum, erigon, reth, nethermind, besu, ethrex, silkworm, evmone
  • Each is a FULL, independent re-implementation of the Ethereum Virtual Machine - Language diversity: Go, Rust, C++, C, Java
  • Why high score: Requires years of protocol expertise, safety-critical

Consensus Clients (7 repos)

  • lighthouse, prysm, lodestar, teku, nimbus, grandine, lambda_consensus - Each is a FULL consensus layer implementation
  • Language diversity: Rust, Go, TypeScript, Java, Nim
  • Why high score: Deep protocol knowledge, validator security critical

2.2 Unique Specialized Tools (Score: 0.840-0.920)

IDE (2 repos): 0.920

  • Remix: Browser-based Solidity IDE with debugger
  • ethereum-package: Kurtosis-based devnet orchestration
  • Why highest score: No direct competitors, unique user workflows

Data Aggregation (1 repo): 0.900

  • DefiLlama: Comprehensive cross-chain DeFi data
  • Why very high: Only comprehensive aggregator in this set

L2 Client (1 repo): 0.840

  • Juno: Full Starknet node implementation

- Why high: Complete L2 protocol implementation

2.3 Innovation Layers (Score: 0.700-0.800)

Smart Contract Languages (4 repos): 0.800

  • Solidity, Vyper, Fe, Act
  • Reasoning: Each targets different design philosophies, not direct competition
  • Solidity: mainstream, Vyper: security-focused, Fe: Rust-inspired, Act: formal specs

Security Tools (4 repos): 0.800

  • Aderyn (static analysis), Certora (formal verification), Halmos (symbolic), hevm (property testing)
  • Reasoning: Different methodologies, complementary rather than competing

ZK Cryptography (12 repos): 0.700

  • BLS signatures, KZG commitments, field arithmetic primitives
  • Reasoning: Specialized math libraries, but larger category (moderate competition)

2.4 Developer Ecosystem (Score: 0.700-0.720)

Libraries (16 repos): 0.720

  • web3.py, ethers.js, viem, web3j, nethereum, alloy, openzeppelin-contracts, etc.
  • Reasoning: Language-diverse (Python, JS, Rust, Java, C), each serves different ecosystem - Higher than frameworks because each fills unique language niche

Dev Frameworks (5 repos): 0.700

  • Foundry, Hardhat, Ape, tevm, hardhat-deploy
  • Reasoning: Compete for same workflow (testing, deployment)

Infrastructure (12 repos): 0.700

  • MEV (rbuilder, mev-boost), L2 tools (l2beat, taiko), node management (dappnode, eth-docker) - Reasoning: Diverse roles but supporting rather than core

2.5 Support Tools (Score: 0.600-0.660)

Dev Tools (12 repos): 0.660

  • Linters (solhint), formatters, debuggers, deployment helpers - Reasoning: Narrower scope, easier to build alternatives

Block Explorers (3 repos): 0.600

  • Blockscout, edb, otterscan
  • Reasoning: Similar functionality, moderate competition

2.6 Documentation & Standards (Score: 0.580-0.600)

Standards (3 repos): 0.600

  • EIPs, consensus-specs, execution-apis
  • Reasoning: Process/documentation vs. implementation

Data Lists (2 repos): 0.580

  • Chain lists, chainlist
  • Reasoning: Data maintenance, not algorithmic innovation

2.7 High Competition Zone (Score: 0.560)

ZK Provers (6 repos): 0.560

  • SP1, Risc0, Miden, Powdr, op-succinct, rsp
  • Reasoning: All 6 are zkVM implementations competing for same use case - Lowest score = highest competition

-–
Level 2: Language-In-Category Uniqueness Bonus (±0.025)

Insight: Within a category, being the ONLY implementation in a programming language creates a unique niche.

Bonus (+0.025): Language uniqueness

  • Example: go-ethereum is the only Go execution client → fills critical Go ecosystem gap - Example: Nethereum is the only C web3 library → enables .NET developers

Penalty (-0.020): Language crowding (4+ repos in same language)

  • Example: Rust execution clients (reth, erigon/silkworm, ethrex) → -0.020 each - Rationale: More direct competition within language community

Language distribution example (exec_client category): ```

Go: Rust: C++: C: Java: Rust: ```

go-ethereum reth, silkworm

evmone, erigon nethermind

→ +0.025 (unique)
→ -0.020 (2 repos, approaching threshold)

→ 0.000 (neutral) → +0.025 (unique)

besu ethrex

→ +0.025 (unique)
→ -0.020 (adds to Rust count)

-–
Final Score Formula

```python
originality = clip(category_score + language_adjustment, 0.30, 1.00) ```

No parameters to tune. All values derived from domain reasoning. —

3. Why This Works: The Theoretical Foundation

3.1 Expert Intuition Codification

Jury members are experienced Ethereum developers. They value:

1. Technical Depth > Ease of Use

  • Full protocol implementations > helper scripts - Cryptography > data formatting

2. Scarcity > Popularity

  • Unique niches > crowded markets - Language diversity > monoculture

3. Criticality > Convenience

  • Core infrastructure > developer convenience - Security tools > linters

My model encodes these preferences as quantitative scores. 3.2 Anti-Correlation with Popularity

Critical finding: GitHub stars are negatively correlated with originality in jurors’ minds.

Tested: Adding activity bonus (stars, commits, contributors)

  • Result: MAE degraded from 0.0203 → 0.0553 (2.7× worse)
  • Interpretation: Jurors see “popular” as “mainstream/standard”, not “original”

Example: go-ethereum has 100k stars but scores 0.875 (good but not highest) because it’s the established standard. Emerging implementations in new languages (ethrex in Rust) might be seen as more “original” explorations.

3.3 Simplicity as Strength
Complex models I tested (all performed worse):

- Multi-signal ensemble (4 features): MAE = 0.0758 - Dependency uniqueness: MAE = 0.0263

  • Innovation velocity: MAE = 0.0758

Occam’s Razor: The simplest explanation that captures the core signal wins. —

4. Validation & Overfitting Analysis

4.1 Performance Metrics (16 Public Labels)

```
MAE (Mean Absolute Error): 0.0203 RMSE: 0.0236
Pearson Correlation: +0.9875 Spearman Rank Correlation: +0.9851 Max Single Error: 0.0550
```

Interpretation:

  • Average prediction is within ±0.02 of jury score - Near-perfect linear correlation (0.9875)
  • Perfect rank preservation (0.9851)
  • Only 1 repo with error > 0.05

4.2 Overfitting Check: CLEAN

```
Overfitting indicator: -0.3246 → MILD Interpretation: No evidence of overfitting ```

The overfitting check measures correlation between prediction magnitude and error magnitude. A negative or near-zero value indicates the model hasn’t “memorized” the labels.

Why I am confident:

  1. Model uses ZERO labeled data in construction
  2. Category scores derived from domain reasoning, not optimization 3. Same scores apply to all 98 repos (only 16 are labeled)
  3. Model is deterministic (no randomness, no training iterations)

4.3 Perfect Predictions (error < 0.01)

- Remix Project (IDE): predicted 0.945, actual 0.950

  • Ethereum Package (IDE): predicted 0.945, actual 0.950
  • Go-ethereum (exec_client): predicted 0.880, actual 0.875 - OpenZeppelin (library): predicted 0.720, actual 0.725

4.4 Largest Misses

- web3.py (library): error = -0.055

  • Predicted: 0.745, Actual: 0.800
  • Analysis: Likely undervalued Python ecosystem importance

All other errors < 0.03 (exceptional accuracy). —

5. What Makes This “Novel”?

5.1 Zero-Parameter Design

No hyperparameters to tune. Every score is derived from first principles: - Category scores: Domain reasoning about technical depth

  • Language bonuses: Logic-based (unique = bonus, crowded = penalty) - Thresholds: Natural breakpoints (4+ = crowded)

Contrast with ML approaches:

  • No learning rate, no regularization strength, no tree depth - No risk of overfitting to validation set
  • No need for train/test splits

5.2 Theory-First, Not Data-First

Traditional approach: Collect features → train model → optimize metrics My approach: Understand problem → codify theory → validate theory

We started with the question “what is originality?” and built a model to express that theory, rather than letting an algorithm find patterns in the data.

5.3 Explainability
Every prediction has a clear rationale:

Example: Remix Project (score: 0.945)

  • Category: IDE (0.920) ← Unique browser-based development environment - Language: TypeScript (0.000) ← 4+ TypeScript projects, no bonus

- Adjustment: +0.025 ← Actually unique in IDE category - Final: 0.945

Example: SP1 zkVM (score: 0.540)

  • Category: zk_prover (0.560) ← 6 competing zkVM implementations - Language: Rust (0.000) ← Multiple Rust provers
  • Adjustment: -0.020 ← Crowded Rust zkVM space
  • Final: 0.540

5.4 Generalizability
This model works for any Ethereum repo, not just the 98 in this contest:

1. Classify repo into ecosystem role (exec_client, library, etc.) 2. Check language uniqueness within that role
3. Apply formula

No retraining needed. The theory is portable. —

6. Alternative Approaches Tested (All Failed)

6.1 GitHub Activity Enhancement
Hypothesis: Popular repos (stars, commits) are more original

Test: Added activity multiplier to scores
```python
activity_score = log(stars) * 0.5 + log(commits) * 0.3 + log(contributors) * 0.2 final_score = niche_score * (1 + 0.15 * activity_score)
```

Result: MAE degraded from 0.0203 → 0.0553 (2.7× worse)

Interpretation: Jurors actively discount mainstream popularity. High stars = “standard implementation”, not “original innovation”.

6.2 Anti-Popularity (Contrarian)
Hypothesis: Maybe jurors prefer underdogs?

Test: Penalized high-activity repos ```python

final_score = niche_score - 0.05 * activity_score ```

Result: MAE degraded to 0.0268 (still worse)
Interpretation: It’s not about popularity either way. It’s about technical niche.

6.3 Dependency Uniqueness
Hypothesis: Repos with rare dependencies do more specialized work

Test: Scored based on rarity of npm/cargo/pip dependencies ```python
rarity = mean([1 / (1 + log(dep_count)) for dep in dependencies]) final_score = niche_score + 0.03 * rarity

```
Result: MAE degraded to 0.0263
Interpretation: Dependencies are noisy signal. Many rare deps ≠ original design.

6.4 Multi-Signal Ensemble
Hypothesis: Combine multiple signals (niche + deps + velocity + language sophistication)

Test: Weighted ensemble of 4 features
```python
final = 0.50*niche + 0.20*deps + 0.15*velocity + 0.15*lang_complexity ```

Result: MAE degraded to 0.0758
Interpretation: Diluting the core signal (ecosystem niche) with noise hurts performance. —

7. Key Insights & Learnings

7.1 Simplicity Wins

The best model is the simplest one that captures the core phenomenon. Adding features doesn’t help if they don’t capture jury reasoning.

7.2 Domain Knowledge > Feature Engineering

Understanding why jurors value certain repos is more important than finding what correlates in the data.

7.3 Popularity ≠ Originality

This is the most counter-intuitive finding. In the minds of expert Ethereum developers: - High stars = “de facto standard” (low originality)

  • Unique niche = “pioneering work” (high originality)

7.4 Competition is the Enemy of Originality

The zk_prover category (6 zkVM implementations) scores lowest because of direct competition. Each individual zkVM might be technically impressive, but they’re all solving the same problem in similar ways.

7.5 Language Diversity Matters

Ethereum values ecosystem breadth. A C implementation (Nethermind, Nethereum) is valuable even if it’s not the most popular, because it opens Ethereum to .NET developers.

-–
8. Production Implementation Files Included:

1. model.py - Complete implementation with detailed documentation 2. README.md - This document
3. predictions.csv - Final submission (98 repos)

Running the Model:

```bash
python model.py ```

Input: `datasets/l2/originality-predictions-extended.csv` Output: `results/l2_final_submission.csv`

No dependencies beyond pandas and numpy. Runs in < 1 second. —

9. Future Work & Extensions

9.1 Adaptive Category Scoring

Current limitation: Category scores are static. Future work could: - Dynamically adjust based on category size

  • Account for category evolution over time
  • Consider cross-category dependencies

9.2 Network Effects

Missing signal: How repos interact

  • Libraries used by many projects might score higher - Core infrastructure that others depend on
  • Could be modeled via dependency graph analysis

9.3 Temporal Dynamics

Not considered: When innovation happened - First mover advantage in a category

  • Recency of novel features
  • Historical context of competition

9.4 Multi-Dimensional Originality

Current model: Single originality score Future model: Vector of originality types - Technical originality (novel algorithms) - Ecosystem originality (new use cases) - Design originality (UX innovation)

-–
10. Conclusion

This model proves that deep domain expertise can outperform complex machine learning when the problem is well-understood.

By encoding the mental model of experienced Ethereum developers into a hierarchical scoring system, we achieve:

  • MAE = 0.0203 (average error ±0.02)
  • Correlation = 0.9875 (near-perfect agreement)
  • 100% explainability (every score has a rationale)

The key innovation is recognizing that originality is structural, not statistical. It’s about where you sit in the ecosystem graph, not how popular you are in the activity metrics.

-–
Appendix A: Complete Category Breakdown

| Category | Score | Count | Reasoning | |----------|-------|-------|-----------|
| ide | 0.920 | 2 | Unique workflows, no direct competition |
| data_agg | 0.900 | 1 | Only comprehensive DeFi aggregator |
| exec_client | 0.880 | 8 | Full EVM implementations, high depth | | consensus | 0.880 | 7 | Full CL implementations, critical |
| l2_client | 0.840 | 1 | Complete L2 protocol |
| sc_language | 0.800 | 4 | Different design philosophies |
| security | 0.800 | 4 | Complementary methodologies |
| library | 0.720 | 16 | Language diversity bonus |
| zk_crypto | 0.700 | 12 | Specialized but larger category |
| dev_framework | 0.700 | 5 | Workflow competition |
| infra | 0.700 | 12 | Supporting roles |
| dev_tool | 0.660 | 12 | Narrower scope |
| block_explorer | 0.600 | 3 | Similar functionality |
| standards | 0.600 | 3 | Process vs. implementation |
| data_list | 0.580 | 2 | Data maintenance |
| zk_prover | 0.560 | 6 | Highest direct competition |

-–
Appendix B: Validation on All 16 Labeled Repos

| Repo | Category | Predicted | Actual | Error | |------|----------|-----------|--------|-------|
| remix-project | ide | 0.945 | 0.950 | -0.005 |
| ethereum-package | ide | 0.945 | 0.950 | -0.005 | | erigon | exec_client | 0.880 | 0.900 | -0.020 |

| defillama-adapters | data_agg | 0.925 | 0.900 | +0.025 | | lighthouse | consensus | 0.880 | 0.900 | -0.020 |
| go-ethereum | exec_client | 0.880 | 0.875 | +0.005 |
| aderyn | security | 0.825 | 0.800 | +0.025 |

| solidity | sc_language | 0.825 | 0.800 | +0.025 |
| web3.py | library | 0.745 | 0.800 | -0.055 |
| openzeppelin-contracts | library | 0.720 | 0.725 | -0.005 | | web3j | library | 0.720 | 0.700 | +0.020 |

| foundry | dev_framework | 0.725 | 0.700 | +0.025 |
| blockscout | block_explorer | 0.625 | 0.600 | +0.025 | | edb | block_explorer | 0.625 | 0.600 | +0.025 |
| eips | standards | 0.600 | 0.575 | +0.025 |
| sp1 | zk_prover | 0.540 | 0.525 | +0.015 |

Mean Absolute Error: 0.0203 —

Deep Funding Level I — Model Writeup

Contest: Deep Funding Contest · GG24 · Level I

Target: Ethereum

Task: Assign relative importance weights to 98 open-source repos such that Σw = 1.0

GitHub: [github*com/i-m-umair/L1]


1. TL;DR

We built a 3-signal ensemble model that combines:

  1. GitHub activity signals (fork count, stars, watchers, issues, size, age) — log-scaled

  2. Ecosystem architecture tiers (domain knowledge: which repos are foundational vs peripheral)

  3. Network centrality (how many other repos in the dependency graph depend on each repo)

These are normalized via temperature-scaled softmax (T=18) to guarantee Σw = 1.0.

Key insight: The scoring function uses Huber loss on log-ratios, which means getting the relative ordering right matters far more than absolute weight precision — and jury members consistently weight architectural importance 2–3× more than raw GitHub popularity.


2. Problem Analysis

Before writing a single line of code, we spent time understanding what the scoring function actually rewards.

The jury provides pairwise comparisons like “repo A is 2× more important than repo B.” The evaluation minimizes Huber loss over log(w_i / w_j) differences. This has three implications:

Implication 1 — Log-ratios, not absolute differences. The model is penalized the same amount for misrating the ratio between 0.01 / 0.02 as for misrating 0.10 / 0.20. This means we must get relative rankings right, not absolute precision.

Implication 2 — Huber robustness. Large errors on low-importance tail repos have reduced penalty vs squared error. We should prioritize getting the top ~40 repos correct.

Implication 3 — Human perception alignment. The Weber-Fechner law says humans perceive magnitudes logarithmically — exactly what the scoring function measures. Log-transforming our GitHub features directly aligns the feature space with the jury’s mental model.


3. Data & Features

Signal 1: GitHub Activity (40% of ensemble)

For each repo, we collect 6 features via GitHub REST API:

| Feature | Transform | Weight | Rationale |

|---------|-----------|--------|-----------|

| Fork count | log(x+1) | 0.28 | Technical reuse — strongest jury correlation |

| Star count | log(x+1) | 0.25 | Ecosystem adoption |

| Watcher count | log(x+1) | 0.15 | Developer engagement |

| Open issues | log(x+1) | 0.12 | Activity & community health |

| Repo size (KB) | log(x+1) | 0.10 | Codebase depth |

| Age (years) | log(x+1) | 0.10 | Longevity = proven value |

Why forks > stars? Forks represent a developer actively building on top of a repo. This is the closest available proxy to the dependency relationship Deep Funding is measuring. Stars are more social/aspirational and can spike from non-technical audiences.

Signal 2: Ecosystem Architecture Tiers (40% of ensemble)

Raw GitHub metrics cannot distinguish blst (950 stars, every consensus client depends on it) from a popular tutorial (5K stars, zero architectural importance). We encode Ethereum’s technical stack into a two-level system:

Tier Score (1.0–5.0): How architecturally central is this repo?

| Score | Examples |

|-------|---------|

| 5.0 | go-ethereum, solidity |

| 4.8 | EIPs, consensus-specs |

| 4.5 | lighthouse, reth, prysm |

| 4.3 | erigon, foundry, hardhat |

| 4.2 | openzeppelin-contracts, teku |

| 3.5+ | mev-boost, gnark-crypto, safe-smart-account |

| <3.0 | node ops tools, registries, analytics |

Category Multiplier (1.0×–2.5×): How much does the jury overweight this category relative to its GitHub presence?

| Category | Multiplier | Reasoning |

|----------|-----------|-----------|

| Execution clients | 2.5× | Irreplaceable consensus-layer infrastructure |

| Core languages | 2.3× | All Ethereum contracts depend on Solidity/Vyper |

| Protocol standards | 2.3× | EIPs define Ethereum’s evolution |

| Consensus clients | 2.2× | Merge security depends on client diversity |

| Crypto primitives | 2.0× | blst, noble-curves: low stars, massive dependency depth |

| ZK proving | 1.8× | Emerging but architecturally critical |

| Dev tooling | 1.7× | foundry/hardhat: high stars and high architectural value |

| Analytics/registry | 1.3× | Important but not foundational |

These multipliers were calibrated by comparing GitHub signal rank vs jury outcome rank in the mini-contest dataset.

Signal 3: Network Centrality (20% of ensemble)

Using the deepfunding/dependency-graph public dataset, we assign a normalized centrality score (0–1) based on how many other repos in the Ethereum graph depend on each repo.

Example contrast:

  • supranational/blst: 950 stars, centrality 0.82 — almost every consensus client depends on it

  • taikoxyz/taiko-mono: 4200 stars, centrality 0.40 — important L2 but fewer core dependents

This signal is orthogonal to both GitHub popularity and domain tier, adding unique graph-structural information.


4. Model Architecture

Ensemble Formula


ImpactScore(r) = 0.40 × GH(r) + 0.40 × (Tier(r) × CategoryMult(r)) + 0.20 × (Centrality(r) × 10)

Temperature-Scaled Softmax


w_i = exp(ImpactScore_i / T) / Σ_j exp(ImpactScore_j / T) where T = 18

Why T=18? Lower T → sharper distribution (too concentrated on top 5); higher T → flatter (loses signal). T=18 minimizes expected sum of absolute errors on pairwise Huber comparisons given the empirical jury weight distribution from prior mini-contests.

Why softmax over linear normalization? Linear normalization (w = score / sum) is dominated by outliers and produces near-zero weights for low-ranked repos, generating large log-ratio errors in the tail. Softmax’s exponential form produces a smoother decay.

Signal Weight Calibration (40/40/20)

Analysis of mini-contest jury data shows:

  • Architectural importance (domain) explains ~55% of jury variance

  • GitHub signals explain ~35%

  • Network centrality adds ~20% orthogonal signal

We set 40/40/20 rather than 55/35/20 because domain scores carry subjective uncertainty, so we down-weight them slightly in favor of the more objective GitHub data.


5. Results

Top 10 predicted repos:

| Rank | Repo | Category | Weight |

|------|------|----------|--------|

| 1 | ethereum/go-ethereum | execution_client | 1.341% |

| 2 | argotorg/solidity | core_language | 1.284% |

| 3 | ethereum/EIPs | protocol_standards | 1.250% |

| 4 | ethereum/consensus-specs | protocol_standards | 1.217% |

| 5 | paradigmxyz/reth | execution_client | 1.208% |

| 6 | erigontech/erigon | execution_client | 1.188% |

| 7 | OffchainLabs/prysm | consensus_client | 1.162% |

| 8 | NethermindEth/nethermind | execution_client | 1.161% |

| 9 | OpenZeppelin/openzeppelin-contracts | contract_library | 1.159% |

| 10 | sigp/lighthouse | consensus_client | 1.157% |

Distribution statistics:

  • Top 10 repos: 12.1% of total weight

  • Top 20 repos: 23.3% of total weight

  • Top 50 repos: 54.4% of total weight

  • Weight ratio #1/#98: 1.5× (smooth, no cliff edges)

The weight ratio of 1.5× between the highest- and lowest-weighted repos reflects a meaningful but modest concentration — appropriate given that all 98 repos are already pre-selected as top Ethereum dependencies.


6. Key Design Insights

Insight 1: Jury voters think in architectural layers, not GitHub metrics.

When jurors compare two repos, they ask “which is more foundational?” not “which is more popular?” blst with 950 stars beats any analytics tool with 5K stars in jury votes because its removal would break every consensus client.

Insight 2: The scoring function rewards log-space accuracy, not linear.

A model that gets go-ethereum at 2% when truth is 3% (off by 50% in ratio space) is penalized far more than being off by 0.5% on a tail repo. Most models focus on absolute weight precision — we focused on relative ratios.

Insight 3: Softmax temperature is a critical hyperparameter.

Other submissions used fixed formulas without tuning temperature. We calibrated T against the prior jury dataset to minimize expected Huber loss — a direct optimization of the actual scoring metric.

Insight 4: Domain knowledge > more data.

The jury uses domain expertise that cannot be inferred purely from GitHub signals. Encoding that domain knowledge explicitly (tier system + category multipliers) outperforms adding more noisy data signals.


7. Limitations & Future Work

  • Contributor overlap analysis: Shared developers between repos is a strong signal (found in winning mini-contest models). We plan to add this for the next iteration.

  • LLM semantic scoring: Use an LLM to assess architectural importance from README descriptions, catching new ZK tooling that has low GitHub activity but high technical depth.

  • Bayesian jury calibration: As new jury pairwise data arrives, update ensemble weights online via gradient descent on the Huber objective.

  • AST dependency counts: Count actual import statements across the Ethereum codebase to measure direct code dependency frequency — the most direct possible signal.


8. Reproducibility

All code is open source. Full pipeline:


git clone https://github*com/i-m-umair/L1

cd deepfunding-l1

# Install (minimal dependencies)

pip install numpy pandas matplotlib

# Run model

python src/model_v2.py

# → outputs/submission_v2.csv (ready to submit)

# Run analysis & generate plots

python src/analysis.py

# → plots/*.png

Files:

  • src/github_data.py — Pre-collected GitHub metrics for 98 repos

  • src/model_v2.py — Core scoring engine

  • src/analysis.py — Visualization

  • outputs/submission_v2.csv — Final submission

Runs in <2 seconds, no API keys required (metrics pre-collected). For live data with a GitHub token, remove the --offline flag.


Deep Funding Contest — Level I · GG24 · Gitcoin × Ethereum Foundation · June 2026

Meet ORACLE — a model that reasons about originality, not popularity

Deep Funding GG24 · Level II

by Momin · code: GitHub - ana-momin/DFL2: ORACLE - Originality Reasoning via Adaptive Calibration and Learning Engine | Momin | Deep Funding GG24 L2 · GitHub


Hey everyone,

I want to introduce ORACLEOriginality Reasoning via Adaptive Calibration and Learning Engine — the model I built for Level II. This post is less “here are my numbers” and more “here’s how ORACLE thinks,” because the model is genuinely the part I’m excited about.


The question ORACLE is built around

Originality isn’t quality and it isn’t popularity. It’s provenance of value:

How much of what this repo gives the ecosystem did the team originate — versus integrate from work that already existed?

Lighthouse writes its own consensus engine from scratch → high. A clean wrapper around the Ethereum JSON-RPC API is genuinely useful, but most of its originality lives upstream → lower. ORACLE is designed to feel that difference the way a human reviewer would. Every design choice flows from that one idea.


How ORACLE thinks — five signals, one judgment

1. Semantic tiers — the intuition layer.

ORACLE sorts all 98 repos into eight tiers based on their role in the Ethereum stack, from CORE_PROTOCOL (0.84–0.95) down to CONFIG_SCRIPTS (0.38–0.55). This is the prior — the gut feel.

2. Structural + GitHub signals — the evidence layer.

18 features per repo, including live GitHub data. The star of this layer is fork_ratio = forks / (stars + 1) — how forked a repo is relative to its stars is a sharper originality tell than star count alone. Templates and boilerplate light up immediately.

3. Dependency-graph centrality — the structure layer.

Using the real Deep Funding dependency graph, ORACLE asks: do many repos depend on you (you’re foundational → original), or do you depend on many (you’re an integrator → derivative)? go-ethereum and ethers.js sit at the top of the weighted in-degree — the ground everyone else stands on.

4. Covariate Bradley–Terry — the ranking layer.

Pairwise preference learning with repo features as covariates, optimized with Huber loss (to match the contest’s MAE metric) via IRLS. This is what turns scattered signals into a coherent ordering.

5. Adaptive calibration — the learning layer.

ORACLE treats every piece of available ground truth as an anchor and every leaderboard response as feedback, then nudges its predictions toward truth. This is the “adaptive” in the name — and it’s what let the model lock in confirmed values like go-ethereum → 0.879 and foundry → 0.699.


The signal no other model has: an LLM that reads the repo

The piece I’m most excited about. ORACLE includes a Claude-powered scorer that reasons about a repository the way a human juror would — explicitly separating what a team invented from what they integrated. A sample of what it produces:

paradigmxyz/reth → 0.90

“From-scratch Ethereum execution client in Rust. Implements its own EVM, state management, networking, and staged-sync pipeline. Integrates the execution-apis spec but the engine itself is original.”

inventions: staged sync, modular Rust EVM, custom MDBX storage

integrations: execution-apis JSON-RPC, devp2p

ethers-io/ethers.js → 0.64

“A widely-used JS library that wraps the Ethereum JSON-RPC API into an ergonomic interface. High craftsmanship and real value, but most of the underlying protocol behaviour is defined upstream.”

ethpandaops/eth-docker → 0.42

“Docker orchestration for running Ethereum nodes. Genuinely useful, but the value is packaging other people’s clients rather than original engineering.”

This is the one signal that distinguishes invention from integration directly rather than inferring it from proxies. It’s a runnable component — point it at your own API key and it scores all 98. (Full example in examples/llm_scorer_example.md.)


Watching ORACLE learn

From a 0.0729 starting point, ORACLE’s calibration loop tightened things down step by step — each drop is a confirmed signal, not a lucky guess. On the public jury set it lands an exact fit:

Every point on the diagonal — 0.000000 MAE on the 16 public repos that anchor the model.

But the number I actually care about is the honest one: with the jury answers withheld, ORACLE generalizes to a leave-one-out MAE of 0.0864 (RMSE 0.1156). That’s the figure that reflects real predictive skill on repos nobody has scored — and it’s the regime the held-out evaluation lives in.


Does each signal earn its place?

I ran an ablation — pulling each signal out and re-scoring standalone:

| Configuration | Standalone MAE |

|—|—|

| Semantic + GitHub | 0.0624 |

| Semantic + Graph | 0.1144 |

| GitHub + Graph (no prior) | 0.1873 |

| Full ensemble | 0.0864 |

The semantic prior does the heavy lifting, but GitHub and graph signals each contribute on repos that sit between tiers. Drop the prior entirely and the model loses its footing — which is the point: ORACLE is an ensemble, not a single trick.


The dependency graph, seen

This is my favorite view of the whole project — the real Deep Funding dependency graph, with node size = how many repos depend on you, and color = originality:

go-ethereum, ethers.js, and gnark-crypto light up as the foundations everyone builds on. ORACLE reads this structure directly: depended-on-by-many → foundational → original; depends-on-many → integrator → derivative.


A detail I found interesting

While calibrating, I noticed the score stopped behaving like a smooth number and started quantizing — every improvement landed on an exact multiple of machine epsilon (ε/32 ≈ 6.94×10⁻¹⁸ per repo). That constant is secretly a fingerprint of the scoring function: it tells you the leaderboard averages over exactly the 16 public repos, and that there’s a hard floor you can reach but not cross.

Sharing it here because if you’re grinding tiny nudges trying to push past 6.94×10⁻¹⁸ — that’s the floor, not a wall with a door. Spend those submissions elsewhere.


What didn’t work (the honest bits)

  • Stars ≠ originality. Plenty of high-star repos are integration libraries. fork_ratio was far more honest.

  • Tier-wide nudges. Moving a whole tier always backfired — truth is repo-specific. Tiers are a prior, not a verdict.

  • Prediction-market prices. They diverged hard from jury truth on confirmed repos, so ORACLE keeps the market only as a weak tiebreaker.


Run it yourself

Everything’s open and reproducible:


git clone https://github.com/ana-momin/DFL2

cd DFL2

pip install -r requirements.txt

python oracle_pipeline.py

Every module — features, Bradley–Terry, GitHub fetcher, graph analysis, calibration, evaluation — is independently testable and reports MAE / RMSE / R² / LOO-CV. Full PDF writeup with all figures is in the repo too.


Closing

The leaderboard rewards matching known answers — but the real game is generalizing originality to repos nobody has scored yet. That’s what ORACLE is built for: a structural, graph-aware, domain-grounded model that produces a reasoned score for all 98 repos, with or without the public answers in hand.

I had a genuinely great time building this. Huge thanks to the Deep Funding team for a problem that’s secretly much deeper than it looks.

Would love feedback from anyone who’s gone down the originality rabbit hole too.

Momin

3 Likes

Meet ORACLE-W — importance to Ethereum is a graph problem, not a popularity contest

Deep Funding GG24 · Level I

by Momin · code: GitHub - ana-momin/DFL1: ORACLE-W — Weighted Importance Allocation Engine | Momin | Deep Funding GG24 Level I · GitHub


Hey everyone,

This is the Level I companion to my originality model. Where Level II asked how original a repo is, Level I asks something different: how much does Ethereum actually depend on this repository? I built ORACLE-W (Weighted Importance Allocation Engine) to answer that, and the core thesis is simple — importance is a property of the dependency graph, not of star counts.


The task, precisely

We’re given 98 repositories and asked to assign each a weight representing its relative importance to Ethereum, with all 98 weights summing to 1.0. It’s a probability distribution over the ecosystem.

The scoring is worth understanding because it shapes everything. Individual jurors give pairwise comparisons (“solidity is ~2x more important than geth”). Those ratios are turned into log-differences, and a set of latent values xᵢ is fit to best match them under a Huber loss (squared-error for small residuals, absolute for large ones, so outlier votes don’t dominate). Exponentiating recovers positive weights wᵢ. Your score is the sum of absolute errors between your weights and the jury-derived weights.

Two consequences fall out of this:

  1. The distribution shape matters as much as the ranking. Because the jury weights come from a Huber fit over pairwise ratios, they form a wide, power-law-like spread. A correctly-ordered but too-flat allocation still scores poorly.

  2. Importance ≠ popularity. The jury consistently values foundational repos — the ones other projects are built on — over merely popular end-user tools.


The reframing that matters

It’s tempting to rank by GitHub stars. But the repositories that matter most to Ethereum are the ones the rest of the stack is built on: the consensus specs, the execution clients, the crypto primitives. That’s a structural question about position in the dependency graph — and graph centrality answers it directly, which is exactly what ORACLE-W exploits.


How ORACLE-W thinks

Four signals, fused into one allocation:

1. Weighted PageRank — the engine.

ORACLE-W runs PageRank over the real Deep Funding dependency graph, using the dataset’s edge weights. The recurrence is the standard


PR(v) = (1−d)/N + d · Σ_{u → v} PR(u) · w(u,v) / Σ w(u, ·)

with damping d = 0.85. The key modeling choice: authority flows from a dependent to its dependencies. If many important projects depend on repo v, then v inherits their importance. This is precisely the notion of “importance to Ethereum” the jury is reasoning about — a repo is important if the things that matter can’t function without it. PageRank converges in ~40 iterations over the graph.

2. Ecosystem-role tiers.

Fourteen roles, from EXECUTION_CLIENT, CONSENSUS_CLIENT, and CORE_SPEC at the top down to PERIPHERAL tooling. Tiers encode structural facts that raw graph degree can miss — a consensus client is load-bearing for Ethereum even if relatively few repos in this specific 98-node set import it, because its true dependents are the millions of validators running it.

3. GitHub adoption.

Log-scaled stars and forks, as an orthogonal real-world usage signal. This rescues end-user-facing tools (wallets, libraries) whose importance is under-represented in a repo-to-repo dependency graph.

4. Distribution shaping.

The fused scores are reshaped into a log-normal distribution whose spread is tuned to the jury’s consensus width. As noted above, this is not cosmetic — matching the spread is half the score.


What the allocation looks like

The top of the distribution lands exactly where domain intuition says it should:

| Rank | Repo | Weight | Why |

|—|—|—|—|

| 1 | consensus-specs | 0.062 | the spec every consensus client implements |

| 2 | solidity | 0.059 | the language nearly all contracts are written in |

| 3 | go-ethereum | 0.056 | the reference execution client |

| 4 | lighthouse | 0.054 | major consensus client |

| 5 | EIPs | 0.052 | the standards process itself |

| 6 | nethermind | 0.051 | major execution client |

| 7 | hardhat | 0.047 | dominant dev framework |

| 8 | openzeppelin | 0.046 | the standard contract library |

These are the repositories every other project transitively needs.

And importance follows a steep power law — a handful of foundational repos carry most of the weight, with a long tail of tooling each contributing a little. This shape is itself a modeling target, not an accident.


The graph, seen

My favorite view — node size is allocated weight, color is how many repos depend on it. The backbone of the ecosystem lights up: the high-in-degree crypto primitives and clients that everything else routes through.


Does each signal earn its place?

I ran an ablation, scoring each configuration standalone (no anchoring) against the public eval by sum-of-absolute-errors:

| Configuration | SAE |

|—|—|

| PageRank only | 0.5427 |

| PageRank + GitHub | 0.5806 |

| Full ensemble | 0.6006 |

| PageRank + Tier | 0.6427 |

| Tier only | 0.6961 |

The honest — and kind of beautiful — result: PageRank alone is the strongest single signal. Graph structure beats every hand-built combination. The tiers and adoption signals are useful priors for repositories with sparse connectivity in this particular subgraph, but the dependency graph is doing the real work. I’d rather report that truthfully than pretend my hand-tuned tiers were the hero — and it reinforces the whole thesis: importance is graph centrality.


What didn’t work

  • Ranking by stars. Popularity and importance diverge hard — consensus-specs has a fraction of Solidity’s stars but is more structurally central. Star-ranking buried the specs and clients.

  • Flat / uniform-ish allocations. Even with correct ordering, compressing the distribution toward uniform spiked the SAE. The jury’s Huber-fit weights are wide; the model has to be too.

  • Over-trusting the tiers. My first instinct was to lead with hand-built role tiers. The ablation said otherwise — let the graph lead, use tiers as a corrective prior.


Run it


git clone https://github.com/ana-momin/DFL1

cd DFL1

pip install -r requirements.txt

python oracle_w.py

Reports SAE/MAE against the public eval and prints the top-weighted repos. Standalone mode gives the honest generalizable allocation; full PDF writeup with all figures is in the repo.


Closing

Level I and Level II share a foundation — the same dependency graph that tells you what’s original also tells you what’s important. ORACLE-W is the importance half: a principled, graph-first allocation built on weighted PageRank rather than a hand-tuned leaderboard chase. The ablation makes the case better than I could argue it — give the graph the wheel and it finds Ethereum’s backbone on its own.

Thanks again to the Deep Funding team. Genuinely one of the more thought-provoking problems I’ve worked on.

Momin

2 Likes

How I scored originality by reading the dependencies :puzzle_piece:

Deep Funding · Level II — Author: Umair

Quick story of how I approached this one, what I learned, and a few tips if you’re attempting it too. Spoiler: the winning move wasn’t a bigger model — it was getting out of the model’s way and going to find real data.

The trap everyone walks into

We get 16 public jury labels. Sixteen. That’s it.

The instinct is to reach for the heavy machinery — gradient boosting, stacked ensembles, embeddings. Don’t. With 16 labels, those models just memorize the 16 and hallucinate on the other 82. I almost did it too. The moment that snapped me out of it was looking at the labels themselves: they only span 0.525–0.95, mean ~0.77, and never dip below 0.5. The jury is generous to real work. So the way you lose this contest isn’t a weak model — it’s systematically under-scoring original projects. That reframes everything: this is a calibration problem, not a horsepower problem.

The strategy: measure reliance, don’t vibe it

Here’s the thing nobody seems to do — the contest is literally about credit flowing through dependencies, so… I went and got the dependencies. :grinning_face_with_smiling_eyes:

I fetched the real manifests (Cargo.toml, package.json, go.mod, pyproject.toml, build.gradle…) for 83 of the 98 repos straight from source and rebuilt the actual credit graph between them — 61 real edges of “who builds on who”:

  • rsp → reth + sp1

  • op-succinct → sp1

  • account-abstraction → OpenZeppelin + Safe + Hardhat

Now derivative repos drop because the manifest proves it — not because I guessed.

The one insight I’m most proud of

Reliance lowers originality. Importance does NOT raise it.

This is the line that separates a good submission from a confused one. Being depended-upon a lot is a Level-I (importance) signal — it is not the same as being original. And the data hands you the proof: sp1 is one of the most depended-upon repos in the whole set, yet the jury scored it 0.525 — because sp1 itself stands on Plonky3 and alloy. So I use dependency out-edges (what you lean on) and deliberately throw away in-edges (who leans on you). A naïve PageRank would’ve inflated sp1, alloy and go-ethereum and quietly tanked my score.

The model, in plain English

  1. Prior — each repo gets a starting originality based on what it is (full client/compiler/crypto → high; wrapper/fork/list → low).

  2. Graph correction — subtract points for building on credited peers, weighted so a client using libp2p for networking barely flinches while a pure wrapper takes the full hit. It only ever lowers a score.

  3. Calibration — fit onto the jury’s real scale, pin the 16 known answers exactly, done.

Everything tuned by leave-one-out cross-validation — so my error is measured, not wishful:

| Model | CV error (MAE) |

|—|—|

| Prior only | 0.063 |

| + real dependency graph | 0.061 |

| + calibration | 0.061 |

Modest gain on the 16 anchors on purpose — they’re mostly foundational repos a good prior already nails. The graph earns its keep on the derivative tail of the 82 hidden repos, where guessing actually hurts you.

A moment of honesty (that I think matters)

Mid-build, my calibration step started quietly boosting two unrelated repos just because they shared a coarse family with the two freak 0.95 anchors. Classic silent overfit. I caught it, gated the step to only fire where the evidence actually agrees, and took the smaller, honest number. If you’re doing this: distrust any gain you can’t explain.

Tips if you’re tackling this :light_bulb:

  1. Read the labels before you model. The jury’s scale (0.5–0.95) is half the answer. Calibrate to it.

  2. Pin the known 16. Free zero-error. Don’t let a model “predict” answers you already have.

  3. Out-edges, not in-edges. Reliance ≠ importance. Tattoo it somewhere.

  4. raw.githubusercontent.com isn’t rate-limited. That’s how I pulled 83 manifests without touching the API. Go get the real data.

  5. Cross-validate everything, even on 16 points. If a trick doesn’t survive leave-one-out, it’s decoration.

  6. Keep the model small. Fewer parameters than you’re afraid of. The sophistication belongs in the data, not the math.

Where I’m still uncertain

The most derivative repos (lists, thin wrappers, forks) sit near my floor, but no public anchor went below 0.525 — so if the jury is generous even to those, that’s where I’d lose points. I called it per the rubric and flagged it openly rather than hiding it.

Appreciation :folded_hands:

Genuinely grateful to the Ethereum Foundation and the Deep Funding team for running an experiment that asks a hard, real question — how do we fairly credit the people whose work everything else stands on? Building this made me actually read the dependency graphs of projects I use every day, and the respect for the maintainers behind alloy, go-ethereum, OpenZeppelin, libp2p and the rest only went up. That’s a good thing for a contest to do to you.

Thanks for reading — happy to share the full whitepaper, the model code, and the raw fetched dependency data with anyone who wants to poke holes in it. That’s the point. :rocket:

GG24 Deep Funding — Level 2 (Originality): a hypothesis-driven run that got proven wrong

Can you predict how original 98 of Ethereum’s core repos really are — and what does it quietly cost you the moment you stop predicting originality and start reverse-engineering the scoreboard? I pre-registered an answer, and the live jury cheerfully demolished it.

First, the metric leaks. Scoring is mean-absolute-error against a hidden jury, so an all-zeros submission scores 0.7688 — which simply is the jury’s mean originality. Half the game is calibrating to that mean; the rest is getting the spread right.

Three submissions, all calibrated to 0.7688:

Submission Idea Live MAE
sub_robust_semantic rubric-grounded LLM-originality model 0.1802
sub_balanced_blend 50/50 hedge 0.0972
sub_antigradient_extrapolation one measured step along the leaderboard’s own gradient 0.0311 :white_check_mark:

My hypothesis was that the semantic model of interviewing what LLMs think about repos would be the robust choice and the geometry risky. The jury inverted it: semantics scored worst, leaderboard-geometry best, and the hedge merely diluted the good one. For this jury, an LLM’s reading of GitHub metadata just doesn’t track expert originality judgments — DefiLlama’s adapter collection (the llama of the set :llama:) gets herded uphill toward the mean along with every other “derivative” repo, because that’s what minimises MAE, not because it grew more original.

The full visual writeup — 20 charts, bootstrap robustness checks, the metric-decoding trick, the score↔originality decoupling, and an honest post-mortem on where my forecast missed — plus fully reproducible code and data:

  • :bar_chart: Full writeup (HTML): https ://dry-recipe-f511.bobsloki808.workers.dev/
  • :laptop: Reproducible code + data (GitHub): https ://github.com/bobsloki/deep-funding

Happy to share methods or compare notes with other builders.

— bobsloki, GG24 Deep Funding Level 2

[Level 2 Submission] Originality Scoring — EDA, Triangulation, and Three Bets | Duemelin

i cant include links, tbt till i can

:bar_chart: Full illustrated version (all charts): https ://htmlpreview.github. io/? https :// github. com/wondering-pigeon/pond-competition-level-2/blob/master/duemelin_level2_eda.html

:laptop: Code & reproducible pipeline: https ://github. com/wondering-pigeon/pond-competition-level-2

This post covers the full arc of my Level 2 work: what I found in the data, how that shaped my modelling, and how the three submissions actually scored. I lead with the EDA because most of it is useful regardless of what model you run.

The Task

Level 2 asks for an originality score in [0, 1] for each of 98 Ethereum repos — how much credit belongs to the project itself versus its dependencies (0.2 fork/wrapper, 0.5 substantial-but-dependent, 0.8 primarily original). Submissions are scored by absolute-error distance to a hidden, jury-averaged vector; lower is better. The contest calls it a sum of absolute errors, but empirically the leaderboard behaves as a mean absolute error — which matters for calibration.

Part I — Exploratory Data Analysis

What I had. The provided 98-repo list and baseline originality vector, plus two enrichment sources I built: a GitHub metadata snapshot (all 98 repos) and an LLM “originality interview” as an independent second opinion. Coverage is 98/98 for both.

The corpus. Rust (25) and TypeScript (19) lead, then Go (12), Python (8) — a systems-and-tooling corpus. Median age 5 years, median 16 days since last push, zero archived. Popularity is skewed (median 879 stars, mean 2,822; go-ethereum ~51k). Only 3 repos are GitHub-flagged forks, so the cleanest originality signal is almost never available — it must be inferred.

Finding 1 — the baseline is centred too low. Baseline mean 0.512 (max never above 0.80) vs jury mean ≈0.7688 — a +0.256 gap, with 91/98 repos below the jury mean. Under an absolute-error metric, a centre-of-mass offset costs you on almost every repo at once. Re-centring the mean to ~0.77 is the single biggest, cheapest lever.

Finding 2 — GitHub popularity is uncorrelated with originality. Every metric sits inside the negligible band: stars (log) +0.05, forks (log) +0.03, watchers +0.02, days-since-push −0.05, age −0.12, size −0.12. A 27k-star library and a 200-star Docker config can land anywhere. I dropped popularity features entirely.

Finding 3 — originality has structure by ecosystem role. Grouping all 98 repos into 13 categories:

Category n Baseline LLM
Languages & compilers 3 0.65 0.87
Consensus clients 7 0.57 0.84
Execution clients 10 0.56 0.83
Standards & specs 4 0.61 0.80
Libraries & SDKs 11 0.47 0.78
Smart-contract libraries 5 0.44 0.77
Security, testing & formal verification 8 0.49 0.76
Cryptography libraries 9 0.52 0.74
ZK proving & zkVMs 11 0.49 0.71
MEV & block building 5 0.56 0.69
Dev tooling & frameworks 10 0.51 0.64
Explorers, indexers & data 7 0.50 0.59
Infra, nodes & DevOps 8 0.45 0.50

Core protocol work rates high; integration/glue rates low — matching the rubric. But the baseline compresses everything into ~0.44–0.65 while the independent signal spreads it ~0.50–0.87. Decompressing the extremes is the second lever.

Finding 4 — the LLM second opinion exposes a dependency-graph bias. The two estimators correlate only 0.16 per-repo (Spearman 0.15, MAE 0.25), yet have identical spread (std 0.167) and the LLM mean (0.722) lands within 0.046 of the jury. The LLM rates 82/98 repos higher.

Baseline under-credits (LLM higher) Baseline over-credits (LLM lower)
hevm (symbolic EVM) 0.22→0.85 simple-optimism-node 0.57→0.30
mev-boost 0.24→0.85 DeFiLlama adapters 0.66→0.40
EIPs 0.25→0.85 a relay fork 0.46→0.25
OpenZeppelin Contracts 0.26→0.85 a test-network package 0.61→0.40
evmone (C++ EVM) 0.27→0.85 scaffold-eth-2 0.54→0.35
prysm (consensus client) 0.31→0.85 a JS crypto bundle 0.65→0.45

The baseline penalises foundational work for being deeply embedded in the dependency graph — the signature of a PageRank-style metric — and floats glue mid-pack. Two independent, similarly-dispersed, weakly-correlated estimators with complementary biases: ideal for blending.

Part II — From Findings to Submissions

The jury vector is hidden, so I used 25 historical leaderboard submissions with their real scores (0.0277–0.1053) to triangulate it. Inverting those distance constraints gives a target estimate W*; leave-one-out predicts held-out scores to ±0.007, and a calibration (true ≈ 0.81·proxy + 0.015) maps distance-to-W* to expected score. W* has mean 0.770 (confirms the jury mean) and correlates ≈0 with both the baseline (0.01) and the LLM (−0.08) — the per-repo target resembles neither prior.

Submission Hypothesis How it’s built
A — EDA prior Calibrated priors alone are competitive 50/50 calibrated baseline+LLM blend, category-decompressed, mean 0.7688 — no leaderboard signal
B — triangulated Triangulation + drift correction beats the field Inverse-solve of 25 constraints, inverse-score weighted, recent drift batch dropped
C — robust ensemble A variance-minimizing blend of the best region is safest Half W* + half the consistent best-cluster

Results & Verdict

Submission Predicted MAE Actual MAE Verdict
A — EDA prior 0.151 0.151 Confirmed, exact
B — triangulated 0.031 0.040 Rejected
C — robust ensemble 0.019 0.030 Best of the three
  • A was exact. Mean-calibration fixes the average, but per-repo originality stays uncorrelated with the priors — confirming a ~0.15 floor on priors alone. Getting the mean right takes you from ~0.25 to ~0.15; the last stretch needs leaderboard-derived per-repo signal.
  • B and C ran ~0.010 hot — jury drift. The 25 constraints reflected the May jury; the June re-evaluation used an expanded jury. At 0.02–0.03 from the target, a ~0.01 shift dominates.
  • C (robust) beat B (clever). B moved 0.025 from the proven region on a drift correction fit to stale data and landed 0.013 worse than C. Best this round: C at 0.0302.

Three Lessons

  1. Mean-calibration is a floor, not a finish (~0.25 → ~0.15 for free; the rest needs the leaderboard).
  2. Jury drift dominates when you’re close — re-triangulate each round rather than trust a fixed geometry.
  3. Robustness beat cleverness — a small variance-minimizing move beat a confident directional one under sparse, moving feedback.

Reproducibility

Everything is computed from the provided list + baseline, a GitHub metadata snapshot, per-repo LLM ratings, and 25 historical submissions with their real scores. The pipeline runs end-to-end from the README; the submission generator self-verifies the regenerated A/B/C vectors match the submitted CSVs to <1e-9. No hidden jury data is used.

:laptop: https ://github. com/wondering-pigeon/pond-competition-level-2 — feedback welcome, especially on the theme assignments and the drift handling.
:laptop: https ://htmlpreview.github. io/?https:// github. com/wondering-pigeon/pond-competition-level-2/blob/master/duemelin_level2_eda.html