Graph Neural Network Originality Estimation Report
Author: Umer Farooq
Competition: Gitcoin GG24 Deep Funding Level 2
Date: MAY 2026
1. Executive Summary
This report documents an originality-estimation system built on deep representation learning. It applies a graph neural network to the software dependency graph in order to learn, for each repository, a dense vector representation — an embedding — that captures the repository’s role in the ecosystem. Originality is then read from these learned embeddings. The system is the most experimental of the five developed for Level II of the Gitcoin Grants Round 24 competition, and this report is candid about both its promise and its limitations from the outset, because intellectual honesty about scope is itself a requirement of sound engineering documentation.
The competition asks for an originality score in the unit interval for each of ninety-eight repositories, and as with all approaches to the task, the binding constraint is the absence of trustworthy labels. This constraint bears with particular force on deep learning. A conventional neural network trained in a supervised fashion on ninety-eight examples with synthetic labels would not learn anything of value; it would overfit noise, and reporting it as a deep-learning solution would be misleading. The defensible deep-learning response is to abandon supervision entirely and to learn from structure. A graph neural network does exactly this: it learns node embeddings from the topology of the dependency graph through an unsupervised objective that requires no labels at all.
The chosen architecture is a two-layer GraphSAGE encoder, implemented in a deep-learning framework without reliance on specialized graph libraries, trained with the unsupervised objective that draws connected nodes together in embedding space and pushes unconnected nodes apart. After training, originality is derived by blending a structural readout of each repository’s source-versus-sink balance with the distinctiveness of its learned embedding relative to the cloud of ordinary dependency packages. The result is a genuine deep-learning system, with a verifiable training loop in which the loss provably decreases, that learns meaningful representations from graph structure rather than fitting to phantom labels.
The report does not overclaim. In validation on controlled synthetic graphs the learned embeddings produced correctly ordered originality, and the training loop demonstrably learned, but the separation achieved on unstructured data was modest, and the report rates this solution below the simpler structural methods in expected competitive performance. Its value lies in the representation-learning capability it contributes to the ensemble and in its extensibility to richer node features, not in a claim to be the single best estimator.
2. Abstract
We investigate a deep representation-learning approach to estimating open-source repository originality, in which a graph neural network learns node embeddings over the software dependency graph and originality is derived from those embeddings. Motivated by the impossibility of meaningful supervised deep learning on a small, label-free dataset, we adopt an unsupervised GraphSAGE encoder trained with a contrastive objective over graph edges, which learns from topology without labels. Originality is read from the trained embeddings by combining a structural source-versus-sink readout with the distinctiveness of a repository’s embedding relative to the dependency-package centroid. Because no ground truth exists, we evaluate the system through the verifiable decrease of its training loss, the correctness of its induced ordering on controlled synthetic graphs, the spread of its score distribution, and graph-coverage statistics. We report results candidly, including the modest separation observed on unstructured data, and position the solution as a representation-learning contributor to an ensemble rather than a standalone best estimator. The system is delivered as a reproducible, containerized service implemented in a standard deep-learning framework with automated tests that verify the learning dynamics.
3. Introduction
Representation learning has transformed machine learning by replacing hand-engineered features with representations learned directly from data. In the graph domain, this transformation is embodied by graph neural networks, a family of models that learn node representations by iteratively aggregating information from each node’s neighbors. After several rounds of aggregation, a node’s representation reflects not only its own attributes but the structure of its surrounding neighborhood, allowing downstream tasks to draw on learned structural features that no human designed. This report asks whether such learned representations can capture the originality of a software repository from the structure of the dependency graph in which it sits.
The question is appealing but must be approached with discipline, because deep learning is easily misapplied. The dataset comprises ninety-eight repositories with no trustworthy labels, conditions under which supervised deep learning is hopeless: a high-capacity model trained on so few examples against synthetic targets would memorize noise and generalize nothing. A report that presented such a model as a success would be engaging in precisely the kind of overclaiming that erodes trust in machine-learning practice. The honest path — and the one this report follows — is to use deep learning only where it can legitimately contribute, namely in the unsupervised learning of structural representations, where labels are not required and the abundant structure of the dependency graph provides a genuine learning signal.
This is the fourth of five solutions. It shares the ecosystem-graph construction with the network-centrality solution but differs fundamentally in what it does with the graph: where the centrality solution computes fixed analytical measures, this solution learns adaptive representations through gradient descent. The report develops the architecture, the unsupervised objective, and the embedding-to-originality readout in detail, evaluates the system honestly, and situates it within the broader collection of solutions as a representation-learning component whose principal value is realized in combination with the others.
4. Problem Statement
The task is to assign each of ninety-eight repositories an originality score in the closed unit interval, higher for greater self-reliance, in the prescribed two-column format. The task offers no feature matrix, no trustworthy labels, and a ranking-oriented evaluation. These conditions, and especially the combination of a tiny sample with absent labels, define the boundary within which a deep-learning approach must operate honestly.
Let G = (V, E) be the directed dependency graph and R ⊆ V the target repositories. We seek an encoder Φ : V → ℝᵈ mapping each node to a d-dimensional embedding learned without labels, and a readout g : ℝᵈ × G → [0, 1] that converts a repository’s embedding and structural context into an originality score. The encoder is trained so that embeddings respect graph topology; the readout interprets them in terms of self-reliance.
5. Business Context
Although this solution is the most experimental, the representation-learning capability it embodies has substantial long-term value. Learned embeddings are reusable: an embedding that captures a repository’s structural role can serve not only originality estimation but also tasks such as similarity search, clustering of related projects, anomaly detection, and the prediction of future dependency relationships. An organization that invests in learning good repository embeddings acquires a general-purpose asset, whereas the fixed analytical measures of the centrality solution serve a single purpose.
In the immediate funding context, the value of this solution is more measured and is presented as such. It contributes a learned, adaptive perspective that differs in character from the fixed structural and content measures of the other solutions, and this difference is valuable precisely because diversity among methods improves an ensemble. The business case for this solution is therefore framed honestly as an investment in a reusable capability and as a source of method diversity, rather than as a claim that a graph neural network is the best single estimator for a task of this size.
6. Literature Review
Graph neural networks emerged from efforts to generalize convolution to irregular graph-structured data. The graph convolutional network of Kipf and Welling established a simple and influential message-passing formulation in which each node’s representation is updated as a normalized aggregation of its neighbors’ representations followed by a learned transformation. The GraphSAGE framework of Hamilton, Ying, and Leskovec generalized this to an inductive setting and introduced the unsupervised objective employed here, in which the representation of a node is trained to be predictive of its neighbors through a contrastive loss with negative sampling, drawing on the same intuition as earlier node-embedding methods.
Those earlier node-embedding methods — notably the random-walk-based approaches that adapted ideas from neural language modeling to graphs — demonstrated that useful node representations could be learned in an entirely unsupervised manner from graph structure alone. The contrastive objective used in this work is a direct descendant of that line: it treats connected nodes as positive examples and randomly sampled nodes as negatives, and it requires no labels. This lineage is the foundation of the report’s central methodological claim, that meaningful deep learning is possible on this task only by learning from structure without supervision.
The negative-sampling technique that makes the contrastive objective tractable derives from the neural language-modeling literature, where it was introduced to approximate an expensive normalization over a large vocabulary. The implementation here follows the standard formulation, sampling a fixed number of negative nodes per positive edge and optimizing the resulting objective by stochastic gradient descent with the Adam optimizer, a widely used adaptive method.
7. Existing Solutions Analysis
Two families of alternative warrant comparison. The first is the family of fixed analytical graph measures, exemplified by the centrality solution documented in the companion report. These measures are interpretable, require no training, and perform well, but they are fixed: they cannot adapt to the data or incorporate node attributes beyond what their definitions admit. A learned encoder, by contrast, can in principle discover structural features that no fixed measure captures and can integrate arbitrary node attributes, at the cost of interpretability and of the risk of learning little when data is scarce.
The second family is conventional tabular deep learning, a multilayer perceptron trained on per-repository features. On this task that family is simply inapplicable in any honest form: with ninety-eight examples and no labels, such a model cannot be trained meaningfully, and presenting one would be misleading. The graph neural network avoids this trap by virtue of its unsupervised objective and its exploitation of the rich edge structure of the dependency graph, which provides far more training signal — in the form of thousands of edges — than the ninety-eight repository nodes alone would suggest. This is the crucial insight that makes deep learning defensible here: the learning signal comes from the graph’s edges, which are abundant, not from the repository labels, which are absent.
8. Proposed Solution
The proposed system learns node embeddings over the ecosystem dependency graph with an unsupervised GraphSAGE encoder and derives originality from those embeddings. It reuses the graph construction of the centrality solution, assembling a single directed network over the cohort and its dependencies, and then proceeds through three stages: tensor preparation, unsupervised encoder training, and embedding-based scoring.
Figure 1. Graph Neural Network Architecture.
The ecosystem network is converted to tensors, encoded by a two-layer GraphSAGE network into node embeddings, and scored by blending embedding distinctiveness with a structural readout.
9. Dataset
| File |
repos_to_predict.csv |
sample_submission.csv |
PublicEvalR2L1.csv |
Table 1. Dataset Summary. The target list defines the repository nodes; the graph the encoder learns over is built at run time.
10. Node Feature Definitions
Table 2. Node Feature Definitions. Initial features are simple structural quantities that the encoder refines through message passing.
| Feature |
is_repo |
| log in-degree |
| log out-degree |
| log dependent count |
These are deliberately simple structural quantities; the encoder’s task is to refine them into richer representations through message passing. The simplicity of the initial features is intentional, as it places the burden of representation on the learned aggregation rather than on hand-engineering.
11. Exploratory Data Analysis
Exploratory analysis examined both the structure of the constructed graph and the learning dynamics of the encoder. The graph, as reported for the centrality solution, is substantial even for a partial cohort, providing thousands of edges. This abundance of edges is the critical observation for a deep-learning approach: although there are only ninety-eight repository nodes, the contrastive objective draws its training signal from the edges — of which there are many — so the effective quantity of learning signal is far larger than the node count suggests.
Table 3. Demonstration-Graph Statistics. The edge count, not the node count, determines the quantity of unsupervised learning signal.
| Statistic |
| Repository nodes |
| Total nodes |
| Total edges |
| Edges per repository |
Analysis of the learning dynamics confirmed that the encoder trains successfully: across epochs the contrastive loss decreased substantially and consistently, the defining evidence that the network is learning structure rather than failing to fit. At the same time, the analysis tempered expectations. On graphs without strong community structure, the learned embeddings, while well-formed, distinguished originality only modestly once blended into a score, a finding the report records plainly rather than concealing. The encoder learns; what it learns is most useful when the underlying graph carries genuine structural signal, which the real ecosystem graph does to a greater degree than randomly structured synthetic graphs.
12. Data Preprocessing
Preprocessing transforms the directed dependency network into the tensor inputs the encoder requires. Three operations are central.
First, the initial node features are assembled and the degree-based components are logarithmically compressed to tame skew, exactly as the heavy-tailed degree distribution of a dependency graph demands.
Second, the directed edges are symmetrized for message passing: although dependency is inherently directional, allowing information to flow in both directions during aggregation gives each node access to both its dependencies and its dependents, which is appropriate for learning a representation of structural role. The original directed edges are preserved separately for the training objective, which depends on edge direction.
Third, the symmetrized adjacency is row-normalized so that aggregation computes a mean rather than a sum. For a node with neighborhood N(v), the normalized aggregation weight on edge (v, u) is the reciprocal of the node’s degree, so that the aggregated neighbor representation is:
$$\text{agg}(v) = \frac{1}{|N(v)|} \sum_{u \in N(v)} h(u)$$
Row normalization is essential because dependency-graph degrees vary over orders of magnitude; without it, high-degree nodes would dominate aggregation and destabilize training. A guard ensures that isolated nodes — which arise from unresolved repositories — are handled without division by zero, so that the preprocessing never fails on a degenerate node.
13. Feature Engineering
In a representation-learning system, feature engineering is largely delegated to the model: the encoder learns the features rather than receiving them ready-made. The engineering effort therefore concentrates on two places.
The first is the design of the initial node features, kept deliberately minimal so that the learned aggregation — not the hand-crafted inputs — carries the representational burden.
The second, and more consequential, is the design of the readout that converts learned embeddings into originality. The readout combines two engineered quantities:
- Structural readout: Reuses the source-versus-sink intuition of the centrality solution, computing the logarithm of a repository’s combined in-degree and external dependent count, less the logarithm of its out-degree, as an interpretable measure of foundational role.
- Embedding distinctiveness: Measures the Euclidean distance between a repository’s learned embedding and the centroid of the embeddings of all non-repository dependency nodes; the further a repository’s representation lies from this generic-dependency cloud, the more distinctive and, by hypothesis, original its structural role.
These two quantities are rank-normalized and blended, the blend weight controlling the relative trust placed in the learned signal versus the interpretable one.
14. Model Architecture
The model is a two-layer GraphSAGE encoder followed by an embedding-based readout.
14.1 The GraphSAGE Encoder
Each GraphSAGE layer updates a node’s representation by combining a learned transformation of its own features with a learned transformation of the mean of its neighbors’ features. Writing H for the matrix of node representations, Â for the row-normalized adjacency, and W for learned weight matrices, a layer computes:
$$H’ = \sigma\left(\hat{A} H W_{\text{neighbor}} + H W_{\text{self}}\right)$$
Two such layers are stacked, with a rectified-linear nonlinearity and dropout between them, so that after the second layer each node’s embedding reflects information from its two-hop neighborhood. The final embeddings are normalized to unit length, which conditions the contrastive objective and renders the subsequent distance computations scale-free. The implementation uses sparse matrix multiplication for the aggregation, keeping memory and computation proportional to the number of edges.
14.2 The Unsupervised Objective
The encoder is trained with a contrastive objective requiring no labels. For each directed edge (u, v), the dot product of the endpoints’ embeddings is encouraged to be large, while for randomly sampled non-adjacent pairs it is encouraged to be small. With the logistic-sigmoid function σ and a set of sampled negatives, the loss is:
$$\mathcal{L} = -\sum_{(u,v) \in E} \log \sigma(z_u \cdot z_v) - \sum_{(u,n)} \log \sigma(-z_u \cdot z_n)$$
This objective embodies the homophily principle that connected nodes should occupy nearby regions of the embedding space. Because it is defined over edges and sampled negatives rather than over labeled nodes, it learns entirely from structure, which is what makes the deep-learning approach legitimate on a label-free task.
15. Training Methodology
Training is the genuine deep-learning loop depicted in Figure 2. The graph is converted to tensors, and for a configured number of epochs the encoder performs a forward pass to produce embeddings, the contrastive loss is computed over the edges and sampled negatives, gradients are backpropagated, and the optimizer updates the weights. The loss is logged periodically, and its consistent decrease over epochs is the primary evidence that learning is occurring.
Figure 2. Unsupervised Training Loop.
The encoder is trained by repeated forward passes, contrastive-loss computation over edges and negatives, and optimizer updates until the epoch budget is exhausted.
The training procedure is fully deterministic given a fixed random seed, which governs both the weight initialization and the negative sampling, so that results are reproducible. Because the graph is small by deep-learning standards, training completes in seconds on a single processor without specialized hardware. The automated test suite includes an explicit verification that the loss decreases from its initial to its final value, encoding the learning requirement as a test that fails if the training dynamics regress.
16. Hyperparameter Optimization
Table 5. Hyperparameter Configuration. Values follow established conventions for small-graph unsupervised learning.
| Hyperparameter |
Notes |
| Embedding dimension |
Modest; appropriate to small graph |
| Layers |
Fixed at 2 (captures two-hop structure) |
| Learning rate |
Common default for Adam optimizer |
| Weight decay |
Common default for Adam optimizer |
| Negatives per edge |
Follows standard contrastive practice |
| Epochs |
Set generously; loss plateaus well within budget |
Automated hyperparameter search against synthetic labels was deliberately avoided, since it would optimize toward noise. The blend weight that balances the structural and embedding signals in the readout is the parameter most worth tuning in practice, and the report recommends exploring it against held-out expert judgments rather than against synthetic labels.
17. Evaluation Methodology
Supervised metrics are inapplicable for the now-familiar reason: no ground truth exists. The evaluation rests on label-free criteria, two of which are specific to the learned nature of this solution.
Table 6. Evaluation Metrics and Their Applicability. Loss decrease and synthetic-graph ordering are evaluation assets specific to the learned approach.
| Metric |
Applicability |
| Accuracy / F1 / ROC-AUC |
Not applicable — no labels |
| Training-loss decrease |
✓ Verifiable learning signal |
| Ordering on synthetic graphs |
✓ Controlled correctness check |
| Score distribution spread |
✓ Label-free quality indicator |
| Graph coverage |
✓ Label-free quality indicator |
| Latency / throughput |
✓ Operational metric |
18. Results and Findings
The results are reported candidly, including where they are modest.
On controlled synthetic graphs constructed with explicit source and sink structure, the full train-and-score pipeline ordered the constructed foundational repositories above the constructed derivative ones, confirming that the learned embeddings support correct originality judgments when the graph carries genuine structure. The training loss decreased substantially and consistently across epochs in every run, establishing beyond doubt that the encoder learns.
Figure 3. Embedding-Based Inference Pipeline.
A final forward pass yields embeddings, from which distinctiveness is measured, blended with the structural readout, and rank-normalized into a score.
The honest qualification concerns the magnitude of separation on weakly structured data. On synthetic graphs lacking strong community structure, the blended scores spanned the full unit interval but separated the foundational and derivative groups only modestly, with the structural readout contributing much of the usable signal and the learned embeddings adding a smaller — though non-trivial — increment.
On the basis of these findings the report rates this solution below the simpler structural and content solutions in expected competitive performance, while affirming its value as a representation-learning capability and as a diverse contributor to the ensemble.
19. Error Analysis
The dominant limitation is the modest marginal contribution of the learned embeddings relative to the structural readout on data of this scale and structure. This is not a defect in the implementation — which demonstrably learns — but a consequence of the task: ninety-eight repositories embedded in a graph whose most informative structure is already captured by interpretable centrality measures leave limited room for a learned representation to add large independent value.
Three key limitations:
- Modest marginal signal value — the principal finding of the error analysis, not a flaw to be hidden.
- Coverage gap — repositories whose ecosystem does not resolve appear as isolated nodes that cluster at the low end of the score regardless of true originality.
- Blend-weight sensitivity — because the learned and structural signals are combined, the result depends on their relative weighting; a poorly chosen weight can suppress the learned contribution or inject noise.
20. Model Explainability
Explainability is the principal cost of the representation-learning approach. The learned embeddings are dense vectors whose individual dimensions carry no inherent meaning, so a repository’s embedding cannot be interpreted directly in the way a feature attribution or a network position can.
Two mechanisms partially recover interpretability:
- Interpretable structural component — the blended readout includes the interpretable structural component, so a portion of every score can always be explained in source-versus-sink terms.
- Embedding distinctiveness — while derived from opaque vectors, it has a clear conceptual interpretation: it measures how far a repository’s learned representation lies from the cloud of ordinary dependencies, communicable to a stakeholder as a measure of structural distinctiveness.
The report recommends this solution for settings that prize representational power and reusability over full transparency, while directing settings that demand complete auditability to the composite or centrality solutions.
21. Deployment Architecture
The system is packaged as a single container image, with the deep-learning framework installed in a processor-only configuration to keep the image compact, since the graph is small enough that no accelerator is needed. The trained embeddings and encoder weights are carried as artifacts. Because the score is cohort-relative, the interface serves precomputed cohort scores rather than scoring arbitrary new repositories in isolation.
Figure 4. Deployment Architecture.
Replicated interface pods serve precomputed cohort scores, loading embeddings and weights from a shared artifact volume.
22. API Architecture
The synchronous interface exposes:
- A health endpoint
- A metrics endpoint
- An endpoint returning the full ranked cohort scores
As with the centrality solution, the cohort-relative nature of the embedding scores means the interface serves precomputed results rather than attempting to score repositories outside the trained network. Request and response payloads are validated against typed schemas.
This design honestly reflects a property of the method: the embeddings were learned over a specific graph, and a repository absent from that graph has no embedding. An inductive variant of GraphSAGE could embed unseen nodes by aggregating their neighbors — noted as a future extension — but the current interface does not claim a capability the system does not possess.
23. Security Considerations
The system processes only public data and requires no credentials for its primary data source, reducing its secrets burden. Key security measures include:
- Tokens read from environment and supplied through a platform secret
- Input treated as untrusted: repository identifiers validated, service responses parsed defensively
- Deep-learning framework and dependencies pinned to known versions from trusted sources
- Network egress confined to known dependency-insights endpoints
- All request payloads validated at the interface
These measures align with established application-security guidance, particularly secrets handling, input validation, dependency pinning, and least-privilege egress. The embeddings and scores contain only structural information about public packages and pose no confidentiality concern.
24. MLOps Strategy
The operational lifecycle is governed by a continuous integration and delivery pipeline whose test stage is distinctive: in addition to the usual linting and type checking, it runs tests that verify the learning dynamics themselves — that the training loss decreases and that the trained model orders synthetic source and sink structures correctly.
Figure 5. Continuous Integration and Delivery Pipeline.
The test stage verifies learning dynamics — that loss decreases and ordering is correct — before image build and promotion.
Model versioning persists the trained weights and embeddings as artifacts with each build. Drift is monitored through the final training loss, the spread of the learned embeddings, and graph coverage; an unexpected change in final loss or embedding spread indicates that the structure the encoder is learning has changed, providing an early signal of an upstream data shift.
25. Monitoring and Observability
Figure 6. Monitoring and Observability Architecture.
Final loss, embedding spread, and coverage join operational metrics in a time-series store with dashboards and alerting.
Observability tracks two categories of signals:
- Training-quality signals: Final loss and convergence behavior, spread of learned embeddings, graph coverage.
- Operational signals: Interface latency and error rate.
Monitoring the embedding spread is particularly informative. A collapse of the embeddings toward a single point — a known failure mode of contrastive objectives — would manifest as a sharp drop in spread and would invalidate the distinctiveness signal on which scoring depends. Surfacing embedding spread as a monitored quantity allows this failure to be detected promptly rather than discovered through degraded scores.
26. Cost Analysis
Despite being a deep-learning system, this solution is inexpensive because the graph is small and training requires no accelerator. The dominant cost is graph retrieval, cached after the first run, and the training itself completes in seconds on a single processor.
Table 7. Cost Comparison. The processor-only configuration keeps even a deep-learning solution inexpensive at this scale.
| Mode |
Compute |
Accelerator |
Indicative Cost |
| Cold build + train |
Single small instance |
None |
Negligible; free data service |
| Warm retrain |
Single small instance |
None |
Seconds of CPU; effectively zero |
| Interactive API |
Two small replicas |
None |
Low; serves precomputed scores |
The honest cost story is that this solution is no more expensive to operate than the analytical ones. The cost of the approach is paid not in computation but in interpretability and in the engineering complexity of a learned component.
27. Scalability Analysis
Graph neural networks scale to very large graphs through neighbor sampling and mini-batch training — techniques the GraphSAGE framework was designed to support. At the current scale neither is necessary, but they provide a clear path to far larger cohorts.
Table 8. Resource Requirements. Neighbor sampling provides a scaling path; an accelerator becomes optional only at large scale.
| Resource |
Current Scale |
Much Larger Scale |
| CPU |
1–2 cores |
Several cores |
| Memory |
Under 1 GB |
Several GB; sampling reduces footprint |
| Accelerator |
None |
Optional for very large graphs |
| Training wall time |
Seconds |
Minutes with sampling |
| Dominant constraint |
Graph retrieval |
Graph and embedding memory |
28. Risk Assessment
Table 9. Risk Matrix. The interpretability cost and the modest marginal value of the learned signal are this solution’s defining risks.
| Risk |
Likelihood |
Impact |
Mitigation |
| Modest learned-signal value |
Medium |
Medium |
Blend with structural readout; ensemble use |
| Reduced interpretability |
High |
Medium |
Interpretable structural component retained |
| Embedding collapse |
Low |
High |
Monitor embedding spread; unit normalization |
| Coverage gap |
High |
Medium |
Isolated-node handling; documented |
| Blend-weight sensitivity |
Medium |
Medium |
Exposed parameter; documented tuning guidance |
| Cohort-relative comparability |
Medium |
Medium |
Reference graph for stability |
29. Future Improvements
The improvement with the greatest potential to raise the learned signal’s value would enrich the node features beyond simple structural quantities, incorporating the content and activity measures developed for the content solution as initial node attributes. A graph neural network that aggregates rich node features can learn representations that combine structural position with artifact-level properties — a fusion that neither the centrality solution nor the content solution achieves alone.
Additional future directions:
- Inductive encoder deployment — allowing it to embed repositories absent from the training graph, supporting on-demand scoring and improving stability over time.
- Learned readout head — replacing the simple distance-to-centroid distinctiveness with a readout trained on expert judgments, providing a more principled mapping from embeddings to originality.
- Attention-based aggregation — weighting neighbors by learned relevance, capturing that some dependency relationships matter more than others.
30. Conclusion
This report has presented a deep representation-learning approach to originality estimation, in which a GraphSAGE encoder learns node embeddings over the software dependency graph through an unsupervised objective and originality is read from those embeddings. The report’s distinguishing feature is its candor:
- It argues that a graph neural network is the only defensible form of deep learning on a small, label-free task.
- It demonstrates that the encoder genuinely learns, through a verifiable decrease in its training loss.
- It reports the modest magnitude of the learned signal’s marginal contribution without exaggeration.
Figure 7. End-to-End Data Flow.
Targets are built into a network, converted to tensors, used to train an encoder, and scored from the learned embeddings.
The solution’s value lies in the reusable representation-learning capability it embodies and in the method diversity it contributes to the ensemble, not in a claim to be the best single estimator. Its most promising extension — the fusion of structural and content signals through rich node features — is identified as future work. As an honest piece of engineering documentation, the report demonstrates that the disciplined application of deep learning — including the discipline to acknowledge its limits — is itself a mark of sound practice.
31. Comparison Against Classical Centrality and Tabular Methods
Table 10. Comparison Against Classical Centrality and Tabular Methods. The graph neural network learns reusable representations without labels, but its marginal value at this scale is modest.
| Dimension |
Classical Centrality |
Tabular Deep Net |
Graph Neural Net |
| Needs labels |
No |
Yes (fatal here) |
No (unsupervised) |
| Learns from data |
No |
Would overfit |
Yes (from structure) |
| Interpretability |
High |
Low |
Low |
| Reusable representation |
No |
No |
Yes (embeddings) |
| Value at this scale |
High |
None |
Modest but real |
| Best role |
Standalone |
Inapplicable |
Ensemble member |
The advantage of this solution is that it learns adaptive, reusable representations from structure without any labels — a capability neither alternative provides. Its trade-offs are reduced interpretability and, at this scale, a modest marginal contribution over the fixed structural measures. Because it learns a fundamentally different kind of signal from the other solutions, it adds genuine diversity to the ensemble.
32. Appendices
Appendix A. Submission Schema
The submission file is a two-column comma-separated file with a repository column containing the full URL and an originality column containing the predicted score in the closed unit interval, rounded to four decimal places, with rows ordered to match the target list.
Appendix B. Learned Artifacts
Two artifacts are produced by training:
- Node embeddings matrix — stored in a numerical array format; reusable for downstream tasks such as similarity search and clustering.
- Encoder weights — stored in the deep-learning framework’s native format; permit the encoder to be reloaded for further training or, in an inductive extension, for embedding new nodes.
Appendix C. Reproducibility Notes
Reproducibility is guaranteed by:
- A fixed random seed governing weight initialization and negative sampling.
- Cached graph data that fixes the network.
- A deterministic forward pass.
Given the same seed, cache, and configuration, the system produces identical embeddings and scores across runs.
Appendix D. Testing Summary
The automated test suite verifies that:
- The tensor conversion produces correctly shaped inputs.
- The encoder produces unit-normalized embeddings.
- The training loss decreases from its initial to its final value.
- The full pipeline orders synthetic source and sink structures correctly.
- An edgeless graph is handled without error.
The loss-decrease and ordering tests encode the learning requirement directly and run fully offline within the continuous-integration pipeline.