State of FDD Season 12 Review

DisruptionJoe · January 27, 2022, 5:40pm

State of the Fraud Detection & Defense workstream - Season 12

In this State of FDD post, we are sharing an overview of the items our Stewards will find most important to know when considering our Season 13 (2/1 - 4/30) budget proposal.

TL;DR

Anti-Sybil = A+ | 3.5/4 Deliverables + 6 Extra Credit
Policy = A | 2/3 Deliverables + 2 Extra Credit
Evaluation Squads = A+ | 3/3 Deliverables + 2 Extra Credit
Evolution = B | 3/4 Deliverables
Research = B | 4/5 Deliverables + 2 Extra Credit
Operations = B- | 2/3 Deliverables + 1 Extra Credit
User Support = Transferred to DAOops

Season 12 (Q4) Financials

FDD Q4 Budget Tracking

P/L will be available prior to 3/15
FDD has $395k at current market value in its possession. $395k less $271k (reserves) less $81k (last 3 epoch payments) means $43k in treasury management gains and/or under budget performance during the quarter.

(We will see with the P/L, but I think we made about $75k in treasury management gains, but went over-budget by about $30k. This was warranted over budget cause by us adding the red team simulation squad, starting to build our source council for FDD-OS, and increasing the number of human evaluations.)

Special Acknowledgements

@Sirlupinwatson For handling FDD onboarding and bringing in a 300% increase in human evaluations completed

@omnianalytics For helping the research stream with data analytics and connecting research to other parts of the stream to apply insights to critical issues

@nollied For driving results and organization in our data storage layer and bringing our advisors into more active roles in FDD

A reminder of WHY we are here

DisruptionJoe · January 27, 2022, 5:41pm

Season 13 Budget Proposal will be reviewed by our multisig keyholder oversight council tomorrow (1/28/22)

It will then be posted to the forum before 2/4/22.

Sirlupinwatson · January 27, 2022, 6:35pm

Thank you @DisruptionJoe!

Really happy to stick on closely with driving the Human Evaluation and the contributor experience.
Looking back in time the evolution of the Human Evaluation and clearly stating that we made incredible progress with the Machine Learning model & Features make me proud of the contributions I can do and bring to the DAO.

In this quarter we are adding an application that should scale into an interface where we can trigger actionable insights, at the same time we might be able to reduce the cost but increase the numbers of Human Evaluation and the overall quality witch is a win/win/win simulation

samspurlin · January 27, 2022, 6:54pm

“…This was warranted over budget cause by us adding the red team simulation squad, starting to build our source council for FDD-OS with Sam Spurlin…”

I just want it to be very clear that I was paid a few hundred dollars (~40 $GTC) for the series of workshops I designed and facilitated over the course of several weeks. I’m uncomfortable to see my name listed as part of the reason why FDD was over budget by $30k. It’s important to me that it’s publicly understood I didn’t show up and start siphoning thousands of dollars from this team.

DisruptionJoe · January 27, 2022, 7:02pm

Absolutely true Sam… Sorry for that! I was intending to list the line items that had variance from the original budget.

I’m going to adjust that line as it reads wrong as if the payments were to you rather than operationalizing some decisions we made after learning from you.

kyle · January 31, 2022, 3:29am

Thanks, Joe -

I haven’t seen an FDD report for GR12 yet. Is that something that is in the works? I would love to better understand if the fraud tax is continuing to trend downward (kudos for the GR11 results!).

For those who haven’t read the attached docs, I want to note that I love this intent to narrow FDD’s focus again. I would love to make sure the core tenets of the workstream are not lost as FDD has expanded.

Now that these functions are secured with accountable parties, we aim to narrow our focus. Season 13 theme will be devoted to defending the grants mechanism from sybil and collusion attacks, ensuring that grants are screened for eligibility in a credibly neutral way, and building FDD in a sustainable way.

Can you explain what the

Flagging Efficiency Estimate which was 140% for GR12

means? I can’t seem to find more details around why we moved to this instead of the tax (despite us not paying out the tax last time).

I really love the way the workstream is now structured into the three groups/streams. It’s a helpful to think about the work FDD is doing.

One criticism on the OKRs for these groups however (and this is merely consider feedback). They are mostly output measure not outcome measures. I would love to see why we need 3 soft commits on data warehouse, what to you imagine the outcome to be of that? Or why contributors need to run the ASOP end to end… what are you hoping that accomplishes for the DAO? The Ground Control focus seems like it’s to confirm the satisfaction of FDD contributors is high, but once again picks output metrics (though does say in GR14 a metric will be defined, which I applaud) instead of simply stating we want contributor NPS to be above an 8 (this is just an example), then letting the stream figure out the best “how” to hit that.

As I dig deeper into the doc, I find this really great nugget:

There are also the cases where a judgment was made which saved the community from paying matching pool funds to an ineligible grant. This total is near $100,000.

I would have loved to learn more somewhere publicly.

My summary, and tl;dr - the focus areas of each squad feels bloated to me (as an outsider, and someone who doesn’t quite understand why decentralization, or 5 people in a council for each team, are necessary for what feels like focused work), and I am surprised that for multiple rounds now the coordination and feedback from your customers (GPG, PGF) has been missing or nascent.

I wonder if there might be a goal or OKR that mentions the satisfaction of your customers is high (Those hosting and running grants rounds).

danlessa · January 31, 2022, 1:41pm

Hey Kweiss

We did not compute the Fraud Tax as a KPI because it has several problems in regards to being an Fraud Detection metric, notably:

It does not measure the performance of the sybil detection efforts
It is not informative of the actual sybil incidence over the contributions and users
It has an implicit meaning that fraud should be dealt by performing payoffs so that grants always gets the maximum denominator

Those points above are enough to move that metric off the scope of the core function of Fraud Detection. I do agree that it is an intriguing metric, but the scope of it is more related to an Business / Community Intelligence effort rather than sybil detection itself, especially considering that it is not a trivial metric to compute (eg. the QF algorithms and inputs must be reproduced exactly as described on the platform, and the data feeds should be clean).

As for the “Flagging Efficiency Estimate”, this is a more suitable metric, because it is defined as being: (% of sybil incidence according to flags) / (% of sybil incidence according to survey). In other words, this is measuring how much the combined flagging process is being “efficient” in regards to what we’ve surveyed through humans.

FEE being above 100% means that the entire process is flagging more than if we’ve flagged using humans only. Less than 100% means that it is flagging less (which was the case on GR11). What that does means? There are some interpretations:

The detection pipeline has more info and it is able to detect sybil behaviour better / with more confidence than humans alone
Humans are under flagging, eg., they do not have as much info as the FDD process, or are being conservative in their answers.

FEE is more suitable as an KPI because it does provides us something that we can use to refine the process and provide something that we could use to criticise the actual Fraud Detection outcome objectively.

Pfed-prog · January 31, 2022, 1:56pm

Not sure if it is optimal to base the efficiency of flagging process only to human intelligence (survey). Why not instead train human intelligence to find inappropriate behaviour?
From seeing pipeline in action, there is a lot of cherry-picking. Especially in regards to favouring existing grants. Here is my data analysis post: Lifetime Gitcoin Grants Data Analysis and Hypothesis Testing - #2 by Pfed-prog

The core issue is the scalability. There is disproportionate number of new grants on the platform. Hence, the strong positive correlation between round_number and grant_id.

However, there is definitely a missing piece of analysis to what extent new grants are actually new users and not old users evolving to cheat the system.

danlessa · January 31, 2022, 2:24pm

Pfed-prog, it is a matter of fact that the current KPI (as well as any potential alternative KPI) has limitations. I tend to believe that we should in due time define a 2nd or even a 3th KPIs for the detection pipeline, however we must be careful to make sure that:

They should consist of a single number for the entire round
They’re able to provide objective and actionable feedback about the sybil detection performance
The assumptions behind are well understood, and their validity limitations clear

Provided that those are met, and given that evaluation cherry picking is a concern topic, then we should try to discover what is a good summary metric for the Human Evaluation Performance. Maybe this is something to be continued on the data analysis post, or a new one altogether

DisruptionJoe · January 31, 2022, 5:56pm

Thanks for a well thought out response.

Here it is: Grants Round 12 Governance Brief

Can you explain what the

Flagging Efficiency Estimate which was 140% for GR12

means? I can’t seem to find more details around why we moved to this instead of the tax (despite us not paying out the tax last time).

Hopefully that helps explain. I’ve asked Danilo with blockscience to explain here too.

I’ve added outcomes to the Key Results now. Originally, the O was the outcome and the KRs are the metrics or binary results to determine if we are moving towards the outcome. I’d love to hear some of your thoughts on good OKRs for FDD. I think the outcomes listed are clear outcomes and the key results are indeed metrics or output functions.

Most of our focus has been on the job at hand and we have definitely been light on communicating that work. We will work on a standard formating for an FDD dashboard of sorts this quarter in our “Mandate Delivery” squad.

I figured on this forum the addressable audience does understand “why decentralization”, but here is the quick explanation. Single points of failure are easily corrupted and remove optionality for all downstream participants of a system. Gitcoin grants is building public infrastructure for funding public goods.

This public goods funding process has two critical weaknesses which can be defended (sybil & collusion), but they require subjective reasoning. By expanding the inputs to both systems, we remove the ability for one actor to corrupt the system. At the same time, these algorithms learn to think like the community rather than the builder of the system.

I’m not sure what this is referencing? Our source council is the outcome owners of all the lower level outcomes. There is one council which currently has 9 members.

This would be great to include. GPG has expanded significantly now so I hope they will have the time to provide us feedback in the future. I even suggested having Nate, the GPG Grants Product Manager on our multisig to better connect us (This may not work because we want to separate the company and DAO).

Public Goods Funding workstream began running the Grants Operations for GR12, which was a core team function before. I ran the grants operations from GR8 to GR10. Then I was included in the weekly meetings for GR11. This is to say I had a consistent connection to them, but we could step up our post round satisfaction efforts directed to PGF.

Our mandate delivery squad will be looking into better metrics to more clearly communicate all these points during this upcoming season.

We will need to work with the PGF team in the future and will include them in our runbook going forward.

kyle · February 2, 2022, 8:01pm

Thanks for the reply. The additional details are helpful!some quick replies inline on open questions I still as (you have answered a bunch of them).

I am not sure I am in a position to offer any advice on WHAT the OKRs should be. Assuming you are defining those appropriately, I am trying to offer my opinion on HOW to measure success. So, instead of binary actions (we did 3 things), instead we would try to measure the outcome we care about (sentiment of the contributors continues to increase). In the later example, the items you defined (the output) may all be the wrong right things to do… it doesn’t leave much room for you adjust if you learn those are the wrong things though. Where as, if you were to outline "improve by the sentiment of contributors by 10% QoQ (measurement to be defined as part of the quarter), it gives you room to adjust and discover the best things to do.

I will go on a tangent here… when I worked at another startup that focused on reducing people’s energy use, we built a huge software platform with lots for goals for development, but we learned mailing a report to people was more effective than the software. We were stuck with being a software company and having software focused OKRs so we never made the pivot to what was actually going to work and solve the problem most efficiently.

I would love to dive into this. You are making a determination on how to prevent this, by scaling and adding redundancy. I am honesty uncertain if this is the best approach. Perhaps instead of a team of 5 to review and ensure no one person is corrupt, you have a single person perform actions in a highly transparent way (don’t trust, verify) with an escalation path.

Thank you again for engaging in discourse here. I know these questions may be seen as attacking the proposal, but I really do have immense respect for and appreciate your engagement to help me understand more. I am chiming in and trying to understand because I care

DisruptionJoe · February 3, 2022, 11:25am

I wish more people would take the time to thoughtfully comment on our proposal and accompanying documents. Thank you.

This part is helpful. The updated top level okrs are hopefully slightly more aligned to this. The example you gave is helpful.

I’m still not understanding where you are seeing a 5 person council for each thing. There is the FDD source council, which is the single leads of all the outcomes FDD has. I’ll be writing up a post to describe our system soon. (Working on budget proposal now)

In terms of the autonomy of single actors delegated authority by the collective vs councils or steward votes, I think this is a matter of how many people are represented. I think the bottoms up legitimacy is designed via thoughtful consideration balancing subsidiarity and parpolity.

Basically, if you are governing a small group, then a single leader elected via consent is fine. As this grows and multiple leaders (or outcome owners in FDD) emerge, then these elected leaders form a council. This is similar to what we are doing with the CSDO at the workstream level. I think it is a fractal structure where there is always a level of collective ownership that is delegated to an elected leader who is accountable.

In Season 14, part of me removing myself will include having our source council impose by-laws for FDD including term limits for representation at the higher level.

Aaron Swartz on the Parpolity System

kyle · February 3, 2022, 3:52pm

There is likely an over index here on the 5 people. I recall seeing praise given that we grew the number of people reviewing grants by 300% and I was unsure if this was a success case or failure case. ie, why do we need to hire 300% more, if there is transparency and an escalation path. In a world with finite resources to grow and foster Gitcoin, is hiring 300% more something that should be celebrated, or should we be optimizing for lean teams with transparency and an escalation path so that we are good stewards of the DAO treasury. this is mostly a rhetorical question, but I am curious in the sentiment of others as well.

Thanks for the updates to the OKRs, I will take a look!

DisruptionJoe · February 4, 2022, 2:35pm

Ohhhh I see. That was growing 300% for human evaluations of accounts flagged as sybil. We performed over 6,000!

The human evaluations should be as many participants as possible because we want the ml algorithm to be dynamically updating to “think” like the community. Our long term metric for this is lowering the cost/evaluation while raising the inter-reviewer reliability.

For the grant reviews, we want to decentralize that input too in the long run. Mostly to have an open and permissionless system, but that is not solved with increasing budget. It will be by building out the GIA software to allow anyone to participate.

For example, in GR12 we paid $15k to 9 reviewers to get through the approvals. Our experiment with GIA allowed us to get 53 reviewers for the disputes within a few days using POAPs and $0 budget. Using the GIA tool for all approvals will allow us to lower this budget while involving more of the community!

bobjiang · February 7, 2022, 1:13am

Thank Joe for the detailed review for season 12!

Although user support is moved to DAOops, I do believe we worked hard with high improvement possibility.

will keep on impromve the user support for Gitcoin

ZER8 · February 18, 2022, 5:00pm

Hello, the grant review squad did not grow with 300%, the Human Evaluations grew with 300%(those are periodic evaluations done by members of the Gitcoin DAO, not only people from the FDD)

But, glad you asked about out little grant review squad . The grant review squad remained kind of constant(it was 7 people during Gr11 with 4 active reviewers and 7 people during GR12 with 7 active reviewers) all this while the number of grants grew and the number of reviewers remained constant. We actually increased our output(with a lot) and also focused on the quality during GR12 all while remaining fair and ethical with our contributors.
All the grants were reviewed by at least 2-3 people in Gr12 AND all the grants were approved in 48 hours after the round started(this efficiency wasn’t even possible before GR12 because we created these processes while also operating).

Because decentralization is also the goal we are working to decentralize and enhance the grant review process even more before GR13