Date: Tue, 20 Jun 2017 04:12:52 -0400 From: Roger Dingledine Subject: Re: Privacy-Preserving Longevity Study of Hidden Services --- My first thoughts --- Initial thoughts on angles to consider: A) The traditional question for this group: Is their methodology safe enough? Do they provide enough detail and specificity for us to decide whether it's safe? B) Assuming yes, do we have faith that they can build and implement and deploy the thing they describe? This piece is interesting, because the bad-relays team already identified and kicked out their relays from the network, since they looked like an unidentified Sybil attack (and then Donncha contacted them, since some of the relays were from neu, and then a few days later they sent us this pdf). I think ultimately they should get the bad-relays team to be comfortable with the plan (else the bad-relays team will quite reasonably wonder what the next Sybil attack is for, and try to disrupt it). And I think we here can play a big role in either reassuring the bad-relays team or not doing that. C) What other steps should they take when deploying their experimental relays, like labelling their relay nicknames, setting contactinfo, setting myfamily, etc? Maybe there's a set of best practices we can invent and then recommend. We might also choose to recommend that they go public about the experiment, before they do it -- unless they have a compelling need for secrecy, e.g. because it would mess up the experiment, and I don't see one here? D) Do we think their mechanism is measuring things correctly, and measuring the right things? That is, if they collect things and compute them as they describe, will they indeed get the results they think they'll get? Part A is "is it safe to do", and part D is "will it actually work". E) Is it worthwhile, that is, how valuable are the outcomes they're aiming for? That is, what do we think about the risk (A) vs the accuracy (D) vs the benefit (E)? E) They seem to have some weird assumptions in their hypothesis, e.g. "Short-lived hidden services could indicate not to be legitimate domains, as compared to long-lived domains." Many short-lived services could be things other websites, such as onionshare addresses. The HSDirs can't distinguish what protocol the onion service speaks. These sorts of issues aren't killers, but it would be polite of us to point them out while we're noticing them. F) What do I leave out? And finally, I'll note that this submission has a lot of overlap with what I would expect to see in a hypothetical future Privcount submission, so here we are with a chance to set the precedent well. :) --- Anonymous reviewer 2 --- Motivation - Why would short-lived hidden services denote illegitimate domains? Onion share and Ricochet are legitimate applications that likely have short-lived hidden services. - How would an unusual lifetime identify a hidden service? Data Collection - The protocol isn't active secure. For example, consider a malicious HSDir or client that "marks" each hash-table entry by adding in some value that is a unique multiple of a base value larger than the largest expected count. Other well-known active attacks can be used as well. - Malicious inputs can arbitrarily increase the counts. - How many parties are controlling the HSDirs? Three? - Are the HSDirs running as normal? Will they run only for the lifetime of the study or are they more stable? How many HSDirs will be controlled by any one entity? - Can the output be made noisy? The data has the flavor of "anonymized" data, which can frequently be deanonymized by an adversary with auxiliary information. - For how long will measurement occur before aggregation? - Who is in control of the measurement study? Can that entity set the measurement interval arbitrarily short (thus eliminating any aggregation over time) or otherwise change the measurement parameters to defeat privacy protections (e.g. by modifying the key/identities of the participants)? - Will the protocol implementation be made publicly available? Will it receive any scrutiny outside of the implementor(s)? Overall, the risk seems minimal against the most likely threats (passive observation, post-hoc compulsion). Reasonable steps are taken to secure individual and intermediate data, and the output should be aggregated to a fairly high degree. However, I do worry that this is a bit of security theater, as it doesn't seem unlikely that the measurement will suffer from easily exploitable weaknesses that eliminate its purported security properties, such as 1. Control of crucial measurement parameters by a single entity 2. Active attacks that can be easily run by any single party, *including malicious clients* 3. Common implementation oversights/shortcuts (e.g. not using/verifying long-term public keys, use of an insecure broadcast protocol, using a language such as Python that doesn't support secure deletion of keys) I do also worry about the validity of claims that can be made from this measurement study. How big is the hash table? If there are lots of collisions, then the apparent lifetimes will actually be the sum of lifetimes of many colliding services. You should be able to bound the chance that this case occurs or detect when it does. Also, it seems as if the protocol couldn't tell the difference between an onion service that frequently publishes its descriptor (e.g. due to frequently-changing Introduction Points) and one that is around for a long time. Those are very different cases. --- Anonymous Reviewer 3 --- Recommendations: Correctly marking relays as family, adding contact info, a public page describing the study and research protocol and linking it in the contact info for the relays. Question of sniffing onions for discovery versus using other discovery methods. This is a question of how much is gained by measuring "private onion sites" versus only measuring "public onion sites"? Limiting to public onions without sniffing can be done as in prior work: http://s3.eurecom.fr/docs/www17_darktracing.pdf --- My meta-review putting the above together --- I think the discussion comes down to three points for analysis: (A) Is your plan more dangerous than you think? That is, did we find new risks in the proposed protocol / methodology? Reviewer 2 identified some issues where a malicious component of your system, e.g. one of the relays, or any client, could influence the resulting data. They also suggested adding noise into the aggregated output. These sound like good points, either for modifying the protocol before you do the experiment, or at least for acknowledging in the paper. Having good answers to Reviewer 2's methodology clarifying questions seems smart, especially for item (C) below. Overall, the consensus is that it's pretty low risk: the safety board people are ok with the research, especially once you've thought through the analysis from Reviewer 2. (B) Are you on track to being able to answer your research questions, if you do the proposed experiments? This one is trickier. I think there are real concerns about whether you would be able to answer your research questions as currently posed -- short lived onion services could be Onionshare users, Ricochet users, or something else. It's a poor assumption that they're all websites, and it gets especially poor when you're grabbing them at the HSDirs because nobody knows even what fraction of onion services are websites or Ricochet or whatever. I think you should rethink whether you'll be able to answer your research questions this way, because I suspect you won't. That said, ultimately this is a safety board, so technically our perspectives on this part are out of scope and you don't need to care about them. :) (C) What are our recommendations for how to best deploy these relays in the real Tor network while keeping the network operators happy? I think Reviewer 3's recommendations here are a great start: set your MyFamily lines correctly -- one family for all three research groups -- and set each ContactInfo accurately too, and include a url in the ContactInfo to a page that describes who you are, what you're doing, why it's useful, and why your methodology is as safe as you can make it. The reason it's not workable to convince only the directory authority operators in private is that there's a community of people on the tor-relays list who are hunting for Sybils and other anomalies, and there's a good chance they will find your relay family after a while, and I expect the directory authority operators won't want to be in the position then of saying "yes, we know about this, but don't worry, you don't need to know." All of this said, assuming you want to proceed, I will volunteer to be the mediator to explain to the other directory authority operators why your plan seems to be a safe enough plan. I can't speak for all of them or predict what they'll want to learn, but I'm optimistic we'd be able to find some way forward. --Roger