Technology preview: Private contact discovery for Signal

moxie0 on 26 Sep 2017

At Signal, we’ve been thinking about the difficulty of private contact discovery for a long time. We’ve been working on strategies to improve our current design, and today we’ve published a new private contact discovery service.

Using this service, Signal clients will be able to efficiently and scalably determine whether the contacts in their address book are Signal users without revealing the contacts in their address book to the Signal service.

Background

Signal is social software. We don’t believe that privacy is about austerity, or that a culture of sharing and communication should mean that privacy is a thing of the past. We want to enable online social interactions that are rich and expressive in all the ways that people desire, while simultaneously ensuring that communication is only visible to its intended recipients.

This squares technology with intent: when someone shares a photo with their friends, their intent is to share it with their friends. Not the service operator, ad networks, hackers, or governments.

The building block of almost all social software is the social graph. For people to be able to use software like Signal, they have to know how to contact their friends and colleagues using Signal.

This raises two related concerns:

Building a social graph isn’t easy. Social networks have value proportional to their size, so participants aren’t motivated to join new social networks which aren’t already large. Very few people want to install a communication app, open the compose screen for the first time, and be met by an empty list of who they can communicate with.
We don’t want the Signal service to have visibility into the social graph of Signal users. Signal is always aspiring to be as “zero knowledge” as possible, and having a durable record of every user’s friends and contacts on our servers would obviously not be privacy-preserving.

Almost all social software has to deal with the first problem. Most solve it by leveraging an existing social graph, such as Facebook (“Sign in with Facebook”), but that means the social graph is “owned” by Facebook.

Instead, Signal began by using the social graph that already lives on everyone’s phones: the address book. Rather than a centralized social graph owned by someone else, the address book is distributed and user-owned. Additionally, having the social graph already on the device means that the Signal service doesn’t need to store a copy of it. Any time someone installs or reinstalls Signal, their social graph is already available locally.

Contact discovery

In order to turn the address book on a device into a social graph for Signal, the Signal client needs to be able to determine which of the device’s contacts are Signal users. In other words, the client needs to ask the service “Here are all my contacts, which of them are Signal users?”

Since we don’t want the Signal service to know the contents of a client’s address book, this question needs to be asked in a privacy-preserving way.

Traditionally, in Signal that process has looked like:

The client calculates the truncated SHA256 hash of each phone number in the device’s address book.
The client transmits those truncated hashes to the service.
The service does a lookup from a set of hashed registered users.
The service returns the intersection of registered users.

The obvious problem with this method is that the hash of a user identifier can almost always be inverted. Regardless of whether the identifier is a phone number, a user name, or an email address, the “keyspace” of all possible identifiers is too small.

This method of contact discovery isn’t ideal because of these shortcomings, but at the very least the Signal service’s design does not depend on knowledge of a user’s social graph in order to function. This has meant that if you trust the Signal service to be running the published server source code, then the Signal service has no durable knowledge of a user’s social graph if it is hacked or subpoenaed.

Trust but verify

Of course, what if that’s not the source code that’s actually running? After all, we could surreptitiously modify the service to log users’ contact discovery requests. Even if we have no motive to do that, someone who hacks the Signal service could potentially modify the code so that it logs user contact discovery requests, or (although unlikely given present law) some government agency could show up and require us to change the service so that it logs contact discovery requests. More fundamentally for us, we simply don’t want people to have to trust us. That’s not what privacy is about.

Doing better is difficult. There are a range of options that don’t work, like using bloom filters, encrypted bloom filters, sharded bloom filters, private information retrieval, or private set intersection. What if, instead, there were just a way for clients to verify that the code running on our servers was the code they wanted to be running on our servers, and not something that we or a third party had modified?

Modern Intel chips support a feature called Software Guard Extensions (SGX). SGX allows applications to provision a “secure enclave” that is isolated from the host operating system and kernel, similar to technologies like ARM’s TrustZone. SGX enclaves also support a feature called remote attestation. Remote attestation provides a cryptographic guarantee of the code that is running in a remote enclave over a network.

Originally designed for DRM applications, most SGX examples imagine an SGX enclave running on a client. This would allow a server to stream media content to a client enclave with the assurance that the client software requesting the media is the “authentic” software that will play the media only once, instead of custom software that reverse engineered the network API call and will publish the media as a torrent instead.

However, we can invert the traditional SGX relationship to run a secure enclave on the server. An SGX enclave on the server-side would enable a service to perform computations on encrypted client data without learning the content of the data or the result of the computation.

SGX contact discovery

Private contact discovery using SGX is fairly simple at a high level:

Run a contact discovery service in a secure SGX enclave.
Clients that wish to perform contact discovery negotiate a secure connection over the network all the way through the remote OS to the enclave.
Clients perform remote attestation to ensure that the code which is running in the enclave is the same as the expected published open source code.
Clients transmit the encrypted identifiers from their address book to the enclave.
The enclave looks up a client’s contacts in the set of all registered users and encrypts the results back to the client.

Since the enclave attests to the software that’s running remotely, and since the remote server and OS have no visibility into the enclave, the service learns nothing about the contents of the client request. It’s almost as if the client is executing the query locally on the client device.

Unfortunately, doing private computation in an SGX enclave is more difficult than it may initially seem.

Building a secure SGX service

An SGX enclave runs in hardware-encrypted RAM, which prevents the host OS from being able to see an enclave’s memory contents (such as the contact information a client transmits to the enclave). However, the host OS can still see memory access patterns, even if the OS can’t see the contents of the memory being accessed.

That presents a number of potentially fatal problems. For example, consider an enclave function that is written as follows:

private List<Long> findRegisteredUsers(Set<Long> registeredUsers,
                                       List<Long> clientContacts)
{
  List<Long> results = new LinkedList<>();

  for (long clientContact : clientContacts) {
    if (registeredUsers.contains(clientContact)) {
      results.add(clientContact);
    }
  }

  return results;
}

The above code iterates over the list of contacts a client submits and checks a local hash table of all registered users in order to determine whether each of the client’s contacts is a registered user or not.

Even with encrypted RAM, the server OS can learn enough from observing memory access patterns in this function to determine the plaintext values of the contacts the client transmitted!

Consider that the OS probably knows the layout of the hash table containing all registered users. This is because it can watch it being built (the enclave has to get the list of all registered users from the OS or some other untrusted source), and those values need to be able to change in real time (also provided by an untrusted source) as new users sign up. Even if the OS didn’t know the layout of the hash table, it would still be able to map it out by submitting its own requests to the enclave through the normal enclave interface and observing where the enclave reads into the hash table for those known values.

With that knowledge, the OS only has to watch which hash table memory addresses the enclave reads from when processing a client request.

It gets worse! The amount of encrypted RAM available to an enclave is limited to 128MB, which isn’t enough to store a large data set of registered users. SGX makes it possible to store the set of all registered users outside the enclave and access that memory from within the enclave, but then the OS really knows what the layout of the registered users hash table is, since they’re not even in encrypted RAM.

Oblivious RAM

Given the above example, it’s clear that standard solutions deployed inside an SGX enclave won’t be enough to provide meaningful protection for what we’re trying to build. This class of problems has been studied under the discipline of Oblivious RAM (ORAM). The objective of ORAM is to define a CPU<->RAM interface, such that the RAM behaves like RAM by retrieving memory for the CPU, but the RAM learns nothing about the memory access pattern of the CPU.

There are some elegant generalized ORAM techniques, like Path ORAM, but unfortunately they don’t work well for this problem. Most perform best when applications have a relatively small number of keys that map to large values, whereas we have an extremely large number of keys and zero-sized values. The more complicated attempts to provide generalized ORAM for data sets like ours, such as Recursive Path ORAM, don’t scale very well and are difficult to build for concurrent access.

If we think about solutions specific to our problem, one possibility is simply to make things inefficient:

private List<Long> findRegisteredUsers(List<Long> registeredUsers,
                                       List<Long> clientContacts)
{
  List<Long> results = new LinkedList<>();

  for (long clientContact : clientContacts) {
    for (long registeredUser : registeredUsers) {
      if (registeredUser == clientContact) {
	results.add(clientContact);
      }
    }
  }

  return results;
}

This function still has problems, but it solves our initial concern by linearizing everything. Instead of a lookup into a hash table of all registered users, this code does a full linear scan across the entire data set of all registered users for every client contact submitted. By touching the entire memory space consistently for each client contact, the OS can’t learn anything from the access patterns. However, with a billion registered users, this would obviously be way too slow.

It’s possible to speed up by inverting the original approach instead:

private List<Long> findRegisteredUsers(List<Long> registeredUsers,
                                       Set<Long> clientContacts)
{
  List<Long> results = new LinkedList<>();

  for (long registeredUser : registeredUsers) {
    if (clientContacts.contains(registeredUser)) {
      results.add(registeredUser);
    }
  }

  return results;
}

This is much faster. The above code still iterates across the entire set of registered users, but it only does so once for the entire collection of submitted client contacts. By keeping one big linear scan over the registered user data set, access to unencrypted RAM remains “oblivious,” since the OS will simply see the enclave touch every item once for each contact discovery request.

The full linear scan is fairly high latency, but by batching many pending client requests together, it can be high throughput.

However, there are still problems. If the OS knew the content or the layout of the clientContacts hash table, this wouldn’t be oblivious.

Oblivious hash table construction

The “normal” way to construct a hash table of client contacts would be:

private Set<Long> constructClientContactsTable(List<Long> clientContacts) {
  Set<Long> results = new HashSet<>();

  for (long clientContact : clientContacts) {
    results.add(clientContact);
  }
}

The above code iterates over the list of client contacts that we receive in an encrypted request and puts them in a hash table.

Obviously, this won’t work, since it reveals to the OS which hash buckets contain elements (the hash table memory addresses that received writes) and which are empty (the hash table memory addresses that didn’t receive writes). As the findRegisteredUsers function iterates over the list of all registered users, the OS knows which registered user the enclave is checking (it’s in unencrypted memory), and can then observe whether the read into the clientContacts hash table references a memory address that it saw a write to during the constructClientContactsTable process.

However, if the clientContacts hash table could be constructed obliviously instead, we wouldn’t have to worry about non-oblivious accesses to it.

There are two primary ways that observers from outside the enclave can monitor memory access patterns: page faults and memory bus snooping. In the former, the OS causes every memory access to page fault, which reveals (only) the address of the page the memory address is in. That results in a 4KB level of memory access granularity the OS is able to observe. However, if someone were to physically attach an FPGA to the memory bus, they could observe memory access at the granularity of a cache line.

To make hash table construction oblivious, we start with a bucketed hash table that looks like this:

 ________   ________         ________
|Bucket 0| |Bucket 1|  ...  |Bucket N|
 --------   --------         --------
|B01||B01| |B01||B01|       |B01||B01|
 ---  ---   ---  ---         ---  ---
|B02||B02| |B02||B02|       |B02||B02|
 ---  ---   ---  ---         ---  ---
|...||...| |...||...|       |...||...|
 ---  ---   ---  ---         ---  ---
|B64||B64| |B64||B64|       |B64||B64|
 ---  ---   ---  ---         ---  ---

Each logical “bucket” (Bucket 1, Bucket 2, …, Bucket N) of the hash table is composed of two cache lines (B1, B2, … B64), one for storing client queries and one for storing lookup results. In order to populate the hash table with all of the client contacts in the request batch, the enclave iterates over each cache line in the hash table:

for (requestCacheLine : requestCacheLines) {
  for (clientContact : clientContacts) {
    if (getHashIndex(clientContact) == requestCacheLine) {
      requestCacheLine[nextAvailable] = clientContact;
    } else {
      requestCacheLine[dummySlot] = clientContact;
    }
  }
}

Rather than iterating over the client contacts and adding them to the appropriate place in the hash table, the above code iterates over every cache line, and then checks to see which client contacts should be stored there. If there are N total client contacts in the full request batch, each cache line in the hash table receives N writes during this process.

This is a very inefficient way to construct a hash table, but after the hash table is constructed, the OS is left with no knowledge of what the hash table contains or how it is laid out. This means that it can be used just like a normal hash table in the critical path of the contact discovery lookup without revealing anything critical to observers.

Basic recipe

There are many other concerns when building a secure enclave. Since branches can potentially be observed through memory access patterns, critical sections should not contain branches. Memory access patterns should always look the same for all outcomes, with care and attention similar to “constant time” programming considerations in cryptographic software.

Using the oblivious hash table construction method above, we can put together a general recipe for doing private contact discovery in SGX without leaking any information to parties that have control over the machine, even if they were to attach physical hardware to the memory bus:

The contact discovery service passes a batch of encrypted contact discovery requests to the enclave.
The enclave decrypts the requests and uses the oblivious hash table construction process to build a hash table containing all submitted contacts.
The enclave iterates over the list of all registered users. For each registered user, the enclave indexes into the hash table containing the batch of client contacts and does a constant-time comparison against every contact in that cache line.
The enclave writes to the “results” cache line for that same hash index, regardless of whether there was a successful compare or not.
After iterating through the entire set of registered users, the enclave builds an oblivious response list for each client request in the batch.
The enclave encrypts the appropriate response list to each requesting client, and returns the batch results to the service.
The service transmits the encrypted response back to each requesting client.

By pushing the inefficiencies into the setup and teardown sections of batch processing, the critical section remains fast enough for the entire process to scale to over a billion users.

Oblivious limitations

The OS is still capable of learning minor things from careful observation. The hash table containing a batch of client contacts is composed of buckets that each contain space for 12 contacts. Using the oblivious hash table construction process, the OS does not know how many elements (if any) are in a bucket, or what those elements may be. However, the OS does know that there can’t be more than 12 contacts in a hash bucket.

As the enclave iterates across the set of all registered users while processing a batch, the OS might see the enclave index into the cache lines for a single bucket more than 12 times. If there were 13 users that correspond to the same hash index, the OS knows that it’s not possible for all 13 of those users to have been in the client request, since they wouldn’t have fit into the hash bucket. However, the OS doesn’t learn whether any subset of them were or were not in the client request, which is the oblivious property that we desire.

View source

We’ve put all of this together into a full contact discovery service, scalable to billions of users, that allows Signal users to discover their social graph without revealing their contacts to the Signal service. Like everything we build, the contact discovery service is open source.

The enclave code builds reproducibly, so anyone can verify that the published source code corresponds to the MRENCLAVE value of the remote enclave.

Check it out and let us know what you think! This is a beta technology preview, but we’ll be deploying the service into production and integrating it into clients as we finish testing over the next few months.

Acknowledgments

Huge thanks to Jeff Griffin for doing the heavy lifting and writing all the enclave code, Henry Corrigan-Gibbs for introducing us to SGX, Raluca Ada Popa for explaining ORAM state of the art to us, and Nolan Leake for systems insight.