Analyzing Sensitive Data

Share on social media

Publish by

Pablo García

Director of Research and Development (R&D)

Although we create a secure environment when analyzing our clients' data — both from a technological standpoint (data encrypted at rest and in transit, hardened devices, perimeter security, and DLP tools, among others) and a legal one (NDAs with the company and involved technicians) — there are additional science- and technology-based measures that provide extra layers of security, and those are the ones we propose to review in this article.

The advice is to always remember the city of Gondor and its "rings of defense": if one barrier is breached, there must be another containment barrier ready to stop the attack or leak.

The great promise for protecting this type of data is “homomorphic encryption” (encryption that allows performing operations on encrypted data and then decrypting the result), but it is far from being usable beyond some specific cases with algorithms that are only partially homomorphic. Therefore, we don’t use it nor waste time trying to use it.

The first practical case I want to discuss with you is differential privacy. Suppose we are analyzing very sensitive data, like a non-anonymous survey, or data that can be deanonymized with a correlation database.

A small correlation database, as simple as an old bank statement or phone data (in my case, I worked a long time with tokenized phone data), are good candidates for deanonymizing data. Just creating a recognizable pattern in the data is enough; for example, calling a number every 2 hours and hanging up after the first 5 seconds — doing this four times creates a pattern that surely allows individualizing that number in a tokenized (anonymized) database and from there triggering a network effect (who the person calls the most, and so on).

So, just tokenizing and adding some entropy to the data is not enough, but fortunately, we can apply differential privacy.
Let’s suppose a very simple case: survey analysis. Suppose the worst case — the surveys, although made anonymous, have been deanonymized. How do we protect ourselves?

We will protect the question “Do I feel comfortable with my boss?”, which has yes or no answers. So, we take the survey data and leave half of it as it is. For the other half, for each answer, we flip a coin: if it’s heads, we leave the data as is; if it’s tails, we invert the value.

This way, someone studying the survey knows that 75% of the data is true and 25% is inverted. The statistical result remains significant, but if someone takes a survey, deanonymizes it, and says “This answer is Pablo’s!!” they won’t know my real answer, since the data is true with a 75% probability and false with a 25% probability. In this way, we have protected the privacy of the person who completed the survey.

Now, let’s analyze another technique, and for that, let’s take this to the extreme: suppose two rival companies competing — say, two local banks — competing but facing a shared challenge: stopping credit card fraud. It is obvious that if someone has data from both banks, they can train a better fraud detection algorithm, and the customers of both banks would be happier, freer, and would trust both banks more (this is a clear indicator that what we want to do is ethical since it increases customers’ trust in banks and their well-being and freedom).

So, how can we train a fraud detection algorithm between Bank A and Bank B when they don’t trust each other even a little?

We previously talked about encryption at rest (encrypted disk) and in transit (encrypted transport with TLS, Transport Layer Security), but there is a third type of encryption: encryption of the computer’s memory and graphics card, known as encrypted processing, technically called a “secure enclave.”

The process works like this:

An algorithm is created, for example, training a machine learning model that reads files and produces a trained model.
That algorithm (the entire working environment) is digitally signed.
Specialists from both banks review and approve the environment and are assured it cannot be altered because any alteration invalidates the digital signature.
Each bank encrypts its data.
The system starts running and requests the key to access the encrypted data from each bank. They validate the digital signature of the algorithm and verify that it is running in a secure enclave (clouds and secure enclave frameworks have a service known as an “attestation service” that confirms it is running in an encrypted environment). If validated, they provide the key to read their part of the data.
It doesn’t matter if someone has access to the machine’s hypervisor, is the OS admin, or has unrestricted access to the server console; even if they do a memory dump, they will never be able to see what is running inside the secure enclave because all memory and processor access are encrypted.
The algorithm runs, trains, delivers a result, and the entire environment disappears. Thus, when it finishes running, each bank has a better fraud model but has never seen—and will never be able to see—the competitor’s data.

These models are already widely used in medicine, to send data between countries for diagnostics while eliminating the possibility that the data is not only observed but also never used for any other purpose, since once the diagnostic execution ends, the environment and data are no longer usable.

Finally, I invite you to rethink your data analytics processes, because beyond common tools, science (as in the case of differential privacy) or technology (as in confidential computing) removes barriers and makes it possible to use data with adequate levels of protection and trust — even if we don’t trust the party on the other side one bit.

Analyzing Sensitive Data

Share on social media

Let’s work together

Company

Office

Canelones 1611

+598 2419 6457

Montevideo