Scientists at MIT have created a cool new way to protect your personal data while still making machine learning models work well.
Imagine, a group of scientists have created a machine learning model (think of it as a kind of computer brain) that can tell if a patient has cancer from images of their lungs.
They want to share this model with hospitals everywhere so doctors can start using it for diagnosis.
But here’s the challenge. To teach their model how to predict cancer, they showed it millions of real lung scan images, this is known as ‘training’ the model.
These images contain sensitive data and could potentially be taken out by a hacker.
Scientists can make it harder for a hacker to guess the original data by adding ‘noise’, kind of like adding a layer of static to a TV channel. However, too much noise can mess up the model’s accuracy, so they want to add as little as possible.
The MIT researchers have come up with a technique that allows them to add just the right amount of noise to ensure the sensitive data stays safe but the model still works well.
They came up with a new privacy measurement called Probably Approximately Correct (PAC) Privacy, which helps them determine the smallest amount of noise needed.
The best part? This system doesn’t need to know how the model works or how it was trained, so it’s easy to use with different models and applications.
The MIT team found that with PAC Privacy, the amount of noise needed to keep data safe is much less compared to other methods.
This could help people build machine learning models that can keep the data they’re trained on hidden, while still being accurate.
“PAC Privacy uses the uncertainty or randomness of the sensitive data in a clever way, and this lets us add, in many cases, a lot less noise. This system lets us understand the characteristics of any data processing and make it private automatically without unnecessary changes,” says Srini Devadas, an MIT professor who co-authored a new paper on PAC Privacy.
One cool aspect of PAC Privacy is that a user can specify how confident they want to be about their data’s safety right from the start.
For example, maybe they want to be sure that a hacker won’t be more than 1% confident that they’ve successfully recreated the sensitive data to within 5% of its actual value. The PAC Privacy system automatically tells the user the best amount of noise to add to achieve those goals.
However, PAC Privacy doesn’t tell the user how much accuracy the model will lose once the noise is added. Also, since it involves training a machine-learning model on different parts of the data over and over again, it can be quite demanding on computer resources.
Future improvements could involve making the user’s machine-learning training process more stable, meaning it doesn’t change much when different data is used.
This would mean less variance between different outputs, so the PAC Privacy system would need to run fewer times to identify the best amount of noise, and it would also need to add less noise.
Ultimately, this research from MIT could lead to more accurate machine learning models that can better protect our sensitive data, making it a real win-win for technology and privacy.
Follow us on Twitter for more articles about this topic.