Privacy-preserving machine learning
In today's digital age vast amounts of data are constantly being generated and processed, for every action we complete on our devices, and this has made privacy a paramount concern. Indeed, machine learning algorithms thrive on large datasets, and it was often the case that user privacy was ignored for the sake of having more advanced algorithms and models. Fortunately, even though privacy preserving machine learning is still not always the norm, several approaches have emerged to find a balance between extracting valuable insights from data and protecting individuals' sensitive information. In this blog post, we will provide an overview of some of the possible approaches to privacy-preserving machine learning. Specifically, we will discuss: differential privacy, federated learning, homomorphic encryption, and secure multi-party computation, describing their benefits and limitations. We hope this will be a good starting point before jumping into more advanced resources such as this survey paper by researchers from IBM and the University of Pittsburgh.
Differential privacy is a mathematical framework that aims at protecting individuals' data by ensuring that the presence or absence of any specific data point does not significantly impact the outcome of an analysis (or query). In other words, it aims at preventing an adversary from identifying whether a specific individual's data was included in a dataset, and allows for robust privacy guarantees while still enabling accurate analysis of aggregated data. The idea behind differential privacy is the introduction of controlled noise into the computation of query responses: by injecting calibrated noise, the contribution of individual data points is obscured while still allowing meaningful analysis on aggregated data (meaning that any conclusions drawn from the dataset remain valid regardless of whether an individual's data is included or excluded).
Code for generating differentially private statistics over datasets is available in the following github repository from Google.
Privacy Preservation: it provides a strong privacy guarantee by ensuring that individual data points cannot be distinguished or linked to specific individuals within a dataset.
Quantifiable Privacy: it offers a quantifiable measure of privacy (through the concept of privacy budget or privacy loss), which allows for fine-tuning the level of privacy protection according to specific requirements.
Robustness: It is robust against various types of attacks - such as statistical inference or auxiliary information attacks - even in the presence of adversaries with background knowledge or access to external information sources.
Trade-off with Utility: the introduced noise can impact the accuracy or utility of the analysis or computations performed, and finding the right trade-off between privacy and utility becomes a crucial consideration: indeed, adding more noise to protect privacy may lead to a loss of valuable insights or accuracy.
Difficulty in Implementation: it requires careful design and calibration of noise parameters. Also, it may require a deep understanding of the underlying algorithms and data.
Limitations in Query Types: it works well for aggregate queries or computations over a dataset, such as counting, averaging, or histograms. However, it is limited when applied to certain types of queries or computations that involve specific patterns or correlations within the data.
Interpretability: The noise introduced can make it more difficult to interpret and understand the reasoning behind specific outcomes or results, as it might obscure the direct causal relationship between inputs and outputs.
Federated learning is a decentralised approach to machine learning where training is performed on local devices (or edge servers), instead of a single location where all the data is centralised. Thus, models are trained collaboratively across devices, and each device contributes with its local data to the process. A central server coordinates the training by aggregating the local model updates and distributing the improved model back to the devices, but without accessing the data that is kept on the edge devices. This approach allows for privacy-preserving machine learning (i.e., the users retain control over their data) while enabling global model improvements. Here is the link to a federated learning library (PySyft) available on github.
Privacy Preservation: it keeps the raw data decentralised and local to the devices, which ensures that sensitive data remains on the user's device or within a specific organisation, reducing the risk of data breaches or unauthorised access.
Data Ownership and Control: users retain ownership and control over their data (which remains on their devices), while still having the possibility of leveraging it by participating to the model training.
Efficient Resource Utilisation: it reduces the need for data transfer to a central server, minimising the bandwidth and storage requirements. Also, it also distributes the computation load across multiple devices, which leads to efficient resource utilisation (indeed, federated learning is suitable even in scenarios with limited computational resources).
Collaborative Learning: it enables collaboration among multiple parties without sharing sensitive data. Several entities can collectively improve a shared model by training it on their respective local data: this allows for collective intelligence and knowledge sharing while respecting data privacy.
Communication Overhead: the communications between the central server and the participating devices during the model training process might be characterised by a significant communication overhead. This might result in slower convergence or increased latency.
Heterogeneous Data: different devices may have varying distributions or characteristics of data, and these data disparities can introduce challenges in achieving consistent and representative models across all devices. Some methods (such as data sampling or weighting schemes) existing and need to be put in place to to mitigate this issue.
Lack of Centralised Data: the data remains decentralised and to the edge devices, and there may be instances where access to a representative portion of the dataset is limited.
Model Security and Integrity: since the model parameters are distributed and shared among multiple devices, ensuring the security and integrity of the model becomes crucial. You should put in place robust security measures to detect and block adversarial attacks or attempts to manipulate the model.
Homomorphic encryption is an encryption scheme that allows computations to be performed directly on encrypted data without the need for decryption. In other words, it enables data to remain encrypted throughout the computation process, which i) ensures privacy and ii) allows useful operations to be performed on the data. Differently from federated learning, it is an approach to privacy-preserving machine learning that can be used on data centralised in a single location.
Privacy Preservation: it ensures that sensitive data remains encrypted at all times, even during computations. Thus the data is never exposed in its decrypted form.
Secure Outsourcing: it enables secure outsourcing of computations to third-party service providers or cloud environments, without revealing the underlying data. This allows organisations to leverage external computing resources while maintaining control over their sensitive information.
Flexible and Generalised Computations: it supports a wide range of mathematical operations, allowing for complex computations to be performed on encrypted data. This includes addition, multiplication, comparison, and more, making it suitable for various machine learning algorithms and analytical tasks. Importantly, there are different types of homomorphic encryption, and not all have the same capabilities.
Interoperability: it can be implemented using standardised cryptographic libraries and protocols, making it compatible with existing cryptographic infrastructure.
Performance Overhead: it is computationally intensive, resulting in increased processing time and resource requirements compared to traditional computations on unencrypted data.
Limited Functionality: although it supports various operations, it may not (efficiently) cover the entire spectrum of computations required for certain applications, such as deep learning architectures with non-linear activation functions.
Key Management: it requires secure key management practices: encryption keys must be kept safe and accessible only to authorised parties. This can be a challenge when multiple parties are involved, or when the encrypted data needs to be processed by different entities.
Secure Multi-Party Computation
Secure Multi-Party Computation (MPC) is a cryptographic technique that enables multiple parties to jointly compute a function on their private inputs without revealing them to each other. In the context of machine learning, MPC allows for collaborative model training without sharing individual data points, similarly to federated learning.
Privacy Preservation: MPC ensures that each party's private inputs remain secret throughout the computation process. No single party gains access to the complete set of inputs, protecting the privacy of sensitive information. This makes MPC particularly useful in scenarios where data sharing is restricted due to legal, competitive, or confidentiality reasons.
Trust and Collaboration: MPC allows multiple parties to collaborate and jointly analyse their data without the need to share it explicitly. This enables organisations or entities to work together, leveraging the collective knowledge and data resources while maintaining privacy and trust between parties. It promotes cooperation in situations where data owners are hesitant to share their sensitive information.
Flexibility in Computation: MPC supports a wide range of computations, including mathematical operations and algorithms used in machine learning and data analysis. Parties can collectively compute complex functions while preserving the privacy of their inputs. This flexibility allows for privacy-preserving analytics on sensitive data across multiple domains and applications.
Resistance to Attacks: MPC provides strong security guarantees against various types of attacks, including collusion attacks, where multiple parties collude to reveal private information. The cryptographic protocols and techniques employed in MPC ensure that even with partial knowledge of other parties' inputs, no additional information can be inferred.
Computation Overhead: it can introduce significant computational overhead compared to traditional computations on centralized data, due to the encryption, decryption, and communication involved.
Communication Complexity: it requires the participating parties to exchange encrypted messages and perform secure computations, which can be a limiting factor, particularly in scenarios with a large number of parties or limited network bandwidth.
Scalability: the scalability of MPC depends on the number of parties involved and the complexity of the computations performed. As the number of parties increases or the complexity of computations grows, the computational and communication overhead can become more significant, potentially limiting the feasibility of MPC in certain scenarios.
Privacy-preserving machine learning has the potential to offer a trade-off between the (seemingly) contradictory goals of leveraging data for insights and training and protecting individuals' data and privacy. The approaches described above are some of the possible techniques to find this balance and, as machine learning technology continues to advance, it is crucial to keep in mind its privacy implications and develop privacy-preserving technique accordingly.