Advanced PII detection and anonymization with Hugging Face Transformers and Amazon SageMaker
repository philschmid/advanced-pii-huggingface-sagemaker
PII or Personally identifiable information (PII) is any data that could potentially identify a specific individual, e.g. to distinguish one person from another. Below are a few examples of PII:
- Name
- Address
- Date of birth
- Telephone number
- Credit Card number
Protecting PII is essential for personal privacy, data privacy, data protection, information privacy, and information security. With just a few bits of an individual's personal information, thieves can create false accounts in the person's name, incur debt, create a falsified passport or sell a person's identity to a criminal.
Transformer models are changing the world of machine learning, starting with natural language processing (NLP), and now, with audio and computer vision. Hugging Face’s mission is to democratize good machine learning and give anyone the opportunity to use these new state-of-the-art machine learning models.
Models Like BERT, RoBERTa, T5, and GPT-2 captured the NLP space and are achieving state-of-the-art results across almost any NLP tasks including, text-classification, question-answering, and token-classification.
In this blog, you will learn how to use state-of-the-art Transformers models to recognize, detect and anonymize PII using Hugging Face Transformers, Presidio & Amazon SageMaker.
What is Presidio?
Presidio (Origin from Latin praesidium ‘protection, garrison’) helps to ensure sensitive data is properly managed and governed. It provides fast identification and anonymization modules for private entities in text and images such as credit card numbers, names, locations, social security numbers, bitcoin wallets, US phone numbers, financial data and more. - Documentation.
- From Presidio Documentation
By Default Presidio is using Spacy
for PII identification and extraction. In this example are we going to replace spacy
with a Hugging Face Transformer to perform PII detection and anonymization.
Presidio supports already out of the box 24 PII entities including, CREDIT_CARD, IBAN_CODE, EMAIL_ADDRESS, US_BANK_NUMBER, US_ITIN...
We are going to extend this available 24 entities with transformers to include LOCATION, PERSON & ORGANIZATION. But it is possible to use any "entity" extracted by the transformers model.
You will learn how to:
- Setup Environment and Permissions
- Create a new
transformers
based EntityRecognizer - Create a custom
inference.py
including theEntityRecognizer
- Deploy the PII service to Amazon SageMaker
- Request and customization of requests
Let's get started! 🚀
If you are going to use Sagemaker in a local environment (not SageMaker Studio or Notebook Instances). You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.
1. Setup Environment and Permissions
Note: we only install the required libraries from Hugging Face and AWS. You also need PyTorch or Tensorflow, if you haven´t it installed
Install git
and git-lfs
Permissions
If you are going to use Sagemaker in a local environment (not SageMaker Studio or Notebook Instances). You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.
transformers
based EntityRecognizer
2. Create a new Presidio can be extended to support the detection of new types of PII entities and to support additional languages. These PII recognizers could be added via code or ad-hoc as part of the request.
- The
EntityRecognizer
is an abstract class for all recognizers. - The
RemoteRecognizer
is an abstract class for calling external PII detectors. See more info here. - The abstract class
LocalRecognizer
is implemented by all recognizers running within the Presidio-analyzer process. - The
PatternRecognizer
is a class for supporting regex and deny-list-based recognition logic, including validation (e.g., with checksum) and context support. See an example here.
For simple recognizers based on regular expressions or deny-lists, we can leverage the provided PatternRecognizer
:
To create a Hugging Face Transformer recognizer you have to create a new class deriving the EntityRecognizer
and implementing a load
and analyze
method.
For this example the __init__
method will be used to "load" and our model using the transformers.pipeline
for token-classification
.
If you want to learn more how you can customize/create recognizer you can check out the documentation.
inference.py
including the EntityRecognizer
3. Create a custom To use the custom inference script, you need to create an inference.py
script. In this example, we are going to overwrite the model_fn
to load our HFTransformersRecognizer
correctly and the predict_fn
to run the PII analysis.
Additionally we need to provide a requirements.txt
in the code/
directory to install presidio
and other required dependencies
create inference.py
create requirements.txt
4. Deploy the PII service to Amazon SageMaker
Before you can deploy a t he PII service to Amazon SageMaker you need to create model.tar.gz
with inference script and model.
You need to bundle the inference.py
and all model-artifcats, e.g. pytorch_model.bin
into a model.tar.gz
. The inference.py
script will be placed into a code/
folder. We will use git
and git-lfs
to easily download our model from hf.co/models and upload it to Amazon S3 so we can use it when creating our SageMaker endpoint.
As the base model for the recognizer the example will use Jean-Baptiste/roberta-large-ner-english
- Download the model from hf.co/models with
git clone
.
- copy
inference.py
into thecode/
directory of the model directory.
- Create a
model.tar.gz
archive with all the model artifacts and theinference.py
script.
- Upload the
model.tar.gz
to Amazon S3:
After you uploaded the model.tar.gz
archive to Amazon S3. You can create a custom HuggingfaceModel
class. This class will be used to create and deploy our SageMaker endpoint.
5. Request and customization of requests
The .deploy()
returns an HuggingFacePredictor
object which can be used to request inference.
Simple detection request
Detect only specific PII entities
Anonzymizing PII entities
Hello, my name is <PERSON> and I live in <LOCATION>. I work as a software engineer at <ORGANIZATION>. You can call me at <PHONE_NUMBER>. My credit card number is <CREDIT_CARD> and my crypto wallet id is <CRYPTO>.
On <DATE_TIME> I visited <URL> and sent an email to <EMAIL_ADDRESS>, from the IP <IP_ADDRESS>. My passport: 191280342 and my phone number: <PHONE_NUMBER>. This is a valid International Bank Account Number: <IBAN_CODE>. Can you please check the status on bank account 954567876544? <PERSON>'s social security number is <PHONE_NUMBER>. Her driver license? it is 1234567A.
Anonzymizing only specific PII entities
Delete model and endpoint
To clean up, we can delete the model and endpoint.
6. Conclusion
We successfully create our Transformers-based PII detection and anonymization with Hugging Face Transformers and Amazon SageMaker.
The service can either detect or directly anonymize the payload we send to the endpoint. The service is built on top of open-source libraries including transformers
and presidio
to keep full control of how detections and anonymization are done.
This is a huge benefit compared to services like Amazon Comprehend, which are non-customizable intransparent black-box solutions.
This solution can easily be extended and improved by improving the transformers model used, e.g. to identify job titles like “software engineer” or add a new pattern recognizer, e.g. german personal number.
The code can be found in this repository philschmid/advanced-pii-huggingface-sagemaker
Thanks for reading! If you have any questions, feel free to contact me, through Github, or on the forum. You can also connect with me on Twitter or LinkedIn.