Automating Document Processing for a Healthcare Use Case with AWS Textract & Comprehend Medical

By Blessing Adesiji

October 4, 2022   ·   8 min read

Protecting Extracted Health Information with AWS Textract & Comprehend


In this article, we will learn how to automate document processing workflows to achieve compliance obligations with sensitive health care information. Aside from compliance obligations, protecting health information allows the data to be used for other applications without the risk of private information leaking.

This tutorial shows how to mask personal health information(PHI) in clinical notes using AWS ML & AI services. The process involves extracting text information from a clinical notes image file to a text file with Amazon Textract and saving the text into a file or S3 bucket. After this, we will perform named-entity recognition (NER) using Amazon Comprehend Medical API and mask the medical entities.

Note that this tutorial shows a quick proof-of-concept for learning purposes, and additional work may be required for your specific uses.

If you need help with Intelligent document processing for healthcare, financial, insurance, and mortgage use cases, Mantium can help you deploy an enterprise-ready solution; click here to find out more.


  • AWS account
  • Python programming knowledge
  • Experience using Jupyter notebooks

Set Up your AWS Environment

To write and run our code, we will use a Jupyter notebook. Follow the steps below to set up the S3 and Notebook instances.

Create an S3 Bucket and Folder

We need an S3 bucket to hold the raw documents and store the processed text documents. To create an S3 bucket, navigate to the S3 Console (you can search for S3 services) and click on Create Bucket.

Add a bucket name of choice, select an AWS Region, and create a bucket by accepting the remaining default settings.

If you name your bucket with SageMakersagemaker or Sagemaker will not need you to create an IAM role in the following step, and you can proceed to attach policies to your notebook instance (link)

After creating the bucket, navigate to the bucket page, click the Create folder button to create a folder, provide the folder name, and accept default options.

Create an IAM policy for your SageMaker Notebook to read from your new S3 bucket

If you did NOT name your s3 bucket using “Sagemaker,” as described above, then you need to create a new IAM role to attach in the next step.

  1. Go to the IAM console screen, then proceed to “Policies”
  2. Choose ‘create policy’
  3. Choose any name and description for the policy you want, but remember, for the next step, “Create an Amazon SageMaker Jupyter Notebook Instance.” We will call it “MyNoteBookS3Policy” below
  4. Paste the following policy in as JSON
  5. Review and Save
  "Version": "2012-10-17",
  "Statement": [{
      "Effect": "Allow",
      "Action": [
      "Resource": ["arn:aws:s3:::mynewbucket"]
      "Effect": "Allow",
      "Action": [
      "Resource": ["arn:aws:s3:::mynewbucket/*"]

Create an Amazon SageMaker Jupyter Notebook Instance

Navigate to the SageMaker management console; on the left side, click on Notebook and click on Notebook instances. On the page, click the create notebook instance button in the top right corner.

On the next page, provide a name for your notebook instance, and select a suitable instance type(An AWS Free Tier is okay). In the IAM Role, you can create a new IAM Role or use an existing role.

Complete the creation steps by accepting the remaining default settings.

Attach Policies to Notebook Instance

On the Notebook Instance page, click on the instance you just created. Navigate to the Permissions and Encryption page, and click on the IAM role ARN.

On the Add Permissions dropdown, click on Attach Policies. After this, search for Textract in the search bar, and select the policy options to access all Amazon Textract APIs.

We need to repeat the above for the following policies; 

ComprehendFullAccess , ComprehendMedicalFullAccessAmazonTextractFullAccess and MyNoteBookS3Policy if created.

Click on the Attach Policies button when you are done. Now that you’ve set up your AWS Environment.

Let’s write some code.


Notebook Setup & Files

Navigate to the Notebook Instance page, and click on either Open Jupyter or Open JupyterLab

Here is the image file that we will work with in this tutorial; it is an image file of an example clinical note (for educational purposes).

Download and upload this file to your Notebook.

Import Libraries

We will use several libraries in this tutorial, so let’s import them. Then, you can copy the code below to your code cell.

import json
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.s3 import S3Uploader, S3Downloader

Create a Session & S3 Setup Bucket Path

A session stores the configuration state and allows you to create service clients and resources. The code snippet below allows us to create a state and bucket path.

region = boto3.Session().region_name
role = get_execution_role()
bucket = sagemaker.Session().default_bucket()
prefix = "phi-masking"
bucket_path = "https://s3-{}{}".format(region, bucket)
endpoint_url = "https://textract.{}".format(region)

Create a Client

Here we will create a client by using the service name, which is textract, then pass the region name, and endpoint_url parameters.

textract_client = boto3.client(service_name = 'textract', region_name = region,
                     endpoint_url = endpoint_url)

Load the Image File

The following code snippet reads the image file and converts it to encoded bytes. We are converting to encoded bytes because the DetectDocumentText API expects the document as base64-encoded bytes or an Amazon S3 Object.

documentName = "health_notes.png"
with open(documentName, 'rb') as file:
    img_file =
    bytes_arr = bytearray(img_file)
    print('Image file is loaded', documentName)

Detect Document

The DetectDocumentText API detects text in the input document and returns the detected text in an array of Block objects.

Note that using the DetectDocumentText is a synchronous operation.

The following code snippet shows how to use the API to detect the text in the input document and generate a response of Block objects.

response = textract_client.detect_document_text(Document={'Bytes': bytes_arr})

Textract Response Parser

You will notice that the response is not in a suitable format; it is an array of blocks of objects. We need to get the useful text from the response using Textract Response Parser(TRP). TRP parses the JSON response returned by Amazon Textract.

In your notebook, install TRP with this command

!pip install amazon-textract-response-parser

Run the code below to parse the JSON response.

from trp import Document
doc = Document(response)
page_string = ''
for page in doc.pages:
    for line in page.lines:
        page_string += " "
        page_string += str(line.text)

After running the code, you will see the extracted text as the output (see image below)

Save Extracted Text to a File

With the following code, you can save the extracted text to a txt file before we mask personal health information.

text_data = 'health_notes.txt'
doc = Document(response)
with open(text_data, 'w', encoding='utf-8') as f:
    for page in doc.pages:
        page_string = ''
        for line in page.lines:
            page_string += " "
            page_string += str(line.text)
        f.writelines(page_string + "\n")

Save Extracted Text to S3

The following code saves the extracted text to the S3 bucket. Ensure that you pass the created path with the set prefix.

with open(text_data, "r") as fi:
    raw_texts = [line.strip() for line in fi.readlines()]
s3 = boto3.resource('s3')
s3.Bucket(bucket).upload_file("health_notes.txt", "phi-masking/health_notes.txt")

Detect Entities with Comprehend Medical API

The DetectEntitiesV2 searches the clinical text body for various medical entities and returns details about each one, including entity category, location, and degree of confidence.

With the code below, we will use detect_entities_v2 method to detect medical entities in English language texts (page_string)

You can print the entities object to see the returned results.

comprehendmedical_client = boto3.client(service_name='comprehendmedical')
entities = comprehendmedical_client.detect_entities_v2(Text=page_string)

Masking Detected Entities

Now that we have detected the PHI in the extracted text, the next thing is to anonymize the information. For this tutorial, we will write a simple method that uses the detected entity’s generic name to mask the detected information.

substrings = []
start = 0
for entity in piilist["Entities"]:
    start = entity["EndOffset"]

masked_text = " ".join(substrings)

In the results below, you will notice that we replaced the entity with the generic name. For example, John Doe was replaced with NAME , 64 was replaced with AGE and others.


This tutorial leveraged AWS Textract to extract text information from an image document. We then used Amazon Comprehend Medical to determine personal health information & entities from the extracted text. After this, we masked the data with the detected entities’ names.

If you or your organization would like help productionizing a system like this, please reach out to us through this link


Blessing Adesiji
With a Bachelor's of Science in Petroleum Engineering, Blessing is a self taught Data Scientist and Software Engineer. He enjoys educating Mantium users on how to build AI applications as a Developer Relations Engineer. When he is not writing code and tutorials, he enjoys doing exercise in the gym, and playing and watching football.

Enjoy what you're reading?

Subscribe to our blog to keep up on the latest news, releases, thought leadership, and more.