Skip to Content


Posted on October 25, 2022 by

Categories: AWS


At Inawisdom, we frequently production machine learning models created for our clients. I’ll go through some of the details of how we do this in this blog, provide some advice, and discuss some things we’ve learned as we’ve peeled back the hood to see how Amazon SageMaker implements endpoints for making predictions.

Amazon SageMaker Endpoints: What are they?

You already know that we at Inawisdom are huge fans of AWS SageMaker if you’ve read our previous posts or seen this video about it. It offers several cool features for data scientists that greatly aid them in gathering data and developing models. The way AWS SageMaker handles the next stage in the application of machine learning, though, is what I truly adore. The following phase involves leveraging learned model artifacts to generate batch or real-time predictions. An Amazon SageMaker endpoint, a fully managed service that enables you to draw conclusions in real-time using a REST API, can help in this situation. Removing the hassle of managing your own EC2 instances, loading S3 artifacts, encasing the model in a thin REST service, connecting GPUs, and many more. This is fantastic since it allows you to deploy a completely functional solution with only one click or command. Here is a sample deployment for XGBOOST from a notebook in AWS SageMaker:

Production Burden

A straightforward deployment described above will enable you to get up and running for specific workloads, then sit back and observe conclusions or forecasts being drawn. However, much more has to be considered for production workloads with high throughput or mission-critical. Currently, you pay on-demand pricing for an AWS SageMaker Endpoint based on how many hours the instances behind it are active (including idle time). Thus, similar to EC2 instances, the following has to be taken into account and balanced:

Cost optimization: Choose an instance type for an AWS SageMaker endpoint that meets your baseline use requirements (with or without Elastic GPU)

Elastic Scaling: To handle low and high usage swings, you must adjust the instances an AWS SageMaker endpoint utilizes to scale in and with the demand.

High Availability: Because Black Swan events like Availability Zone failures might happen, mission-critical systems must be prepared to manage these situations effectively. The smooth delivery of model updates and the option to return to a prior stable version are both made possible by high availability.

The Fundamental Characteristics of SageMaker Endpoints

We need to learn more about the nature and usage of AWS SageMaker endpoints to optimize and tweak them. First off, AWS is transparent about the fact that AWS SageMaker uses Docker for training tasks and endpoints; further information can be found at:

  • AWS: The code for your algorithm inference
  • AWS: The code for your algorithm inference (how the container serves requests)
  • AWS: The code for your algorithm inference (run image)
  • SageMaker API on AWS (CreateModel)

Using Docker exposes the following issues right away:

  • How many underlying EC2 instances can make one run one or more docker containers?
  • Is ECS or Kubernetes used? Do I need to become an expert in Docker?
  • How quickly and slowly do instances start and stop?
  • How do instances consume network resources while living inside the VPC? Can the number of instances exhaust a VPC’s network addresses, for instance?
  • Isolation levels of my models? Docker employs soft CPU and memory units, correct?
  • If containers are bin packed or redistributed, will I experience problems?

The Investigation

I ran several tests to peek behind the hood of an Amazon SageMaker endpoint and examine all these factors. Here is how I did it:

Three /20 subnets (4091 addresses) were added to my dedicated VPC in EU-West-1, one in each of the three Availability Zones.
I downloaded this example notebook boost after deploying an Amazon SageMaker notebook instance into the VPC in AZ 1a.

I modified the notebook’s model settings to use the VPC as follows:

  • After completing all the steps until “Create endpoint,” I noted the addresses accessible in the subnets.
  • I then completed the “Create endpoint” step, waited for the endpoint to be online, and noted the addresses accessible in the VPC.
  • I next develop a new AWS Lambda function, which receives the content of an HTTP POST request and sends it to the AWS SageMaker endpoint URL. When an endpoint returns a response, lambda must deliver that response to the invoker.
  • After that, I set up an AWS API Gateway instance with a regional endpoint and authenticated it with an API Key. I developed a resource within the AWS API Gateway with an AWS Lambda interface that launched the lambda I developed in step 7.
  • I downloaded, set up, and configured serverless artillery to attack my API Gateway.
  • From serverless artillery, I launched the load test.
  • I logged data from CloudWatch and the available addresses in the VPC every 15 minutes for two hours.

Please be aware of the following:

Due to a stringent service restriction on the AWS account I wa using, the tth2.mediuminstance type was utilized. This emphasizes a crucial point: AWS SageMaker’s service limitations are pretty constrained. Once you are aware of the type of instance you will be using and

How many instances will you require for an endpoint? You should be proactive in requesting a limit increase. I would suggest requesting help before commissioning any model into production.

An API Key was utilized for API Gateway rather than IAM roles in making integrating with serverless-artillery as simple as possible.

The XGBoost model underwent a 2-hour load test that started slowly and then maintained a high level of demand. By checking the endpoint’s overall CPU utilization in AWS CloudWatch, I validated this and the fact that I was utilizing two instances. The graph is shown here: