Step Functions with API Gateway integration, good idea?
I’m currently testing AWS Step Functions direct integration with Amazon API Gateway. The aim is to make an API call directly from my state machine, rather than from an intermediary Lambda function. Eventually, it can improve the performance and possibly reduce the cost too. Let’s have a look.
In a registration workflow, I need to validate the address provided by the user with a 3rd party API. In my case, I’m using an open API from the french government: https://adresse.data.gouv.fr/api-doc/adresse. The process is quite simple:
- Call the external API with the input from the user (street, city, postal code).
- Assert there is a valid address corresponding to that input: at least one value, with a confidence score > 0,82.
- Retrieve the “official” label for this
Using a Lambda function
The naive approach is to have a Lambda function that will request this API and perform the validation logic:
Let’s have a look at the function code (in python):
Nothing fancy here, just use requests to query the API, retrieve a result and check it is correct.
This solution is perfectly valid and ensures to have complete control on the API call and the business logic to apply to the result.
But what if we could actually call our external API directly from the Step Functions state machine and get rid of this Lambda function? This is actually possible since last year in November (source)…
Using API Gateway
You can actually invoke an API Gateway endpoint directly from Step Functions. To do so, you first need to create an HTTP proxy to the 3rd party API in API Gateway, and then have the state machine invoking it:
1. Create the HTTP Proxy
Using AWS Cloud Development Kit (CDK), you can define a simple HTTP API and a proxy like this:
2. Invoke the API from Step Functions
Using CDK, you can use the CallApiGatewayHttpApiEndpoint construct from @aws-cdk/aws-stepfunctions-tasks, just like that:
And here we are!
Wait, wait, wait! Where is the business logic? There is no validation of the address here. Let’s complete the state machine.
3. Add the address validation logic
Indeed, we need to add a few states in order to validate the address and return the expected result:
And here is the CDK code to accomplish this:
I’m using HTTP API because it offers up to 60% reduction in latency over REST API (source) and it is also cheaper: $1.11/million requests vs $3.50/million requests (source). And I don’t need the additional features provided by the REST API (throttling, caching, …). Have a look here for the differences.
Regarding Lambda, I’m using the arm architecture for the same reason: up to 34% better price performance (source). I’ve also used lambda-power-tuning, an open source tool that executes the Lambda function with different memory configurations, in order to find the optimal one. The execution time heavily depends on the 3rd party API, so after running lambda-power-tuning multiple times, it looks like the best configuration is 768MB.
Using a dataset of thousands of addresses, I wrote a load test (in python) that starts the execution of each state machine (in a random order, to avoid a potential cache effect from the 3rd party API). Running the test for about 20 minutes, I have the following execution times for my state machines (with Lambda in blue and API Gateway in orange):
➔ Except during the pick in the middle, the difference is not obvious but looking closer, the one with API Gateway is slightly faster than the one with Lambda (<100 ms). Small advantage for the direct integration with API Gateway.
Indicated prices are for the Ireland region.
In order to calculate the price of the Lambda function, I need to know its execution time. Using CloudWatch Logs Insight, I can find the p50 duration (median), which is more representative than the average duration:
Let’s use 125ms as a baseline for our calculation. Based on the pricing page, the formula is the following (M = number of million requests, t = the execution time and m = the memory):
Notes to understand this formula:
The AWS Lambda free tier includes 1 million free requests per month and 400,000 GB-seconds of compute time per month.
$0.0000133334 is the price per GB-second for the arm architecture (vs $0.0000166667 for x86)
For example with M=5 million requests a month, t=125ms, and m=768, it would give $1.72.
Looking at API Gateway, you pay $1.11 / million requests for the HTTP API. So for 5 millions, it will be $5.55.
Looks like Lambda has some advantage here but let’s see with more requests:
We can see that at some point (around 17 million requests / month), it becomes more interesting to have a direct integration with API Gateway than using a Lambda function. 17 million may seem like a lot, but depending on the business, it can come quickly.
Also, my Lambda function is pretty quick. But if I do the math with 250ms instead of 125 for example, API Gateway becomes more attractive at 4 million requests. Anyway, don’t spend days on comparing, and trying to optimize, …, because at the end of the month, the biggest bill is you: the engineer who build this, and by far!
I can see some of you grumbling in the dark: “grumpf 😡, you forgot Step Functions!”
Indeed, we need to evaluate the price of Step Functions, as there are 2 additional states to handle the validation logic:
The price for a standard workflow is 25$ / million state transitions. We have 3 with Lambda integration and 5 with API Gateway integration. And that makes a huge difference. For example with 5 million executions, it would cost $375 for Lambda and $625 for API Gateway. The Lambda function and HTTP API prices are insignificant in comparison. And it obviously increases as the number of requests increases:
If the performance is slightly better for the direct integration with API Gateway, the price is significantly higher. So in that case, it doesn’t make sense, the ratio price performance is clearly in the advantage of the integration with Lambda.
Wait! That’s all? What was all the fuss about?
Edit (23/11/2021): I forgot (my bad) to mention express workflow which is much cheaper than the standard one and also probably most suitable for big load (see blog post). The p50 execution time of my workflow with API Gateway is 292ms, and 580ms for the one with Lambda. The price for express workflows is 1$ / million requests, plus a few cents per GB-hour. In our case, as the workflow with API Gateway is faster than the other, it will be cheaper in all cases, despite the number of states.
In this article, I’ve shown that the direct integration with API Gateway in Step Functions is quite competitive versus the use of an intermediary Lambda function, both in terms of performance and potentially in terms of pricing. So if you need to retrieve some data from an API, that makes sense to use this direct integration. Be sure to use the HTTP API which provides lower latency and better prices than the REST API.
But, when it comes to implement some business logic, we saw that adding states in our standard workflow to replace a few lines of code from a Lambda function has a dramatic impact on the overall cost. Furthermore, it is still limited in comparison to what can be achieved with a Lambda function. And you also need to take into consideration the testability of the business logic: far easier and faster with a function and some unit tests than integration tests on a fully deployed state machine.
I’m not saying not to use Step Functions: It’s a great service, I love it, and even more since a few weeks and the SDK integration. Just pay attention on how you use it, the states you add. If you start implementing some “complex” business logic, consider adding a Lambda function in your workflow instead.
Edit (23/11/2021): If you can leverage express workflow instead of standard one (see the comparison here, the main one is the limitation to 5 minutes of execution), it is much affordable, and direct integration makes totally sense, even with a few additional states. I really love Step Functions!
I’m writing another article on this, hopefully for the AWS blog, stay tuned…
Source code is available on Github.