menez.io | webhooks are hard

It’s been a while since I began toying with the idea of hosting Soundbot on AWS Lambda. Going with a serverless platform seemed like the perfect fit for implementing a simple assistant bot. The bot would receive a short instruction in the form of a message, the Lambda would spring into action, do whatever it had been told to, report back to the user with some sort of confirmation, and die quietly.

It was a match made in heaven: no servers to mantain, no pooling for unread messages, no load balancing; all my needs neatly taken care of by the magic of the Amazon Web Services.

And you know what? I was mostly right.

serverful vs serverless

I have been using the (very nice) Telegram Bot API for this particular experiment, which offers two different ways of retrieving messages sent to the chatbot:

The first one is the garbage way for plebs I’ve been trying to avoid. By calling the getUpdates method on your bot’s private URL, you get back a json object containing all the messages it’s received since you last looked your updates up:

{
  "ok": true,
  "result": [
    {
      "update_id": 901915146,
      "message": {
        "message_id": ...,
        "from": {
          "id": ...,
          "is_bot": false,
          "first_name": "Vinicius",
          "last_name": "Menezio",
          "username": "...",
          "language_code": "pt-BR"
        },
        "chat": {
          "id": ...,
          "first_name": "Vinicius",
          "last_name": "Menezio",
          "username": "...",
          "type": "private"
        },
        "date": 1520212244,
        "text": "What's up, my bot?"
      }
    },
    ...
  ]
}

So I could set up a 24/7 server and pool for these updates every so often, responding to them in batches as they came in. Except this made no sense at all for my use case, since my bot would only respond to me, and I would be pinging it at most a few times a day, so leaving a server running all the time would be way overkill.

The second way is just what I was looking for. With the setWebhook method, I could give the bot a URL to access whenever a new message came in. I could hook this to an http endpoint on AWS and put the bot’s code in a Lambda instance, so that it would only run when it was strictly needed. All very efficient, very cheap, and very straightforward.

Or so I thought.

the catch

The thing is, I can’t hook my Telegram bot up to AWS Lambda directly because Lambdas are not accessible from outside of the AWS. Instead, I need a second Amazon service to do the hooking up for me, AWS API Gateway. The whole point of this service is to act as an endpoint for when you have stuff outside AWS trying to talk to stuff inside it.

And while this was confusing at first, it wasn’t a deal breaker by any means, I just had to set up an API Gateway instance, point it to my Lambda and all would be well. Sorta.

You see, a Lambda on AWS is technically a single function. In practice it really isn’t though, because this function can call many other functions, stored or not in other files, defined or not by different classes. So it is as much a “single function” as any program that has a main() function as an entry point.

However, it is a single function in that the API Gateway expects it to behave as a function. So any http request made to a Gateway that’s hooked up to a Lambda gets its response from the return value of the Lambda. As such, the entry function of the Lambda has to be formatted like so:

lambda.py

def lambda_handler(event, context):

    # actually handle the message

    return {
        "statusCode": 200,
        "headers": {
            "Content-Type": "application/json"
        },
        "body": "all is good"
    }

Now, here’s the catch: since there’s an http request waiting for a response the whole time the Lambda is running, this execution time can’t really be arbitrarily long, otherwise the request could timeout.

API Gateway has a default (and maximum) timeout of 29 seconds. And Telegram will wait for a minute or so for a 200 http response, before firing again at the webhook. So, if my Lambda finishes in 30 seconds or more, even if it responds with a code 200, it’s too late, because that request has been dropped and another one is about to be made, just to wait again for 30 more seconds for a new OK response which will never come, before making yet another request. This means Telegram keeps setting off my Lambda indefinetly (I mean, I assume there is a limit, but I never let it run long enough to reach it), and my inbox ends up looking like this:

Whoops.

To fix it I would have to delete the webhook, manually consume the pending updates, and reinstate the webhook. A huge pain in the neck.

the cheat

So, how do I get around this issue? I couldn’t really do anything to guarantee my Lambda would be able to run under one minute every time, since it’s doing pretty intensive audio and video processing work (as explained in my last post). To prevent Telegram from spinning out of control I would have to let it know everything was fine before actually finishing the task at hand. But since the http response is tied to the return value of the Lambda function, and since Amazon kills my Lambda instace and its whole environment as soon as it returns (preventing me from doing anything clever with threading or spawning a second process in the background), I was out of luck.

Or was I?

Well, I knew I couldn’t start another process, since it would die with the Lambda, but there was nothing in the rulebook saying ~~a dog can’t play basketball~~ I couldn’t run a second Lambda instance.

This second instance would act as an intermediary between API Gateway and the actual guts of my bot, like so:

Making use of the Lambda.Client.invoke() method I can launch a Lambda from within another Lambda. And by using the "Event" invocation type, I get to do so asynchronously. So the Lambda function triggered by my API Gateway webhook ends up looking like this:

dummy_lambda.py

import boto3
import json

def lambda_handler(event, context):
    client = boto3.client("lambda")

    client.invoke(
        FunctionName="actual_lambda",
        InvocationType="Event",
        Payload=json.dumps(event),
    )

    return {
        "statusCode": 200,
        "headers": {"Content-Type": "application/json"},
        "body": "all is good"
    }

While the rest of my code is hidden away inside the actual_lambda package, in another Lambda entirely.

Now Telegram gets its precious code 200 response almost immediately upon accessing the webhook, and my lengthy process can run on the background, no longer pressured by unreasonable time constraints.