Webhooks are important in modern web architectures, and are a fantastic alternative to costly polling, webhooks, or server-sent events. They allow your application (the producer) to notify another application (the consumer) when an event occurs.
Unfortunately, there are some pitfalls that teams run into when aiming to build world-class systems like they see deployed by companies like Stripe and Shopify. It’s important to explore these in order to avoid building a flawed (or even insecure) system.
Sending in the Request Handler
The most simple webhook system (and the first one some teams build) isn't much of a system at all. Just perform an HTTP request from within the request handler or controller of your application server.
In the best case, this delays your application's response (by waiting for the webhook to finish up). In the worst case, an exception could be thrown by a failed webhook request—possibly causing an otherwise-healthy application request to return an internal server error.
Queueing
A production webhook system should enqueue a job or an event to perform webhook requests. Libraries like Oban, Sidekiq, and Celery are often used to move tasks to an asynchronous background worker. If those libraries don't fit your application, consider building directly on top of a queue like RabbitMQ, Redis (PubSub or Lists), Kafka, AWS SQS, or Google Cloud PubSub. Whatever you do, use a queue.
By moving the webhook request into a queueing system, you will be able to improve the performance and reliability of your web applications—all while ensuring that webhook requests are eventually made.
Fire and Forget
A naive webhook system makes a single request to an endpoint, ignoring the response. While this can be a good first step for prototyping, it is far from "production-grade."
Retry
A production webhook system should retry until the receiver returns a 200-level success status code, or until a certain number of retries has been reached.
Fixed Retries
It's a fact of life—requests don't always succeed. No system has 100% uptime. The consumers of your webhooks are no exception.
So, you build a retry system. It sends the request every five minutes until there is a success.
—Great, right?—
Wrong. Sometimes errors are temporary. A request triggered a few milliseconds later could succeed. Don't punish your customers by forcing them to wait minutes for an event that could be received more quickly. It could even lead to an increase of out-of-order events (a topic that is explored later on in this guide).
On the flip side, making too many requests could cause a flood of traffic that makes a consumer's recovery more difficult. It could be a waste of your system's valuable (and expensive) outbound bandwidth.
Backoff
A production webhook system should leverage a backoff strategy in its retries. A common strategy is an exponential backoff, in which the retry delays increase exponentially.
The specific backoff timing configuration should be tailored to the expectations of your customers or norms within your industry. Test them out. Iterate based on data that you gather. It can also be valuable to introduce jitter to the backoff, in order to prevent a thundering herd.
Allow Insecure Recipients
A naive webhook system will send a request to any registered URL, even if the endpoint is insecure. They don't validate the recipient's TLS certificate—if there even is one.
Require Secure Recipients
A production webhook system should require that the recipient uses a secure endpoint. This pattern is simple to implement. Just follow these two steps:
Services like Let’s Encrypt have made insecure HTTP endpoints a thing of the past. These days, it is easy and free to get a TLS certificate issued. There's no excuse to send your data to an insecure endpoint.
Unstoppable Webhooks
A naive webhook system has no facility for stopping or pausing requests and retries to a
consumer's endpoint. Messages are enqueued, sent, and retried indefinitely—even if the consumer
begs for mercy asks for the barrage of requests to be stopped while they are down for
maintenance or experiencing an outage.
Stoppable, Restartable Webhooks
A production webhook system allows for a consumer's configuration to be updated such that requests can be discarded instead of hammering the consumer's endpoint. This should be configurable on-demand, without requiring a re-deployment of the producer application.
Ideally, this system also allows for the consumer to be re-enabled—at which point new events will be sent as they were before.
Pausable, Resumable Webhooks
A better production webhook system allows for a consumer's configuration to be updated such that requests can be paused. Like stoppable webhooks, this should be configurable on-demand, without requiring a re-deployment of the producer application.
Unlike stoppable webhooks, the events enqueued during the pause should not be discarded. These events should be stored—either in the queue or in a database—for a future time when the consumer resumes webhooks.
When the consumer's webhooks are resumed, the system should send the events that were enqueued during the pause. Special care should be taken to ensure that the paused events are not sent simultaneously, as it could be a substantial volume of traffic that could harm the consumer. Use a queue, use retries, use backoff, and get those events sent out!
Irrelevant Events
A naive webhook system will send every type of event to every consumer. As new event types are added, those too will be sent out. Hopefully the consumer can gracefully handle events of a type they are not familiar with. Hopefully it's not a waste of time and money to perform all of these useless requests.
Relevant Events
A production webhook system allows producers to be configured with a set of relevant event types. Events that are not on the list aren't sent to the consumer.
Fewer events will be sent through your system, fewer requests will be made to the consumer, and less of your precious data will stream out of your database into the hands of a third party.
In addition to making the system more efficient, there is a potential for improved security. This is similar to adding scopes to an API. Knowing and controlling the types of data exposed to each consumer can be important for regulatory, compliance, and data breach response purposes.
As with many of these patterns, the list of events relevant to a consumer should be configurable on-demand—without requiring a re-deployment of the producer application.
Unsigned Request
A naive webhook system might send requests without any way to assert authenticity to the consumer. This means that anyone on the internet could make a POST request to the consumer’s endpoint—and be trusted.
Pre-Shared Key
A production webhook system could send a pre-shared key (uniquely generated for each consumer) in a header of each request. By doing this, the attack surface is dramatically reduced—only callers who know the key will be trusted by the consumer.
Signed Requests
A better production webhook system uses a key to sign the body it intends to send, using an algorithm like HMAC SHA-256. This signature is to be sent in a header, similar to the pre-shared key pattern above. When receiving the request, the consumer will perform the same signing ceremony and compare the result with the request’s signature header.
This pattern has similar security properties to the pre-shared key, but the key is not sent in the request. This ensures that key material is not present in any request logs.
To protect against replay attacks, a timestamp can be added to the request headers and signature input. Once the request is validated, a consumer can decide to only permit requests with a recent timestamp. A similar approach is used in S3 link signatures.
Unversioned Payload Schemas
A naive webhook system will send requests with the latest schema to all consumers. When breaking changes are made to the objects being sent, it’s the consumer’s responsibility to figure out what you changed and deal with it.
Versioning
A production webhook system has payload schemas that evolve with the application, but establishes a new version any time there is a breaking change.
Each consumer has a schema version associated with its configuration. This version can be updated on-demand, without requiring a re-deployment of the producer application. The consumer is free to update their code to accommodate new payloads before having to receive them.
Breaking changes can be harder to make in webhook systems than typical APIs, since there is no “caller” and there are no headers on which to choose a version. By allowing the consumer to change the schema version they receive on their own schedule, your team can confidently roll out changes without being blocked on consumers (or breaking integrations).
It should be noted that versions can (and should) be deprecated according to the needs of your business. It can be expensive keeping every old version around forever, so be sure to encourage your consumers to update.
Assuming the System Works
The importance of reliable webhook systems should be clear. They are the key to partnerships with other technology companies. They are valuable utilities that unlock countless off-platform capabilities. They are the basis of successful ecosystem plays that reduce customer churn.
A naive webhook system is unmonitored. It is written once, and assumed to be reliable and function with low latency. Escalated support tickets are the only thing that cause the engineering team to look into issues.
Customers know about webhook outages before your team does.
Webhook Monitoring
A production webhook system emits a few key metrics that can be observed to ensure that the system hasn’t completely fallen over.
These metrics typically include queue depth and request volume.
When the queue depth gets too high for too long, consumers might experience delayed event deliveries. If request volume suddenly changes, it could be an indication that an existing webhook-triggering path has changed—sometimes for the worse.
Active Webhook Monitoring
The best production webhook systems use Deliver—the active webhook monitoring solution.
We take an active approach to monitoring your webhook system, triggering events via API requests and observing the event (or lack thereof).
When your webhook system fails to deliver, a notification is sent to your engineering team so they can diagnose the issue—before your customers notice.