Guarding the Gate: Mitigating DoS Attacks on Large Language Models – Mantium

Guarding the Gate: Mitigating DoS Attacks on Large Language Models

By Ryan Sevey

August 10, 2023   ·   11 min read

In the realm of artificial intelligence, Large Language Models (LLMs) like OpenAI’s GPT-4 have revolutionized our ability to generate human-like text on a massive scale. With the ability to process and generate extensive amounts of text, these models have found applications across industries, from business and academia to creative arts and more. However, this power comes with risks, particularly in the form of Denial of Service (DoS) attacks.

This article explores the vulnerability of systems utilizing LLMs to potential DoS scenarios and outlines strategies to mitigate such risks. While the focus is on OpenAI’s API, the insights and lessons drawn are applicable to any large language model, whether self-hosted or provided through a cloud service. The considerations span across varying degrees of severity, with unique nuances based on the hosting model. For instance, hosting your own LLM might lower the overall risk in terms of cost per request, but it is essential to recognize that there will still be associated costs and potential vulnerabilities.


Consider the table below, which illustrates OpenAI’s rate limits. We will concentrate on the constraints applicable to paid accounts after 48 hours of usage.

At first glance, the limit of 350,000 Tokens Per Minute (TPM) may seem substantial. However, without proper safeguards, an attacker could exploit the model’s ability to handle ~32,000 tokens per request (for the 32k versions of GPT-4) or ~8,000 tokens for most other models. A calculated assault would need only 43 requests of 8,000 tokens per minute to render the system inoperable for other users (44*8000=352,000). This equates to less than one request per second—a surprisingly feasible task when compared to the seemingly daunting task of making 3,500 requests per minute.

Moreover, the cost implications are significant. With GPT-4’s pricing at $.06 per 1,000 tokens, a determined attacker can incur charges of $21 per minute, summing up to a staggering $30,240 per day.

These figures illuminate the profound vulnerability inherent in such systems, necessitating a comprehensive approach to security. The subsequent sections of this article will delve into various mitigation techniques to fortify these critical tools against potential abuse.

Practical Demonstration: Mitigating DoS Vulnerabilities in LLMs

To fully grasp the potential vulnerabilities and the measures that can be taken to mitigate them, we’ve created a collection of sample applications that demonstrate different scenarios of denial-of-service risks and their solutions.

The following code snippet provides a basic example of a Flask application integrating OpenAI’s GPT-4, configured for a maximum token size of 32,000. This minimal configuration represents a typical usage case but has very limited safeguards against DoS attacks:

from flask import Flask, request, jsonify
import openai

# Replace with your OpenAI API key
openai_api_key = 'YOUR-API-KEY-HERE'

# Configure OpenAI
openai.api_key = openai_api_key

app = Flask(__name__)

@app.route('/ask', methods=['POST'])
def ask_gpt4():
    question = request.json.get('question', '')

    if len(question) > 32000:
        return jsonify(error="Question exceeds maximum token size"), 400

        response = openai.Completion.create(
          engine="gpt-4-32k", # (gpt-4, gpt-3.5-turbo, etc)
        return jsonify(answer=response.choices[0].text)
    except Exception as e:
        return jsonify(error=str(e)), 500

if __name__ == "__main__":'', port=8080)

The code serves as a starting point for understanding how LLMs might be deployed and how they might be exploited. It does not include important protections such as rate limiting, adaptive rate limiting, or monitoring and alerting.

For those interested in delving deeper into various security measures, the full code repository, including examples of different mitigation steps, can be found at MantiumAI’s GitHub Repository. The examples provide hands-on insight into applying real-world security practices to protect LLM-enabled applications from potential denial-of-service attacks.

By studying and implementing these examples, developers and security professionals can gain a practical understanding of the risks involved and the ways to mitigate them, enhancing the security posture of applications utilizing large language models.


Rate Limiting

Rate limiting is an essential tactic to prevent DoS attacks by limiting the number of requests a user (or IP address) can make within a specified time frame. By setting a threshold, you can ensure that a single actor cannot overwhelm the system. Different strategies can be applied, such as fixed limits, dynamic limits based on user behavior, or tiered limits based on user type or subscription level.

from flask_limiter import Limiter 
limiter = Limiter(app, key_func=get_remote_address)

from flask_limiter.util import get_remote_address 
@limiter.limit("10 per minute") 
@app.route('/ask', methods=['POST']) 
    def ask_gpt4(): # Your code here

Rate limiting offers flexibility, allowing you to tailor the limits based on the specific needs and risks of your application. It is also compatible with various backend stores, enabling more complex setups with distributed rate limiting across multiple instances.

For more detailed examples, including more complex configurations and strategies, please refer to the code repository at Mantium’s GitHub Repository.

Monitoring and Alerting

Monitoring involves keeping a close eye on relevant metrics, such as request rates, token usage, error rates, and response times. By collecting and analyzing these data points, unusual or suspicious patterns can be detected, often before they escalate into a full-blown attack.

Alerting adds an additional layer of responsiveness. By configuring automated alerts based on specific thresholds or anomalies, administrators can be promptly notified of potential issues. This enables rapid response, often minimizing damage or preventing an attack altogether.

Here’s an example of how you might set up monitoring and alerting within a Flask application:

import logging

def log_request():
        f"Request from {request.remote_addr}: {request.path}, Tokens used: {len(request.json.get('question', ''))}"

Integration with Monitoring Tools: You can integrate with monitoring platforms like Prometheus, Datadog, etc. to gather and visualize metrics.

Alerting Configuration: Depending on your monitoring system, you can set up alerts for unusual patterns, like a sudden spike in request rates or error rates. This can be configured within the monitoring tool’s interface.

For a more extensive example, check out the full code here.

Content-Length Limitations

Security is often about recognizing that sometimes, less is more. By limiting the size of the requests, we can prevent an attacker from sending excessively large requests that might consume a disproportionate amount of resources, effectively creating a Denial of Service (DoS) attack. Content-Length limitations serve as a helpful supplement to other mitigation strategies, adding a barrier against attempts to overwhelm the system.

Here’s an example of how to implement Content-Length limitations in a Flask application:

from flask import Flask, request, jsonify
from flask_limiter import Limiter
import openai

# Set up rate limiting
limiter = Limiter(app, key_func=lambda: request.remote_addr, default_limits=["5 per minute"])

# Content-Length limit in bytes (32,000 tokens should generally fit within this, though exact size may vary)

def content_length_limit():
    if request.content_length and request.content_length > MAX_CONTENT_LENGTH:
        return jsonify(error="Content-Length exceeds maximum size"), 400

In this example, the content_length_limit function checks if the incoming request’s content length exceeds a predefined maximum size. If it does, it returns an error response, effectively blocking any request that’s too large.

For the full code and more in-depth examples, visit the Mantium’s GitHub Repository.

Adaptive Rate Limiting

Adaptive rate limiting offers a dynamic response that goes beyond simple request quotas, adjusting the rate limits based on observed behavior and other contextual information. This allows for more precise control and can detect and mitigate attacks more effectively, without unduly limiting legitimate users.

Here’s an example of how to implement adaptive rate limiting in a Flask application using the Flask-Limiter package:

from flask import Flask, request, jsonify
from flask_limiter import Limiter
from flask_limiter.util import get_remote_address
import time

# Basic in-memory store to track request times by IP
request_times = {}

# Flask-Limiter setup
limiter = Limiter(app, key_func=get_remote_address)

def adaptive_rate_limit():
    remote_addr = get_remote_address()
    now = time.time()

    # Track request times in a list
    if remote_addr not in request_times:
        request_times[remote_addr] = []

    # Remove old request times
    request_times[remote_addr] = [t for t in request_times[remote_addr] if now - t < 60]

    # Calculate requests per minute
    rpm = len(request_times[remote_addr])

    # Set adaptive limit
    if rpm > 30:
        return "1 per minute"
    elif rpm > 20:
        return "5 per minute"
        return "10 per minute"

This approach analyzes the pattern of requests from individual IPs, and based on the observed rate of requests, dynamically adjusts the allowed requests per minute. It recognizes potential attack patterns and automatically tightens the rate limits, offering an agile defense mechanism that responds to the threat landscape in real time.

For more comprehensive details, including the integration with the OpenAI model, refer to the complete code sample on GitHub.

API Gateway

An API Gateway serves as a critical control point in managing and protecting access to internal resources, particularly when dealing with Large Language Models. An API Gateway can act as a buffer. It can be configured with rate limiting, throttling, and request validation rules to control the number and type of requests that reach the internal resources. By monitoring and filtering incoming traffic, the API Gateway can detect and block suspicious patterns or excessive requests from individual IP addresses or regions. This ensures that only legitimate and well-formed requests are processed, preventing an attacker from overloading the system and maintaining the availability and responsiveness of the Large Language Model for legitimate users.

WAFs (Web Application Firewalls)

Web Application Firewalls are a critical component of any web application security architecture. WAFs can inspect incoming requests, filter out malicious traffic, and provide real-time protection against various attack vectors, including SQL injections, Cross-Site Scripting (XSS), and more. By employing rule-based or anomaly-based detection mechanisms, WAFs can effectively mitigate the risk of attacks, ensuring that only legitimate requests reach the application.

Setting Billing Limits

In a world where many applications rely on third-party APIs and cloud-based resources, cost control is paramount. Setting billing limits is a practical way to manage costs and protect against unexpected spikes in usage that may be indicative of an attack. Most cloud providers offer tools to set daily, weekly, or monthly budget caps, providing an automatic brake that can halt potential financial damage. This is especially relevant for APIs like OpenAI’s, where an uncontrolled flood of requests can translate into significant expenses.

However, it’s crucial to understand that billing limits, while effective in containing financial risks, won’t stop a denial-of-service (DOS) attack from occurring. If an attack triggers these limits, the application may become unavailable when the cap is reached, as further requests would be blocked. Thus, billing limits should be seen as part of a broader strategy to control costs, rather than a standalone solution to prevent attacks. Combining billing limits with other mitigation techniques can provide a more robust defense against both the operational and financial risks associated with DOS attacks.

Content Delivery Networks (CDNs)

CDNs serve as a buffer between the user and the application, distributing content and traffic through a globally distributed network of servers. More than just a performance enhancer, CDNs can act as a security shield. Many CDNs come with built-in protections against Distributed Denial of Service (DDoS) attacks and other malicious activities. By absorbing and mitigating these attacks, CDNs ensure that your application remains available and performs optimally, even under adverse conditions.


Defending against denial-of-service attacks, especially in the context of large language models, is a multifaceted challenge that requires a layered approach. From rate limiting and content-length restrictions to adaptive measures and the use of WAFs, CDNs, and billing controls, each strategy plays a vital role in the overall defense.

The key is to understand the unique requirements and risks associated with your particular application or service and to deploy a tailored combination of these measures. By doing so, you can ensure not only the integrity and availability of the service but also the financial well-being of the organization.

These examples and more can be explored in detail in the provided sample applications that demonstrate various mitigation steps in practice. These samples offer a hands-on way to understand and implement these strategies, further securing your use of large language models like OpenAI’s.

By staying vigilant, continuously monitoring, and adapting to the evolving threat landscape, you can build resilient systems that stand strong against the persistent and ever-changing threats of the digital age.

Check out the code examples for this blog post.


Ryan Sevey
CEO & Founder Mantium

Enjoy what you're reading?

Subscribe to our blog to keep up on the latest news, releases, thought leadership, and more.