Inspired by rate limiter section in System Design Interview book, I want to share my understanding of different rate limiter algorithms and implement them in Python.

Rate limiting is a crucial technique in system design that helps control the rate of requests to a service. This post will cover 5 common rate limiter algorithms:

  • Token Bucket
  • Leaky Bucket
  • Fixed Window Counter
  • Sliding Window Log
  • Sliding Window Counter

1. Token Bucket Link to heading

The token bucket algorithm maintains a bucket of tokens, where:

  • The bucket starts with a maximum number of tokens (bucket size)
  • Tokens are added to the bucket at a constant rate (tokens per second)
  • When a request arrives:
    • If there are enough tokens, one is consumed and the request is allowed
    • If there aren’t enough tokens, the request is rejected
  • The bucket never exceeds its maximum size
  • This allows for burst traffic up to the bucket size, then smooths out to the refill rate
# token_bucket_demo.py
import time
import threading

class TokenBucket:
    def __init__(self, tokens_per_second: int, bucket_size: int):
        """
        Initialize a token bucket rate limiter.
        
        Args:
            tokens_per_second: Rate at which tokens are added to the bucket
            bucket_size: Maximum number of tokens the bucket can hold
        """
        self.tokens_per_second = tokens_per_second
        self.bucket_size = bucket_size
        self.tokens = bucket_size
        self.last_refill = time.time()
        self.lock = threading.Lock()

    def _refill(self):
        """
        Refill the bucket with tokens based on elapsed time.
        This method is called before each token consumption attempt.
        """
        now = time.time()
        time_passed = now - self.last_refill
        new_tokens = time_passed * self.tokens_per_second
        
        with self.lock:
            self.tokens = min(self.bucket_size, self.tokens + new_tokens)
            self.last_refill = now

    def consume(self, tokens: int = 1) -> bool:
        """
        Try to consume tokens from the bucket.
        
        Args:
            tokens: Number of tokens to consume (default: 1)
            
        Returns:
            bool: True if tokens were consumed, False if not enough tokens
        """
        with self.lock:
            self._refill()
            if self.tokens >= tokens:
                self.tokens -= tokens
                return True
            return False

if __name__ == "__main__":
    # Create a token bucket with 2 tokens per second and a bucket size of 5
    bucket = TokenBucket(tokens_per_second=2, bucket_size=5)

    # Simulate 20 requests
    for i in range(20):
        if bucket.consume():
            print(f"Request {i + 1}: Allowed")
        else:
            print(f"Request {i + 1}: Rejected")
        time.sleep(0.2)  # Sleep for 0.2 seconds between requests

Run the code:

python token_bucket_demo.py
Request 1: Allowed
Request 2: Allowed
Request 3: Allowed
Request 4: Allowed
Request 5: Allowed
Request 6: Allowed
Request 7: Allowed
Request 8: Rejected
Request 9: Allowed
Request 10: Rejected
Request 11: Allowed
Request 12: Rejected
Request 13: Rejected
Request 14: Allowed
Request 15: Rejected
Request 16: Allowed
Request 17: Rejected
Request 18: Rejected
Request 19: Allowed
Request 20: Rejected

Pros Link to heading

  • Simple to implement and understand: The token bucket algorithm is conceptually straightforward, making it easy to implement and maintain. Its behavior is predictable and easy to reason about.
  • Allows for burst traffic: By maintaining a bucket of tokens, it can handle sudden spikes in traffic up to the bucket size, which is useful for applications with variable load patterns.
  • Memory efficient: Only needs to track a single token count and last refill time, making it very lightweight in terms of memory usage.
  • Smooths out traffic spikes: After handling burst traffic, it naturally smooths out to the refill rate, preventing sustained high load on the system.
  • Good for APIs and general-purpose rate limiting: Its balance of burst handling and smoothing makes it suitable for most API rate limiting scenarios.

Cons Link to heading

  • Can be too permissive with burst traffic: The ability to handle bursts might be undesirable in systems that need strict rate limiting.
  • May not be suitable for strict rate limiting: The burst capability means it can’t guarantee a strict maximum rate over very short time periods.
  • Requires careful tuning of bucket size and refill rate: Finding the right balance between burst capacity and refill rate requires careful consideration of the application’s needs.
  • Not ideal for distributed systems without additional coordination: In distributed environments, maintaining consistent token counts across multiple instances requires additional coordination mechanisms.

2. Leaky Bucket Link to heading

The leaky bucket algorithm:

  • Requests arrive at the bucket at any rate
  • The bucket has a maximum capacity (queue size)
  • Requests are processed at a constant rate (leak rate)
  • When a request arrives:
    • If there’s space in the queue, it’s added
    • If the queue is full, the request is rejected
  • The background thread continuously processes requests at the fixed rate
# leaky_bucket_demo.py
import time
import threading
from queue import Queue

class LeakyBucket:
    def __init__(self, rate_per_second: int, capacity: int):
        """
        Initialize a leaky bucket rate limiter.
        
        Args:
            rate_per_second: Rate at which requests are processed
            capacity: Maximum number of requests that can be queued
        """
        self.rate_per_second = rate_per_second
        self.capacity = capacity
        self.queue = Queue(maxsize=capacity)
        self.lock = threading.Lock()
        self.processing = False
        self.stop_event = threading.Event()

    def start_processing(self):
        """
        Start the background thread that processes requests at a constant rate.
        This method must be called before adding requests.
        """
        self.processing = True
        self.process_thread = threading.Thread(target=self._process_requests)
        self.process_thread.daemon = True
        self.process_thread.start()

    def _process_requests(self):
        """
        Background thread that processes requests at a constant rate.
        This method runs continuously until stop_processing is called.
        """
        while not self.stop_event.is_set():
            with self.lock:
                if not self.queue.empty():
                    request_id = self.queue.get()
                    print(f"Processing request {request_id}...")

            time.sleep(1.0 / self.rate_per_second)

    def add_request(self, request_id: int) -> bool:
        """
        Try to add a request to the bucket.

        Args:
            request_id: The ID of the request to add

        Returns:
            bool: True if request was added to queue, False if queue is full
        """
        with self.lock:
            if self.queue.full():
                return False
            self.queue.put(request_id)
            return True

    def stop_processing(self):
        """
        Stop the background processing thread.
        This method should be called when the rate limiter is no longer needed.
        """
        self.stop_event.set()
        self.processing = False
        if hasattr(self, 'process_thread'):
            self.process_thread.join()

if __name__ == "__main__":
    # Create a leaky bucket that processes 2 requests per second
    # with a maximum queue size of 5
    bucket = LeakyBucket(rate_per_second=2, capacity=5)
    bucket.start_processing()

    try:
        # Simulate 20 requests
        for i in range(20):
            if bucket.add_request(i + 1):
                print(f"Request {i + 1}: Added to queue")
            else:
                print(f"Request {i + 1}: Queue full, request rejected")
            time.sleep(0.2)  # Sleep for 0.2 seconds between requests

        # Wait for all requests to be processed
        time.sleep(5)
    finally:
        bucket.stop_processing()

Run the code:

python leaky_bucket_demo.py
Request 1: Added to queue
Request 2: Added to queue
Request 3: Added to queue
Processing request 1...
Request 4: Added to queue
Request 5: Added to queue
Processing request 2...
Request 6: Added to queue
Request 7: Added to queue
Request 8: Queue full, request rejected
Processing request 3...
Request 9: Added to queue
Request 10: Queue full, request rejected
Processing request 4...
Request 11: Added to queue
Request 12: Queue full, request rejected
Request 13: Queue full, request rejected
Processing request 5...
Request 14: Added to queue
Request 15: Queue full, request rejected
Processing request 6...
Request 16: Added to queue
Request 17: Queue full, request rejected
Request 18: Queue full, request rejected
Processing request 7...
Request 19: Added to queue
Request 20: Queue full, request rejected
Processing request 9...
Processing request 11...
Processing request 14...
Processing request 16...
Processing request 19...

Pros Link to heading

  • Provides a constant output rate: Guarantees a consistent processing rate, which is crucial for systems that need predictable throughput.
  • Smooths out traffic bursts: By queuing requests and processing them at a fixed rate, it effectively eliminates traffic spikes.
  • Good for network traffic shaping: Its constant output rate makes it ideal for network traffic management and bandwidth control.
  • Predictable behavior: The fixed processing rate makes system behavior more predictable and easier to plan for.
  • Memory efficient when queue size is limited: With a bounded queue size, it uses a predictable amount of memory.

Cons Link to heading

  • Can cause request delays: Requests may have to wait in the queue before being processed, potentially increasing latency.
  • May drop requests when queue is full: Unlike token bucket, it doesn’t allow bursts beyond the queue capacity.
  • Not suitable for bursty traffic patterns: The constant processing rate might be too restrictive for applications with naturally bursty traffic.
  • Requires background processing thread: The continuous processing of requests adds complexity to the implementation.
  • More complex implementation than token bucket: Managing the queue and processing thread requires more code and careful synchronization.

3. Fixed Window Counter Link to heading

The fixed window counter algorithm:

  • Time is divided into fixed windows (e.g., 1 second)
  • Each window has a maximum request limit
  • When a request arrives:
    • If the current window’s counter is below the limit, increment and allow
    • If the counter is at the limit, reject the request
  • At the start of each new window, the counter resets to zero
# fixed_window_counter_demo.py
import time
import threading

class FixedWindowCounter:
    def __init__(self, max_requests: int, window_size: int):
        """
        Initialize a fixed window counter rate limiter.
        
        Args:
            max_requests: Maximum number of requests allowed in a window
            window_size: Size of the time window in seconds
        """
        self.max_requests = max_requests
        self.window_size = window_size
        self.window_start = time.time()
        self.counter = 0
        self.lock = threading.Lock()

    def _reset_window(self):
        """
        Reset the counter if the current window has expired.
        This method is called before each request check.
        """
        now = time.time()
        if now - self.window_start >= self.window_size:
            self.window_start = now
            self.counter = 0

    def allow_request(self) -> bool:
        """
        Check if a new request is allowed.
        
        Returns:
            bool: True if request is allowed, False if rate limit exceeded
        """
        with self.lock:
            self._reset_window()
            if self.counter < self.max_requests:
                self.counter += 1
                return True
            return False

if __name__ == "__main__":
    # Create a fixed window counter rate limiter
    # with a maximum of 10 requests per second
    # and a window size of 1 second
    limiter = FixedWindowCounter(max_requests=5, window_size=1)

    # Simulate 20 requests
    for i in range(20):
        if limiter.allow_request():
            print(f"Request {i + 1}: Allowed")
        else:
            print(f"Request {i + 1}: Rejected")
        time.sleep(0.1)  # Sleep for 0.1 seconds between requests

Run the code:

python fixed_window_counter_demo.py
Request 1: Allowed
Request 2: Allowed
Request 3: Allowed
Request 4: Allowed
Request 5: Allowed
Request 6: Rejected
Request 7: Rejected
Request 8: Rejected
Request 9: Rejected
Request 10: Rejected
Request 11: Allowed
Request 12: Allowed
Request 13: Allowed
Request 14: Allowed
Request 15: Allowed
Request 16: Rejected
Request 17: Rejected
Request 18: Rejected
Request 19: Rejected
Request 20: Rejected

Pros Link to heading

  • Very simple to implement: Requires minimal code and is easy to understand and maintain.
  • Memory efficient: Only needs to track a single counter and window start time, using very little memory.
  • Low computational overhead: Simple counter operations make it very fast and lightweight.
  • Easy to understand and debug: The fixed window concept is intuitive and straightforward to reason about.
  • Good for simple use cases: Perfect for basic rate limiting needs where exact precision isn’t required.

Cons Link to heading

  • Allows bursts at window boundaries: The reset of counters at window boundaries can lead to request bursts, potentially causing system overload.
  • Less accurate than sliding window approaches: The fixed windows can lead to inaccuracies in rate limiting, especially around window boundaries.
  • Can be unfair if traffic is unevenly distributed: Traffic patterns might be unfairly limited due to the fixed window nature.
  • Not suitable for strict rate limiting: The burst potential makes it unsuitable for scenarios requiring precise rate control.
  • May cause thundering herd problems: The window reset can cause multiple requests to be processed simultaneously, potentially overwhelming the system.

4. Sliding Window Log Link to heading

The sliding window log algorithm:

  • Maintains a log of all request timestamps
  • When a request arrives:
    • Remove timestamps older than the window size
    • Count remaining timestamps
    • If count is below limit, add new timestamp and allow
    • If count is at limit, reject the request
  • The window slides continuously with time
  • Provides exact counting within any time window
# sliding_window_log_demo.py
import time
import threading
from collections import deque

class SlidingWindowLog:
    def __init__(self, max_requests: int, window_size: int):
        """
        Initialize a sliding window log rate limiter.
        
        Args:
            max_requests: Maximum number of requests allowed in the window
            window_size: Size of the sliding window in seconds
        """
        self.max_requests = max_requests
        self.window_size = window_size
        self.requests = deque()
        self.lock = threading.Lock()

    def _clean_old_requests(self):
        """
        Remove timestamps of requests that are outside the current window.
        This method is called before each request check.
        """
        now = time.time()
        while self.requests and now - self.requests[0] >= self.window_size:
            self.requests.popleft()

    def allow_request(self) -> bool:
        """
        Check if a new request is allowed.
        
        Returns:
            bool: True if request is allowed, False if rate limit exceeded
        """
        with self.lock:
            self._clean_old_requests()
            if len(self.requests) < self.max_requests:
                self.requests.append(time.time())
                return True
            return False

if __name__ == "__main__":
    # Create a sliding window log rate limiter
    # with a maximum of 5 requests per second
    # and a window size of 1 second
    limiter = SlidingWindowLog(max_requests=5, window_size=1)

    # Simulate 20 requests
    for i in range(20):
        if limiter.allow_request():
            print(f"Request {i + 1}: Allowed")
        else:
            print(f"Request {i + 1}: Rejected")
        time.sleep(0.1)  # Sleep for 0.1 seconds between requests

Run the code:

python sliding_window_log_demo.py
Request 1: Allowed
Request 2: Allowed
Request 3: Allowed
Request 4: Allowed
Request 5: Allowed
Request 6: Rejected
Request 7: Rejected
Request 8: Rejected
Request 9: Rejected
Request 10: Rejected
Request 11: Allowed
Request 12: Allowed
Request 13: Allowed
Request 14: Allowed
Request 15: Allowed
Request 16: Rejected
Request 17: Rejected
Request 18: Rejected
Request 19: Rejected
Request 20: Rejected

Pros Link to heading

  • Most accurate rate limiting: Maintains exact counts of requests within any time window, providing precise rate limiting.
  • No burst traffic allowed: The sliding window nature prevents any form of burst traffic, ensuring consistent load.
  • Precise control over request timing: Exact timestamps allow for fine-grained control over request processing.
  • Fair distribution of requests: The continuous sliding window ensures fair treatment of all requests regardless of timing.
  • Good for high-accuracy requirements: Ideal for systems where precise rate limiting is critical.

Cons Link to heading

  • Memory intensive: Storing all request timestamps can consume significant memory, especially under high load.
  • Higher computational overhead: Processing and cleaning up timestamps requires more CPU resources.
  • More complex implementation: Managing the timestamp log and window sliding adds complexity to the code.
  • Not suitable for high-throughput systems: The memory and computational overhead make it less ideal for very high-volume systems.
  • May need cleanup of old timestamps: Requires periodic cleanup of old timestamps to prevent memory growth.

5. Sliding Window Counter Link to heading

The sliding window counter algorithm:

  • Maintains two counters: current window and previous window
  • When a request arrives:
    • Update window boundaries if needed
    • Calculate weighted count from both windows
    • If total is below limit, increment current window and allow
    • If total is at limit, reject the request
  • Uses weighted average to approximate sliding window
  • Provides good accuracy with minimal memory usage
# sliding_window_counter_demo.py
import time
import threading

class SlidingWindowCounter:
    def __init__(self, max_requests: int, window_size: int):
        """
        Initialize a sliding window counter rate limiter.
        
        Args:
            max_requests: Maximum number of requests allowed in the window
            window_size: Size of the sliding window in seconds
        """
        self.max_requests = max_requests
        self.window_size = window_size
        self.current_window = 0
        self.previous_window = 0
        self.window_start = time.time()
        self.lock = threading.Lock()

    def _reset_window(self):
        """
        Update window counters based on elapsed time.
        This method is called before each request check.
        """
        now = time.time()
        if now - self.window_start >= self.window_size:
            self.previous_window = self.current_window
            self.current_window = 0
            self.window_start = now

    def allow_request(self) -> bool:
        """
        Check if a new request is allowed using weighted counters.
        
        Returns:
            bool: True if request is allowed, False if rate limit exceeded
        """
        with self.lock:
            self._reset_window()
            current_time = time.time()
            window_progress = (current_time - self.window_start) / self.window_size
            
            # Weighted count of previous window
            weighted_count = self.previous_window * (1 - window_progress)
            total_count = weighted_count + self.current_window
            
            if total_count < self.max_requests:
                self.current_window += 1
                return True
            return False

if __name__ == "__main__":
    # Create a sliding window counter rate limiter
    # with a maximum of 5 requests per second
    # and a window size of 1 second
    limiter = SlidingWindowCounter(max_requests=5, window_size=1)

    # Simulate 20 requests
    for i in range(20):
        if limiter.allow_request():
            print(f"Request {i + 1}: Allowed")
        else:
            print(f"Request {i + 1}: Rejected")
        time.sleep(0.1)  # Sleep for 0.1 seconds between requests

Run the code:

python sliding_window_counter_demo.py
Request 1: Allowed
Request 2: Allowed
Request 3: Allowed
Request 4: Allowed
Request 5: Allowed
Request 6: Rejected
Request 7: Rejected
Request 8: Rejected
Request 9: Rejected
Request 10: Rejected
Request 11: Allowed
Request 12: Rejected
Request 13: Allowed
Request 14: Rejected
Request 15: Allowed
Request 16: Rejected
Request 17: Allowed
Request 18: Rejected
Request 19: Allowed
Request 20: Rejected

Pros Link to heading

  • Memory efficient: Only needs two counters regardless of request volume, making it very memory efficient.
  • More accurate than fixed window: The weighted average approach provides better accuracy than simple fixed windows.
  • Good balance of accuracy and performance: Offers a practical compromise between precision and resource usage.
  • Suitable for distributed systems: The simple counter-based approach works well in distributed environments.
  • Handles burst traffic better than fixed window: The weighted average helps smooth out traffic better than fixed windows.

Cons Link to heading

  • More complex than fixed window: The weighted average calculation adds complexity compared to simple fixed windows.
  • Still less accurate than sliding window log: While more accurate than fixed window, it’s not as precise as the sliding window log.
  • Requires careful tuning of window size: Finding the optimal window size requires consideration of the specific use case.
  • May need synchronization in distributed systems: While suitable for distributed systems, it still requires proper synchronization mechanisms.
  • More computational overhead than fixed window: The weighted average calculations add some computational overhead.

Further Reading: Link to heading