Using Memcached As A Distributed Lock From Within Django

I recently needed to lock a critical section of code to a single thread despite running in an environment with django on several servers. After a little bit of searching I came across the popular idea of using memcached’s “add” command, which succeeds if the key doesn’t already exist, and fails otherwise. All access for the same key is serialized, so with this you can build a nice distributed lock. In fact I found a nice one turned into a python context manager, so I started using it, and it seemed to work great. You just do something like

with dist_lock(my_key):
    # Critical section
    pass

Even though you might be running on different machines, the memcached “add” command ensures that only one thread gets into the critical section per value of “my_key”.

Then we got sporadic odd bug reports. Should have tested things a little more. The code in that link above has a subtle bug. If a second thread doesn’t end up ever getting the lock, it still ends up releasing it.

Here’s my attempt at fixing up the problem. (Note the addition of the “got_lock” variable):

import time
import logging
import contextlib
import random
from django.core.cache import cache as django_cache

class MemcacheLockException(Exception):
    def __init__(self, *args, **kwargs):
        Exception.__init__(self, *args, **kwargs)

@contextlib.contextmanager
def memcache_lock(key, attempts=1, expires=120):
    key = '__d_lock_%s' % key

    got_lock = False
    try:
        got_lock = _acquire_lock(key, attempts, expires)
        yield
    finally:
        if got_lock:
            _release_lock(key)

def _acquire_lock(key, attempts, expires):
    for i in xrange(0, attempts):
        stored = django_cache.add(key, 1, expires)
        if stored:
            return True
        if i != attempts-1:
            sleep_time = (((i+1)*random.random()) + 2**i) / 2.5
            logging.debug('Sleeping for %s while trying to acquire key %s', sleep_time, key)
            time.sleep(sleep_time)
    raise MemcacheLockException('Could not acquire lock for %s' % key)

def _release_lock(key):
    django_cache.delete(key)

I use it like this:

try:
    with memcache_lock(my_key):
        # Critical section
        pass
except MemcacheLockException:
    # Never got the lock

This seemed to work great! Then we got odd sporadic bugs. Should have tested more.

Turns out django’s use of memcached isn’t entirely optimal. For each request, django creates a new client to your set of memcached servers. That means a new tcp connection (assuming you’re using tcp) to each memcached server in your cluster. Even though having more memcached servers typically helps with scale, this behavior can show up as a performance hit with high load and more memcached servers. That is, if you have 5 memcached servers, each django request causes 5 tcp connections to be created (and then subsequently destroyed when the request is over). I’ll have another blog post on how I’ve mitigated that issue coming soon.

Anyway, under high load weird things can happen. Resources get scarce, CPU or IO becomes a bottleneck, and connections time out. I am using the python-memcached library, a pure python interface to memcached, which is perfectly good, but was not designed to be used in this environment.

During a memcached call the library needs to figure out which memcached server to use based on a hash of the key. Suppose the key hashes to memcached_1, and suppose that at that moment attempts to connect to memcached_1 fail. The code then rehashes the key, probably leading to a new memached server, and tries to connect again.

    def _get_server(self, key):
        if isinstance(key, tuple):
            serverhash, key = key
        else:
            serverhash = serverHashFunction(key)

        for i in range(Client._SERVER_RETRIES):
            server = self.buckets[serverhash % len(self.buckets)]
            if server.connect():
                #print "(using server %s)" % server,
                return server, key
            serverhash = serverHashFunction(str(serverhash) + str(i))
        return None, None

This is probably ok in a situation where you’ve configured multiple memcached servers and one is down for a long time, but it’s not ok for the occasional network blip. Supposed two requests come in under load, both reach the critical section of the memcached lock, and one hashes the key to memcached_1 while the other hashes to memcached_2. No locking will occur, and two threads will access the same critical section of code at the same time for the same key. Actually, we don’t have to suppose this will happen. It will happen. It happened.

There are probably a bunch of ways to fix this, but for me I wanted a consistent hash. If key1 hashes to memcached_1, it should always hash to memcached_1. And if I can’t talk to memcached_1, fail the request back to the caller. So, I simply decided not to retry. Here’s how I made that happen.

First, I created a new cache backend by cribbing from django’s MemcachedCache. Before the class is fully loaded, I reach into the python-memcache library and set some constants.

from django.core.cache.backends.memcached import BaseMemcachedCache

class MemcachedCache(BaseMemcachedCache):
    """A (slightly hacked) implementation of a cache binding using python-memcached"""
    def __init__(self, server, params):
        import memcache

        # Monkey patch memcache library so it doesn't retry on failure to connect to memcache servers.
        # If we allow it to retry on connection failures it uses a different hash on subsequent
        # attempts, and this can lead to us using a different memcache server for a key than would
        # normally be used.  Future reads of that key can thus return missing or wrong data.
        memcache.Client._SERVER_RETRIES = 1
        memcache._SOCKET_TIMEOUT        = 20
        memcache._DEAD_RETRY            = 0

        super(MemcachedCache, self).__init__(server, params,
                                             library=memcache,
                                             value_not_found_exception=ValueError)

Then, I use this new cache backend in my settings.py file instead of the typical one provided by django:

CACHES = {
    'default': {
        #'BACKEND' : 'django.core.cache.backends.memcached.MemcachedCache',
        'BACKEND' : 'apps.utils.memcached.MemcachedCache',
        'LOCATION': [
            '10.1.1.101:11211',
            '10.1.1.102:11211',
            '10.1.1.103:11211',
        ],
    }
}

And voila! A hardened distributed lock using memcached for use within django.

Couple things to think about with this approach:

  • Disabling retries in python-memcache means memcache calls will fail if the code can’t connect. If you’re doing long maintenance where one or more of your memcached servers are down for a long time, consider either removing them from LOCATION or restoring retries.
  • Memcached can crash (although I’ve never seen this in production), be restarted, or run out of memory and drop your keys/values. Any of these at exactly the right moment will undermine this distributed locking strategy.
This entry was posted in django, memcached, python and tagged . Bookmark the permalink.

2 Responses to Using Memcached As A Distributed Lock From Within Django

  1. John says:

    Thanks for the post! There seems to be a minor indentation issue in the very first example in the post, one line 11. Might be a wordpress thing.

Leave a comment