Generating Random Tokens (in Python)

Update 1/13: After reading the comments and thinking about it some more, I think binascii.hexlify(os.urandom(n)) is the easiest way to generate random tokens, and random = random.SystemRandom(); ''.join(random.choice(alphabet) for _ in range(n)) is better when you need a string that contains only characters from a specific alphabet. Pyramid uses the former approach; Django uses the latter.

I’m working on a web site where I need to generate random CSRF tokens. After digging around a bit, I found os.urandom(n), which returns “a string of n random bytes suitable for cryptographic use.” Okay, that sounds good… except that it can include bytes that aren’t “web safe”.

So I needed a way to encode the output of urandom. I poked around some more and saw binascii.hexlify(data) being used for this purpose (in Pyramid). For some reason, though, I thought it would be “clever” to hash the output from urandom like so: hashlib.sha1(os.urandom(128)).hexdigest().

What I like about this is that no matter how many bytes you request from urandom (assuming more bytes means more entropy), you always end up with a 40 character string that’s safely encoded.

I’m not sure if this provides any real benefit though (in terms of increased security). Are there better ways to generate random tokens?

Another thought I had was to use bcrypt.gensalt() and use its output as is–it uses urandom to generate the initial salt, which is then hashed, and also returns a fixed number of bytes (29).

On a slightly related note, I recently needed to generate a new PIN. My first thought was to reuse a PIN I use elsewhere, but of course that’s a bad idea. My second thought was to use KeePassX to generate one. I happened to have a calculator sitting next to me (one with big buttons); I closed my eyes and banged on it a bit to generate the PIN.

15 thoughts on “Generating Random Tokens (in Python)

  1. Solution is simple then. Find a small gnome and small calculator and a way of it sending the results of its calculator banging back into python.

    It’s completely NSA-proof too, unless the gnome is a NSA plant.

  2. I am simply doing uuid.uuid4() to generate a unique ID everytime I need one. Seems to be doing the trick properly.

    1. I thought about using uuid4, but I couldn’t tell from a quick scan of the manual how random it is. I guess I could check the source…

      1. uuid4 uses the a system generator (e.g., libuuid) if available then falls back to os.urandom then falls back to random.randrange, so I like this option: uuid.uuid4().hex, which gives you a 32 byte string.

        1. I had a need to create random and non-repeating strings of 7 lowercase alphanumeric characters. It appears that random.uuid4().hex is far not as good as random.sample in this case. random.sample and random.SystemRandom().sample are seemingly equal. Test here:

          Probably the 7-character limitation is the reason.

          1. It makes sense that truncating UUID.hex wouldn’t be as good for this since you lose randomness by truncating. I tried a different approach just for fun and got slightly better results:

            alphabet = string.ascii_lowercase + string.digits
            alphabet_len = len(alphabet)
            def my_rand():
                return ''.join(alphabet[b % alphabet_len] for b in os.urandom(7))

            When I ran your test script with this added, I got these results:

            Generating 1000000 sample alphanumeric strings of 7 symbols.
            Number of repetitions:
            uuid4().hex: 1856
            Plain random: 15
            SystemRandom: 11
            my_rand: 7

            This is just demonstrating that urandom is more random than the other options, but it’s a little puzzling because SystemRandom uses urandom. Maybe it has something to do with how SystemRandom.sample() works.

            Edit 1: sample() returns a list of unique characters, which explains why it doesn’t work as well.

            Edit 2: My algorithm above isn’t that great because it doesn’t select uniformly from the alphabet (due to the way the random bytes are mapped into the alphabet).

          2. Maybe it’s because hex strings are limited to 16 possible characters and the other two methods you devised allow 36 possible characters.

  3. For my Python RSA implementation I’ve used os.urandom(); reviewers have found it to be secure enough too. I would use neither hashing nor uuid4, as “they result in a nice string” hardly covers any security requirement. I think that hexlify’ing the result of os.urandom() would be the best of the above options.

    1. Does hashing a random value decrease its randomness?

      Edit: I guess it does since hashing maps bytes from a larger range to bytes in a smaller range, but 16 ** 40 = 1461501637330902918203684832716283019655932542976, so it probably doesn’t matter much for my original use case, but strictly speaking, I think you’re right.

    1. The Django version (by default) limits the character set to just ASCII alphanumerics, which reduces the number of possibilities (probably not by an amount that really matters for CSRF tokens, though):

      >>> (26 + 26 + 10) ** 12  # Django's get_random_string(12)
      >>> 256 ** 12  # binascii.hexlify(os.random(12))

      If you don’t need characters from a specific alphabet, I think binascii.hexlify(os.urandom(n)) is the best (simplest) choice.

      1. binascii.hexlify outputs hex — so the alphabet is only 16 characters, meaning it’s going to be twice the length of the input to hexlify. So now you’ve got 24 characters, not 12, for that number of possibilities.

        1. The “alphabet” urandom uses is 0-255. hexlify encodes those raw bytes using hex characters, but that doesn’t affect the number of possibilities, although it does affect the length of the “web safe” string, as you pointed out.

          Using ASCII alphanumerics, the byte range is something like 48-57,65-90,97-122. Strings generated from that alphabet can be encoded in a slightly shorter form (e.g., 12 vs 18 bytes for roughly the same number of possibilities).

          So I guess there’s a trade-off between the length of the encoded string vs the number of possibilities, but for my use case a few bytes don’t really matter.

  4. There is a python module “shortuuid”, which you can use with uuid4() to provide a 22-character string instead of 32 if that is a concern. I use it a lot for verification tokens where I want the uniqueness/randomness of the uuid4().

Comments are closed.