# Const-time Modular Inversion Using CRT

Modular inversion is an important component in many cryptographic computations, notably in number-theoretic public key cryptosystems like RSA and ECDSA. In such uses, we must both perform the computation as quickly as possible and also in const-time, that is without any software-observable side channels which leak information about the inputs or output. Otherwise it is possible to attack computations such as RSA key generation or ECDSA signature generation, and recover the secret key.

# Simple and hardware friendly RSA threshold signatures

A $(n,t)$ threshold signature schemes allow splitting a key into $n$ pieces, in such a way that at least $t < n$ participants must jointly use their key shards in order to generate a valid signature.

Many techniques for RSA threshold signatures have been developed. Currently published techniques require either a trusted dealer, or use of a distributed key generation algorithm. In addition, the signers must perform a non-standard RSA signature; that is, signing a message with a private exponent which is not equal to $e^{-1} \bmod n$. Both requirements prevent using standard hardware such as HSMs or smartcards to protect the key shards.

I discovered a technique for $n$-of-$n$ RSA signatures where both keys and signatures can be computed using standard cryptographic hardware.

# How Not To Do BLS Signatures

The BLS signature scheme has several interesting properties, namely that the signatures are very short compared to any other known scheme, and it affords a simple implementation of threshold signatures and signature aggregation. For these reasons it has been of some interest especially in cryptocurrencies which can make good use of these properties.

# Bit manipulations using BMI2

Intel's Haswell design (expected in 2013) will include a new instruction set called BMI2 with various fun bit manipulation instructions. Particularly of note are the pext and pdep instructions which are essentially bit-level gather/scatter operations. Combining two pext operations results in the GRP instruction described in Efficient Permutation Instructions for Fast Software Cryptography, where the authors show how to implement bit level permutations using a variety of instructions. Perhaps not coincidentally at least one of the authors now works at Intel.

# Using std::async for easy parallel computations

C++0x, the next major revision of C++, includes a number of new language and library facilities that I am greatly looking forward to, including a standard thread interface. Initially the agenda for C++0x had included facilities built on threads, such as a thread pool, but as part of the so-called 'Kona compromise' (named after the location of the committee meeting where the compromise was made) all but the most basic facilities were deferred for a later revision.

# Converting Line Endings in InnoSetup

I recently packaged botan using InnoSetup, an open source installation creator. Overall I was pretty pleased with it - it seems to do everything I need it to do without much of a hassle, and I'll probably use it in the future if I need to package other programs or tools for Windows.

# The Case For Skein

After the initial set of attacks on MD5 and SHA-1, NIST organized a series of conferences on hash function design. I was lucky enough to be able to attend the first one, and had a great time. This was the place where the suggestion of a competition in the style of the AES process to replace SHA-1 and SHA-2 was first proposed (to wide approval). This has resulted in over 60 submissions to the SHA-3 contest, of which 14 have been brought into the second round.

Of the second round contenders, I think Skein is the best choice for becoming SHA-3, and want to explain why I think so.

# 4x4 integer matrix transpose in SSE2

The Intel SSE2 intrinsics has a macro _MM_TRANSPOSE4_PS which performs a matrix transposition on a 4x4 array represented by elements in 4 SSE registers. However, it doesn't work with integer registers because Intel intrinsics make a distinction between integer and floating point SSE registers. Theoretically one could cast and use the floating point operations, but it seems quite plausible that this will not round trip properly; for instance if one of your integer values happens to have the same value as a 32-bit IEEE denormal.

However it is easy to do with the punpckldq, punpckhdq, punpcklqdq, and punpckhqdq instructions; code and diagrams ahoy.

# Speeding up Serpent: SIMD Edition

The Serpent block cipher was one of the 5 finalists in the AES competition, and is widely thought to be the most secure of them due to its conservative design. It was also considered the slowest candidate, which is one major reason it did not win the AES contest. However, it turns out that on modern machines one can use SIMD operations to implement Serpent at speeds quite close to AES.

# Inverting Mersenne Twister's final transform

The Mersenne twister RNG 'tempers' its output using an invertible transformation:

unsigned int temper(unsigned int x)
{
x ^= (x >> 11);
x ^= (x << 7) & 0x9D2C5680;
x ^= (x << 15) & 0xEFC60000;
x ^= (x >> 18);
return x;
}


The inversion function is:

unsigned int detemper(unsigned int x)
{
x ^= (x >> 18);
x ^= (x << 15) & 0xEFC60000;
x ^= (x << 7) & 0x1680;
x ^= (x << 7) & 0xC4000;
x ^= (x << 7) & 0xD200000;
x ^= (x << 7) & 0x90000000;
x ^= (x >> 11) & 0xFFC00000;
x ^= (x >> 11) & 0x3FF800;
x ^= (x >> 11) & 0x7FF;

return x;
}


This inversion has been confirmed correct with exhaustive search.

# Optimizing Forward Error Correction Coding Using SIMD Instructions

Forward error correction (FEC) is a technique for handling lossy storage devices or transmission channels. A FEC code takes k blocks of data and produces an additional m blocks of encoding information, such that any set of k of the blocks (out of the k+m total) is sufficient to recover the original data. One can think of RAID5 as a FEC with arbitrary k and m fixed at 1; most FEC algorithms allow wide latitude for the values that can be sent, allowing the code to be adjusted for the reliability expectations and needs of the particular channel and application. For instance, the Tahoe distributed filesystem splits stored files using k of 3 and m of 7, so as long as at least 30% of the devices storing the file survive, the original file can be recoved.

# On Syllable's /dev/random

Inspired by the recent FreeBSDarc4random vulnerability, I've been taking a look at the random number generators used by various libraries and operating systems.

# Serious Weakness in GNU Classpath/gcj PRNG; DSA keys are compromised

GNU Classpath is an open source implementation of the Java class libraries used by gcj, the GNU Compiler for Java. One component of the Java library is JCE, the Java Cryptography Extensions (so called because originally it was not bundled with the JVM due to United States export restrictions), which provides the basic crypto features one would expect (ciphers, hashing, signatures) for Java applications. I found a rather interesting bug that compromised all RSA and DSA keys used with GNU classpath.

# The More Things Change...

"Anyone who considers arithmetic methods of producing random digits is, of course, in a state of sin." - John von Neumann, 1951

On an Ubuntu forum I caught a reference to a C++ library called JUCE, which is one of those all-inclusive C++ libraries along the lines of POCO or GNU Common C++. One thing I noticed was that it includes a few cryptographic operations, including RSA key generation, so I decided to take a peek at the latest release as of this writing, 1.46.

# A Failure Case in a Linux Random Number Generator

The Linux kernel implements a random number generator called a Tausworthe generator, in the file lib/kernel32.c. The kernel uses this generator for a variety of non-cryptographic purposes, such as calculating network delays and random ports numbers, choosing a random element to drop from full caches, and many other places where a randomized algorithm is useful. While looking through this source, I found some cases where it could fail quite dramatically.

# Roman Proverbs Applicable to Software

Quod non est in actis, non est in mundo. ("What is not in the documents does not exist")

# Insurance, Evaluation, Risks

The bond insurers MBIA and Ambac are going bankrupt because they wrote insurance for mortgage backed securities which are now failing at rates far higher than they had estimated. This is a pretty common problem with insurance; humans tend to be really bad at estimating or pricing risk.

# Racing in Java

Reading the documentation for Java's File object, I was astounded to find that the Java designers managed to replicate one of the best known file system race conditions for no good reason: the functions canRead and canWrite are essentially the Java equivalents of the access function, which is so well known to be a security hole that the Linux man page actually warns that:

Using access() to check if a user is authorized to e.g. open a file before actually doing so using open(2) creates a security hole, because the user might exploit the short time interval between checking and opening the file to manipulate it.

While OpenBSD provides the less ambiguous caveat that:

access() is a potential security hole and should never be used.

I was reading the man page for Linux fcntl(2), because I've never used it and was curious what exactly it could do. For a couple of hours this afternoon, I thought I had perhaps found a security vulnerability in the design, this post is to trace my logic and describe what I learned.

# Search Based Filesystem

I was using a friend's OS X machine briefly, and at one point thought that how the "Movies" folder worked was it acted as an index into the actual filesystem, using some search technology or another to find all of your movies and present them to you. Presenting search results as a directory of files instead of a list (especially with movies, since they are normally a single file that are more or less independent of each other) seemed so obviously user-friendly that I figured that had to be what Apple would do. (Apparently not: my friend said it's just a plain old directory with actual files in it). But my misconception planted the idea that this would be pretty useful!

# Python Format String Annoyance

Python's format string operator is useful, but if you wish to provide arguments to it you must do so all at once. This makes some situations harder to deal with, including this one wherein I am annoyed at being unable to compact some Python code manipulating a MySQL database.

# Huffman Encoding of Phone Contacts

Every mobile phone I've ever used or heard of sorts their contact lists in alphabetical order based on name.

# Algorithmic Complexity Attacks on Allocators

A few years back some researchers presented the concept of performing denial of service through algorithmic complexity attacks, which essentially cause pathological behavior in data structures like hash tables through carefully chosen inputs.

# Finding Equivalences of Boolean Function

A fairly common class of functions in crypto are functions mapping {0,1}3 onto {0,1}. In particular, these show up a lot in hash functions derived from MD4, including MD5, SHA-1, RIPEMD, and SHA-512. These range in complexity from simple three-term expressions like "(A xor B xor C)" to functions like "((A and B) or (C and (A or B)))". One interesting and important difference between these two functions becomes very important when you consider how to implement these functions on an x86 (or x86-64) processor. The x86 uses two-operand instructions, and has very few registers, so computing something like "(A and B) xor (not(A) and C)", which requires two temporaries (one to hold A and B, the other (not(A) and C)) might require you to spill values to the stack. Often, that means a major performance hit. Finding an alternate form for this function that only requires fewer temporary registers could be a major benefit. Obviously finding these equivalences could be done by hand, but having a computer do it seemed both faster and more interesting.

# Fun with assembly

"If you can explain how you do something, then you're very very bad at it."
-- John Hopfield

The Monotone folks have been doing some profiling and performance work of late. One thing that came out of that was the finding that Botan's SHA-1 implementation was causing a bottleneck; because Monotone identifies everything via hashes, there are times where it needs to hash many (many) megabytes of source data, and the faster that happens, the better. Since low-level C++ wasn't cutting it, I felt that it was time to try my hand at x86 assembly again.