Shared-Dictionary Compression for HTTP/S

(x-post from LinkedIn Engineering Blog)

HTTP protocol has been the glue that holds together the Web, mobile apps and servers. In the world of constant innovation, HTTP/1.1 appeared to be the only safe island of consistency until the announcement of HTTP/2. Yet, even with HTTP being as robust and efficient as it is, there is still room for improvement, and this is what this post is about. LinkedIn’s Traffic Infrastructure team is making LinkedIn faster by exploring ways in which HTTP can be improved.

Traditionally, HTTP communications are compressed using either gzip or deflate algorithms. Of the two, gzip strikes the balance of being aggressive enough to reach a good compression ratio, while not having ridiculous CPU requirements. Algorithms such as Burrows-Wheeler transform (popularized through bzip2) offer higher degrees of compression, but have higher CPU requirements. Until recently it was generally accepted that gzip is the best way to compress HTTP traffic. Yet there is an alternative to using computationally intense compression algorithms to transfer less data over the wires – this alternative is to start with sending less data in the first place.

Enter SDCH

SDCH (pronounced “Sandwich”) is a an HTTP/1.1-compatible extension, which reduces the required bandwidth through the use of a dictionary shared between the client and the server. It was proposedby four Google employees over five years ago. It utilizes a relatively simple mechanism that has the potential to drastically reduce transmission sizes. Currently Google Chrome and other browsers based on the Chromium open-source browser project support SDCH, but until recently Google was the only Internet site to support SDCH on the server. In this post, we will be sharing an overview of how it works, our experience implementing it and perhaps shed some light on why its use isn’t very widespread despite its remarkable potential.

SDCH is very simple at first glance. Instead of compressing every page as it was independent document, it takes into consideration that many sites have many common phrases occurring across multiple pages – navigation bars, copyrights, company names, etc. In addition to phrases, modern pages share a good deal of CSS files, JavaScript code, etc. Once those common elements have been downloaded, there is no need to transfer them over and over again. Instead, the common elements are enumerated in a dictionary. If the browser and the server have same revision of the dictionary, then the server can start transferring only short keys, instead of long values. The browser will use the dictionary to substitute keys for phrases.

LinkedIn is committed to privacy of its users, and therefore we rely on browsers advertising support for SDCH for HTTPS schemas. Chromium does that, other browsers will follow the suit.

Negotiations

For SDCH to work, several conditions have to be met. First, both the browser and the server must support SDCH. Second, both the browser and the server must share same revision of the dictionary – using two different dictionaries will result in broken pages due to missing style sheets, invalid JavaScript or unfinished sentences.These conditions are negotiated through HTTP headers as the browser makes a GET request. In plain English, the negotiation can be described as following:

browser: Hi! I need to GET /page.html. And BTW, I support SDCH

server: Hi! Here is the page. And BTW, here’s URL to get SDCH dictionary!

… (browser downloads dictionary in the background) …browser: Hi I need to GET /another.page.html. And BTW I support SDCH -here is my client-hash (see below)

server: Hi! Here is the page! And BTW, since your dictionary is up to date, the page is SDCH encoded!

In both phases, the client advertises SDCH support with the Accept-Encoding: sdch header. If the client has a dictionary, it advertises that with the Avail-Dictionary: client-hash header. We will talk about the hash generation later.

Once the browser is done fetching the dictionary, it is expected to advertise the availability of this dictionary by including the Avail-Dictionary header with the next request. This header contains a unique identifier for the provided dictionary called “client hash”. At this point the server can start sending SDCH-compressed traffic (marked with the Content-Encoding: sdch header). The SDCH-compressed payload is prefixed by “server-hash”. The two hashes combined form the SHA256 digest of the dictionary, which allows the browser and the server to agree that they use the same dictionary.

From this point on the server can start SDCH compression: it substitutes the long common strings included in the dictionary with shorter representations, includes sdch in the Content-Encodingheader and prepends the body with the server hash.

Client receives the response and decodes the content by doing what the server did in reverse.

Making Sandwiches

SDCH is very simple at first glance, it makes one wonder why it isn’t more popular. After all, all you need to do is to generate a dictionary of long common strings found in your static content and have a web server that supports SDCH. How hard can that be? Let’s find out:

First, you need to generate a dictionary of long common strings from your static content. The open-source solution Femtozip by Garrick Toubassi seems to be the only available solution at this point. We have run into a performance issue that we fixed and we contributed the patch back. Running SDCH in production requires generating the dictionary fairly often, ideally every time we deploy a front-end application. That comes with a price of extra deployment time, about 7 hours for our production dataset. At LinkedIn we are deploying up to 3 times a day, so we opted for “near-line” dictionary generation – we generate the dictionary in the background every two weeks. At this cadence we can benefit from SDCH compression without paying the overhead at every deployment.

The second challenge is presented by versioning of static content. Every company has a different process for versioning and building their static content and this step may be challenging for some as dictionary generation may take a very long time. For websites that have a large collection of static content, generation of the dictionary can take days. A map-reduce approach can potentially speed up dictionary generation, but this is beyond the scope of this post. Let’s assume, for the sake of simplicity, that you didn’t have that many files and haven’t had any issues with this; now you have a dictionary for your static content and you are serving it from your main site under the “/my_dictionary” partial url.

The third thing you need is a web server that supports SDCH. You are out of luck on this one as there are currently only a few publicly available solutions for serving SDCH-encoded content. You will need to code your own module/plugin/webserver for this purpose. Since we extensively use Apache Traffic Server at LinkedIn, our experiment involves a custom Traffic Server plugin that uses atscppapi andopen-vcdiff libraries. We are in the process of open-sourcing the plugin.

Challenges

Femtozip outputs a dictionary that can be used for SDCH with minor modifications. You need to prepend it with the SDCH dictionary headers so that the browser knows on which domain this dictionary can be used and under which paths is this dictionary valid. We discovered that Chrome aggressively blacklists domains and does not tell you why your domain is blacklisted. This turned debugging to an arduous task. Fortunately, the Yandex Browser has a tab under their net-internals page that gives you the error codes that you can conveniently lookup by checking outsdch_manager.h under the Chromium source-code.What we discovered is that if the browser advertises its dictionary using the Avail-Dictionaryheader, the server is expected to encode the response body with SDCH no matter what. Even if the response doesn’t contain sdch under the Content-Encoding header, the browser will still try to decode the received body, fail and blacklist your domain from further SDCH use until the browser restarts. If the server doesn’t want to encode with SDCH and if the client has advertised an available dictionary, the server has to include the X-Sdch-Encode: 0 header in the response to explicitly indicate that the response body is not SDCH-encoded.

Results

The preliminary results we had with our SDCH initiative at LinkedIn have been nothing short of remarkable. We generated two dictionaries, one for 1282 CSS files, and one for 6225 Javascript files. Each dictionary was about 700kB and these dictionaries were less than 100kB when compressed and served with gzip. The data we collected compares gzip with sdch+gzip. When sdch and gzip are combined, we have seen additional compression as high as 81% on certain files when compared to gzip only. Average additional compression across our static content was about 24%. It takes us about 400 microseconds on average to encode a single static content file on the fly. Since our static content is cached through CDNs, it only needs to be encoded every once in a while. In order to assure that CDNs knew how to distinguish between a page encoded with one dictionary from another, we also needed to append the Avail-Dictionary into our Vary response header. We think SDCH has a lot of potential and we expect to see a lot of new developments very soon.

Credits and Final Notes

Dzmitry Markovich championed the idea of using SDCH at LinkedIn. The plugin was developed by Can Selcik, during his internship at LinkedIn. It is worth mentioning that the specs have changed quite a bit since the initial announcement of SDCH. Randy Smith, Chris Bentzel and the Chromium developers have been extremely helpful and none of this would be possible without them.

Advertisements

Singletrack as a scene for growth

TL;DR: another piece of introspection.

Looking back, it has been interesting year from the racing and training perspective. Lots of introspection, illusion breaking and reframing the question of why racing is important to me.

I would like to highlight this question.

Do I race because of boredom? Is it my inherent competitive drive that pushes me? Is it my way to find an outlet for stress and negative emotions?

To a degree, all explanations above is true. Yet, all of them miss the point.

I race because I want to build a system where I can constantly control my personal growth. And I need this control so that I can constantly improve. And I need to constantly improve because I ultimately believe there is no stable point. Either we are improving or we are losing our edge. There is no such thing as “reaching stable point and relaxing. Life is a dynamic balance.”

As I type it, I hear a voice in my head “wait a second, Rhynolite. At any given second there is somebody you know that has hit a plato and is trying hard to push through it. Whiskey Tango Foxtrot?

To answer this, I would like to first recall the wonderful world of Planiverse, of which I read 30 years ago in the amazing book “The New Mathematical Diversions” by Martin Gardner. Planiverse, a world created by A.K. Dewdney, everything is, flat. The earth is flat (yet round), and the people are flat. More than that, men are always facing right, and women are always facing left. Which makes it really handy for dating, but what if a teenage boy wants to have a serious talk to his father? We can only hope the Dad did enough waiters walks to be able to hold the boy upside down for the duration of the talk… or maybe they should use Flat version of Skype… I diverge.

Well, if we were living in Planiverse, and we were single dimensional, and everything would be measured on a single plane, then of course, the model of “Dynamic Balance” would ensure we would never see any plato.

However, we are multidimensional (or at least meant to be such) and we can happily improve at one direction while losing in other… projected into our favorite activity that would mean no progress at all.

On the other hand, projecting the progress into a simple system – such as bicycle racing, I can see my progress and I can develop the basic skills I need to keep myself from stagnating, and eventually becoming irrelevant.

Tearing down connections

Summary

Recently a colleague was debugging server software and was looking for a way to simulate dropped TCP connection. The easiest way to drop a TCP connection is terminating either the server or the client process – once the process is terminated, the operating system will “clean up” all resources used by the process, including open TCP connections.

In my colleague’s case, though, killing either a client or a server was not desirable. What she looked for was terminating a connection and having the client establish a new one.

The approach we took was crafting a “TCP reset” packet and sending it to a server. The server was led into belief that the client just hang up on it, and proceeded to establish a new connection, which was exactly what we were looking for.

The gritty details

The TCP protocol provides two ways of closing connections.  The normal way is requires each party (the client and the server) to send a FIN packet, which tells the other party (the server and the client, respectfully) that no more data is goign to come. By sending a FIN packet each way, the parties agree that the conversation is done, and all that was to be said was.

This rule, called “the 4-way teardown” exists to ensure that the nice abstraction of TCP connection is maintained over the messy world of retransmits, delays and dropped bits, which is common in networking.

Every rule has its exception, though. In the case of “send a FIN after you’re done talking” rule, the exception is RST, which is the equivalent of “shut the flock up“. An RST means that someone received data but was not ready for that data to be received.

RST packet was exactly what we were looking for. So the first question was: Where to send this packet?

Each TCP connection has two endpoints, described by the IP address (that is the machine, or rather the network card on the machine) and the TCP port number (this is a channel on the local machine).

The TCP port number on a server is well-known and is advertised as part of the server’s URL. A client, on the other hand, picks up a random (ephemeral in TCP jargon) TCP port number.

A non-obvious design choice is to send the RST packet to a server, and not to a client. This allows us to use the well-known port number on the server, rather than the one on the client.

Once the endpoint is identified, the task is to send a TCP RST segment to the port identified by the IP address and the TCP port. While historically crafting TCP segments is being done by opening a RAW socket and crafting the bits and the checksums of the segment, we used the “hping” utility for that.

The first version of the script did not work, as we missed an important caveat: the sequence numbers had to line up.

TCP protocol ensures integrity of transmitted data by associating position in the stream with every chunk of data it transmits.  If a chunk data has to be re-sent, the stream position will be re-sent along the data, so that the receiving end will know what part of data is being resent.

The TCP protocol calls these position identifiers “sequence numbers”. By default, a TCP endpoint will only accept packets that are in the expected range of sequence numbers, and will reject anything that would fall outside of the range.

These numbers do not have to be guessed, however. If packets are going to be injected from one of the endpoints, then one can intercept the connection and inject RST segment into a valid TCP window.

The following scriptlet does the work, provided you have tcpdump and hping installed on your system:

#!/usr/bin/env perl

use Getopt::Long ;

$DEV="" ;

$PORT=1899 ;

$ADDR="";

$DEBUG = 0 ;

$result = GetOptions( "port=i" => \$PORT,
		      "dev=s" => \$DEV,
		      "addr=s" => \$ADDR,
		      "dbg=i" => \$DEBUG) ;

chomp($DEV);

if ( $DEV eq "" && $ADDR eq "" ) {
	print "Usage: must specify either device or server address. Exiting\n " ;
	exit (1) ;
} elsif ( $DEV eq "" && $ADDR ne "") {
	open (NETSTAT_NR, "netstat -nr|");

	while (<NETSTAT_NR>) {
		my $nline = $_ ;
		if ( $nline =~ /(\w+\.\w+\.\w+\.\w+)\W+([\w\.\:\#]+)\W+(\w+)\W+(\d+)\W+(\d+)\W+(\w+).*/ ) {
			my ($route_dest,$route_gw, $route_flags,$route_refs,$route_use,$route_netif) =
				($1,$2,$3,$4,$5,$6) ;
			if ($DEBUG) {
				print "route dst: $route_dest\n";
				print "route gw: $route_gw\n";
				print "route flags: $route_flags\n";
				print "route refs: $route_refs\n";
				print "route use: $route_use\n";
				print "route netif: $route_netif\n";
				print " ----- \n\n";
			}
			chomp ($route_netif);
			if ( $ADDR eq $route_dest) {
				$DEV = $route_netif ;
				last ;
			}
		}
	}

	close(NETSTAT_NR);
}

print "DEV == $DEV\n" ;

$pkt_count=3; # number of RST packets to send

open (TCPDUMP, " tcpdump -l  -i ${DEV} -A -S -n tcp port ${PORT}  |") ;

print "Waiting for traffic on ${PORT}..." ;

while (<TCPDUMP>) {
	my ($line) = ($_);

	if ( $line =~ /\d+:\d+:\d+.\d+\WIP\W\d+\.\d+\.\d+\.\d+.(\d+) > (\d+\.\d+\.\d+\.\d+)\.(\d+): P (\d+):(\d+)\(\d+\) ack (\d+) win (\d+).*/) {

		my ($src_port,$dst_ip,$dst_port,$seq_num,$next_seq_num,$ack_num,$win) = ($1,$2,$3,$4,$5,$6,$7);

		if ( $DEBUG ) {
			print "src port: $src_portn";
			print "target ip: $dst_ip\n";
			print "dst port: $dst_port\n";
			print "seq num: $seq_num\n";
			print "next seq num: $next_seq_num\n";
			print "ack num: $ack_num\n";
			print "window: $win\n" ;
		}

		my $cmd = "hping -c $pkt_count -s $src_port -p $dst_port -M $next_seq_num -L $ack_num -R  $dst_ip" ;

		print "$cmd\n" ;

		system($cmd);

		last ; # exit the listen loop, we found our target
	}
}
END { close (TCPDUMP) ; }

Zero, One, Else

Sometimes, there are only three numbers that matter.

Stepping from Zero to One is when a componet is being introduced. Stepping from One to Zero is when a component is being elliminated. From these systems engineering processes, the latter is preferred, but the former is the reality. Still, these two processes are usually well defined, and are driven by functionality.

The Else is the realm of non-functional requirements, the gate through which scalability problems enter the scene. Even if Else is only 4. Tomorrow it may grow to 57.

The Else is where concurency becomes a problem. The Else is where one should decide which way to scale – up or sideways. The Else is when no one can speak for sure without referring to historical data. The Else is the 2 am call when all is down and nothing works because memory was upgraded, and cache miss ratio is in the sky

Stack traces on Linux/x86

While reading stackoverflow I recalled a little snippet I wrote 4 years ago.

This snippet allows the programmer to get a stack trace in C environment on Linux/x86 whenever a memory access problem happens. x86_64 is not supported yet.


/* */
#include <stdio.h>
#include <stdlib.h>

#include <signal.h>

#include <execinfo.h>

#include <unistd.h>

#define __USE_GNU
#include <ucontext.h>

typedef void signal_specific_handler(int signum, siginfo_t *pinfo, void *arg) ;

struct sigaction_pack
{
 struct sigaction         current ;
 struct sigaction         original ;
 signal_specific_handler *pspecific ;
} ;

/* we are instrumenting the first 32 signals. We are NOT instrumenting the real-time signals */
static struct sigaction_pack g_pbs_signal_actions[32] ;
static size_t max_pbs_size = 32 ;

static void* g_a_activation_record_list[64] ;
static size_t max_activation_record_count = 64 ;
static size_t activation_record_count ;

/* This is a global variable set at program start time.  It marks the  highest used stack address.  */
extern void *__libc_stack_end;

/*------------------------------------------------------------------------------
 *
 *   A short refresh on the x86 assembly
 *   -----------------------------------
 *
 *    %eip -> current instruction
 *
 *    %esp -> top of the stack (the used element)
 *
 *   This is the stack layout we see with every stack frame.
 *
 *           +-----------------+
 *   %esp -> | local variable  |
 *           | local variable  |
 *           +-----------------+
 *   %ebp -> | %ebp last frame |-+
 *           |                 | |
 *           | return address  | |
 *           |(old instruction | |
 *           |   pointer )     | |
 *           +-----------------+ |
 *           | parameters 1..  | |
 *      +----| parameters 2... |--+
 *      |    | parameters---   |
 *      |    +-----------------+
 *      |    | local variable  |
 *      |    | local variable  |
 *      |    +-----------------+
 *      + -> | %ebp last frame----....
 *           |                 |
 *           | return address  |
 *           +-----------------+
 *
 *
 */

/**
 * Activation record layout
 * Used to parse stack contents while unwinding the stack.
 */
struct activation_record_layout
{
 struct activation_record_layout * next;
 void *return_address;
};

/**
 * Unwind the stack at any location.
 * This function works like the backtrace function provided by the GNU libc,
 * with the only difference that this function doesn't unwind _current_ stack.
 * Instead, it takes values of the relevant registers (EIP, EBP and ESP) as
 * parameters.
 * Thus, this function is suited for being called from a signal handler.
 */
int
rhyno_backtrace (
   void **activation_frame_array,
   int    max_size,
   void  *eip,
   void  *ebp,
   void  *esp)
{
 /* used to parse the stack */
 struct activation_record_layout *current;

 /* keeps track on visited stack trace */
 int count = 0;

 /* save the EIP into the frames list */
 activation_frame_array[count++] = eip ;

 /* Initialize the frame list with EBP */
 current = (struct activation_record_layout *) ebp;

 while (count < max_size)
 {
 /* If we reach past stack pointer, or end of stack segment,
 * we have problem (possible memory overrun.)
 */
 if ((void *) current < esp || (void *) current > __libc_stack_end)
 break;

 /* Save the return address of frame i into the frame list */
 activation_frame_array[count++] = current->return_address;

 /* Advance the frame list pointer */
 current = current->next;
 }

 /* How many frames we had collected */
 return count;
}

void sa_segv_specific_hanlder(int signum, siginfo_t *pinfo, void *arg)
{
 /*
 Example for SEGV...
 */
 printf("  Segmentation violation: Reason: %d: %s\n",
 pinfo->si_code,
 pinfo->si_code == SEGV_MAPERR ?
 "Address not mapped" : pinfo->si_code == SEGV_ACCERR ?
 "Invalid permissions" : "?????") ;
 printf("    Offending address: %p\n", pinfo->si_addr ) ;
}

void sa_bluescreen_handler(int signum, siginfo_t *pinfo, void *arg)
{

 /*
 siginfo_t {
 int      si_signo;  / * Signal number * /
 int      si_errno;  / * An errno value * /
 int      si_code;   / * Signal code * /
 pid_t    si_pid;    / * Sending process ID * /
 uid_t    si_uid;    / * Real user ID of sending process * /
 int      si_status; / * Exit value or signal * /
 clock_t  si_utime;  / * User time consumed * /
 clock_t  si_stime;  / * System time consumed * /
 sigval_t si_value;  / * Signal value * /
 int      si_int;    / * POSIX.1b signal * /
 void *   si_ptr;    / * POSIX.1b signal * /
 void *   si_addr;   / * Memory location which caused fault * /
 int      si_band;   / * Band event * /
 int      si_fd;     / * File descriptor * /
 }
 */
 printf("Got deadly signal %d, following information (%p) is available:\n",
 signum,pinfo ) ;

 printf(" Signal %d (code %d, errno %d) send by process %d (uid %d)\n",
 pinfo->si_signo,
 pinfo->si_code,
 pinfo->si_errno,
 pinfo->si_pid,
 pinfo->si_uid ) ;

 /* Call message specific handler.
 For example, the segmentation fault handler will
 print the offending address
 */
 if (g_pbs_signal_actions[signum].pspecific != NULL)
 {
 (*g_pbs_signal_actions[signum].pspecific)(signum,pinfo,arg) ;
 }

 if ( arg == NULL )
 {
 printf("No register/stack information is available\n") ;
 }
 else
 {
 printf("Important CPU Registers:\n" ) ;
 printf("CS:  %012p DS:  %012p  SS:  %012p \n",
 ((ucontext_t*)(arg))->uc_mcontext.gregs[REG_CS],
 ((ucontext_t*)(arg))->uc_mcontext.gregs[REG_DS],
 ((ucontext_t*)(arg))->uc_mcontext.gregs[REG_SS]
 ) ;

 printf("EIP: %012p EBP: %012p  ESP: %012p\n",
 ((ucontext_t*)(arg))->uc_mcontext.gregs[REG_EIP],
 ((ucontext_t*)(arg))->uc_mcontext.gregs[REG_EBP],
 ((ucontext_t*)(arg))->uc_mcontext.gregs[REG_ESP]
 ) ;

 printf("\n\nStack Trace:\n") ;

 /*------------------------------------------------------------------------------
 * Collect activation records in format similar to that of
 * glibc's `backtrace', but use the foreign stack.
 */
 activation_record_count = rhyno_backtrace(
 g_a_activation_record_list,
 max_activation_record_count,
 (void*)(((ucontext_t*)(arg))->uc_mcontext.gregs[REG_EIP]),
 (void*)(((ucontext_t*)(arg))->uc_mcontext.gregs[REG_EBP]),
 (void*)(((ucontext_t*)(arg))->uc_mcontext.gregs[REG_ESP])
 ) ;

 /*------------------------------------------------------------------------------
 * Use internal glibc reporting mechanism to convert
 * activation records to symbolic names (requires -rdynamic linker flag)
 */
 backtrace_symbols_fd(g_a_activation_record_list,activation_record_count,1) ;

 } /* register information */

 /* propagate the signal to the original handler */
 if ( g_pbs_signal_actions[signum].original &&
 (g_pbs_signal_actions[signum].original.sa_sigaction != NULL ))
 {
 printf("\n\nPropagating the signal %d to the original handler\n",
 signum) ;
 g_pbs_signal_actions[signum].original.sa_sigaction(signum,pinfo,arg)  ;
 }
 else
 {
 printf("No original handler is registered for singal %d, continuing\n",
 signum ) ;
 }
}

int register_signal_handlers()
{
 int signum ;
 int result = -1 ;

 for ( signum = 0 ; signum < max_pbs_size ; ++signum)
 {
 /* zero the signal action */
 memset(&g_pbs_signal_actions[signum].current,0,sizeof (struct sigaction)) ;

 /* register the generic handler */
 g_pbs_signal_actions[signum].current.sa_sigaction = sa_bluescreen_handler ;

 /* turn on the flag that tells the runtime to invoke the extended signal handler,
 see man sigaction for details */
 g_pbs_signal_actions[signum].current.sa_flags |= SA_SIGINFO ;

 /* Register the signal handler, store the original signal handler for propagation */
 result = sigaction(signum,&g_pbs_signal_actions[signum].current,
 &g_pbs_signal_actions[signum].original) ;
 /* Report result */
 printf ("-- Registered signal handler for signal %d, result %d, handler %p original handler %p \n", signum,result,
 g_pbs_signal_actions[signum].current.sa_sigaction,
 g_pbs_signal_actions[signum].original.sa_sigaction) ;
 }

 g_pbs_signal_actions[SIGSEGV].pspecific = sa_segv_specific_hanlder ;

 return result ;
}

/* Function prototype. Here we will do THINGS */
int malicious_code(int) ;

int
main (int argc, char* argv[], char* envp[])
{
 int sleep_interval = 300 ;
 /* Very important - register the handlers */
 if ( argc > 1 )
 register_signal_handlers() ;

 sleep(sleep_interval) ;
 malicious_code(10) ;

 return 0 ;
}

int malicious_code(int arg)
{
 if ( ! arg )
 {
 int * garbage_pointer = (int*)441231324;
 return *((int*)(garbage_pointer)) ;
 }
 else
 {
 return malicious_code(--arg) ;
 }
}

Data Leak Prevention and Birthday Attacks

Data Leak Prevention is set of technologies used to monitor information that is leaving corporate network, and taking actions in case sensitive information “leaves the building”. Detection of sensitive information is done usually using pattern matching with checksums (e.g. credit card numbers, which have quite a bit of redundant information for validation), content proximity heuristics (e.g. US social security numbers, which don’t have any redundant information for validation, and without taking content into equation can be recognized as Canadian driver license numbers), digital fingerprints and the combination of the above.

Birthday Attack is a way to meddle with digital signatures to greatly reduce key strength. This attack is based on (and takes its name from) the Birthday Paradox. It resorts to slightly tweaking (by inserting commas, reordering words and using synonyms) two documents – one which is considered to be correct, and one which is not – until both have same digital signature. Once digital signatures collide, one can use the “good” document (e.g. request for a day off) to generate digital signature, and then use “bad” document (e.g. approval for fund transfer) as legitimate. The Birthday Attack reduces key depth by two, effectively turning 64bit digital signatures int 32bit.

Returning to the DLP problem. Usually, corporations have some sort of legal banners, footers, headers and sidebars, which comprise all kinds of disclaimers, terms, conditions and threats. As these are usually not considered sensitive, good fingerprinting schemas will contain “black lists” for “sensitive” fingerprints, along with “white lists” for insensitive fingerprints.

If the potential “extruder” – the person who wants to get the information out from the organization is willing to get away with it – can slightly influence the white lists, he can use the Birthday Attack to mask sensitive bits as legalese, which provide plenty of opportunity for meddling by being inhumanely long and unreadable. Once fingerprint for the sensitive information collide with fingerprint for the legalese, the “extruder” can safely circumvent DLP system.

The DLP system, in order to prevent this kind of attack, should do at least one of following:

  • Disallow meddling with white lists (keep lawyers on track)
  • Use hash functions with sufficient key depth, so that keys will be cryptographically resistant to BA. For example, 2^64 is pretty strong key depth; this suggests that  hash function with key depth of 128 will be used.
  • Combine fingerprints with context information. Seeing one fingerprint from a banner in the middle of neutral text should set some bits on

Autonomous Software, circa 1997

Take your software, put it in a box, throw it out of window.

Sometimes, throwing the software out of the window is not act of frustration. Sometimes, it is just a part of the requirements. In particular, if you put inside the box some tricky hardware, and throw it hard enough, it may reach other planets.

However, one must make sure the software works while being thrown out of window, to reach the final destination.

I am referring, of course, to the Pathfinder project. The Authoritative Account on the infamous software failure is a must read for software engineers.