Tag Archives: programming

Zero, One, Else

Sometimes, there are only three numbers that matter.

Stepping from Zero to One is when a componet is being introduced. Stepping from One to Zero is when a component is being elliminated. From these systems engineering processes, the latter is preferred, but the former is the reality. Still, these two processes are usually well defined, and are driven by functionality.

The Else is the realm of non-functional requirements, the gate through which scalability problems enter the scene. Even if Else is only 4. Tomorrow it may grow to 57.

The Else is where concurency becomes a problem. The Else is where one should decide which way to scale – up or sideways. The Else is when no one can speak for sure without referring to historical data. The Else is the 2 am call when all is down and nothing works because memory was upgraded, and cache miss ratio is in the sky

Unicode Byte Order Detection

Once in a while I run into the following problem: give a stream of bytes, which are know to represent Unicode characters, what is the correct byte order?

Assuming that the Unicode byte stream is valid Unicode stream, we can clearly see that if character Ci is in certain byte order, then the next character, Ci+1 will be in same encoding as well, unless a byte order mark (BOM) specifies the byte order for following bytes.  Moreover, assuming the Unicode byte stream represents character stream, there’s certain probability that adjacent characters belong to same word.

Same words are usually written in same language, which implies same Unicode code page. In other words: in human text, number of different code pages will be low.

This is important property, which we are going to exploit.

This number of different code pages clearly depends on the document: for a math text in English, some words will be in English, some in Greek and some will be special symbols. On the other hand, a biology text in Romanian will include Romanian words intermixed with biology terms in Latin (probably, in italic).

The location of codepage bytes is governed by the byte order used by the Unicode byte stream. By determining, which bytes are used for codepages, we can detect the byte order. 

The algorithm is easy:

  1. take a sampling buffer, _buf  from the stream. The size of _buf should be sufficiently large for detection, but not too large so to slow the algorithm down. 64 bytes sounds like a good guess.
  2. For a first pass: 
    1. Allocate 2 hash sets: LO_BYTE and HI_BYTE
    2. Iterate over the _buf in 2 byte offset increments (stored in offset). Insert _buf[offset] to HI_BYTE and _buf[offset+1] to LO_BYTE 
  3. For a second pass:
    1. Allocate 2 hash sets: LO_NIBBLE and HI_NIBBLE
    2. Iterate over the _buf in 4 byte offset offset. Insert (_buf[offset]&0xFF)<<8|_buf[offset+1]&0xFF to HI_NIBBLE and _buf[offset+2]&0xFF)<<8|_buf[offset+3]&0xFF to LO_NIBBLE 
  4. Chose the hash set best with lowest number of different entries nbest.
    1. if nbest > EXPECTED_NUMBER_OF_LANGUAGES  probably the encoding is UTF8
    2. otherwise, if best is LO_BYTE, the encoding is UTF-16LE 
    3. otherwise, if best is HI_BYTE, the encoding is UTF-16BE
    4. otherwise, if best is HI_NIBBLE, the encoding is UTF-32BE
    5. otherwise, if best is LO_NIBBLE, the encoding is UTF-32LE
    6. otherwise, check your implementation, it probably contains a bug.

Of course, if the BOM (byte order mark) is present, the whole issue of detecting the encoding becomes moot.

Lessons learned from interviewing developers – Part I

After interviewing many software developers, I came to a need to write down many mistakes that people do.

Multithreading

Unfortunately, many software developers have troubles with multi threading.

Take deadlock concept, for example. This is what happens when you have a circular dependency betwen multiple threads (or one thread, in really complicated cases). Thread A waits for thread B, B waits for C… which in turn waits for A.

Easy… but somehow, many people have problem relating the deadlock to a cycle.

“A deadlock is when a thread tries to take a lock, but it is already taken!” – is one popular wrong answer. 

This is wrong, because this is not a deadlock. This is normal flow control between two threads.

 

Some people do know what deadlock is, but they tend to confuse it with other problem in multithreaded applications – the race condition. Race condition occurs when multiple threads try to access same resource, and the final state of result depends on the order, in which it was accessed. That order is, unfortunately, undefined – unless we define it using our own flow control method, implemented with synchronization mechanisms.  So a race condition is a problem, where a result depends on undefined behaviour, i.e. not defined well.  Race condition is not a deadlock – it is a completely different beast. One rule of thumb – race condition happens when there are not enough locking, while deadlock is a sign that there is one lock too many.

Another manifestation of too much locks is starvation. Aquiring a lock is expensive operation, which does not have immediate positive impact on the software. So when we acquire lots of locks, we are waisting the CPU time. With enough waste, our threads can be low on the CPU time.

One particular case of starvation happens when we start prioritizing threads. When threads are prioritized, ones with higher priority should run faster than these with lower priority. But what does it mean, run faster? CPU tends to run on same speed all the time, unless it is overheated, or put to sleep. To make thread run faster, we should schedule it to run more often. However, there’s a trap: if a high priority thread needs to acquire resource that is taken by a lower priority thread, the high priority thread will wait until the low priority thread, which not scheduled to run as often as the high priority thread require, will release the resource. This means that the high priority thread becomes a low priority thread during its wait for a resource. This is, of course, the dreaded priority inversion problem.

I don’t really expect a Java programmer to be expert with priority inversions, but too many people mess with deadlocks….

JPcap problem on Windows XP

Yesterday I spent a few hours being unable to use JPcap.

 

JPcap is a nice Java wrapper around winpcap/libpcap. It allows you to subscribe to network interface and receive network packets that happen to go through the interface.

 

In order to subscribe to an interface, you need to know the interface name. JPcap provides you with static method String[] net.sourceforge.jpcap.capture.PacketCapture.lookupDevices()

You can then choose one of the interfaces, and use it in non-static method net.sourceforge.jpcap.capture.PacketCapture.open(String interfaceName, boolean promiscous)

Very easy, except it doesnt work on Windows XP.

After skimming through a few forums that suggested double quoting all backslashes in the device name (silly), using Unix device names (optimistic), using root privileges (very optimistic), it turned out thatindeed there was a bug.

On Windows XP the interface name, as reported by PacketDevice.lookupDevices() consists of two names – the internal name, followed by a newline, followed by user readable name.

One should split the reported name and use only the internal name.

For example:

String internalDeviceName = deviceName.split(“\n”)[0];

packetCapture_.open(internalDeviceName);