Sunday 10 February 2013

Burrows-Wheeler transform (BWT)

 Burrows Wheeler Transform

I would ordinarily explain what the BWT is but to be honest I wouldn't be able to put it into better words than the article on wikipaedia so I am not going to try. So what then is this blog post about? Well simply I am reacquainting myself with Groovy after a long time of not doing any and decided to see what a simple (and I mean really simple) implementation would look like.

One of the major differences as far as I can see is that I've not included an EOF indicator/marker in the resultant string, instead I've indicated the row that the original data will exist on once the data is reconstituted from the encoded data.

First of all take the test string and rotate it storing each rotation:
def list = [args[0]]
(1..<args[0].size()).each { list.add( args[0][it..-1] + args[0][0..it-1] ) }


Saving this in a file called BurrowsWheeler.groovy typing this on the command line
groovy BurrowsWheeler.groovy "this is a test"
produces...absolutley nothing, however if we add list.each { println it } it all becomes far clearer with;

this is a test
his is a testt
is is a testth
s is a testthi
is a testthis
is a testthis
s a testthis i
a testthis is
a testthis is
testthis is a
testthis is a
estthis is a t
stthis is a te
tthis is a tes

The next thing the algorithm needs is to have these rows sorted, well that's simple try list.sort() and since we only have a bunch of text the default Comparator is fine, so that's stage two complete, add a  list.each { println it } to be sure.  A note on this simply printing the list will show everything that's in it, however it looks better when displayed using this closure method.


a testthis is
is a testthis
testthis is a
a testthis is
estthis is a t
his is a testt
is a testthis
is is a testth
s a testthis i
s is a testthi
stthis is a te
testthis is a
this is a test
tthis is a tes



So now on to the final stage of the compression, get the last column and the row number that the original data exists on.

def lastline = ""
def num
(0..<list[0].size()).each { lastline += list[it][-1]; (list[it]==args[0]) ? num = it : num }


I am sure there is  a simpler way to do this , just as I was sure that I could skip the sort stage by having a SortedList that would ensure the contents were sorted as items were added, but I lost patience looking for it :) 

Add a quick println "last column: ${lastline}, match num: ${num}" and we get;
last column: ssa tt hiie ts, match num: 12

As you can plainly see no compression has yet taken place, in fact by adding the row number we've increased the amount of data we're storing.  The point in the BWT is that we've now stacked far more of the duplicated characters together and can therefore use some run length encoding to compress it.


Uncompress / Reconstitute data

What good is making this data if we can't get back the original data, well it's as simple as taking the process in reverse.  So starting with the string  ssa tt hiie ts and row number 12.

def lastline =  ssa tt hiie ts
def anotherlist = []
(0..<lastline.size()).each {
   (0..
< lastline.size()).each { anotherlist[it] = lastline[it] + (anotherlist[it]?:'') }
   anotherlist.sort()
}


Printing this out gives;
0      a testthis is
1      is a testthis
2      testthis is a
3     a testthis is
4     estthis is a t
5     his is a testt
6     is a testthis
7     is is a testth
8     s a testthis i
9     s is a testthi
10    stthis is a te
11    testthis is a
12    this is a test
13    tthis is a tes


And sure enough our original data is right there on row 12.  This is by no means meant to be an industrial bwt it's simply just an example of educational groovy, hope you enjoy it.
 
Stack Overflow profile for Richard Johnson at Stack Overflow, Q&A for professional and enthusiast programmers