store numbers compactly in readable strings

Hey. While working on my masters project with a friend, we stumbled upon a minor puzzle. The storage we were going to use was designed to store only string values. But we wanted to store triples of large integers, so just writing them as decimals on strings would use more space than necessary. A number up to (2^32)-1 (i.e., 4294967295) can be stored on mere 32 bits, but when encoded in UTF-8 it takes 80 bits.

Well, reducing the space we needed to store those triples by half could help the project, so I looked around for any tools that could encode numbers as readable strings. Something like Base64 encoding, but that didn’t pad the numbers so much that the resulting string isn’t that smaller than the decimal number. Also, I took the chance to write that in Ruby, as I’m working with it now, and publish my first gem.

Decimal numbers are our way of representing values on base 10, that is, using 10 symbols. Base64, as the name says, uses 64 symbols. The chosen symbols are readable characters – all 26 alphabet letters in upper and lower case, the numbers 0 to 9, plus “+” and “/”. To represent a number what I had to to then was to actually change the base of the number from 10 to 64. This way the number 0 would be “A”, 1 would be “B”, 50 is “y”, 64 is “BA” and so on.

After experimenting a bit, mostly taking care with string building, I noticed that the same code could be used to any set of symbols. And it was a good perspective: there are actually 95 readable characters on the good old ASCII table. So instead of using just the 64 characters of Base64, I also kept a 95 characters set around for even better usage of space.

Code written, I got to the task of setting it up as a gem. It was actually simple, pretty straightforward as in the guide. You keep your code on the lib folder, add gemspec on the Gemfile and create a gemspec file. After that, you create and account on RubyGems, get the API key and then gem build, gem push and that’s it, gem published.

Really sweet. Now anyone who wishes to use it just have to run “gem install num_coder”, and run any of the examples described on the project README. For instance, you could get a list of numbers in the billions and encode it in a single string and back:

> NumCoder.fixed_encode95 [1234567890,1876543290,6758493021], 5
=> “/.y5M7#c69r|iNl”
> NumCoder.fixed_decode95 “/.y5M7#c69r|iNl”, 5
=> [1234567890, 1876543290, 6758493021]

Each number using five characters instead of the expected ten. The representation could be even more compact using an even larger character set for symbols, going for the rest of the UTF-8 range. But then it wouldn’t be that readable, depending on the platform. Halving the space will do for now!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s