Most programmers today get to work at an extraordinary level of abstraction. This is obvious to anyone developing in JavaScript or Python. However even ‘low level’ code is often sitting on top of the tremendous abstraction provided by the OS, drivers, and the hardware itself: a modern hard drive has more RAM (cache) than my first Windows 95 computer did.
That is why, whenever a developer starts talking to me about runtime or “scaling up” hardware, I ask about the bits and the bytes.
It’s tempting to look at the bandwidth or storage a system requires today and work from there: “We currently use X per user. For twice as many users we’ll need 2X. Perhaps with some engineering we could shave off 20%.”
This mode of thinking locks you into past inefficiencies. It’s lazy, and sometimes that’s ok: development time is expensive. However, when data costs start to affect the bottom line you need a better approach.
It’s time to go back to the basics.
Bits and Bytes
When evaluating data requirements, it’s important to start from the basics: what is the message you’re trying to encode?
- A single bit can represent a single binary, true or false, state
- A byte, 8 bits, can represent a number from 0-255, or equivalently
- Any one of 256 states
- Any combination of 8 binary states
- Text is most commonly encoded in UTF-8, this requires at least 1 byte per character, but may use more to encode symbols for other languages.
To provide some examples:
A phone number
- The text “1-202-303-4004” would required 15 bytes as a UTF-8 null-terminated string.
- The number 12023034004 requires 34 just bits to represent in binary – a typical encoding would round this up to 8 bytes (int64) but 5 would also be sufficient.
This blog post
- is about 6,000 characters long. The plain text would require 6KB.
- is embedded on a web site with pictures and formatting. This requires about 113KB to be downloaded by your browser.
- requires you to scroll through 3-4 screen to read it. If you have a full HD monitor that’s
1920*1080 pixel * 3 bytes per pixel * 4 screens = 25,660,800 = 25MB of data to be send to you monitor… more if you included the transitions while you are scrolling.
These examples demonstrate how quickly data requirements change in different formats. Just “prettying up” this blog post with HTML and an image increases the content size by more than an order of magnitude. Going to the on-screen representation adds another two orders of magnitude.
Now you try
Q: How much data would you need to represent the results of a survey with 32 optional multiple choice questions, each with 3 answers?
A: Each question has a total of 4 states: three answers plus unanswered. You need 2 bits to encode one of four answers (00,01,10,11) therefore you would need 32*2=64 bits in total. 64bits = 8 bytes.
The Big Picture
“we may well tell hym agayne, that he can not se the wood for the trees.”
–Thomas More, 1533
Of course, focusing on the bits isn’t always the right thing to do when you’re solving a big problem. There’s a reason formats like JSON are popular: they are easy to encode, decode, and debug. Their self-describing nature greatly simplifies backward and forward compatibility. Don’t come up with an arcane binary format just to save a few pennies or even a few dollars.
Instead focus on the areas where there are significant costs. If you’re unsure what constitutes “significant”, a good starting point might be around 5% of your salary over the same time period.
Next, look at the potential savings based on a rough estimate of the most efficient theoretical solution. If that number is significant, start to consider options for improvement.
- The biggest wins often come from eliminating data that was not required in the first place. Do you have a graph that pulls a whole month of data each time someone scrolls by one day?
- Moving from one standard format (eg. JSON or XML) to another slightly more specialized one (eg. MsgPack) can often save 50-80% and be implemented with minimal effort.
- A fully custom solution often yields the largest gains, but also takes the most effort to develop and maintain.
Sometimes you have to walk away: it would be fun to write that custom encoder, but the three week effort just isn’t worth it to save $60/month. Other times you find that a small change can cut a major expense in half, and pay for itself almost overnight.
The key is going back to basics. Unlock your mind from the way things are done now, and look for opportunities.
Put it into action
- Consider the largest expense your company or department has related to data.
- Take a look at that data and consider how efficiently it is represented.
- If you find a gap > 2x, spend 15-60 minutes considering how you could close it. If you have some ideas you think would pay-off in less than 6 months, chat with your team lead.
Discuss
Use the comments section below to share your successes, and let me know what you think. Do you have some great ideas I didn’t mention above?
The forum is moderated and your first post may not appear until it has been manually approved.