Friday, June 10, 2011

A plea to all software developers: Use Units

Have you ever come across configuration files that read:

MAX_BUFFER_SIZE = 4096
MAX_TIMEOUT     = 20

WTF!!

4096 what?? bits, bytes, kilobytes, megabytes?

20 what?? mill-second, second?

Please, please make units part of the config. option and say:

MAX_BUFFER_SIZE_BYTES = 4096
MAX_TIMEOUT_MILLISEC  = 20

5 comments:

Stallon said...

Shouldn't the unit be along with the value?

.ini parsers i've come across, use a default unit incase only a value is specified. You can override that by suffixing M, G, K,...

MAX_BUFFER_SIZE = 4096 (default, in Kilobytes)
MAX_BUFFER_SIZE = 4096M (here Megabytes)

Maybe its just me, but MAX_TIMEOUT_MILLISEC just doesn't sound right for a variable/constant name ;)

Dhruv Matani said...

@Stallon, you are mostly right in noticing that the unit in fact should be part of the value (on the right hand side). However, since most people (including me) don't spend too much time writing tools to parse config files and validating the values, etc... it's just convenient to use the language (python/ruby/javascript) hash map as a config file. For example, on node.js you would do this:

config.js:

module.exports = {
MAX_BUFFER_SIZE = 12,
MIN_BUFFER_SIZE = 6
};

If otoh, you wanted to do the "putting units in the RHS", it would look like:

module.exports = {
MAX_BUFFER_SIZE = "12MB",
MIN_BUFFER_SIZE = "6MB"
};

Even then, the problem remains - what is the difference between small (b) and capital (B), which one is bytes, and which is bits?

Additionally, the config parser now needs to do additional processing on the value (12MB and 6MB) in this case before the consumer can consume the value. Furthermore, anyone reading the code (at the point of usage) has absolutely NO IDEA what the units of MIN_BUFFER_SIZE and MAX_BUFFER_SIZE are...

Astro said...

But such practise enlarges identifiers by much! It may worsen readability.

Bytes are the most common size, as it's the smallest addressable quantity of memory. In case of larger units:
KB = 1024;
MB = 1024 * KB;
CDROM_SIZE = 650 * MB;

Timeouts are often uniform depending on the environment: milliseconds in JavaScript and Erlang for example, float values for seconds somewhere else...

Dhruv Matani said...

@Astro, in general I agree with your observation that it increases the length of the identifiers and that worsens the readability of the code where it occurs. It would be great if projects can maintain consistent units across the board when it comes to config. variables, but at times this doesn't happen and it leads to confusion.

Even the example you mention where timeouts are in ms, the next logical question would be whether floating point values are permitted or not (floating point for millisecond and not second)? In a statically typed language such as C, this wouldn't be a problem, but for dynamically typed languages, this turns out to be a bit of a point of confusion (not in the specific case of the timeout value though). That apart, it would be nice if projects use the same units that the underlying runtime uses, but this rarely happens. For example, suppose you are providing an HTTP request library in javascript, and you really don't want to bother giving millisecond precision, since most people will only ever round off the timeout to the closest second, you have 2 options:
[1] Have the API accept the timeout in second and multiply it by 1000 behind the scenes or
[2] Expect the users to pad every input with 3 zeros just because the API expects ms.
What would you choose?

Just to quote an example, on this page, what do you think the units of io.sort.mb are?

http://wiki.apache.org/hadoop/HowToConfigure

megabytes or bytes? If they are MB, why are they different from the rest of the units (which are in bytes)?

Another option on this page:
http://hadoop.apache.org/common/docs/r0.18.3/hadoop-default.html

hadoop.logfile.size 10000000

Is that 10000000 bytes or lines?

This config explicitly mentions MB
fs.inmemory.size.mb 75

whereas this doesn't (is in bytes)
io.file.buffer.size 4096

dfs.blockreport.intervalMsec 3600000
v/s
dfs.heartbeat.interval 3
v/s
fs.s3.sleepTimeSeconds 10
v/s
mapred.tasktracker.expiry.interval 600000

Q. Which of the above is in sec?

And compare them to the MySQL config options which are so beautifully done: http://dev.mysql.com/doc/refman/5.0/en/server-system-variables.html since every unit is consistent and follows the suffix convention as Stallon above has mentioned.

In general I agree with your final observation that if consistent units are maintained throughout the project, then mentioning units in unnecessary.

Dhruv Matani said...

After having thought more about it, I think a better approach is the one that Astro has mentioned:

* Omit units from the configuration parameter
* Maintain consistent units across the project
* Document the default units for each type of quantity (eg. bytes for buffer sizes, ms for intervals, etc...)

I shall be moving to this convention for all the projects I've worked on and shall work on in the future.

Thanks @Stallon and @Astro for taking the time out and sending in your feedback!!