Tuesday, June 01, 2010

Python generators and the dict.items() function

During the course of yesterday's session, there was an example of iterating over a dictionary(dict) in python. The code looks something like this:

d = {
"name": "chuck norris",
"age": "positive infinity",
}
for k,v in d.items():
print "Key: " + str(k) + ", Value: " + str(v)

The d.items() function returns a list which will take up a non-trivial amount of memory for large dicts that we wish to iterate over. For simplicity's sake if we assume that each pointer takes up 8 bytes and the python tuple type takes up 32 bytes(say) and there are 1 million(106) entities in the dict(d), then we land up with 106 x 56 (Two 8 byte pointers, each to each the key and the value objects, 8 for a pointer to the tuple and 32 for the tuple object itself) which is about 56MB of memory used for just iterating over the dictionary.

If however, we use generators, then we can save all that memory.
However, you can directly print the result of d.items(). Printing a generator object doesn't do anything spectacular, so we will need to create a proxy object to print the result of generation if string coercion is requested. The code for doing so is shown below. Notice that we use the .keys() function within our custom generator. We won't need to use it if we really were the dict object and had handles to the internal variables.

d = {
"name": "chuck norris",
"age": "positive infinity",
}

def dict_items(d):
"""
Returns an object which can be iterated over in a for loop
as well as printed. This is achieved by returning an object
which is both iterable as well as convertible to a string.
However, iteration involves using a generator for fetching
the next value in the sequence
"""
def dict_generator():
dkeys = d.keys()
for k in dkeys:
yield (k, d[k])
dg = dict_generator()

def dgen_to_str():
return str(d.items())

class Dummy(object):
def __getattr__(self, attr):
return getattr(dg, attr)
def __str__(self):
return dgen_to_str()
def __iter__(self):
return dg

proxy = Dummy()
return proxy

diter = dict_items(d)
print diter

for k,v in diter:
print "Key: " + str(k) + ", Value: " + str(v)


However, the major difference comes when you try to do something like this:

for k,v in d.items():
d.clear()
print "Key: " + str(k) + ", Value: " + str(v)

The code above will work as expected and print all the elements of the dict(d)


for k,v in diter:
d.clear()
print "Key: " + str(k) + ", Value: " + str(v)

However, the code above will throw an exception since our generator caches all the keys and dereferences the dict(d) to get the value for the corresponding key.

So, applications that mutate the dict while iterating over it won't quite work as expected.

No comments: