Optimization, Correctness, and Laziness

Nov 3, 2017 18:57 · 1607 words · 8 minutes read meta

Premature optimization is the root of all evil – Donald Knuth

This quote is really memorable, “root of all evil” really sticks in the mind. There’s a desire to interpret this quote as something simple like:

Don’t optimize before you have to

Or maybe even simpler

Don’t optimize unless you have to

Which can lead to a dangerous thought

Don’t optimize at all now, do it later

The full quote for comparison is

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. – Donald Knuth

Knuth’s concern is that people are constantly optimizing trivial things that should not be optimized, it’s the 97% case that already runs efficiently as necessary. The last sentence is also important here, in the 3% case you should optimize because that optimization is important for your implementation to be correct. If your profiler says the algorithm runs too slowly for your expected use case, it’s time to think about optimizations.

Define Optimize

What I really want to talk about here though is not just this quote, but the whole concept of optimization. There are multiple types of optimization, but in working with the pithy version of the Knuth quote it’s possible to pass up something that doesn’t fit in the 3% case, but that also shouldn’t be written off as premature optimization. The dictionary definition of optimization is:

The action of making the best or most effective use of a situation or resource.

But this leaves us open to interpretation still. After all, hand tuning assembly or re-writing some critical part of your application in C is an optimization. But using the correct data structure is an optimization. Not using a linear search over a large collection when a binary search would be more efficient is another optimization. Those optimizations are likely not premature, they are necessary for the code to be correct. I’ll give two examples:

Unnecessary Work

This is a problem I ran into in production that kept myself and another engineer working late on a Friday evening. We were working on a basic CRUD application with some specific business related functionality that uses MongoDB as a backend. There’s some functionality that requires a set of subdocuments to be kept in order so they can be displayed on the frontend. The overall document is structured like:

{
    name: ""
    description: ""
    // Some other information
    documentTypeOne: [{
        name: ""
        description: ""
        order: 1
        active: false
    }, {
        name: ""
        description: ""
        order: 2
        active: true
    }]
    documentTypeTwo: [{
        name: ""
        description: ""
        order: 1
        active: true
    }]
}

The order field is necessary, you can’t just use the order of the list (take my word for it). The problem here is that when you have to create a new document you obviously need to update the order field of every other document in the list. When this feature was originally implemented, it was thought that insertions would be infrequent and there would only ever be a few subdocuments. Say 100 at the most. So the code for updating the order looked like:

// Insert new document with the correct order

for subdocument in documentTypeOne:
    update_one(subdocument, {"$set": {"order": newOrder}} 

The idea is that you would put in the new document with it’s correct order and then update each subdocument individually with the new order it should have. Not a batch. This is obviously inefficient and can be improved but the code made it into production regardless.

Problem

Elsewhere, a client of this API started inserting subdocuments at a significantly higher rate than usual, around 500-1000 per day. After the first or second day of this, everything was still working correct, but as time wore on the API started behaving more and more strangely. As more subdocuments were added, it was taking more and more work to update the order field of all the documents that came before it. Eventually the problem persisted to the point of bringing the load on the database server to 80 (on a 15 minute average).

Digging into the logs, the problem became apparent. The collection in question contained a few thousand documents and whenever clients using the API would request new documents to be created, every one of those thousands of documents had to be updated. So maybe 5000 individual update statements. To make matters worse, the client would often request new subdocuments to be inserted in bunches, maybe 20-30 at a time. The DB was totally bogged down doing 100-150k updates.

Digging in more, we discovered something truly horrific. The update statements were not even necessary! For the document in question, some of the subdocuments have to be ordered and some do not. The order field is used only to display things on the UI, and for some documents the domain doesn’t require an order field at all. It just exists and gets updated for no reason.

Solution

If an updated isn’t needed, don’t perform one:

// Insert new document with the correct order

if documentType != "documentTypeOne":
    for subdocument in document[documentType]:
        update_one(subdocument, {"$set": {"order": newOrder}} 

Predictably after deploying the patch, load on the DB dropped to 2.

Wrong Algorithm

There’s another application I maintain that has a use case for processing files as they come into a certain directory. There’s a watch set with inotify on that directory, and when a file comes in we want to conditionally do something if a file of a corresponding type exists in the same directory.

def processFile(path):
    files = os.listdir(os.dirname(path))
    coorespondingFile = [file for file in files if # Some complex conditional here]
    if coorespondingFile:
        # Do some action  

Every file coming into the directory gets processed with this processFile function. When the code was originally written, it was assumed that the directory would include only a few thousand files (maybe 10k at most) and indeed that’s what the code was tested with. Testing went fine and the code was deployed to production.

Problem

Much later, our team noticed that the application was running unusually slow. Logging showed that execution was preceding inside the processFile method and then just halting. After listing the directory and counting the number of files, we got a nasty surprise. 173k, way more than the 10k max predicted.

Running the code on a sample directory of 173k files revealed that indeed the problem was in the processFile implementation. Not only was it taking a long time to do the listdir call, but it was taking a long time to do the iteration as well. Even worse, this iteration had to be done for every file that came into the directory in question. So just like in the previous example, the more work you create, the worse off you are. It’s a compounding problem.

Solution

Just like last time there’s a relatively simple fix. Instead of doing a scan, keep track of every file we’ve seen so far and then do a lookup:

files_seen_so_far = {}

def processFile(path):
    if os.path.basename(path) in files_seen_so_far:
        # Do the complex operation
    else:
        key_to_add = # Get the path into a key format that we can look up later
        files_seen_so_far[key_to_add] = path

In this case the problem was doing a sequential scan when a lookup would suffice. You don’t have to do listdir or iteration at all.

Correctness vs Optimization

Coming back to the quote

Premature optimization is the root of all evil

Was the original code at the time it was written right or wrong? It’s true that the code was not prematurely optimized. It was optimized exactly at the point that we found out that it needed to be more performant, when our assumptions were invalidated in a production environment. At that point we only had a little bit of time to get a patch out that would make the code performant enough.

I would argue that while the code was not optimal, it was not even correct. Stated another way, nobody has ever said:

Premature correctness is the root of all evil

Optimizations come in a spectrum. In the cases above, there was literally no reason not to make the algorithms more optimal. We’re not talking about re-writing the code in C or building a system that can process 10 thousand transactions a second when you only get a million a day. Premature optimization of that kind is a waste of time and resources, that’s true. But when you implement something that is going to be executed 97% of the time for your expected workload it’s worth asking:

  1. Is this work flow garaunteed to only be executed a specific number of times and is that number high or low
  2. If it’s executed an unbounded number of times, what is the actual behavior and what are the performance characteristics
  3. If the performance characteristics are bad, how can it be optimized and what is the cost of that optimization. Are you sure?

In cases like the above, “optimization” costs maybe 30 minutes of extra thought and a minute or two of refactoring. The real cost ended up being several hours of debugging late in the evening and hacky patches that will require more work to refactor after deployment.

Premature optimization can be a problem, but there is no such thing as premature correctness. And “premature optimization” cannot be used as a defence for failing to implement something correctly.