When will it be done? Who knows?

Several months ago an exceptional former manager of mine planted a seed in my mind about how to determine project timelines. We were discussing the timeline of a particularly complex calendaring project and he suggested that it may be impossible to determine such a timeline. At the time, the idea of my manager telling me something like this is impossible struck me as odd. It flew in the face of everything I had learned in my software engineering classes. I had already discarded most of the things I learned in those classes as not being compatible with the real world, but the concept of mapping out timelines remained. I felt that an experienced manager should be able to determine an approximate completion date.

The more I’ve thought about this, the more I believe that my former manager was right in more ways than he realized. The only way to be able to determine an accurate timeline is to draw on previous experience so you can predict the future. The problem with software development is that there is almost never any previous experience to draw from. If you’re repeating an old project or something very close to it - than the work is really already done and you don’t need to bother with much development. Otherwise, you’re treading into the unknown. How long does it take to build the unknown? It’s a shot in the dark at best.

I still believe it’s a good idea to have target dates. They provide something to work toward and keep productivity high. Be prepared to meet those targets with reduced feature sets though as certain pieces of the technology don’t work out and/or prove to be more complicated than originally expected (which is often the case). Don’t get rid of good workers because they couldn’t predict the impossible. Definitely don’t make the mistake of promoting a hard launch before you have a product that works! It doesn’t take much research to realize that this approach fails more often than not.

Posted by chrisp Wed, 18 Jul 2007 16:45:00 GMT


Paging all data exports

There’s always a need for importing, exporting, and synchronizing data between various applications, especially in corporate environments.

As data has grown by orders of magnitude in my databases, the amount of resources required to run these processes has shot up dramatically. Many databases are now 10 times larger and exports that once took 2 or 3 minutes now take a good hour or more.

Think about that for a second and you’ll notice something strange. Shouldn’t 10X more data only take 20-30 minutes to export rather than an hour.

I’ve recently had to go back and make changes to some (of my own) legacy code doing such an export and was so aggravated by the excessive run times, I decided to do something about it.

After looking into it, I figured out that the problem I was running into is the excessive use of garbage collection. By pushing gigabytes of data into memory, not only was the OS working overtime to reallocate all of that, the garbage collector was having to constantly run in an effort to keep memory use down. When I tried running the process on my macbook that “only” has 2 gig of ram, the process ate up all 2 gig and virtually starved itself to death. I finally gave up and killed it.

The solution may seem obvious, but I really haven’t come across it before:

Page all data that gets pulled.

I always thought of data paging as something done to improve the user experience and not a method of improving performance when you need to work on all of the data (quite the opposite really). While it does slow things down for a small number of results, it really makes a HUGE difference as your application grows.

As a Rubyist, I now rely on my relatively new friend paginating_find to meet my paging needs. I set up an enumerated version of my export process to pull 1000 record pages at a time. Experiment with results-per-page until you get your desired performance. A lower number of results_per_page will tax your database, while a higher number will tax your web server and increase your memory requirements. I found that at 1000 records the database would carry about 15% of the load, and about 100MB of memory would be used.

What was the performance difference? The process has gone from 60+ minutes per run to just over 7 minutes. Wow!

Posted by chrisp Sat, 27 Jan 2007 02:56:00 GMT