Gitlab Upgrade: Postmortem

This is an article I drafted and forgot about. It dates from January 2016

This week at work we had an upgrade fiasco. If we want to lay some blame out there, I’ll take 50% for inadequate testing procedures, and lay the other 50% on the GitLab development team for making a breaking change to a versioned API. The basic issue is that they ripped out some functionally out of a particular API with the idea of integrating it with the main API. Not only did they not replicate all of the functionality, but they also crippled the old API, making it mostly non-functional. Did I mention that this change is completely undocumented, to the point where the current documentation reflects the old API functionality. You can read my ticket for this issue here: https://gitlab.com/gitlab-org/gitlab-ce/issues/5599  As of this writing, it seems my issue is not being taken seriously.

The upgrade took place on Tuesday. Classes start on Monday. This New Years on Friday, so we have a short week. What’s a boy to do? I wasn’t interested in rolling back for two reasons. 1) By the time the issue came to light, there had been a bunch of activity on the system. Restoring the pre-upgrade database dump from several hours prior would have introduced dataloss, which I think is a bigger evil than broken functionality. 2) GitLab has a very aggressive upgrade cycle, They release a new version monthly. This wouldn’t be so bad, except that once a new version drops, the previous versions are abandoned. As I only upgrade the system 4 times per year (at quarter breaks) [The GitLab project has since started back-porting security patches a couple versions], I always start the quarter with the  latest version — this seems the most responsible from a security and functionality standpoint. Given the above, I felt the only option available to me was to restore the missing API functionality. Here’s where this analysis will descend into a curmudgeonly rant.

Ruby on Rails — WTF. seriously. I’ve avoided Ruby stuff for years, because every time I’ve attempted to look at a RoR codebase I`m come away with a sick feeling that I’m incapable of  understanding it. For those of you who don’t know, the basic philosophy of RoR is to make  extensive use of templating and whatnot, so that you only need to write a *very* small amount of code in order to do things. This idea works, if you have your application setup properly, you can change an entire database call by just by changing a class name, or pull different data back from the database with a very small variable change. The downside to this is that it’s damn near impossible to figure out what is actually going on under the covers. The whole thing takes programming abstraction to a near terminal degree making it very powerful, but impenetrable to mere mortals unless you drink (main-line?) the Kool-aid.

Regardless, I dug into the code. I started by replacing the missing code file which described the part of the API which was missing. I then made an API call to that location and followed the stack trace to see what dependent files were missing. Through this iterative approach I was eventually able to get the API calls to stop generating Server errors. It was at this point that the *real* work began. As the data is now being held in different locations, I needed to try and grok the way the templating works in order to modify the database calls to:
1) point to the new table
2) reference the different column names where the relevant pieces of data are held.

This took quite a long time as I was getting classname conflicts. Some of the supporting files, were defining and using a class which pointed to now empty database tables. Eventually, I figured out how to change the class that was being called for the particular database calls I  was trying to make. Eventually I was able to retrieve the relevant data. Then came the hard  part.

The hard part was to modify the handler for a POST request to dump data in the new database tables. Using some of the knowledge I’d gained in the first step, I was able to get it to reference the new table, then it was a (sort of simple) process of updating some column names and suddenly…it worked.

The end result, is that I`ve replaced the 3 API calls that one of my faculty member’s scripts are using. He is using this for management of hundreds of course repositories and associated Continuous Integration Build runners. Doing it manually was just not an option. The issue with all this, is that it’s a Hacky McHackerson solution, with broken pieces of code (even some security validation routines) commented out. At 4 pm on New Years eve, as I and my faculty colleague were hacking away on this and testing it I said, “This is the worst thing
I’ve ever had to do in my professional career.” I finally got it finished about quarter after midnight on Jan 1. Upon further reflection, I stand by this statement. Not only was a large amount of time to fix something which should not have been broken, the actual code I hacked in that’s running it is kind of an embarrassment – This is mostly due to me being unfamiliar with RoR paradigms.

Lessons Learned:
1) Have a better testing methodology in place.
2) Be sure that you get sign-off from your power users before
pushing an update to production.
3) Having gotten a little bit of experience with RoR, I
understand it’s power more fully, and hate it even more.

About Jason

Jason has worked in IT for over 10 years. Starting as a student manning the University Helpdesk to his current role as an Enterprise Systems Engineer, specializing in Web Application Infrastructure and Delivery.
This entry was posted in /dev/random. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload the CAPTCHA.