How To Properly Preserve Legacy URLs In Your Express App

How To Properly Preserve Legacy URLs In Your Express App

We've all been there, trying to click a link in an article that's a few years old only to find that the link is broken and the destination site has undergone an upgrade. Maybe the page you're looking for is there under a different name or maybe the content you're after has been removed entirely; there is no way to know. You use the site's built-in search bar to try and find it but to no avail. Finally when you've given up on that particular piece of content you return to Google in the hopes of finding another. After trying a variation on your original search query you click a search result only to find that it's the page you've been looking for all along.

Why do so many modern applications seem to disregard old URLs everytime they undergo a remodel? Perhaps sometimes it is a deliberate decision, but I believe it stems largely from simple laziness. After all, building a legacy redirect system can be somewhat complex and isn't very exciting; why spend time on a boring feature like that when you could be building the next responsive single page app?

What most developers don't realize is that it doesn't have to be hard to build a legacy routing system. When most people start thinking about legacy routing they start imagining the grueling task of tracking down every possible endpoint used on the old site and then entering them into some sort of data store. How you build your legacy routing system will surely depend on the application in question, but in most cases it doesn't have to be quite so bad.

I recently finished an open source project for a non-profit organization called Paw Project. I enjoy volunteering my time for good causes, but if there is one thing about web development that is universally true, it's that the client must be made happy. You might be doing the work for free, but that doesn't mean your client wants to receive 50 emails from website visitors all complaining about things that no longer work or don't work as expected. You have to be just as diligent in your volunteer work as you do in any other project. At the very minimum your free contribution to their cause must help them in some way; if the client views your work as a step backward then you may be viewed as a burden rather than a helpful asset.

The PawProject.org site was a purely informational site that was starting to look fairly antiquated. The old site was not responsive and so would not load in any mobile format, and the application lacked structure or cohesiveness. I decided to build them a new application using the MEAN stack (MongoDB, ExpressJS, AngularJS, NodeJS). I sat down with the client and really tried to come up with a vision for what kind of web presence they would love to have. We decided to go in a direction that would allow us to first build a fancier version of the purely informational site they already had, then begin adding in community-driven features that would allow the site to come alive as an interactive destination for visitors.

I built a responsibe app that flowed smoothly, was easy to read, and even spiced it up with some subtle loading animations for elements as you scroll through pages. I just knew they were going to love it now that their application stood out more and was easier to read through. I went forward with launching the project only to receive a downpour of email from them informing me that users were complaining about broken links all over the web. I hadn't stopped for one second to consider just how many pages they had on the old site; they had pages for all sorts of things from legal information to tshirt design contests, etc. They had been a lot busier on the old site than I realized and asking them "which pages from the old site are used the most" was a naive way for me to assertain such information. Of course they did as I asked and sent me a list of their most important pages, but I failed to realize the obvious fact that they themselves probably don't even know all of the pages their users use regularly.

It was time to sit down and think about the problem from a higher level. After peeking inside the old server for a minute it was easy to see that there are simply too many files to go through. We could write a script to simply extract the information we're after, but that wouldn't weed out any content that shouldn't be ported over and the ported content would be ugly and filled with superfluous markup. Ideally it would be better if each piece of content were manually ported to a new section of the new application, taking care to format the content using Markdown in order to store as little HTML markup in the database as possible. In order to do that we need to know what to port over because porting all of it just isn't practical or neccessary.

I came up with the idea to leave the old server running under a special "backup" DNS A-Record. With backup.pawproject.org in place we could get at the old server and the old application at any time. I then went to the new app and augmented the default Express 404 error handler with the following:

/// catch 404 and forward to error handler
app.use(function(req, res, next) {
  // Trim any trailing slashes from the request URL.
  var requestUrl = req.url.replace(/\/+$/g, "");
  db.LegacyRedirect.findOne({ url: requestUrl }, function (err, lr) {
    if (err) return next(err);
    
    // If a legacy redirect does not yet exist for the request URL, create one.
    if (!lr) {
      lr = new db.LegacyRedirect({
        url: requestUrl,
        should404: false,
        redirectUrl: 'http://www1.pawproject.org' + requestUrl,
        count: 0,
        triggered: []
      });
    }
    
    // Check if legacy redirect is supposed to send a 404.
    // If so, send a 404 response and don't bother updating the legacy redirect.
    if (lr.should404) {
      var err = new Error(lr.url + ' Could Not Be Found');
      err.status = 404;
      next(err);
    }
    // Update legacy redirect count and add information about the current request.
    else {
      lr.count++;
      lr.triggered.push({
        date: Date.now(),
        ip: req.ip,
        referrer: req.get('referrer')
      });
      // Save legacy redirect.
      lr.save(function (err) {
        if (err) next(err);
        res.redirect(lr.redirectUrl);
      });
    }
  });
});

If the code above runs then we know none of our routes or static files matched the incoming request. Normally we simply generate a 404 error here, but this code first checks if the requested URL is stored in the legacy redirect MongoDB collection. If the redirect does not exist then it creates one, with the redirectUrl pointing to the same URL but on the old site. After it creates the redirect or loads it from the database it increments the number of times the redirect has been used and adds information about the current request to an array.

Down the road I'll be able to pull information out of the database and see what redirects are being used the most. I'll even be able to see if there are pages on the web referring to those old URLs. It will provide a road map for what content still needs to be ported from the old site to the new one. There is also a should404 field that can manually be set to true if a certain URL really shouldn't redirect; for example, robots.txt was being forward to the the file stored on the old site but I didn't want the new site to have one so I set that field to true to force a 404.

As I port content from the old site to the new one I can manually change the redirectUrl to point to the new page if the URL is different from the old one. If the URL is identical to the old one then I can simply add the content to the new site and delete the redirect entry from the database.

This automation not only fixes all the 404 errors users would get but it also allows me to see the popularity of content and adjust my priorities for which content should be brought over.