Reducing Database Load using Redis and Batched Insert/Update SQL Queries

I’ve been running my “T-Comm” Bus Locator website for over seven years now. As a quick recap, on the backend, it’s a procedural PHP/MySQL website, kind of a typical thing one would see of PHP development circa 2000s.

The high level overview is that every minute or two, the system polls TransLink’s real time information APIs for the locations of the all the buses in the system. The system then calculates some information based on TransLink’s GTFS data, which has the schedule information for the entire system. Then it saves the data into a MySQL database, and some other outputs for use on the web interface.

So naturally as more buses operate during the daytime (up to around a thousand buses), the CPU load increases during the day, then drops off at night (down to several on the Night Bus routes). It could take about two minutes to calculate all the necessary information to update all the buses during the daytime

CPU load over the last week

Last night, I set off to find a quick way to optimize this, without having to rewrite the entire system (which arguably it sort of needs, but that’s for another day). From a cursory at the system resources using tools such as htop, it was pretty obvious that the bottleneck was on the MySQL database and its reading and writing to the relatively slow hard disks.

Continue Reading

Checking for a Null value in SQL

Yesterday I fell into a very simple but dangerous SQL trap.  Can you identify it?

SELECT first_column, another_column
FROM my_table
WHERE first_column = 1
AND (another_column >= 100 OR
another_column = null)

Found it?  It’s the comparison to null.  My intention was to get rows where another_column was null, however the query did not do so.  It didn’t give me any results where another_column was null, but didn’t give me any error either.  Why?

Null in SQL is not a real value.  Null means “lack of a value” or “unknown”. Anything compared to null leads to an “unknown” result. Therefore, in SQL there are actually three values conditions can evaluate to: true, false, and unknown/null. This is called three-valued logic. SQL WHERE clauses only return rows that evaluate to true, not rows that evaluate to false or “unknown”.

The proper way of checking for null is by using the IS NULL or IS NOT NULL predicates. The correct SQL query would have been as follows:

SELECT first_column, another_column
FROM my_table
WHERE first_column = 1
AND (another_column >= 100 OR
another_column IS null)

So if you’re used to languages where you check for null with equal signs, remember that in SQL, you must use the special IS NULL or IS NOT NULL predicates instead of the equals sign or other comparison operators. Don’t get tripped up like I did.

If you want to read more about, Wikipedia has a detailed article on null in SQL.

Database query optimization is a useful skill

Recently I read an article titled “Quick tip: Improve developer habits by showing time cost of DB queries” by Dan Birken.  He claims that one common reason why websites load slowly are because of unoptimized database queries:

Websites can be slow for many reasons, but an overwhelmingly common one is time spent performing database queries. Now database queries are likely a very important part of rendering the webpage – the information to populate a page has to come from somewhere. But sometimes rendering a page is slowed down by either unnecessary queries or improperly optimized queries.

I completely agree with his view.  For most web developers, querying information from a database is quite a common action.  Therefore, I believe learning how to write efficient and optimized database queries is a good skill to have.

I was fortunate to learn about this topic when I was developing MyBB way back years ago.  At the bottom of each MyBB page, administrators are shown debugging info with a summary of loading times and other statistics.

Debug info
Debugging information presented to the administrator on each MyBB page

Clicking on the “advanced details” link yields a much more in-depth analysis of the page generation, including an explanation of all the database queries used to create the page.

Performance debugging details
Advanced performance debugging details presented to MyBB administrators.

This is very useful to determine where the bottlenecks occur when generating the page.  Some common mistakes include repeating similar queries instead of batching them together, selecting more data than is needed, and doing full table scans rather than indexed queries.

However, not all software and web frameworks has built-in performance debugging pages like what MyBB has.  If you’re programming something from scratch, it might take some extra effort to create a useful debugging interface.  What other tools can we use then?

Using the EXPLAIN query (works at least for MySQL, PostgreSQL, SQLite) will give you the details about how a specific query will be executed by the database.  All you need to do is to tack “EXPLAIN” on at the beginning of your SELECT query, and run it through your favourite database management program.  SitePoint has a thorough article on how to understand the output of an EXPLAIN query.

Of course, it’s also beneficial to have a concrete understanding of database indexes and how they work.  One of my favourite sites on this topic is Use the Index, Luke!  They also have a fun 3-minute quiz to test your knowledge, if you think you already know everything 😉  (I got 4/5 on my first try, how about you? 🙂 )

Do you know of any other resources to help with optimizing database queries?  Let me know in the comments.

Side Project 1: Real Time Bus Map in Vancouver

Since TransLink released their new mobile Next Bus site with real-time GPS updates of bus locations, I’ve been trying to find ways to get the data and rehash it into something that  Metro Vancouver transit enthusiasts (more specifically, enthusiasts who chase buses and monitor the transit system’s operation) will find useful.

There were two main shortcomings of TransLink’s site from the viewpoint of a transit enthusiast:

  1. Can’t search for a specific bus.  Often times transit enthusiasts “chase” a particular bus, usually a new bus, a fresh bus after a midlife refurbishment, or a bus with a new advertisement wrap.
  2. Can’t see the entire system as a whole.  This one’s pretty self explanatory.  It’s just fun to be able to see where all the buses are.

So I created a system which gleaned information from the TransLink site and aggregated it into a useful interface which I called “T-Comm”.

This is named after CMBC’s Transit Communications centre which has an interface similar to what I created.  Using the information I was also able to add additional functionality like grabbing the bus’s schedule for the current trip and even the entire day.

Since most transit enthusiasts would be using this on the go, I knew I had to make this site mobile-friendly. The enthusiasts I knew used a myriad of mobile devices including Blackberries, iPhones, and Androids, so it would not have made sense for me to create a native app for each of the platforms; it would have killed me in terms of time and energy.  I chose to use jQuery Mobile and Google Maps API as the basis of the frontend.  The backend is powered by PHP/MySQL. I was amazed at the ease I was able to make something mobile-friendly using jQuery Mobile.  It was actually fun too.  More importantly testers reported positively on their mobile devices.

I’m hoping that TransLink will release the GPS data officially for developers.  They’ve said after April 2012 on Twitter, so I’m crossing my fingers.

Contact me if you would like access to the site.  Since the current site hammers TransLink’s servers I’m trying to tread lightly.  Once official GPS data is used I will open it to the public, but I don’t really see this being used by the public as it is quite enthusiast-oriented.

Update: The site has been updated to use TransLink’s official data feed for real time bus data, so here’s the link: http://tcomm.bustrainferry.com.  More info on this site can be found on my T-Comm project info page.

MyBB Security

These are just my thoughts about MyBB security updates.  I’m not a security expert of any sort, but I just offer my opinion based on the knowledge I have.
Over the last few weeks there have been two releases to MyBB to patch potential security vulnerabilities that have been discovered by various parties. I have seen some people who have found these seemingly miniscule updates too trivial to apply to their own boards, despite the fact that I and various other members of the MyBB staff have recommended these updates.

These people seem to believe that just because no harm has been done by people attempting to exploit the vulnerability, or just because no harm has been done when they try the exploit script by themselves, that the upgrade is not required. Personally I find this absurd.

First of all, I’d like to point out that not all proof-of-concept scripts are harmful; as their name suggests, these scripts prove the concept, but may not actually compromise the system. Wikipedia says: “In both computer security and encryption, proof of concept refers to a demonstration that in principle shows how a system may be protected or compromized, without the necessity of building a complete working vehicle for that purpose.”

Just because a board administrator cannot find a way to exploit the vulnerability, doesn’t mean that another malacious user won’t find a way. Just because nothing has been “done” to the board when an attempt has been made, doesn’t mean that eventually someone else won’t find a way to compromise the board. For example, the 1.1.3 release patched a serious security vulnerability where a malacious user could execute arbitrary PHP code at their own heart’s content (with a malaciously-formed username). As an administrator, you may not even detect any problems on the surface if you tried the proof-of-concept script, or seen usernames that have registered on your board, but nothing harmful has happened. In fact, much more serious and critical information may have been available to the hands of malacious users, if they indeed have compromised the board in this manner, and the patch released was not applied.

As well, once the security vulnerability has been patched, anyone with a malacious intent would be able to figure out how to exploit it, and may be able to compromise boards which have not patched the vulnerability.

Okay, so I may not be a security expert, however, I do use my common sense (and I do hope that you use yours). When a security vulnerability has been found, and has been identified to affect the particular version of MyBB (or any other software), we do not just release these patches to annoy our users with little upgrades every few weeks. No, we actually do want to improve our software by patching these holes and keeping our users safe. If a vulnerability has been reported, it is most likely that something harmful can be done to your board, and if a board administrator wishes to take that risk and not upgrade, it is his or her decision, and I cannot force anyone to apply the patch.
Obviously it is possible that sometimes the malacious users will compromise boards before we can find the vulnerability and release the patch, but I assure you that security is at the highest priority with the MyBB Group, and we strive to keep our customers safe from these exploits in as a timely manner as possible.

However, once we have released a patch, it is up to each and every individual board administrator to update their board to keep them and their board safe from the exploit. Each security patch, no matter how small, should be considered as significant. I hope that you all take this into mind the next time you ponder whether or not to update your board.

After writing all this about security, I hope I won’t get hit on my behind by something that I have just fervently preached. 🙂