A Story
Recently, a client of ours wanted to institute a “point” system for an existing body of users. The idea was that certain actions of the user would generate points for that user, which the client could then track as part of an incentive program.
But What are “Points”?
At the time, we had a simple “users” table in our database which stored all our user-related data. Now we were asked, essentially, to add a new “points” attribute to the “user” entity. However, we could not simply add a “points” column to the “user” table, because the client needed to track individual point-granting actions separately, with descriptions and such.
But this was also not a one-to-many relationship with an abstract “point-event” entity either, since some points were inferred from information which was properly normalized into other parts of the database. For example, referring another user (information we know at user registration time) was worth a certain number of points, but to copy a “referred user” event to a “point-event” entity would mean denormalizing the database. If a user-referral were added or changed later, we would have to make sure to do the same thing to a corresponding point-event.
Thus a user’s “points” are an attribute of the user, but the value of this attribute is derived from potentially many different entities or attributes. Guess what? It’s a derived attribute (scroll to the bottom).
So, how are we going to deal with this?
Implementation
Derived Columns
Some “real” databases have native support for derived attributes (e.g., SQL Server) but as far as I know they all require that the value of the derived attribute be defined as an expression, not the result of an arbitrary query. We could get around this using a stored function which calculates the points for us, but this particular database was MySQL (which does not support derived attributes), version 4.1 (which does not support stored functions).
In any case, this is a bad solution for us because any changes to the point calculation algorithm would require modification of the database, yet we had been accustomed to putting this kind of logic into the application. Additionally, a lazy SELECT *
(many of which were unfortunately sprinkled throughout our application) would suddenly become much more expensive, requiring an additional function call per row.
Application Code
The other solution, of course, is that we simply put all the point-calculation code into the application. The problem with this is that it would take multiple queries to the database for every user that interested us, and we could potentially get the wrong point value if a change were made to the database in between our queries (since MySQL MyISAM does not have transactions). Plus, if we want to sort by points (or something more complicated), we would have to do the sorting ourselves, in the application.
UNION
Clearly, we wanted to handle point calculation by a single query. The solution we finally hit upon was to use a temporary table (not a view, since MySQL 4.1 doesn’t support them) filled by a UNION
. This is quite possibly the only good use for a UNION
. Each subquery of the UNION
would calculate points based on a particular attribute or entity, and all the subqueries would SELECT
to common column names.
DROP TEMPORARY TABLE IF EXISTS tmp_all_points; CREATE TEMPORARY TABLE tmp_all_points -- Get referrer-derived points (SELECT user.id AS user_id, COUNT(*)*5 AS points FROM user ... INNER JOIN ... GROUP BY ...) UNION -- Get pointevents-derived points (SELECT user_id AS user_id, SUM(points) AS points FROM pointevents GROUP BY user_id HAVING points != 0);
This will give us a temporary table with 0, 1, or 2 rows per user. If we want to limit this to particular users, we can add the relevant WHERE
conditions to the individual subqueries before we send them to the database.
Now if we want to do any queries which involve points, we can just treat tmp_all_points
as a “points” entity with a many-to-one relationship with the “users” entity.
Want the top five point-holders?
SELECT users.name, SUM(tmp_all_points.points) AS points FROM users INNER JOIN tmp_all_points ON users.id = tmp_all_points.users_id GROUP BY users.id ORDER BY points DESC LIMIT 5
Happy Ending?
By using a UNION
, we were able to neatly model the derived attribute as a table, using a single query that maps easily to the logic of the derived attribute and is easy to extend to account for any additional criteria that the client may dream up. And we didn’t have to denormalize our database or introduce complex application code.
There is a caveat, however. Tables defined by a query have no index, and probably we are going to want to join on this table, which means we’ll be doing a join without an index. For this reason, it is pretty important to keep the result set of your UNION query as small as possible using additional WHERE
conditions.
If your result set will always be large, split off the temporary table creation into a definition with keys and use a INSERT INTO tmp_table SELECT ... UNION SELECT ...
. Don’t use CREATE INDEX
after filling your table, since creating an index on a full table is much slower than building it incrementally (except for FULLTEXT
indexes, where the opposite is true).
Don’t Try This With Views
If you are using MySQL 5.0 or above, you won’t be able to mitigate this problem by using a VIEW
. MySQL is not very good at optimizing views. If there is not a one-to-one relationship between the rows of your view and the rows of the underlying tables, MySQL will use ALGORITHM = TEMPTABLE
for your view. So any view with a UNION in it will be created as a temporary table anyway.
Thus I would not wrap a UNION
in a view for this technique, since you can’t control the result set size for a view and you will be generating a new temporary table every time you use the view, instead of once per connection.