Posts Tagged ‘mysql’

Derived Attributes with UNION

A Story

Recently, a client of ours wanted to institute a “point” system for an existing body of users. The idea was that certain actions of the user would generate points for that user, which the client could then track as part of an incentive program.

But What are “Points”?

At the time, we had a simple “users” table in our database which stored all our user-related data. Now we were asked, essentially, to add a new “points” attribute to the “user” entity. However, we could not simply add a “points” column to the “user” table, because the client needed to track individual point-granting actions separately, with descriptions and such.

But this was also not a one-to-many relationship with an abstract “point-event” entity either, since some points were inferred from information which was properly normalized into other parts of the database. For example, referring another user (information we know at user registration time) was worth a certain number of points, but to copy a “referred user” event to a “point-event” entity would mean denormalizing the database. If a user-referral were added or changed later, we would have to make sure to do the same thing to a corresponding point-event.

Thus a user’s “points” are an attribute of the user, but the value of this attribute is derived from potentially many different entities or attributes. Guess what? It’s a derived attribute (scroll to the bottom).

So, how are we going to deal with this?

Implementation

Derived Columns

Some “real” databases have native support for derived attributes (e.g., SQL Server) but as far as I know they all require that the value of the derived attribute be defined as an expression, not the result of an arbitrary query. We could get around this using a stored function which calculates the points for us, but this particular database was MySQL (which does not support derived attributes), version 4.1 (which does not support stored functions).

In any case, this is a bad solution for us because any changes to the point calculation algorithm would require modification of the database, yet we had been accustomed to putting this kind of logic into the application. Additionally, a lazy SELECT * (many of which were unfortunately sprinkled throughout our application) would suddenly become much more expensive, requiring an additional function call per row.

Application Code

The other solution, of course, is that we simply put all the point-calculation code into the application. The problem with this is that it would take multiple queries to the database for every user that interested us, and we could potentially get the wrong point value if a change were made to the database in between our queries (since MySQL MyISAM does not have transactions). Plus, if we want to sort by points (or something more complicated), we would have to do the sorting ourselves, in the application.

UNION

Clearly, we wanted to handle point calculation by a single query. The solution we finally hit upon was to use a temporary table (not a view, since MySQL 4.1 doesn’t support them) filled by a UNION. This is quite possibly the only good use for a UNION. Each subquery of the UNION would calculate points based on a particular attribute or entity, and all the subqueries would SELECT to common column names.

DROP TEMPORARY TABLE IF EXISTS tmp_all_points;
CREATE TEMPORARY TABLE tmp_all_points
-- Get referrer-derived points
(SELECT user.id AS user_id, COUNT(*)*5 AS points
FROM user ... INNER JOIN ... GROUP BY ...)
UNION
-- Get pointevents-derived points
(SELECT user_id AS user_id, SUM(points) AS points
FROM pointevents GROUP BY user_id HAVING points != 0);

This will give us a temporary table with 0, 1, or 2 rows per user. If we want to limit this to particular users, we can add the relevant WHERE conditions to the individual subqueries before we send them to the database.

Now if we want to do any queries which involve points, we can just treat tmp_all_points as a “points” entity with a many-to-one relationship with the “users” entity.

Want the top five point-holders?

SELECT users.name, SUM(tmp_all_points.points) AS points
FROM users
INNER JOIN tmp_all_points ON users.id = tmp_all_points.users_id
GROUP BY users.id
ORDER BY points DESC
LIMIT 5

Happy Ending?

By using a UNION, we were able to neatly model the derived attribute as a table, using a single query that maps easily to the logic of the derived attribute and is easy to extend to account for any additional criteria that the client may dream up. And we didn’t have to denormalize our database or introduce complex application code.

There is a caveat, however. Tables defined by a query have no index, and probably we are going to want to join on this table, which means we’ll be doing a join without an index. For this reason, it is pretty important to keep the result set of your UNION query as small as possible using additional WHERE conditions.

If your result set will always be large, split off the temporary table creation into a definition with keys and use a INSERT INTO tmp_table SELECT ... UNION SELECT .... Don’t use CREATE INDEX after filling your table, since creating an index on a full table is much slower than building it incrementally (except for FULLTEXT indexes, where the opposite is true).

Don’t Try This With Views

If you are using MySQL 5.0 or above, you won’t be able to mitigate this problem by using a VIEW. MySQL is not very good at optimizing views. If there is not a one-to-one relationship between the rows of your view and the rows of the underlying tables, MySQL will use ALGORITHM = TEMPTABLE for your view. So any view with a UNION in it will be created as a temporary table anyway.

Thus I would not wrap a UNION in a view for this technique, since you can’t control the result set size for a view and you will be generating a new temporary table every time you use the view, instead of once per connection.

Broadband for the People

Technology author and activist Drew Clark turned to Dancing Mammoth when he wanted to make his idea for Broadbandcensus.com into a reality. He envisioned a site capable of providing the most accurate and up-to-date information on broadband technologies to consumers in the United States.

Dancing Mammoth implemented blogs, wikis, speed tests, comments, real time graphs and carrier data into Broadbandcensus.com and designed the clearinghouse Clark imagined.

The first step in the creation of the site involved gathering data for the “What are your broadband internet options?” function. Dancing Mammoth collected data from the FCC and maps from the U.S. postal service. Data was also gathered from individual carriers websites, this data is usually buried deep in the sites, or worse yet, involved some programming knowledge to scrape the data from the sites. We did the scraping and we did the hours of manipulating data to create a tool where users could search their market by zip code.

The website also continues to learn about broadband markets by surveying its users about location, carrier, promised speeds, and an individual’s rating of his service through a census. The survey data, in combination with the search function previously mentioned, a user can automatically correlate carriers to specific zip codes, along with promised speeds and any comments about that location and carrier.

The second part of the census involves a speed test. Broadbandcensus.com has worked closely with Internet2 and Virginia Tech to implement a modified Java-based NDT (Network Diagnostics Tool) client.

Based on the location provided by the user in the census, the site calculates the closest online NDT server accepting connections. The speed test takes approximately 30 seconds and roughly 50 data points are collected during this time, which measure everything from total speed to where bottlenecks in the network are occurring. Once this data is collected it allows the site to display real time percentages of user ratings and percentage of users getting their promised speeds. This is crucial when trying to find the right (only) carrier in your market and makes it a great research tool for consumers.

Broadbandcensus.com is now a publicly available resource that provides real data to consumers about broadband in the U.S. and facilitates consumer research and competition in the broadband carrier sector.

Technologies:

  • Custom ORM Framework written in PHP/MySQL
  • Java
  • Javascript
  • Wordpress
  • custom wiki software

Data:

  • 37,000 zipcodes
  • 30,000 Federal datapoints
  • 95,000 relationships
  • 110,000 objects
  • 1500 providers (and growing)

THE BALLERINAS

  • PJ Doland

    Born in a cross-fire hurricane and he howled at his ma in the driving rain.

    Brian Kieffer

    Solving website problems before you even knew they existed.

    Matt Fetissoff

    Making sure all our websites have at least 15 pieces of flair.

    Erin Doland

    100 percent all-natural high-quality content machine.

    Francis Avila

    Ambidextrously juggling clients and code without breaking a sweat.