Removing UTF8 Gremlins

If you work with documents from many different sources, you’ve probably seen this before:

That’s good.

“Oh no”, you think, “A utf-8 encoding problem.” That three-letter combo should be a single close-quote, like this:

That’s good.

Sometimes the problem is that your application is reading the file as a win-1252 (or cp1252, or the kinda-sorta iso-8859-1 used on the web). In this case the solution is easy: instruct your application to reopen the file as utf8.

But sometimes, your file really does say “’”, even when decoded as utf-8. How this happens is that someone took some utf8 text, pasted it into a win1252 document, and then saved the document as utf8. So now the bytes in your document are:

That[c3][a2][e2][82][ac][e2][84][a2]s good

instead of

That[e2][80][99]s good.

So how do you fix it?

I wrote a tool.

The Python code below uses Python’s codec interface to register a simple stateless encoder that turns these utf8 gremlin bytes back into pure utf8 bytes. You can use it from the command line like removeUTF8Gremlins.py infile.txt -o outfile.txt or you can use it as a library by importing it and then using the CP1252asUTF8gremlins pseudo-codec anywhere you can use a stateless codec.

#!/usr/bin/env python
# encoding: utf-8

# BSD LICENSE
# Copyright (c) 2010, Dancing Mammoth Inc
# All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# Redistributions in binary form must reproduce the above copyright notice, this
# list of conditions and the following disclaimer in the documentation and/or
# other materials provided with the distribution.
#
# Neither the name of Dancing Mammoth nor the names of its contributors may
# be used to endorse or promote products derived from this software without
# specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

"""
removeUTF8Gremlins.py

Will recode file with utf8 gremlins to a proper utf8 file.

When used as a library, will register the codec 'CP1252asUTF8gremlins', which
provides a stateless decoder which will convert bytes with gremlins into pure
utf8 bytes.

We call a 'utf8 gremlin' a byte sequence that results when a utf8 byte is read
as a cp1252 encoding into unicode chars, and then written out as utf8.

The tell-tale sign of it is bytes that look like this in a file read as utf8.

Original: That’s good.
Bytes as utf8: That[e2][80][99]s good.
When read as CP1252: That’s good. (acute lowercase a, euro symbol, trademark symbol)
Bytes as utf8 gremlins: That[c3][a2][e2][82][ac][e2][84][a2]s good.

This utility turns "Bytes as utf8 gremlins" back into "Bytes as utf8"

Created by Francis Avila on 2010-10-27.
Copyright (c) 2010 Dancing Mammoth, Inc. All rights reserved.
"""

import sys
import getopt
import codecs
import re

help_message = '''
Fix a conversion error where a utf8 file got interpreted as a win1252 file
and then saved as utf8, producing three-character multibyte gremlins.
'''

def win1252_to_utf8_gremlin_table(mapping={}):
	if mapping:
		return mapping
	def makemapping(mapping):
		for i in range(256):
			byte = ('%02x' % i).decode('hex_codec')
			try:
				cp1252uni = byte.decode('cp1252')
			except UnicodeDecodeError:
				cp1252uni = byte.decode('iso-8859-1')

			if cp1252uni:
				realutf8 = cp1252uni.encode('utf-8')
				try:
					asuni = realutf8.decode('cp1252')
				except UnicodeDecodeError:
					asuni = realutf8.decode('iso-8859-1')
				if asuni:
					utf8gremlin = asuni.encode('utf8')
					mapping[utf8gremlin] = realutf8
	makemapping(mapping)
	return mapping

def win1252_to_utf8_gremlin_re():
	mapping = win1252_to_utf8_gremlin_table()
	rechars = []
	for k,v in mapping.items():
		if k != v:
			rechars.append(k.encode('string_escape'))
	regex = '(?:%s)' % '|'.join(rechars)
	return re.compile(regex)

def reverse_win1252_to_utf8_gremlins(bytes, errors='strict'):
	regex = win1252_to_utf8_gremlin_re()
	mapping = win1252_to_utf8_gremlin_table()
	def replace(mo):
		try:
			newchar = mapping[mo.group(0)]
		except KeyError:
			if errors=='strict':
				raise ValueError('Encountered bytes with no pure utf8 equivalent.')
			else:
				if errors=='ignore':
					newchar = ''
				elif errors=='replace':
					newchar = '?'
		return newchar
	newbytes = re.sub(regex, replace, bytes)
	return (newbytes, len(bytes))

def register_win1252_to_utf8_gremlins(encoding):
	ci = None
	if encoding == 'cp1252asutf8gremlins':
		ci = codecs.CodecInfo(None, reverse_win1252_to_utf8_gremlins, name='CP1252asUTF8gremlins')
	return ci

codecs.register(register_win1252_to_utf8_gremlins)

class Usage(Exception):
	def __init__(self, msg):
		self.msg = msg

def main(argv=None):
	if argv is None:
		argv = sys.argv
	options = {}
	try:
		try:
			opts, args = getopt.getopt(argv[1:], "ho:v", ["help", "output="])
		except getopt.error, msg:
			raise Usage(msg)

		# option processing
		for option, value in opts:
			if option == "-v":
				options['verbose'] = True
			if option in ("-h", "--help"):
				raise Usage(help_message)
			if option in ("-o", "--output"):
				options['outputfile'] = value

	except Usage, err:
		print >> sys.stderr, sys.argv[0].split("/")[-1] + ": " + str(err.msg)
		print >> sys.stderr, "\t for help use --help"
		return 2

	bytes = file(args[0], 'rb').read()
	outfp = file(options['outputfile'], 'wb') if 'outputfile' in options else sys.stdout
	bytes = bytes.decode('CP1252asUTF8gremlins')
	outfp.write(bytes)
	outfp.close()

if __name__ == "__main__":
	sys.exit(main())

PHP Stream Filters: Unchunking HTTP Streams

Slinging php code to and fro one day, I found myself needing to process a potentially large result from a url–a result too large to fit within PHP’s memory limit.  However, I could process this result a line at a time, so I could avoid buffering the entire thing in memory.  I couldn’t use cURL, since it buffers everything, but I could use PHP’s handy file-like stream interface, fetch the url with an fopen('http://my-url.n.e.t/', 'r'); and then use fgets() to keep only a line in memory at a time.

It was a great plan, but I noticed that I occasionally got garbage lines or bogus input. Using http cli tools like wget and curl revealed nothing out of the ordinary, until I realized that those garbage lines were the uninterpreted length markers for Transfer-Encoding: chunked. PHP’s http stream handler does not decode chunked transfers.

There is a pecl function http_chunked_decode(), but it operates on strings, not streams, so I would still have to buffer the entire input first.

PHP’s streams allow you to attach a chain of stream filters to a stream to process input and output (it’s the same mechanism ob_gzhandler() uses). My plan was to create a stream filter to transparently unchunk the stream. Unfortunately, the documentation on writing your own stream filter is pretty sparse, and the examples I could find on the web were all very trivial.

After a few false starts, however, I was able to create an http stream unchunker:

/**
* A stream filter for removing the 'chunking' of a 'Transfer-Encoding: chunked'
* http response
*
* The http stream wrapper on php does not support chunked transfer
* encoding, making this filter necessary.
*
* Add to a file resource with <code>stream_filter_append($fp, 'http_unchunk_filter',
* STREAM_FILTER_READ);</code>
*
* If the wrapper metadata for $fp does not contain a <code>transfer-encoding:
* chunked</code> header, this filter passes data through unchanged.
*
* @license BSD
* @author Francis Avila
*/
// Stream filters must subclass php_user_filter
class http_unchunk_filter extends php_user_filter {
	protected $chunkremaining = 0; //bytes remaining in the current chunk
	protected $ischunked = null; //whether the stream is chunk-encoded. null=not sure yet

	// this is the meat of the filter.
	// The class must have a function with this name and prototype
	// It must return a status--one of the PSFS_* constants;
	function filter($in, $out, &$consumed, $closing) {
		if ($this->ischunked===null) {
			$this->ischunked = self::ischunked($this->stream);
		}
		// $in and $out are opaque "bucket brigade" objects which consist of a
		// sequence of opaque "buckets", which contain the actual stream data.
		// The only way to use these objects is the stream_bucket_* functions.
		// Unfortunately, there doesn't seem to be any way to access a bucket
		// without turning it into a string using stream_bucket_make_writeable(),
		// even if you want to pass the bucket along unmodified.

		// Each call to this pops a bucket from the bucket brigade and
		// converts it into an object with two properties: datalen and data.
		// This same object interface is accepted by stream_bucket_append().
		while ($bucket = stream_bucket_make_writeable($in)) {
			if (!$this->ischunked) {
				$consumed += $bucket->datalen;
				stream_bucket_append($out, $bucket);
				continue;
			}
			$outbuffer = '';
			$offset = 0;
			// Loop through the string.  For efficiency, we don't advance a character
			// at a time but try to zoom ahead to where we think the next chunk
			// boundary should be.

			// Since the stream filter divides the data into buckets arbitrarily,
			// we have to maintain state ($this->chunkremaining) across filter() calls.
			while ($offset < $bucket->datalen) {
				if ($this->chunkremaining===0) { // start of new chunk, or the start of the transfer
					$firstline = strpos($bucket->data, "\r\n", $offset);
					$chunkline = substr($bucket->data, $offset, $firstline-$offset);
					$chunklen = current(explode(';', $chunkline, 2)); // ignore MIME-like extensions
					$chunklen = trim($chunklen);
					if (!ctype_xdigit($chunklen)) {
					// There should have been a chunk length specifier here, but since
					// there are non-hex digits something must have gone wrong.
						return PSFS_ERR_FATAL;
					}
					$this->chunkremaining = hexdec($chunklen);
					// $firstline already includes $offset in it
					$offset = $firstline+2; // +2 is CRLF
					if ($this->chunkremaining===0) { //end of the transfer
						break;  // ignore possible trailing headers
					}
				}
				// get as much data as available in a single go...
				$nibble = substr($bucket->data, $offset, $this->chunkremaining);
				$nibblesize = strlen($nibble);
				$offset += $nibblesize; // ...but recognize we may not have got all of it
				if ($nibblesize === $this->chunkremaining) {
					$offset += 2; // skip over trailing CRLF
				}
				$this->chunkremaining -= $nibblesize;
				$outbuffer .= $nibble;
			}
			$consumed += $bucket->datalen;
			$bucket->data = $outbuffer;
			stream_bucket_append($out, $bucket);
		}
		return PSFS_PASS_ON;
	}

	protected static function ischunked($stream) {
		$metadata = stream_get_meta_data($stream);
		$headers = $metadata['wrapper_data'];
		return (bool) preg_grep('/^Transfer-Encoding:\s+chunked\s*$/i', $headers);
	}

	function onCreate() {
		if (isset($this->stream)) { // This is usually not defined until the first filter() call.
			$this->ischunked = self::ischunked($this->stream);
		}
	}
}

stream_filter_register('http_unchunk_filter', 'http_unchunk_filter');

What you are left with is a stream filter you can then use like so:

$fp = fopen('http://my.url', 'r');
stream_filter_append($fp, 'http_unchunk_filter', STREAM_FILTER_READ);

If the http stream has a chunked transfer encoding, the filter will automatically unchunk it. However, it ignores extended data (anything after the hex-encoded chunk-length) and trailing headers, both of which are in the http specification but hardly ever used.

Derived Attributes with UNION

A Story

Recently, a client of ours wanted to institute a “point” system for an existing body of users. The idea was that certain actions of the user would generate points for that user, which the client could then track as part of an incentive program.

But What are “Points”?

At the time, we had a simple “users” table in our database which stored all our user-related data. Now we were asked, essentially, to add a new “points” attribute to the “user” entity. However, we could not simply add a “points” column to the “user” table, because the client needed to track individual point-granting actions separately, with descriptions and such.

But this was also not a one-to-many relationship with an abstract “point-event” entity either, since some points were inferred from information which was properly normalized into other parts of the database. For example, referring another user (information we know at user registration time) was worth a certain number of points, but to copy a “referred user” event to a “point-event” entity would mean denormalizing the database. If a user-referral were added or changed later, we would have to make sure to do the same thing to a corresponding point-event.

Thus a user’s “points” are an attribute of the user, but the value of this attribute is derived from potentially many different entities or attributes. Guess what? It’s a derived attribute (scroll to the bottom).

So, how are we going to deal with this?

Implementation

Derived Columns

Some “real” databases have native support for derived attributes (e.g., SQL Server) but as far as I know they all require that the value of the derived attribute be defined as an expression, not the result of an arbitrary query. We could get around this using a stored function which calculates the points for us, but this particular database was MySQL (which does not support derived attributes), version 4.1 (which does not support stored functions).

In any case, this is a bad solution for us because any changes to the point calculation algorithm would require modification of the database, yet we had been accustomed to putting this kind of logic into the application. Additionally, a lazy SELECT * (many of which were unfortunately sprinkled throughout our application) would suddenly become much more expensive, requiring an additional function call per row.

Application Code

The other solution, of course, is that we simply put all the point-calculation code into the application. The problem with this is that it would take multiple queries to the database for every user that interested us, and we could potentially get the wrong point value if a change were made to the database in between our queries (since MySQL MyISAM does not have transactions). Plus, if we want to sort by points (or something more complicated), we would have to do the sorting ourselves, in the application.

UNION

Clearly, we wanted to handle point calculation by a single query. The solution we finally hit upon was to use a temporary table (not a view, since MySQL 4.1 doesn’t support them) filled by a UNION. This is quite possibly the only good use for a UNION. Each subquery of the UNION would calculate points based on a particular attribute or entity, and all the subqueries would SELECT to common column names.

DROP TEMPORARY TABLE IF EXISTS tmp_all_points;
CREATE TEMPORARY TABLE tmp_all_points
-- Get referrer-derived points
(SELECT user.id AS user_id, COUNT(*)*5 AS points
FROM user ... INNER JOIN ... GROUP BY ...)
UNION
-- Get pointevents-derived points
(SELECT user_id AS user_id, SUM(points) AS points
FROM pointevents GROUP BY user_id HAVING points != 0);

This will give us a temporary table with 0, 1, or 2 rows per user. If we want to limit this to particular users, we can add the relevant WHERE conditions to the individual subqueries before we send them to the database.

Now if we want to do any queries which involve points, we can just treat tmp_all_points as a “points” entity with a many-to-one relationship with the “users” entity.

Want the top five point-holders?

SELECT users.name, SUM(tmp_all_points.points) AS points
FROM users
INNER JOIN tmp_all_points ON users.id = tmp_all_points.users_id
GROUP BY users.id
ORDER BY points DESC
LIMIT 5

Happy Ending?

By using a UNION, we were able to neatly model the derived attribute as a table, using a single query that maps easily to the logic of the derived attribute and is easy to extend to account for any additional criteria that the client may dream up. And we didn’t have to denormalize our database or introduce complex application code.

There is a caveat, however. Tables defined by a query have no index, and probably we are going to want to join on this table, which means we’ll be doing a join without an index. For this reason, it is pretty important to keep the result set of your UNION query as small as possible using additional WHERE conditions.

If your result set will always be large, split off the temporary table creation into a definition with keys and use a INSERT INTO tmp_table SELECT ... UNION SELECT .... Don’t use CREATE INDEX after filling your table, since creating an index on a full table is much slower than building it incrementally (except for FULLTEXT indexes, where the opposite is true).

Don’t Try This With Views

If you are using MySQL 5.0 or above, you won’t be able to mitigate this problem by using a VIEW. MySQL is not very good at optimizing views. If there is not a one-to-one relationship between the rows of your view and the rows of the underlying tables, MySQL will use ALGORITHM = TEMPTABLE for your view. So any view with a UNION in it will be created as a temporary table anyway.

Thus I would not wrap a UNION in a view for this technique, since you can’t control the result set size for a view and you will be generating a new temporary table every time you use the view, instead of once per connection.

Use Your iMac as a Display

I have an Intel iMac (the white kind). It’s my personal machine. I like it. It’s nice. What I especially like about it is that it has a big screen (1680×1050).

I also have an Intel MacBook. It’s my work machine. I like it. It’s nice. But what I don’t like about it is that the screen is a bit smaller than my iMac (1200×800). Using the smaller keyboard and mouse isn’t so nice either.

What to do?

Well, there’s VNC. OS X even has a VNC server built in. So I could turn that on and then use a VNC client on my iMac. But that only gives me the keyboard and mouse and a 1280×800 window mirroring the MacBook screen. Not cool.

The same guy who makes this excellent VNC client also makes ScreenRecycler. ScreenRecycler turns your VNC client into an attached display. The monitor of the computer your VNC client runs on looks to OS X like just another monitor, plugged in through the mini-DVI port. So now I can work on my MacBook and have a 1680×1050 screen in addition. Joy!

But ScreenRecycler ignores input from the VNC client, so I can’t use my iMac’s keyboard and mouse to control my MacBook. No joy.

But some other guy on the internet makes Transport. Transport lets you control other Macs using your keyboard and mouse. Joy has returned!

So, the plan:

  1. Install and run ScreenRecycler and Transport on the MacBook.
  2. Install and run JollysFastVNC and Transport on the iMac.
  3. The VNC client finds ScreenRecycler via Bonjour. No sweat.
  4. On the MacBook, tell Teleport to “Share this Mac.”

All done! Now I can use my iMac as a second display to my MacBook and control my MacBook with my iMac. (I can even make the iMac the MacBook’s main display!) Using the power of Spaces, I can even have multiple workspaces, and keep (for example) Mail and iChat permanently displayed in the MacBook screen, no matter what workspace I’m in.

A caveat: Transport doesn’t seem to recognize the ScreenRecycler display, at least when one machine is Panther (iMac) and the other Leopard (MacBook). You have to arrange your virtual screens in Transport in such a way that they don’t share the same borders. Otherwise your pointer will get stuck on the MacBook.