PHP Stream Filters: Unchunking HTTP Streams

Slinging php code to and fro one day, I found myself needing to process a potentially large result from a url–a result too large to fit within PHP’s memory limit.  However, I could process this result a line at a time, so I could avoid buffering the entire thing in memory.  I couldn’t use cURL, since it buffers everything, but I could use PHP’s handy file-like stream interface, fetch the url with an fopen('http://my-url.n.e.t/', 'r'); and then use fgets() to keep only a line in memory at a time.

It was a great plan, but I noticed that I occasionally got garbage lines or bogus input. Using http cli tools like wget and curl revealed nothing out of the ordinary, until I realized that those garbage lines were the uninterpreted length markers for Transfer-Encoding: chunked. PHP’s http stream handler does not decode chunked transfers.

There is a pecl function http_chunked_decode(), but it operates on strings, not streams, so I would still have to buffer the entire input first.

PHP’s streams allow you to attach a chain of stream filters to a stream to process input and output (it’s the same mechanism ob_gzhandler() uses). My plan was to create a stream filter to transparently unchunk the stream. Unfortunately, the documentation on writing your own stream filter is pretty sparse, and the examples I could find on the web were all very trivial.

After a few false starts, however, I was able to create an http stream unchunker:

/**
* A stream filter for removing the 'chunking' of a 'Transfer-Encoding: chunked'
* http response
*
* The http stream wrapper on php does not support chunked transfer
* encoding, making this filter necessary.
*
* Add to a file resource with <code>stream_filter_append($fp, 'http_unchunk_filter',
* STREAM_FILTER_READ);</code>
*
* If the wrapper metadata for $fp does not contain a <code>transfer-encoding:
* chunked</code> header, this filter passes data through unchanged.
*
* @license BSD
* @author Francis Avila
*/
// Stream filters must subclass php_user_filter
class http_unchunk_filter extends php_user_filter {
	protected $chunkremaining = 0; //bytes remaining in the current chunk
	protected $ischunked = null; //whether the stream is chunk-encoded. null=not sure yet

	// this is the meat of the filter.
	// The class must have a function with this name and prototype
	// It must return a status--one of the PSFS_* constants;
	function filter($in, $out, &$consumed, $closing) {
		if ($this->ischunked===null) {
			$this->ischunked = self::ischunked($this->stream);
		}
		// $in and $out are opaque "bucket brigade" objects which consist of a
		// sequence of opaque "buckets", which contain the actual stream data.
		// The only way to use these objects is the stream_bucket_* functions.
		// Unfortunately, there doesn't seem to be any way to access a bucket
		// without turning it into a string using stream_bucket_make_writeable(),
		// even if you want to pass the bucket along unmodified.

		// Each call to this pops a bucket from the bucket brigade and
		// converts it into an object with two properties: datalen and data.
		// This same object interface is accepted by stream_bucket_append().
		while ($bucket = stream_bucket_make_writeable($in)) {
			if (!$this->ischunked) {
				$consumed += $bucket->datalen;
				stream_bucket_append($out, $bucket);
				continue;
			}
			$outbuffer = '';
			$offset = 0;
			// Loop through the string.  For efficiency, we don't advance a character
			// at a time but try to zoom ahead to where we think the next chunk
			// boundary should be.

			// Since the stream filter divides the data into buckets arbitrarily,
			// we have to maintain state ($this->chunkremaining) across filter() calls.
			while ($offset < $bucket->datalen) {
				if ($this->chunkremaining===0) { // start of new chunk, or the start of the transfer
					$firstline = strpos($bucket->data, "\r\n", $offset);
					$chunkline = substr($bucket->data, $offset, $firstline-$offset);
					$chunklen = current(explode(';', $chunkline, 2)); // ignore MIME-like extensions
					$chunklen = trim($chunklen);
					if (!ctype_xdigit($chunklen)) {
					// There should have been a chunk length specifier here, but since
					// there are non-hex digits something must have gone wrong.
						return PSFS_ERR_FATAL;
					}
					$this->chunkremaining = hexdec($chunklen);
					// $firstline already includes $offset in it
					$offset = $firstline+2; // +2 is CRLF
					if ($this->chunkremaining===0) { //end of the transfer
						break;  // ignore possible trailing headers
					}
				}
				// get as much data as available in a single go...
				$nibble = substr($bucket->data, $offset, $this->chunkremaining);
				$nibblesize = strlen($nibble);
				$offset += $nibblesize; // ...but recognize we may not have got all of it
				if ($nibblesize === $this->chunkremaining) {
					$offset += 2; // skip over trailing CRLF
				}
				$this->chunkremaining -= $nibblesize;
				$outbuffer .= $nibble;
			}
			$consumed += $bucket->datalen;
			$bucket->data = $outbuffer;
			stream_bucket_append($out, $bucket);
		}
		return PSFS_PASS_ON;
	}

	protected static function ischunked($stream) {
		$metadata = stream_get_meta_data($stream);
		$headers = $metadata['wrapper_data'];
		return (bool) preg_grep('/^Transfer-Encoding:\s+chunked\s*$/i', $headers);
	}

	function onCreate() {
		if (isset($this->stream)) { // This is usually not defined until the first filter() call.
			$this->ischunked = self::ischunked($this->stream);
		}
	}
}

stream_filter_register('http_unchunk_filter', 'http_unchunk_filter');

What you are left with is a stream filter you can then use like so:

$fp = fopen('http://my.url', 'r');
stream_filter_append($fp, 'http_unchunk_filter', STREAM_FILTER_READ);

If the http stream has a chunked transfer encoding, the filter will automatically unchunk it. However, it ignores extended data (anything after the hex-encoded chunk-length) and trailing headers, both of which are in the http specification but hardly ever used.

Leave a Reply

Your email address will not be published. Required fields are marked *