PHP Stream Filters: Unchunking HTTP Streams

Slinging php code to and fro one day, I found myself needing to process a potentially large result from a url–a result too large to fit within PHP’s memory limit.  However, I could process this result a line at a time, so I could avoid buffering the entire thing in memory.  I couldn’t use cURL, since it buffers everything, but I could use PHP’s handy file-like stream interface, fetch the url with an fopen('http://my-url.n.e.t/', 'r'); and then use fgets() to keep only a line in memory at a time.

It was a great plan, but I noticed that I occasionally got garbage lines or bogus input. Using http cli tools like wget and curl revealed nothing out of the ordinary, until I realized that those garbage lines were the uninterpreted length markers for Transfer-Encoding: chunked. PHP’s http stream handler does not decode chunked transfers.

There is a pecl function http_chunked_decode(), but it operates on strings, not streams, so I would still have to buffer the entire input first.

PHP’s streams allow you to attach a chain of stream filters to a stream to process input and output (it’s the same mechanism ob_gzhandler() uses). My plan was to create a stream filter to transparently unchunk the stream. Unfortunately, the documentation on writing your own stream filter is pretty sparse, and the examples I could find on the web were all very trivial.

After a few false starts, however, I was able to create an http stream unchunker:

/**
* A stream filter for removing the 'chunking' of a 'Transfer-Encoding: chunked'
* http response
*
* The http stream wrapper on php does not support chunked transfer
* encoding, making this filter necessary.
*
* Add to a file resource with <code>stream_filter_append($fp, 'http_unchunk_filter',
* STREAM_FILTER_READ);</code>
*
* If the wrapper metadata for $fp does not contain a <code>transfer-encoding:
* chunked</code> header, this filter passes data through unchanged.
*
* @license BSD
* @author Francis Avila
*/
// Stream filters must subclass php_user_filter
class http_unchunk_filter extends php_user_filter {
	protected $chunkremaining = 0; //bytes remaining in the current chunk
	protected $ischunked = null; //whether the stream is chunk-encoded. null=not sure yet

	// this is the meat of the filter.
	// The class must have a function with this name and prototype
	// It must return a status--one of the PSFS_* constants;
	function filter($in, $out, &$consumed, $closing) {
		if ($this->ischunked===null) {
			$this->ischunked = self::ischunked($this->stream);
		}
		// $in and $out are opaque "bucket brigade" objects which consist of a
		// sequence of opaque "buckets", which contain the actual stream data.
		// The only way to use these objects is the stream_bucket_* functions.
		// Unfortunately, there doesn't seem to be any way to access a bucket
		// without turning it into a string using stream_bucket_make_writeable(),
		// even if you want to pass the bucket along unmodified.

		// Each call to this pops a bucket from the bucket brigade and
		// converts it into an object with two properties: datalen and data.
		// This same object interface is accepted by stream_bucket_append().
		while ($bucket = stream_bucket_make_writeable($in)) {
			if (!$this->ischunked) {
				$consumed += $bucket->datalen;
				stream_bucket_append($out, $bucket);
				continue;
			}
			$outbuffer = '';
			$offset = 0;
			// Loop through the string.  For efficiency, we don't advance a character
			// at a time but try to zoom ahead to where we think the next chunk
			// boundary should be.

			// Since the stream filter divides the data into buckets arbitrarily,
			// we have to maintain state ($this->chunkremaining) across filter() calls.
			while ($offset < $bucket->datalen) {
				if ($this->chunkremaining===0) { // start of new chunk, or the start of the transfer
					$firstline = strpos($bucket->data, "\r\n", $offset);
					$chunkline = substr($bucket->data, $offset, $firstline-$offset);
					$chunklen = current(explode(';', $chunkline, 2)); // ignore MIME-like extensions
					$chunklen = trim($chunklen);
					if (!ctype_xdigit($chunklen)) {
					// There should have been a chunk length specifier here, but since
					// there are non-hex digits something must have gone wrong.
						return PSFS_ERR_FATAL;
					}
					$this->chunkremaining = hexdec($chunklen);
					// $firstline already includes $offset in it
					$offset = $firstline+2; // +2 is CRLF
					if ($this->chunkremaining===0) { //end of the transfer
						break;  // ignore possible trailing headers
					}
				}
				// get as much data as available in a single go...
				$nibble = substr($bucket->data, $offset, $this->chunkremaining);
				$nibblesize = strlen($nibble);
				$offset += $nibblesize; // ...but recognize we may not have got all of it
				if ($nibblesize === $this->chunkremaining) {
					$offset += 2; // skip over trailing CRLF
				}
				$this->chunkremaining -= $nibblesize;
				$outbuffer .= $nibble;
			}
			$consumed += $bucket->datalen;
			$bucket->data = $outbuffer;
			stream_bucket_append($out, $bucket);
		}
		return PSFS_PASS_ON;
	}

	protected static function ischunked($stream) {
		$metadata = stream_get_meta_data($stream);
		$headers = $metadata['wrapper_data'];
		return (bool) preg_grep('/^Transfer-Encoding:\s+chunked\s*$/i', $headers);
	}

	function onCreate() {
		if (isset($this->stream)) { // This is usually not defined until the first filter() call.
			$this->ischunked = self::ischunked($this->stream);
		}
	}
}

stream_filter_register('http_unchunk_filter', 'http_unchunk_filter');

What you are left with is a stream filter you can then use like so:

$fp = fopen('http://my.url', 'r');
stream_filter_append($fp, 'http_unchunk_filter', STREAM_FILTER_READ);

If the http stream has a chunked transfer encoding, the filter will automatically unchunk it. However, it ignores extended data (anything after the hex-encoded chunk-length) and trailing headers, both of which are in the http specification but hardly ever used.

  • tags:
  • 2 Comments

Comments

Thanks for this great filter example. I think it’s worth of a note that with PHP 5.3 there were some improvements done with chunked transfer:

Quote: 5.3.0 — The protocol_version supports chunked transfer decoding when set to 1.1.
From: HTTP context options

Ref: [PHP-DEV] Support for “Transfer-Encoding: chunked” in http stream wrapper

p.s. check the tab order on your blog, when you tab out of NAME you jump to your client’s access PASSWORD field on top of the page.

mot August 8, 2010

There is now a ‘dechunk’ filter in PHP 5.3.

Jaka Jancar September 21, 2010

Leave a mark

 
 
 
 
 

THE BALLERINAS

  • PJ Doland

    Born in a cross-fire hurricane and he howled at his ma in the driving rain.

    Matt

    Making sure all our websites have at least 15 pieces of flair.

    Erin Doland

    100 percent all-natural high-quality content machine.

    Francis Avila

    Ambidextrously juggling clients and code without breaking a sweat.

    Rachelle Ondiege

    Far too much energy for her own good.