Slinging php code to and fro one day, I found myself needing to process a potentially large result from a url–a result too large to fit within PHP’s memory limit. However, I could process this result a line at a time, so I could avoid buffering the entire thing in memory. I couldn’t use cURL, since it buffers everything, but I could use PHP’s handy file-like stream interface, fetch the url with an fopen('http://my-url.n.e.t/', 'r');
and then use fgets()
to keep only a line in memory at a time.
It was a great plan, but I noticed that I occasionally got garbage lines or bogus input. Using http cli tools like wget and curl revealed nothing out of the ordinary, until I realized that those garbage lines were the uninterpreted length markers for Transfer-Encoding: chunked
. PHP’s http stream handler does not decode chunked transfers.
There is a pecl function http_chunked_decode()
, but it operates on strings, not streams, so I would still have to buffer the entire input first.
PHP’s streams allow you to attach a chain of stream filters to a stream to process input and output (it’s the same mechanism ob_gzhandler()
uses). My plan was to create a stream filter to transparently unchunk the stream. Unfortunately, the documentation on writing your own stream filter is pretty sparse, and the examples I could find on the web were all very trivial.
After a few false starts, however, I was able to create an http stream unchunker:
/** * A stream filter for removing the 'chunking' of a 'Transfer-Encoding: chunked' * http response * * The http stream wrapper on php does not support chunked transfer * encoding, making this filter necessary. * * Add to a file resource with <code>stream_filter_append($fp, 'http_unchunk_filter', * STREAM_FILTER_READ);</code> * * If the wrapper metadata for $fp does not contain a <code>transfer-encoding: * chunked</code> header, this filter passes data through unchanged. * * @license BSD * @author Francis Avila */ // Stream filters must subclass php_user_filter class http_unchunk_filter extends php_user_filter { protected $chunkremaining = 0; //bytes remaining in the current chunk protected $ischunked = null; //whether the stream is chunk-encoded. null=not sure yet // this is the meat of the filter. // The class must have a function with this name and prototype // It must return a status--one of the PSFS_* constants; function filter($in, $out, &$consumed, $closing) { if ($this->ischunked===null) { $this->ischunked = self::ischunked($this->stream); } // $in and $out are opaque "bucket brigade" objects which consist of a // sequence of opaque "buckets", which contain the actual stream data. // The only way to use these objects is the stream_bucket_* functions. // Unfortunately, there doesn't seem to be any way to access a bucket // without turning it into a string using stream_bucket_make_writeable(), // even if you want to pass the bucket along unmodified. // Each call to this pops a bucket from the bucket brigade and // converts it into an object with two properties: datalen and data. // This same object interface is accepted by stream_bucket_append(). while ($bucket = stream_bucket_make_writeable($in)) { if (!$this->ischunked) { $consumed += $bucket->datalen; stream_bucket_append($out, $bucket); continue; } $outbuffer = ''; $offset = 0; // Loop through the string. For efficiency, we don't advance a character // at a time but try to zoom ahead to where we think the next chunk // boundary should be. // Since the stream filter divides the data into buckets arbitrarily, // we have to maintain state ($this->chunkremaining) across filter() calls. while ($offset < $bucket->datalen) { if ($this->chunkremaining===0) { // start of new chunk, or the start of the transfer $firstline = strpos($bucket->data, "\r\n", $offset); $chunkline = substr($bucket->data, $offset, $firstline-$offset); $chunklen = current(explode(';', $chunkline, 2)); // ignore MIME-like extensions $chunklen = trim($chunklen); if (!ctype_xdigit($chunklen)) { // There should have been a chunk length specifier here, but since // there are non-hex digits something must have gone wrong. return PSFS_ERR_FATAL; } $this->chunkremaining = hexdec($chunklen); // $firstline already includes $offset in it $offset = $firstline+2; // +2 is CRLF if ($this->chunkremaining===0) { //end of the transfer break; // ignore possible trailing headers } } // get as much data as available in a single go... $nibble = substr($bucket->data, $offset, $this->chunkremaining); $nibblesize = strlen($nibble); $offset += $nibblesize; // ...but recognize we may not have got all of it if ($nibblesize === $this->chunkremaining) { $offset += 2; // skip over trailing CRLF } $this->chunkremaining -= $nibblesize; $outbuffer .= $nibble; } $consumed += $bucket->datalen; $bucket->data = $outbuffer; stream_bucket_append($out, $bucket); } return PSFS_PASS_ON; } protected static function ischunked($stream) { $metadata = stream_get_meta_data($stream); $headers = $metadata['wrapper_data']; return (bool) preg_grep('/^Transfer-Encoding:\s+chunked\s*$/i', $headers); } function onCreate() { if (isset($this->stream)) { // This is usually not defined until the first filter() call. $this->ischunked = self::ischunked($this->stream); } } } stream_filter_register('http_unchunk_filter', 'http_unchunk_filter');
What you are left with is a stream filter you can then use like so:
$fp = fopen('http://my.url', 'r'); stream_filter_append($fp, 'http_unchunk_filter', STREAM_FILTER_READ);
If the http stream has a chunked transfer encoding, the filter will automatically unchunk it. However, it ignores extended data (anything after the hex-encoded chunk-length) and trailing headers, both of which are in the http specification but hardly ever used.
It seems that there is a predefined filter “dechunk” to perform what you did
See here : http://technosophos.com/2012/02/28/php-stream-filters-compress-transform-and-transcode-fly.html