December 17th, 2008

Quick stripping non-ascii in PHP

Written by Dave BarkerTopics: Code, Php

Working on this contract required me to strip non-ascii characters out of their content before sending it to Sphinx to be indexed. “No problem” I thought, “I’ll just use that function I wrote earlier“…

Whoops, to filter some test data it took like 3 minutes! So I did some further tests, here where my findings:

  • strip_utf - 3 mins and 20.546s
  • str_replace($crap, ”, $string) - 3 mins 6 seconds
  • preg_replace(’/[^(\x20-\x7F)]*/’, ”, $content) - 2 mins 10 seconds
  • No filtering - 8 seconds!

So it still wasn’t fast enough. Finally I wrote a small chunk of C to strip the crap, piping the data through that gave:

  • 12 seconds

Not bad! The code is pretty simple too, in case you want it here you go:

Wordpress has mangled the code so badly, I have given up trying to get it to display properly. View it here

Edit: Wordpress' gone so here you go:

#include <stdio.h>
#include <stdlib.h>

int main(void) {
  unsigned int Chr;

  while ((Chr = getchar()) != EOF) {
    if ((Chr > 31 && Chr < 128) || (Chr == 13) || (Chr == 10))
      putchar(Chr);
  }
  return 0;
}

Next up, making a module to use some C to strip these characters from within PHP..

(Also next Wordpress is going to be replaced)

- Dave.

(Thanks to erisco in ##php for the suggestions and help.)

Edit: By the way the time of 3 minutes mattered because that meant with the entire database the time would be like 5 or 6 hours, with the speed increase I managed to get it’s down to 20 minutes :D