December 17th, 2008
Quick stripping non-ascii in PHP
Topics: Code, PhpWorking on this contract required me to strip non-ascii characters out of their content before sending it to Sphinx to be indexed. “No problem” I thought, “I’ll just use that function I wrote earlier“…
Whoops, to filter some test data it took like 3 minutes! So I did some further tests, here where my findings:
- strip_utf - 3 mins and 20.546s
- str_replace($crap, ”, $string) - 3 mins 6 seconds
- preg_replace(’/[^(\x20-\x7F)]*/’, ”, $content) - 2 mins 10 seconds
- No filtering - 8 seconds!
So it still wasn’t fast enough. Finally I wrote a small chunk of C to strip the crap, piping the data through that gave:
- 12 seconds
Not bad! The code is pretty simple too, in case you want it here you go:
Wordpress has mangled the code so badly, I have given up trying to get it to display properly. View it here
Edit: Wordpress' gone so here you go:
#include <stdio.h>
#include <stdlib.h>
int main(void) {
unsigned int Chr;
while ((Chr = getchar()) != EOF) {
if ((Chr > 31 && Chr < 128) || (Chr == 13) || (Chr == 10))
putchar(Chr);
}
return 0;
}
Next up, making a module to use some C to strip these characters from within PHP..
(Also next Wordpress is going to be replaced)
- Dave.
(Thanks to erisco in ##php for the suggestions and help.)
Edit: By the way the time of 3 minutes mattered because that meant with the entire database the time would be like 5 or 6 hours, with the speed increase I managed to get it’s down to 20 minutes :D