Detecting the ‘crazy’ letters

It’s always a pain to work with multibyte characters; most fonts only support ISO-8859-1 English letters. I was working with Imagemagick but some of the strings contained Arabic, Chinese, Russian letters, even some Chinese / Korean, and i did not want to end up with ???’s instead of non-supported characters all over the place.

The solution is easy if you only want to work with 8859-1 set:

if(strlen($string) == mb_strlen($string, 'utf-8')) {
// the string's strlen count and mb strlen count is same so; no multibyte characters here.
}

or if the string only consists of foreign letters:

 if ( ! preg_match("/\p{Latin}+/u", $string) ) {
// The string doesnt contain any Latin characters, all crazy letters here.
 }

But i wanted to work with extended Latin characters such as Nordic, Turkish, German etc. letters. When the string is mixed it gets a little tricky. What i ended up doing is to check first for any Latin character occurence, no point of further checking if there’s no Latin character there. And then going on to checking for other common “crazy” alphabets that might come into play. If the string contains any of these alphabets I’d just skip them.

 if ( ! preg_match("/\p{Latin}+/u", $string) ) {
          echo "No Latin characters here";
          return;
 }
 else {
    $crazies = array("Han","Hangul","Hebrew","Arabic","Cyrillic","Greek","Khmer");
    foreach($crazies as $crazie) {
        if ( preg_match("/\p{".$crazie."}+/u", $string) ) {
            echo "returned because has crazy:" . $crazie; ." letters.";
            return;    
        }
    }
 }

Still not sure if that’s the most efficient way to do it but works for me.

Leave a Reply

Your email address will not be published.