What you see may not be what you get

Say you have a string like "💖 Love", and you want to extract the emoji. Your first instinct might be to try something like:

string = "💖 Love"
emoji = string[0] # => "💖"

While that works in simple cases, it’s unreliable. If you try the same thing with a more complex emoji, you’ll get different results:

string = "👨‍👩‍👧 Family"
emoji = string[0] # => "👨"

Here’s the catch: that emoji is composed of multiple Unicode characters. Even though we see one thing, it’s actually made up of several code points under the hood.

Because not every user-perceived character corresponds to a single code point, we need to use a different “unit” of measurement to accurately walk through the string.

That’s where Grapheme Clusters come in. They provide a more accurate way to represent user-perceived characters.

Back to the initial example, let’s extract the emoji using Grapheme Clusters. Luckily, Ruby has built-in support for them:

string = "💖 Love"
emoji = string.grapheme_clusters[0] # => "💖"

string = "👨‍👩‍👧 Family"
emoji = string.grapheme_clusters[0] # => "👨‍👩‍👧"

And… done!

Why should you care?

Even if you’re not dealing with emojis, understanding Grapheme Clusters is useful for handling any text with “complex” characters.

If you’re working with languages that have accents, ligatures, or other composite characters (things like g̈ or กำ), being aware of Grapheme Clusters can help you avoid subtle bugs and ensure your text processing is accurate. This can be especially sensitive when dealing with user data like names, addresses, and other personal information.

Now you know about Grapheme Clusters. Use that power well.

What you see may not be what you get

Why should you care?

About thoughtbot

Let’s make your product and team a success

Why should you care?

Sign up to receive a weekly recap from thoughtbot

About thoughtbot

Let’s make your product and team a success