What you see may not be what you get

Matheus Richard in Brazil

Say you have a string like "💖 Love", and you want to extract the emoji. Your first instinct might be to try something like:

string = "💖 Love"
emoji = string[0] # => "💖"

While that works in simple cases, it’s unreliable. If you try the same thing with a more complex emoji, you’ll get different results:

string = "👨‍👩‍👧 Family"
emoji = string[0] # => "👨"

Here’s the catch: that emoji is composed of multiple Unicode characters. Even though we see one thing, it’s actually made up of several code points under the hood.

Because not every user-perceived character corresponds to a single code point, we need to use a different “unit” of measurement to accurately walk through the string.

That’s where Grapheme Clusters come in. They provide a more accurate way to represent user-perceived characters.

Back to the initial example, let’s extract the emoji using Grapheme Clusters. Luckily, Ruby has built-in support for them:

string = "💖 Love"
emoji = string.grapheme_clusters[0] # => "💖"

string = "👨‍👩‍👧 Family"
emoji = string.grapheme_clusters[0] # => "👨‍👩‍👧"

And… done!

Why should you care?

Even if you’re not dealing with emojis, understanding Grapheme Clusters is useful for handling any text with “complex” characters.

If you’re working with languages that have accents, ligatures, or other composite characters (things like g̈ or กำ), being aware of Grapheme Clusters can help you avoid subtle bugs and ensure your text processing is accurate. This can be especially sensitive when dealing with user data like names, addresses, and other personal information.

Now you know about Grapheme Clusters. Use that power well.

About thoughtbot

We've been helping engineering teams deliver exceptional products for over 20 years. Our designers, developers, and product managers work closely with teams to solve your toughest software challenges through collaborative design and development. Learn more about us.