About half a dozen times over the years, I’ve noticed something perplexing to me
in various software projects: a person’s gender is encoded as a boolean
is_male
field. The main, surface-level problem with this is something about
which I and many of my colleagues have had various opportunities to educate:
gender is not a binary (humans and our culture are much more beautifully messy
than that). However there’s something else that’s always bothered me about it
that I’ve struggled to put into words: even if gender were a binary, encoding
it as a boolean would be a poor choice. I’d like to take some time to explore
why this is, because it’s applicable as a more general lesson in data modeling.
I believe the reason that some developers choose this approach is because
they’re aware that a boolean uses less space on disk in their systems, which is
a good intention. I’d argue that that comes with much more impactful tradeoffs
though: it sacrifices readability, and it’s difficult to augment in the future.
The addition and use of an is_male
field conveys (unintentionally) an errant
view of what gender is; it models femininity as the absence of masculinity.
It probably feels obvious to the author that the intention of that field and its
naming is that a false
value indicates that the person is female, but that’s
an assumption that other developers (including one’s future self) must intuit;
they have to mentally translate the question “is this person male?” as “what is
this person’s gender?”.
In order to better illustrate why that feels lacking in readability, let’s compare it to a couple of other hypothetical systems that use booleans to represent some trait of a person.
Imagine we’re modeling the members of a
bicameral legislature, such as the
USA’s Congress or the UK’s Parliament. Unlike gender, these legislative bodies
are truly divided into two alternative groups: the upper house (the USA’s
Senate, the UK’s House of Lords) and the lower house (the USA’s House of
Representatives, the UK’s House of Commons). Encoding a representative as
is_senator = false
requires one to understand that someone who is “not a
senator” in this system must be a member of some specific other group, and that
that other group is “representatives”. They have to be implicitly aware that
this model is specifically not designed to store data about any other type of
person. And if at some point in the future they wanted to represent (no pun
intended) constituents, staffers, foreign dignitaries, etc., they would need to
cautiously refactor that field and its usages. In such a system, I’d instead
recommend encoding this information as a
house: 'senate' | 'house_of_representatives'
enum and/or position
field from
the onset.
Or imagine that you’re modeling a person’s handedness. You could argue that
is_left_handed
is an accurate way to represent that, but your suggestion would
probably garner some odd looks. Someone remarks, “What about ambidextrous
people?” The quick-and-dirty workaround is to represent them as
is_left_handed = NULL
, but that would be very confusing for someone querying
the database: is NULL
an ambidextrous person; are they a person whose
handedness is unknown; are they a person without hands for whom handedness does
not apply? Perhaps a
dominant_hand: 'left' | 'right' | 'both' | 'not_applicable' | 'unknown'
enum
is more aligned with what you really want to encode (and more extensible).
Does this mean that we should never use booleans in data modeling? I don’t think so. Booleans are appropriate when representing the presence or absence of a trait (e.g. certain HTML attributes, whether a task is billable, whether a node in a graph has been visited) or a state of truth (e.g. caching the outcome of an expensive calculation, whether a person is eligible for a benefit, whether a prerequisite has been satisfied). Gender does not fit into either of those categories though; instead, it’s a choice between multiple alternative options.
So how should we encode gender? Consider the following approaches:
- Don’t store gender. Reevaluate whether this is even necessary or helpful.
- Failing that, permit self-description. Show a text field in which a user can describe their gender in their own words. Voluntary “pronouns” fields on social-media platforms are great examples of this.
- If you really, really must restrict it to finite, predetermined options,
offer several options besides just “female” and “male”. In government and
healthcare systems, I’ve often seen “nonbinary”, “other”, “unspecified”, and
“unknown”. It’s even more welcoming to add specific identities such as
two-spirit, agender, demiboy/demigirl, gender-nonconforming, genderfluid, etc.
- If possible, survey or interview your userbase to understand which options represent them best.
- Avoid “transgender man” and “transgender woman” options, unless you also offer “cisgender man” and “cisgender woman”. Being cisgender or transgender is a secondary descriptor of someone’s gender identity, not a gender identity itself.
- Let users choose multiple options simultaneously.
- Augment it with a self-description text field, and display that field’s value alongside their chosen predetermined option(s).
- Allow users to opt out of answering.