is_male, or how NOT to encode gender

About half a dozen times over the years, I’ve noticed something perplexing to me in various software projects: a person’s gender is encoded as a boolean is_male field. The main, surface-level problem with this is something about which I and many of my colleagues have had various opportunities to educate: gender is not a binary (humans and our culture are much more beautifully messy than that). However there’s something else that’s always bothered me about it that I’ve struggled to put into words: even if gender were a binary, encoding it as a boolean would be a poor choice. I’d like to take some time to explore why this is, because it’s applicable as a more general lesson in data modeling.

I believe the reason that some developers choose this approach is because they’re aware that a boolean uses less space on disk in their systems, which is a good intention. I’d argue that that comes with much more impactful tradeoffs though: it sacrifices readability, and it’s difficult to augment in the future. The addition and use of an is_male field conveys (unintentionally) an errant view of what gender is; it models femininity as the absence of masculinity. It probably feels obvious to the author that the intention of that field and its naming is that a false value indicates that the person is female, but that’s an assumption that other developers (including one’s future self) must intuit; they have to mentally translate the question “is this person male?” as “what is this person’s gender?”.

In order to better illustrate why that feels lacking in readability, let’s compare it to a couple of other hypothetical systems that use booleans to represent some trait of a person.

Imagine we’re modeling the members of a bicameral legislature, such as the USA’s Congress or the UK’s Parliament. Unlike gender, these legislative bodies are truly divided into two alternative groups: the upper house (the USA’s Senate, the UK’s House of Lords) and the lower house (the USA’s House of Representatives, the UK’s House of Commons). Encoding a representative as is_senator = false requires one to understand that someone who is “not a senator” in this system must be a member of some specific other group, and that that other group is “representatives”. They have to be implicitly aware that this model is specifically not designed to store data about any other type of person. And if at some point in the future they wanted to represent (no pun intended) constituents, staffers, foreign dignitaries, etc., they would need to cautiously refactor that field and its usages. In such a system, I’d instead recommend encoding this information as a house: 'senate' | 'house_of_representatives' enum and/or position field from the onset.

Or imagine that you’re modeling a person’s handedness. You could argue that is_left_handed is an accurate way to represent that, but your suggestion would probably garner some odd looks. Someone remarks, “What about ambidextrous people?” The quick-and-dirty workaround is to represent them as is_left_handed = NULL, but that would be very confusing for someone querying the database: is NULL an ambidextrous person; are they a person whose handedness is unknown; are they a person without hands for whom handedness does not apply? Perhaps a dominant_hand: 'left' | 'right' | 'both' | 'not_applicable' | 'unknown' enum is more aligned with what you really want to encode (and more extensible).

Does this mean that we should never use booleans in data modeling? I don’t think so. Booleans are appropriate when representing the presence or absence of a trait (e.g. certain HTML attributes, whether a task is billable, whether a node in a graph has been visited) or a state of truth (e.g. caching the outcome of an expensive calculation, whether a person is eligible for a benefit, whether a prerequisite has been satisfied). Gender does not fit into either of those categories though; instead, it’s a choice between multiple alternative options.


So how should we encode gender? Consider the following approaches:

  • Don’t store gender. Reevaluate whether this is even necessary or helpful.
  • Failing that, permit self-description. Show a text field in which a user can describe their gender in their own words. Voluntary “pronouns” fields on social-media platforms are great examples of this.
  • If you really, really must restrict it to finite, predetermined options, offer several options besides just “female” and “male”. In government and healthcare systems, I’ve often seen “nonbinary”, “other”, “unspecified”, and “unknown”. It’s even more welcoming to add specific identities such as two-spirit, agender, demiboy/demigirl, gender-nonconforming, genderfluid, etc.
    • If possible, survey or interview your userbase to understand which options represent them best.
    • Avoid “transgender man” and “transgender woman” options, unless you also offer “cisgender man” and “cisgender woman”. Being cisgender or transgender is a secondary descriptor of someone’s gender identity, not a gender identity itself.
    • Let users choose multiple options simultaneously.
    • Augment it with a self-description text field, and display that field’s value alongside their chosen predetermined option(s).
    • Allow users to opt out of answering.