Shaping Values with Types

Josh Clayton

On a client project recently, I was putting fake data together as a first pass for seeding a UI in Elm.

The domain model looked fairly straightforward:

module Data.Employee
        ( Employee
        , Name(..)
        , EmployeeId(..)

type alias Employee =
    { id : EmployeeId
    , fullName : Name

type EmployeeId
    = EmployeeId String

type Name
    = Name String

In my module to generate fake data, I built an employee function:

employee : Employee
employee =
    { id = EmployeeId "A-1234-jane-doe"
    , fullName = Name "Jane Doe"

With this (and other data) wrapped up, I submitted a pull request to gather feedback. Another developer commented that the value of the EmployeeId didn’t reflect reality (employee IDs are four- or five-digit codes, which may include leading zeroes).

Types without Reality

While wrapping the underlying String in a EmployeeId prevents argument order bugs, it doesn’t reflect real-world usage and possible values.

While there are only 110,000 valid four- and five-digit employee IDs, our data model for an employee ID uses the underlying type of String, which can represent an infinite number of values. Our data model does not reflect reality. By reducing the number of possible values captured in a type, it’s less likely that an incorrect value sneaks in.

Dissecting Types and Value Surface Area


The String type (independent of any memory or storage limitations) can represent an infinite number of characters. Values like the one I submitted in my pull request (A-1234-jane-doe) have a type of String, but the type is too permissive.

For example, "12357" is valid, but "made-up-id", "", and "-----" are not. All have the type String.

List Int

A List Int type better describes that we expect to have a list of numbers, but this list can also be infinitely long.

For example, [1, 2, 3, 4] is correct, but [1, 2, 3, 4, 5, 6, 7, 8, 9] is not. Both have the type List Int.

(Int, Int, Int, Int) and (Int, Int, Int, Int, Int)

This is closer to what we’d actually expect; there are explicit, arbitrary limits to the digits themselves. However, valid Ints include negative numbers and numbers greater than 9.

For example, (1, 2, 3, 4) is correct, but (-100, 15, 2, 295001) is not. Both have the type (Int, Int, Int, Int).

Constructor Validation

Let’s take a quick tangent and discuss ways to guarantee correct values even with less-than-ideal types.

With the type

type EmployeeId
    = EmployeeId String

Instead of exposing the EmployeeId data constructor (the function of type String -> EmployeeId), we can define a function to build an employee ID that might fail:

parseEmployeeId : String -> Result String EmployeeId
parseEmployeeId value =
    case intResults value of
        [ Ok d1, Ok d2, Ok d3, Ok d4 ] ->
            Ok <| buildEmployeeIdFromSafeInts [ d1, d2, d3, d4 ]

        [ Ok d1, Ok d2, Ok d3, Ok d4, Ok d5 ] ->
            Ok <| buildEmployeeIdFromSafeInts [ d1, d2, d3, d4, d5 ]

        _ ->
            Err "Employee ID is not in the correct format"

intResults : String -> List (Result String Int)
intResults = String.toInt << String.split ""

buildEmployeeIdFromSafeInts : List Int -> EmployeeId
buildEmployeeIdFromSafeInts =
    EmployeeId << String.concat << toString

With some property testing, we could achieve a high level of confidence that this function protects the system from invalid data; coupled with the fact that we don’t expose EmployeeId : String -> EmployeeId, we’re all but guaranteed that the system won’t be fed bad data.

This safety is provided at runtime instead of compile-time, however; the underlying data (the value of type String) will fulfill the business requirements but doesn’t help clarify what those requirements are. From a communication perspective, readers of our code can’t understand the business requirements behind an EmployeeId only by reading the type because it’s still wrapping a String.

A Long-Winded (and Theoretically “Correct”) Type

How can we model EmployeeId to reflect reality?

type Digit
    = D0
    | D1
    | D2
    | D3
    | D4
    | D5
    | D6
    | D7
    | D8
    | D9

type EmployeeId
    = FourDigitEmployeeId Digit Digit Digit Digit
    | FiveDigitEmployeeId Digit Digit Digit Digit Digit

This greatly reduces the number of values possible to represent employee IDs (now 110,000, where previous types like String and List Int were both infinity!) More importantly, the type enforces that the value represented is valid.

With a couple of boilerplate functions:

digitFromChar : Char -> Result String Digit
digitFromChar char =
    case char of
        '0' ->
            Ok D0

        '1' ->
            Ok D1

        '2' ->
            Ok D2

        '3' ->
            Ok D3

        '4' ->
            Ok D4

        '5' ->
            Ok D5

        '6' ->
            Ok D6

        '7' ->
            Ok D7

        '8' ->
            Ok D8

        '9' ->
            Ok D9

        v ->
            Err <| String.fromChar v

parseDigitsFromString : String -> List (Result String Digit)
parseDigitsFromString = digitFromChar << String.toList

We can now build out our same constructor function to parse values and generate correct EmployeeIds:

parseEmployeeId : String -> Result String EmployeeId
parseEmployeeId value =
    case parseDigitsFromString value of
        [ Ok d1, Ok d2, Ok d3, Ok d4 ] ->
            Ok <| FourDigitEmployeeId d1 d2 d3 d4

        [ Ok d1, Ok d2, Ok d3, Ok d4, Ok d5 ] ->
            Ok <| FiveDigitEmployeeId d1 d2 d3 d4 d5

        _ ->
            Err "Employee ID is not in the correct format"

This safety is now provided at compile-time. In the first example, the data is correct because of parseEmployeeId, while in this example, we need parseEmployeeId because the data is correct. The relationship is flipped: the need to parse is the cause of correctness in the first example, while in the second, the need to parse is caused by correctness.

Practical Application

Is this more strict approach viable? Useful? Flexible? It depends on the application, the likelihood of the domain being “correct”, and the risks of introducing values where the types are correct but the data isn’t.

I’d avoid this approach in cases where the domain is evolving rapidly or when there are less rigid data structure requirements, instead relying on the “newtype” technique of wrapping primitives (e.g. type Example = Example String).

The benefits of this approach are two-fold: types introduce improved safety when working with data and we’re able to communicate business rules. Improved safety results in a more accurate system, assuming types properly encode the structures. Communicating business rules means other developers understand possible values and states the information can exist in, allowing for improved reasoning across the codebase.