On a client project recently, I was putting fake data together as a first pass for seeding a UI in Elm.
The domain model looked fairly straightforward:
module Data.Employee
exposing
( Employee
, Name(..)
, EmployeeId(..)
)
type alias Employee =
{ id : EmployeeId
, fullName : Name
}
type EmployeeId
= EmployeeId String
type Name
= Name String
In my module to generate fake data, I built an employee
function:
employee : Employee
employee =
{ id = EmployeeId "A-1234-jane-doe"
, fullName = Name "Jane Doe"
}
With this (and other data) wrapped up, I submitted a pull request to gather
feedback. Another developer commented that the value of the EmployeeId
didn’t reflect reality (employee IDs are four- or five-digit codes, which may
include leading zeroes).
Types without Reality
While wrapping the underlying String
in a EmployeeId
prevents argument
order bugs, it doesn’t reflect
real-world usage and possible values.
While there are only 110,000 valid four- and five-digit employee IDs, our data
model for an employee ID uses the underlying type of String
, which can
represent an infinite number of values. Our data model does not reflect
reality. By reducing the number of possible values captured in a type, it’s
less likely that an incorrect value sneaks in.
Dissecting Types and Value Surface Area
String
The String
type (independent of any memory or storage limitations) can
represent an infinite number of characters. Values like the one I submitted in
my pull request (A-1234-jane-doe
) have a type of String
, but the type
is too permissive.
For example, "12357"
is valid, but "made-up-id"
, ""
, and "-----"
are
not. All have the type String
.
List Int
A List Int
type better describes that we expect to have a list of numbers,
but this list can also be infinitely long.
For example, [1, 2, 3, 4]
is correct, but [1, 2, 3, 4, 5, 6, 7, 8, 9]
is
not. Both have the type List Int
.
(Int, Int, Int, Int)
and (Int, Int, Int, Int, Int)
This is closer to what we’d actually expect; there are explicit, arbitrary
limits to the digits themselves. However, valid Int
s include negative numbers
and numbers greater than 9.
For example, (1, 2, 3, 4)
is correct, but (-100, 15, 2, 295001)
is
not. Both have the type (Int, Int, Int, Int)
.
Constructor Validation
Let’s take a quick tangent and discuss ways to guarantee correct values even with less-than-ideal types.
With the type
type EmployeeId
= EmployeeId String
Instead of exposing the EmployeeId
data constructor (the function of type
String -> EmployeeId
), we can define a function to build an employee ID that
might fail:
parseEmployeeId : String -> Result String EmployeeId
parseEmployeeId value =
case intResults value of
[ Ok d1, Ok d2, Ok d3, Ok d4 ] ->
Ok <| buildEmployeeIdFromSafeInts [ d1, d2, d3, d4 ]
[ Ok d1, Ok d2, Ok d3, Ok d4, Ok d5 ] ->
Ok <| buildEmployeeIdFromSafeInts [ d1, d2, d3, d4, d5 ]
_ ->
Err "Employee ID is not in the correct format"
intResults : String -> List (Result String Int)
intResults =
List.map String.toInt << String.split ""
buildEmployeeIdFromSafeInts : List Int -> EmployeeId
buildEmployeeIdFromSafeInts =
EmployeeId << String.concat << List.map toString
With some property testing, we could achieve a high level of confidence that
this function protects the system from invalid data; coupled with the fact that
we don’t expose EmployeeId : String -> EmployeeId
, we’re all but guaranteed
that the system won’t be fed bad data.
This safety is provided at runtime instead of compile-time, however; the
underlying data (the value of type String
) will fulfill the business
requirements but doesn’t help clarify what those requirements are. From a
communication perspective, readers of our code can’t understand the business
requirements behind an EmployeeId
only by reading the type because it’s still
wrapping a String
.
A Long-Winded (and Theoretically “Correct”) Type
How can we model EmployeeId
to reflect reality?
type Digit
= D0
| D1
| D2
| D3
| D4
| D5
| D6
| D7
| D8
| D9
type EmployeeId
= FourDigitEmployeeId Digit Digit Digit Digit
| FiveDigitEmployeeId Digit Digit Digit Digit Digit
This greatly reduces the number of values possible to represent employee IDs
(now 110,000, where previous types like String
and List Int
were both
infinity!) More importantly, the type enforces that the value represented is
valid.
With a couple of boilerplate functions:
digitFromChar : Char -> Result String Digit
digitFromChar char =
case char of
'0' ->
Ok D0
'1' ->
Ok D1
'2' ->
Ok D2
'3' ->
Ok D3
'4' ->
Ok D4
'5' ->
Ok D5
'6' ->
Ok D6
'7' ->
Ok D7
'8' ->
Ok D8
'9' ->
Ok D9
v ->
Err <| String.fromChar v
parseDigitsFromString : String -> List (Result String Digit)
parseDigitsFromString =
List.map digitFromChar << String.toList
We can now build out our same constructor function to parse values and generate
correct EmployeeId
s:
parseEmployeeId : String -> Result String EmployeeId
parseEmployeeId value =
case parseDigitsFromString value of
[ Ok d1, Ok d2, Ok d3, Ok d4 ] ->
Ok <| FourDigitEmployeeId d1 d2 d3 d4
[ Ok d1, Ok d2, Ok d3, Ok d4, Ok d5 ] ->
Ok <| FiveDigitEmployeeId d1 d2 d3 d4 d5
_ ->
Err "Employee ID is not in the correct format"
This safety is now provided at compile-time. In the first example, the data
is correct because of parseEmployeeId
, while in this example, we need
parseEmployeeId
because the data is correct. The relationship is flipped:
the need to parse is the cause of correctness in the first example, while
in the second, the need to parse is caused by correctness.
Practical Application
Is this more strict approach viable? Useful? Flexible? It depends on the application, the likelihood of the domain being “correct”, and the risks of introducing values where the types are correct but the data isn’t.
I’d avoid this approach in cases where the domain is evolving rapidly or when
there are less rigid data structure requirements, instead relying on the
“newtype” technique of wrapping primitives (e.g. type Example = Example
String
).
The benefits of this approach are two-fold: types introduce improved safety when working with data and we’re able to communicate business rules. Improved safety results in a more accurate system, assuming types properly encode the structures. Communicating business rules means other developers understand possible values and states the information can exist in, allowing for improved reasoning across the codebase.