---
title: A Glossary for Data Engineering
teaser: 'We''ve been doing more data engineering for clients lately. Every time I
  say we''re doing that, I have to explain what it is and what we''re doing! A brief
  glossary to describe data engineering and some common practices and technologies
  in the data engineering ecosystem.

  '
tags: data
author: Joe Ferris
published_on: 2019-01-29
---

We've been doing more data engineering for clients lately. Every time I say
we're doing that, I have to explain what it is and what we're doing!

Here's a brief glossary to describe data engineering and some common practices
and technologies in the data engineering ecosystem.

*Data Engineering*: the task of building infrastructure and process for
ingesting, processing, and aggregating data so that it can be displayed to users
or made available to data scientists.

*Data Science*: the practice of using statistics, machine learning, and other
tools to analyze data to discover trends and truths that can be used to provide
business intelligence.

*Batch Processing*: processing large amounts of data at once. This is acceptable
for smaller amounts of data and can be simpler in terms of engineering and
deployment. Some batch processes can also be useful for "recomputing the world"
when you want to analyze existing data in a new way.

*Data Streaming*: processing data in small chunks, one at a time, rather than
processing all data at once. Streaming is necessary for processing infinite
event streams. It's also useful for processing large amounts of data, because it
prevents memory overflows during processing and makes it easier to process data
in a distributed manner or real-time manner.

*Real-time*: analyzing data and delivering results simultaneously so that stream
output is always visible. For example, real-time analytics will mean that the
system is constantly processing events (clicks, purchases, etc) and displaying
the latest results in a user interface.

*Distributed Data Processing*: breaking up data into partitions so that large
amounts of data can be processed by many machines simultaneously.

*Cluster*: several computers (or virtual machines) grouped together to perform a
single task.

*Scala*: a programming language (like Ruby, Python, or JavaScript) which is fast
and has become popular for data-focused tasks. Scala runs on the Java Virtual
Machine, which is a high-performance engine for running languages like Scala
that compile into bytecode.

*Type Safety*: languages that provide type safety (such as Scala) check the
program for possible errors when compiling, which allows developers to prevent
many types of bugs before being deployed.

*Spark*: a distributed computing engine for big data and data streams. Spark is
a Scala-focused framework for data engineering and data science.

*Kafka*: a distributed commit log for data streams. Many of the large data
systems deployed today use Kafka.

Check out the [case study] on our work with Teikametrics to learn more about
what thoughtbot does for clients with data needs!

[case study]: https://thoughtbot.com/work/teikametrics