Reference

Data glossary

The words that come up in every data job ad and every meeting. One line each, plain English, grouped by what they're actually for. Bookmark this page — you'll come back to it.

New to data? Start with the beginner's guide →

01Category

Moving & loading data

The words that show up around getting data from A to B — the plumbing layer of every data team.

Data Source

Anywhere the raw data originates — an app database, a CSV export, a SaaS API (Stripe, HubSpot, Salesforce), a sensor feed. Every pipeline names its sources explicitly so analysts know what they're looking at.

Data Migration

Moving data from one system, format or location to another (e.g. an old Oracle database into a new cloud warehouse). A one-off project; once done, it's done.

Data Ingestion

Collecting and importing data from various sources into a storage system — the first step of any pipeline. Runs continuously, not once.

Ingress

Data coming IN to a system or cloud account — uploads to a warehouse, API calls landing on your service, files dropped into storage. Cloud providers usually don't charge for ingress, so getting data in is the cheap part.

Egress

Data going OUT of a system or cloud account — downloads, query results sent to a dashboard, files copied to another region or provider. Cloud providers DO charge for egress, and it's often the biggest surprise line item on a data team's bill.

ETL

Extract, Transform, Load. The classic pattern: pull data out of source systems, reshape it, write it into a warehouse. Transformation happens on the way in.

ELT

Extract, Load, Transform. The modern cloud variant: dump the raw data straight into the warehouse, then transform it there using SQL. Cheaper and more flexible than ETL.

Data Pipeline

A series of automated steps that move and reshape data from source to destination. Most production pipelines run on a schedule (e.g. nightly) or stream events in real time.

Batch vs Streaming

Batch = run periodically over a chunk of accumulated data (the nightly job). Streaming = process each event the moment it lands (fraud detection, live dashboards).

02Category

Shape & structure

How data is organised once it's in a system — tables, columns, keys, schemas, and the stores that hold them.

Database

A persistent, organised store of data that supports queries, updates and concurrent access. Usually means a relational database (SQL Server, PostgreSQL, MySQL) — the workhorse behind most apps.

Data Warehouse

A central database optimised for analytics, not the live app. Holds cleaned, modelled data from many sources so the business can ask big questions fast. Examples: Snowflake, BigQuery, Azure Synapse.

Data Lake

A cheap, schema-light store for raw files of any shape — CSV, JSON, Parquet, images. Dump first, structure later. Often the landing zone before data is shaped and moved into the warehouse.

Medallion Architecture

A Databricks-popularised layering pattern: Bronze (raw landing), Silver (cleaned & joined), Gold (business-ready aggregates). Each layer is a quality upgrade on the last.

Schema

The blueprint of a database or table: what columns exist, what type they hold (number, text, date), and how they relate to each other.

Table

A grid of rows (records) and columns (fields). The smallest unit you'll query with SQL. A database is mostly a collection of related tables.

Row

A single record in a table — one customer, one order, one transaction. All the values across one row describe the same thing. Sometimes called a tuple or record.

Column

A vertical strip in a table holding one piece of information across every row — email, price, country. Each column has a single data type (number, text, date) the database enforces.

View

A saved SQL query that behaves like a table you can query. The data lives in the underlying tables; the view is a window onto it — handy for hiding complexity, locking down columns, or giving stakeholders a consistent named source.

Primary Key

The column (or combination) that uniquely identifies each row in a table. Every customer has one customer_id; no two rows share it.

Foreign Key

A column in one table that points at the primary key of another, linking the two. orders.customer_id pointing at customers.id is the standard example.

Structured / Semi-structured / Unstructured

Structured = neat rows and columns (SQL tables). Semi-structured = JSON, XML, log files. Unstructured = images, audio, free-text emails. Different tools handle each.

Data Model

The overall design of how your tables relate to each other — which entity owns which, which links which. A good model makes future analysis cheap; a bad one makes everything hurt.

Relational Model

The mainstream way of organising data since the 1970s: rows and columns in tables, linked by keys. SQL is the query language built for it. PostgreSQL, SQL Server, MySQL and Oracle are all relational. The opposite shape (document, key-value, graph) is "NoSQL".

Dimensional Modelling

The warehouse-design pattern (Kimball-style) that splits data into fact tables (the events) and dimension tables (the context). Optimised for analytics — fast aggregations across the events, easy slicing by the context. The opposite of the highly-normalised model an app database uses.

Fact Table

The events table in a dimensional model — one row per measurable thing that happened: an order line, a transaction, an appointment. Holds the numbers you sum and the keys that point at the dimensions (when, what, who, where).

Dimension Table

The context table in a dimensional model — one row per entity you slice the facts by: a customer, a product, a date, a department. Holds the descriptive attributes (name, category, region) that filters and group-bys read.

Surrogate Key

A meaningless integer (or generated ID) used as the primary key of a dimension row, instead of the real-world business key (email, NHS number, SKU). Insulates the warehouse from upstream renames and lets the same conceptual entity be tracked across history even if the business key changes.

03Category

Transforming & analysing

What happens once the data is in place — reshaping, summarising, asking it questions.

Data Transformation

Converting, structuring or reshaping data so it fits the format the next step (analysis, dashboard) needs. Renaming columns, splitting dates, joining lookups — all transformation.

Aggregation

Rolling up many rows into a summary: SUM, COUNT, AVERAGE by some grouping. "Total revenue by region" is an aggregation.

Join

Combining two tables based on a matching column. INNER JOIN keeps only matches; LEFT JOIN keeps every row from the left table even if the right has none.

Merge

Combining records from two sources by matching key — and deciding row by row what to do: INSERT a new row, UPDATE the existing one, or DELETE if it's gone. In SQL it's the MERGE statement; in Power Query it's the merge step (with join kinds Left/Inner/Anti). Common for syncing a source into a warehouse.

Upsert

Short for "update OR insert". Write a row by its key — if it's there, update it; if it's not, insert a new one. The atomic alternative to a SELECT-then-INSERT/UPDATE in application code, and the typical loading pattern for slowly-changing dimensions in a warehouse.

Normalisation

Breaking data into multiple linked tables to avoid storing the same fact twice. Cleaner storage; more joins at query time.

Denormalisation

The opposite: flattening data into fewer, wider tables. Quicker reads for dashboards; harder to keep consistent on writes.

OLTP vs OLAP

OLTP databases (the app database) optimise for fast individual writes — "add this order". OLAP databases (the warehouse) optimise for big read queries — "sum revenue across 5 years".

Data Visualisation

Turning numbers into charts, maps and dashboards so people can spot trends at a glance. Power BI, Tableau and Looker are the industry tools; matplotlib and ggplot for code-first work.

Data Insights

The "so what" you extract from the data — patterns, anomalies, opportunities — communicated in a way a decision-maker can act on. Analysis without insight is just numbers in a slide.

Parallel Processing

Splitting a job into chunks that run simultaneously across many machines (or cores) so it finishes faster. The reason Spark/Databricks can crunch a year of click data in minutes, not hours.

Multi-threading

Doing several things at once within a single program by running multiple threads of execution. Pipelines use it to download from many sources concurrently or to keep the CPU busy while waiting on I/O.

04Category

Quality & trust

Whether anyone can trust what comes out of the pipeline — and how teams prove they can.

Data Validation

Checking accuracy, completeness and consistency before processing — bad rows out, good rows through. A moment-in-time gate.

Data Quality

The ongoing measure of how reliable, consistent and accurate your data is. Validation is a moment; quality is a programme.

Data Cleansing

Fixing the broken rows — deduplicating customers, standardising postcodes, filling missing fields, repairing dates. Usually the bulk of an engineer's day.

Data Lineage

The map of where a number came from: which source, which transformations, which pipeline. Critical when a director asks "is this figure right?"

Data Observability

Continuously monitoring pipelines for freshness, volume, schema drift and anomalies — so you know the dashboard is broken before the CEO does.

Data Integrity

The guarantee that data stays accurate, consistent and complete as it moves and is updated — no duplicates, no orphaned rows, no silent corruption. Enforced by constraints, transactions and checks.

Data Contract

A formal agreement between a data producer (e.g. the app team) and its consumers (analytics, ML) defining the schema, freshness, ownership and breaking-change rules. Stops silent upstream changes from breaking downstream dashboards.

Quarantine

Rows that fail validation get parked in a separate "quarantine" table or folder rather than blocking the pipeline. A human (or rule) reviews them later — bad data isolated, good data flows through.

05Category

Governance, security & access

The rules around who can see what, who owns what, and how the company stays out of legal trouble.

Data Governance

The policies, processes and accountability that decide who owns what data, who can change it, and who can see it. The grown-up rulebook.

Data Security

Protecting data from unauthorised access, corruption or breaches — encryption, access controls, audit trails.

PII (Personally Identifiable Information)

Anything that identifies a real person: name, email, address, NI number, IP. Heavily regulated — GDPR in the UK/EU sets the rules.

Anonymisation / Pseudonymisation

Replacing or removing identifying fields so analysts can use the data safely. Anonymisation is one-way; pseudonymisation (e.g. hashing) can be reversed if you hold the key.

Role-Based Access Control (RBAC)

Permissions granted by job role rather than individual: "finance can see revenue, support can see one customer at a time". How real companies stop the wrong people seeing the wrong rows.

06Category

Discovery & metadata

How people actually find the data that exists inside a large company — and understand it once they do.

Data Cataloguing

Indexing and organising your data assets so people can actually find them. The library card system of the data world.

Metadata

Data about the data: what this table holds, who owns it, when it last refreshed, what the columns mean. The thing that makes a catalogue useful.

Data Dictionary

A document (or page in the catalogue) that defines every column in plain English. "total_value_gbp = order total in pence, including VAT, excluding refunds."

Data Steward

The named human accountable for a slice of the data — usually a domain expert who answers "is this number right?" Not necessarily on the data team.

07Category

Roles you'll see in job ads

What the job titles actually mean. Boundaries blur in practice, especially at smaller companies.

Data Analyst

Turns existing data into answers — writes SQL, builds dashboards, presents findings. Closest to the business; furthest from the plumbing.

Data Engineer

Builds and maintains the pipelines that move data between systems and into the warehouse. Lots of SQL, Python, cloud services. Owns the plumbing.

Analytics Engineer

Newer role — bridges analyst and engineer. Owns the transformation layer (often in dbt) that turns raw warehouse data into clean models analysts query.

BI Developer

Specialises in dashboards and reporting tools (Power BI, Tableau, Looker). Knows the data well enough to model it for fast queries by non-technical users.

Data Scientist

Builds statistical and machine-learning models on top of the data — forecasts, recommendations, classifications. Closer to research than engineering.

Now you know the words — pick a track.

The 7-minute assessment recommends a starting track and level based on your background. No signup.

Take the 7-minute assessment →Or revisit the beginner's guide →