A technical deep dive into how PGDUMP works

A technical deep dive into how PGDUMP works

Introduction

PGDUMP (or pg_dump) is a Postgres utility (source) that comes bundled with every PostgreSQL installation. It is a command-line tool that connects to a PostgreSQL server and constructs queries to extract database metadata and/or table data, including all of the constraints. This exported file can then be used to recreate the database on the same or another PostgreSQL server. This is really useful for backups and data migrations.

Let's take a step-by-step look at exactly how PGDUMP works under the hood and then at the end go through an example.

Step 1 - Connect and Query

When you run the pg_dump command, the first thing it does is establish a connection to the PostgreSQL server using the provided connection string. It uses the same libpq library as the psql command-line client. This is nice because it ensures compatibility across versions.

To ensure a consistent snapshot of the database, pg_dump leverages PostgreSQL's MVCC (Multi-Version Concurrency Control). This means it can safely back up the database without locking out concurrent users. This is usually fine for most operations but large transactions might be impacted.

Once connected, pg_dump queries the PostgreSQL system catalogs to retrieve metadata about the target database. This includes information on schemas, tables, data types, indexes, constraints, views, functions, and more. It fetches all the database object definitions and their properties.

Let's go through these in more detail.

  1. Database Properties: It queries the pg_database system catalog to get general properties of the database like encoding, tablespace, etc.
  2. Extensions: Queries pg_extension to get a list of installed extensions in the database.
  3. Schemas: Queries pg_namespace to get a list of all schemas in the database.
  4. Types: For each schema, it queries pg_type to get a list of data types defined in that schema.
  5. Tables & Columns: For each schema, it queries pg_class and pg_attribute to get a list of tables and their column definitions.
  6. Constraints & Indexes: For each table, it queries pg_constraint and pg_index to retrieve constraints (primary keys, foreign keys, etc.) and index definitions.
  7. Inheritance Hierarchy: It determines table inheritance hierarchies by examining the pg_inherits system catalog.
  8. Views & Sequences: Queries pg_class again to identify views and sequences in each schema.
  9. Functions & Procedures: Queries pg_proc to get functions, procedures, triggers, etc. defined in each schema.
  10. Operators & Operator Classes: Queries pg_operator and pg_opclass for user-defined operators and operator classes.
  11. Comments: Queries pg_description to retrieve comments on database objects.

It's important to note that pg_dump doesn't query the tables directly. Instead, it queries the system catalogues which contain all of the metadata representing the database. The order also matters. For example, data types are queried before tables since tables may use custom types. It then buffers this metadata in memory as it gets ready for step 2.

Step 2: Dump the Data

Now that we have the metadata, pg_dump then starts the process of extracting the raw data for each table in the database. There are configurations for pg_dump where you only get the schema. If so, then this step is effectively skipped. If we want the data as well, then here's a breakdown of what happens:

  1. It constructs a query like SELECT \* FROM schema.table for each table.
  2. This query is executed, and pg_dump fetches the result rows in batches to minimize memory overhead.
  3. For each row fetched, pg_dump formats an INSERT statement with the row data using the COPY command. For example, COPY schema.table (col1, col2) FROM stdin; string1 string2 \.
  4. These INSERT statements are buffered in memory in the order the rows were retrieved.

Tables with foreign keys or inheritance hierarchies are dumped in an order that preserves referential integrity. Data is dumped sorted by OID (object identifier) to maintain consistency across runs. Lastly, pg_dump has special handling for large objects. It typically uses the lo_export function to manage these objects.

At this point, pg_dump has, in memory, the contents of the pg_dump output file. Just in time for step 3!

Step 3: Assembling the Dump File

As pg_dump fetches metadata definitions and table data, it incrementally constructs the dump file by writing out sections in this order:

  1. Comments: Any comments on database objects specified.
  2. Server Settings: Session settings required to recreate the environment.
  3. Extensions: Any installed PostgreSQL extensions used in the database.
  4. Type Definitions: Custom data types used in the database.
  5. Table Schemas: The SQL statements to create tables, indexes, constraints, etc.
  6. Table Data: The INSERT statements populated with each table's data.
  7. Procedural Code: Definitions for views, functions, triggers, and other routines.

The output is written as plain text with SQL statements intermixed with PGDUMP directives that indicate which database objects a section applies to.

Step 4: Output

By default, pg_dump outputs the complete SQL script to standard output (stdout). You can optionally specify an output file to write to instead.

Additionally, pg_dump can also compress the output SQL file using the specified compression format (gzip, bzip2, etc). The compressed file can then be transferred and decompressed on the destination system.

Let's look at an example

Let's go through an example of using pg_dump to back up the "neosync_test" sample database:

$ pg_dump -U postgres neosync_test > neosync_test

This command connects to the PostgreSQL server as the "postgres" user and dumps the complete "neosync_test" database as plain text SQL into the file "neosync_test".

Here's a high-level look at what the dump will look like:

--
-- PostgreSQL database dump
--
 
-- Dumped from database version 16.0 (Debian 16.0-1.pgdg120+1)
-- Dumped by pg_dump version 16.2 (Homebrew)
 
SET statement_timeout = 0;
SET lock_timeout = 0;
SET idle_in_transaction_session_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;
SELECT pg_catalog.set_config('search_path', '', false);
SET check_function_bodies = false;
SET xmloption = content;
SET client_min_messages = warning;
SET row_security = off;
 
--
-- Name: neosync_test; Type: SCHEMA; Schema: -; Owner: postgres
--
 
CREATE SCHEMA neosync_test;
ALTER SCHEMA neosync_test OWNER TO postgres;
 
--
-- Name: uuid-ossp; Type: EXTENSION; Schema: -; Owner: -
--
 
CREATE EXTENSION IF NOT EXISTS "uuid-ossp" WITH SCHEMA public;
 
--
-- Name: EXTENSION "uuid-ossp"; Type: COMMENT; Schema: -; Owner:
--
 
COMMENT ON EXTENSION "uuid-ossp" IS 'generate universally unique identifiers (UUIDs)';
 
SET default_tablespace = '';
 
SET default_table_access_method = heap;
 
--
-- Name: users; Type: TABLE; Schema: neosync_test; Owner: postgres
--
 
CREATE TABLE neosync_test.users (
    email character varying(60) NOT NULL,
    id uuid DEFAULT public.uuid_generate_v4() NOT NULL,
    first_name character varying(20) DEFAULT 'evis'::character varying NOT NULL,
    last_name character varying(255) DEFAULT 'drenova'::character varying NOT NULL,
    weight integer DEFAULT 30 NOT NULL,
    middle_name character varying DEFAULT 'bj'::character varying NOT NULL
);
 
ALTER TABLE neosync_test.users OWNER TO postgres;
 
--
-- Data for Name: users; Type: TABLE DATA; Schema: neosync_test; Owner: postgres
--
 
COPY neosync_test.users (email, id, first_name, last_name, weight, middle_name) FROM stdin;
evis@gmail.com a293f9fe-96dc-4e3d-84bd-1b966356e54b evis hello 30 bj
johndoe@gmail.com 152983c0-824f-4fc7-896e-5d56c8560e6c 1234567890 hello 30 bj
evistest@hotmrgergail.com 54f01b8b-c423-4c3b-b1eb-d2cd15271d13 evis joijfio32j3JIO 30 bj
evispgdump@hotmrgergail.com 54f01b8b-c423-4c3b-b1eb-d2cd15271d19 evis joijfio32j3JIO 30 bj
\.
 
--
-- Name: users users_email_key; Type: CONSTRAINT; Schema: neosync_test; Owner: postgres
--
 
ALTER TABLE ONLY neosync_test.users
    ADD CONSTRAINT users_email_key UNIQUE (email);
 
--
-- Name: users users_pkey; Type: CONSTRAINT; Schema: neosync_test; Owner: postgres
--
 
ALTER TABLE ONLY neosync_test.users
    ADD CONSTRAINT users_pkey PRIMARY KEY (id);
 
--
-- PostgreSQL database dump complete
--

The dump contains the DDL to create all schemas, tables, data types, and other objects first. Then it has the raw INSERT statements populated with data for each table. Finally, it includes the definitions for constraints, indexes, views, functions, and other database objects.

When this dump file is run through psql on another PostgreSQL server, it will accurately reconstruct the complete "neosync_test" database from scratch.

Wrapping up

Pg_dump is used by thousands of companies every day to back up their databases. Conceptually, it's pretty simple but the implementation is quite complex. The source code is about 18k lines long! It's fun to take a deep dive under the covers and see exactly what's going on. I hope this technical deep dive should give you a solid understanding of the detailed process pg_dump goes through.


Introducing Free-Form Text Anonymization for AI and Machine Learning Workflows

Introducing Free-Form Text Anonymization for AI and Machine Learning Workflows

Use Neosync to detect and redact PII in free-form text such as LLM prompts and other workflows

December 13th, 2024

View Article