Postgres Multi Version Concurrency Control

In 1986 database researcher named Michael Stonebraker was working on a problem that has plauged databases since their inception. how do you let many people read and write to the data simultaneously without ever grinding to a hault. The traditional locks solution was like having one bathroom for an entire building. Sure it works but line gets long very quickly.

Stonebraker had a radical idea: what if we kept multiple versions of each row? What if, instead of locking data, we just let everyone see the version that existed when they started their work? This became MVCC, and it's the reason PostgreSQL can do things that seem almost magical.

The fundamental problem

Imagine you're building a banking application. Two tellers are working simultaneously, both looking at the same account. The account has 1000 dollars in it. Teller A starts a transaction to withdraw 100 dollars. At the exact same moment, Teller B starts a transaction to deposit 50 dollars. What should happen?

This is called the "reader blocks writer" problem, and it's a disaster for high-concurrency systems. Teller B is just sitting there, unable to do anything, because Teller A happened to read the account balance first.

Transaction IDs

Every transaction in PostgreSQL that modifies data gets a unique identifier called a transaction ID, or XID. This isn't some abstract concept—it's a 32-bit unsigned integer that gets stamped onto every tuple you insert or update.

CREATE TABLE mvcc_demo (
    id INT PRIMARY KEY,
    account_name TEXT,
    balance NUMERIC
);

-- Start a transaction
BEGIN;

-- Check: do we have an XID yet?
SELECT txid_current_if_assigned();

This returns NULL. Why? Because PostgreSQL is lazy about assigning transaction IDs. A read-only transaction never needs one. Only when you do something that modifies data does PostgreSQL say, "Okay, you need a number."

-- Now force an XID assignment
SELECT txid_current();

INSERT INTO mvcc_demo VALUES (1, 'Alice', 1000);

-- Now let's look at what happened physically
SELECT t_xmin, t_xmax, t_ctid, * 
FROM heap_page_items(get_raw_page('mvcc_demo', 0));

You should see something like this:

 t_xmin | t_xmax | t_ctid | id | account_name | balance 
--------+--------+--------+----+--------------+---------
   1847 |      0 | (0,1)  |  1 | Alice        | 1000

Look at that t_xmin field. It's 1847—the transaction ID we just saw. This tuple was created by transaction 1847. The t_xmax is 0, meaning no transaction has deleted it yet.

Now let's do an update and see what happens:

-- In the same transaction
UPDATE mvcc_demo SET balance = 900 WHERE id = 1;

-- Look at the page again
SELECT t_xmin, t_xmax, t_ctid, id, account_name, balance 
FROM heap_page_items(get_raw_page('mvcc_demo', 0));

 t_xmin | t_xmax | t_ctid | id | account_name | balance 
--------+--------+--------+----+--------------+---------
   1847 |   1847 | (0,2)  |  1 | Alice        | 1000
   1847 |      0 | (0,2)  |  1 | Alice        |  900

Two tuples now! The old version has t_xmax=1847 (my transaction deleted it) and t_ctid=(0,2) pointing to the new version. The new version has t_xmin=1847 (my transaction created it). Both versions exist on disk simultaneously.

COMMIT;

-- Check the snapshot from outside this transaction
SELECT txid_current_snapshot();

Let me decode this for you. The snapshot format is xmin:xmax:xip_list. Here, xmin=1847 (the oldest transaction that was active when this snapshot was taken), xmax=1848 (the next XID to be assigned), and the xip_list is empty (no transactions are currently in progress). This snapshot is the key to everything. It's how PostgreSQL knows which tuple versions you're allowed to see.

Snapshots

A snapshot is a point-in-time view of which transactions are visible to you. Think of it as a photograph of the transaction ID space at the moment your query (or transaction) begins.

Let me demonstrate this with two concurrent sessions. Open two terminal windows and follow along.

Session A

BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
SELECT txid_current();

Let's say this returns 2000.

SELECT txid_current_snapshot();

Output: 2000:2001:

This means: "I am transaction 2000. The next transaction will be 2001. No other transactions are running right now". Now, while keeping Session A open, go to Session B:

Session B

BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
SELECT txid_current();

Returns: 2001

SELECT txid_current_snapshot();

Output: 2000:2002:2000

Read this as: "I am transaction 2001. The next transaction will be 2002. Transaction 2000 is currently in progress." Now, in Session B, let's insert some data:

Session B

INSERT INTO mvcc_demo VALUES (2, 'Bob', 500);
SELECT * FROM mvcc_demo;

You'll see both Alice (from earlier) and Bob. Session B can see its own insert immediately.

COMMIT;

Session B commits. Now let's go back to Session A:

Session A

SELECT * FROM mvcc_demo;

You'll only see Alice! Bob doesn't appear. Why? Because Session A's snapshot was taken before transaction 2001 existed. Even though 2001 has committed, Session A captured a snapshot at the beginning of its transaction that said, "I can't see anything from transaction 2001 or higher."

This is snapshot isolation in action. Session A sees a frozen view of the database as it existed when the transaction started.

Now let's see what happens if we change the isolation level to READ COMMITTED:

Session C (NEW WINDOW)

BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;
SELECT txid_current_snapshot();

Output might be: 2000:2003:2000

SELECT * FROM mvcc_demo;

You see both Alice and Bob! Why? Because in READ COMMITTED mode, PostgreSQL takes a new snapshot at the start of each SQL statement, not at the start of the transaction. The commit from Session B happened before this SELECT started, so it's visible.

This is the difference between REPEATABLE READ and READ COMMITTED:

REPEATABLE READ: One snapshot for the entire transaction
READ COMMITTED: New snapshot for each statement

Let me show you the snapshot structure more precisely. When PostgreSQL creates a snapshot, it builds a small data structure in memory:

typedef struct SnapshotData {
    TransactionId xmin;    // Oldest XID still active
    TransactionId xmax;    // Next XID to be assigned
    uint32 xcnt;           // Number of XIDs in xip[]
    TransactionId *xip;    // Array of in-progress XIDs
} SnapshotData;

When transaction 2000 takes its snapshot while 2001 and 2005 are running:

xmin: 2000
xmax: 2010  (next to be assigned)
xcnt: 2
xip: [2000, 2005]

Now when you look at a tuple with t_xmin=2003, PostgreSQL asks:

Is 2003 < xmin (2000)? No.
Is 2003 >= xmax (2010)? No.
Is 2003 in the xip array? No.
Therefore, check CLOG to see if 2003 committed.

If CLOG says "committed," the tuple is visible. If "aborted," it's not. If "in progress," it's not visible either (unless it's your own transaction). This is how visibility works: every tuple read triggers this check against your snapshot.

Postgres Multi Version Concurrency Control - MVCC

The fundamental problem

Transaction IDs

Snapshots

Comments

More from this blog

Understanding HTTP MIME Types

Centralized Cache Key Management In Redis

PostgreSQL TOAST Storage Models

Tuple Layout

Command Palette

The fundamental problem

Transaction IDs

Snapshots

Comments

More from this blog