Tuple Layout

When you insert a new row into a table, postgreSQL creates a tuple, a contiguous chunk of memory bytes that contains both system metadata and your actual data.

A tuple starts with a 23 byte header. Every single tuple in every table has this same header structure regardless of how many columns you have or which datatype you are using. These 23 bytes are pure overhead, which is why storing lots of tiny rows (say, a two-column table with a smallint and a boolean) is inefficient—you're spending more space on metadata than on data.

Tuple Headers

t_xmin

The first 4 bytes are t_xmin, a transaction ID. This is the ID of the transaction that created this tuple. When you run INSERT INTO users VALUES (1, 'Alice') inside transaction 1000, the resulting tuple gets t_xmin=1000.

This field never changes. Ever. Even if the tuple is later updated or deleted, t_xmin remains the ID of the transaction that originally created it. This is the foundation of MVCC. To determine if a tuple is visible to your transaction, PostgreSQL looks at t_xmin and asks: "Was this transaction committed before my snapshot was taken? Am I allowed to see tuples it created?"

t_xmax

The next 4 bytes are t_xmax, which is more complicated. In the simple case, if the tuple has never been deleted or locked, t_xmax is zero. But if a transaction deletes this tuple, t_xmax gets set to that transaction's ID.

Here's where it gets subtle: t_xmax can also mean "this tuple is locked" (as in SELECT ... FOR UPDATE) rather than deleted. How do you tell the difference? You have to look at the t_infomask flags. If HEAP_XMAX_LOCK_ONLY is set, then t_xmax is a lock, not a deletion. If HEAP_UPDATED is set, this was an UPDATE (so there's a newer version somewhere). If neither is set, it's a plain DELETE.

And there's yet another case: if multiple transactions lock the same tuple concurrently (for example, multiple SELECT ... FOR SHARE statements), t_xmax doesn't hold a transaction ID at all. Instead, it holds a "MultiXactId," which is an ID into a separate structure (pg_multixact) that stores a list of transaction IDs. The HEAP_XMAX_IS_MULTI flag in t_infomask tells you this has happened.

This overloading of t_xmax is a clever space optimization, but it makes the visibility logic quite complex.

t_cid

The next 6 bytes are t_ctid, a tuple identifier consisting of a page number (4 bytes) and a line pointer number (2 bytes). This field serves a dual purpose.

First, it's the tuple's own physical address, often called the TID (tuple identifier). If you run:

SELECT ctid, * FROM users;

You'll see values like (0,1), meaning page 0, line pointer 1. This is how indexes refer to tuples—they store the TID.

Second, t_ctid is used for update chains. When a tuple is the current version (hasn't been updated), its t_ctid points to itself: (0,1) points to (0,1). But when a tuple is updated, the old version's t_ctid gets changed to point to the new version. This creates a chain:

Old version at (0,1): t_ctid = (0,2)
New version at (0,2): t_ctid = (0,2)  [self-pointer]

If you update again:

Code

Old v1 at (0,1): t_ctid = (0,2)
Old v2 at (0,2): t_ctid = (0,3)
Current at (0,3): t_ctid = (0,3)

This chain allows PostgreSQL to follow updates. An index points to (0,1). When you look up that TID, you find an old version with t_ctid=(0,2), so you follow the chain to (0,2), then to (0,3), where you find the current version.

Long update chains are a performance problem. If a tuple has been updated 100 times, you have to follow 100 hops to reach the current version. This is one reason why HOT (Heap-Only Tuple) updates are so valuable—they keep the chain short and on the same page.

t_infomask

Now we come to the most complex field in the tuple header: t_infomask, a 2-byte bitmap containing 16 Boolean flags. This is where PostgreSQL packs a huge amount of state information.

Some flags describe the tuple's data layout. HEAP_HASNULL (bit 0) means at least one column is NULL, so there's a null bitmap after the tuple header. HEAP_HASVARWIDTH (bit 1) means there are variable-length columns. HEAP_HASEXTERNAL (bit 2) means at least one column is stored out-of-line in a TOAST table.

Other flags describe the tuple's MVCC state. HEAP_XMAX_LOCK_ONLY (bit 7) means t_xmax is a lock, not a delete. HEAP_UPDATED (bit 13) means this tuple was updated, so there's a newer version. HEAP_XMAX_IS_MULTI (bit 12) means t_xmax is a MultiXactId.

But the most important flags are the hint bits: HEAP_XMIN_COMMITTED (bit 8), HEAP_XMIN_INVALID (bit 9), HEAP_XMAX_COMMITTED (bit 10), and HEAP_XMAX_INVALID (bit 11). Understanding hint bits is essential.

Here's the problem they solve: To determine if a tuple is visible, we need to know whether t_xmin and t_xmax are committed or aborted. This information lives in the CLOG (commit log), also called pg_xact. The CLOG is on disk (or maybe cached in memory), and checking it requires I/O. If we had to check the CLOG for every tuple we examine during a query, performance would be terrible.

Hint bits cache this information directly in the tuple. The first time someone checks whether transaction 1000 is committed, they look it up in the CLOG. If it's committed, they set the HEAP_XMIN_COMMITTED bit in the tuple's t_infomask and mark the page dirty. From that point on, anyone who looks at this tuple sees the hint bit and knows immediately that transaction 1000 is committed, without having to touch the CLOG.

This has a fascinating consequence: a SELECT query can cause writes. If you run a big INSERT, creating millions of new tuples, and then immediately run a SELECT that scans the table, that SELECT will be the first to check the visibility of each tuple. For every tuple, it will look up the transaction in CLOG (probably finding it committed), set the hint bit, and mark the page dirty. Eventually, those dirty pages get written to disk. Your SELECT just triggered a write of the entire table.

The solution is to run VACUUM immediately after a bulk insert. VACUUM will proactively set all the hint bits, so subsequent queries won't have to.

CREATE TABLE hint_demo (id INT);
INSERT INTO hint_demo VALUES (1);

SELECT t_infomask FROM heap_page_items(get_raw_page('hint_demo', 0));

You might see t_infomask = 2818, which is 0x0B02 in hex. Let's decode that:

Binary: 0000 1011 0000 0010
Bit 1: HEAP_HASVARWIDTH (set)
Bit 8: HEAP_XMIN_COMMITTED (set)
Bit 9: HEAP_XMIN_INVALID (set)
Bit 11: HEAP_XMAX_INVALID (set)

Wait, both HEAP_XMIN_COMMITTED and HEAP_XMIN_INVALID are set? That seems contradictory. Actually, HEAP_XMIN_INVALID set means the tuple was created by an aborted transaction, which overrides the "committed" bit. This tuple is garbage and will never be visible to anyone.

t_infomask2

There's a second infomask field, t_infomask2, which is also 2 bytes. The lower 11 bits store the number of attributes (columns) in this tuple, allowing up to 2047 columns per table. The upper bits are flags related to HOT updates: HEAP_HOT_UPDATED (bit 14) and HEAP_ONLY_TUPLE (bit 13).

Tuple Layout

Tuple Headers

Comments

Demystifying Postgres

PostgreSQL TOAST Storage Models

More from this blog

Understanding HTTP MIME Types

Centralized Cache Key Management In Redis

Postgres Multi Version Concurrency Control - MVCC

PostgreSQL TOAST Storage Models

Command Palette

Tuple Headers

Comments

Demystifying Postgres

PostgreSQL TOAST Storage Models

More from this blog