Tyranny Blogs

Understanding HTTP MIME Types

Rohan Tiwari — Sun, 22 Mar 2026 11:14:43 GMT

The internet hosts many kinds of resources, so each resource needs a label to tell clients how to handle it. MIME (Multipurpose Internet Mail Extensions) was originally developed to solve problems moving data between different mail systems, and the same concept is used by HTTP to indicate the media type of a resource.

MIME types are written as type/subtype.

Type

The type indicates the broad category of the data, for example: text, image, video, audio, application (binary or structured data), and multipart (multiple parts in a single message).
These categories are formally defined in standards such as RFC 2046.

Subtype

The subtype specifies the exact format inside the main type, for example: html, css, png, json.
Together the type/subtype pair precisely identifies the resource format (for example, text/html).

Common MIME types

text/*

text/plain — plain text
text/html — HTML documents
text/css — CSS stylesheets
text/javascript (historical) or application/javascript — JavaScript
text/event-stream — Server-Sent Events (SSE)

image/*

image/jpeg — JPEG images
image/png — PNG images
image/gif — GIF images (supports animation)
image/webp — WebP images

audio/*

audio/mpeg — MP3 audio
audio/wav — WAV audio
audio/ogg — Ogg audio
audio/webm — WebM audio
audio/aac — AAC audio
audio/flac — FLAC (lossless) audio

video/*

video/mp4 — MP4 video
video/webm — WebM video
video/ogg — Ogg video

application/*

application/json — JSON data
application/xml — XML data
application/pdf — PDF documents
application/octet-stream — arbitrary binary data (default for unknown binaries)
application/zip — ZIP archives

multipart/*

multipart/form-data — used for form submissions that include files
multipart/byteranges — multiple byte ranges in a single response

HTTP usage

Servers send the MIME type in the Content-Type response header so browsers and other clients know how to process the payload. Example:
- Content-Type: text/html; charset=utf-8

Centralized Cache Key Management In Redis

Rohan Tiwari — Fri, 20 Mar 2026 13:36:01 GMT

In modern web applications, efficient data access is essential for performance and user experience. Redis, a blazing-fast in-memory store used for caching, messaging, and short-lived persistence, depends heavily on how you name and organize keys. In this article we explore why centralized, modular Redis key management matters and how to design an approach for a Node.js-based school management system that reduces bugs, improves maintainability, and scales cleanly.

Why Redis Key Management Matters

The stakes are high

Redis keys are the foundation of caching, session handling, queues, and many transient-data workflows. Poor key management causes:

Bugs: A typo in a key name leads to cache misses and inconsistent behavior.
Inconsistency: Different services or modules using divergent naming conventions create confusion and integration bugs.
Scalability problems: When keys are scattered, refactoring or changing schemas becomes risky and expensive.
Debugging nightmares: Tracking where a certain key is created, read, or invalidated is difficult.
Naming conflicts: Accidental collisions can overwrite unrelated data.
Type unsafety: No guarantees that the right parameter types or formats are used when building keys.

Properly managed keys reduce these risks and make system behavior predictable, testable, and easier to evolve.

The problem with ad-hoc key management

Bad practice: Scattered keys

Common anti-patterns include:

Inconsistent separators and format (user:123 vs user_123 vs user/123).
Duplicate key construction logic scattered across modules/services.
Hard-coded strings littered through code, leading to silent failures on rename.
No parameter validation (e.g., using raw objects or arrays in parts of the key).
No centralized documentation or discoverability for which keys exist.

These patterns make it hard to refactor, test, or enforce cross-cutting rules like TTLs, versioning, and prefixes.

Good practice: Centralized, modular key management

Centralizing Redis key creation and lifecycle rules brings clarity, reduces bugs, and speeds development. Key ideas:

Single source of truth: central module (or small set of modules) that defines key templates and helper functions.
Consistent naming convention: choose separators and order of namespaces and enforce them.
Parameter validation and typesafety: validate or type the parameters used to construct keys (TypeScript helps).
Versioning: include a version segment or use a prefix to make migrations safe.
Modularization by domain: group keys by bounded context (e.g., students, classes, attendance).
TTL strategy and defaults: centralize TTLs per key or key group so expirations are consistent.
Instrumentation & discovery: log or expose which keys are created, and document the registry for teams.
Migration plans: support supportable migration paths by key versioning or prefixing.

Below are concrete recommendations and examples tailored for a school management system.

Naming conventions (recommendations)

Use a clear separator, such as colon (:). Example: school:123:student:456:profile.
Order segments from broad to specific: {domain}:{orgId}:{resource}:{resourceId}:{subresource}.
Keep keys short but descriptive. Avoid embedding large JSON structures in keys.
Add an optional version segment or prefix: v1:school:... to allow rolling migrations.
Use prefixes for environment when sharing Redis (e.g., prod:, staging:) or use distinct Redis instances.

Domain-based key examples (school management)

Suggested structure:

School-level cache: school:{schoolId}:meta
Student profile: school:{schoolId}:student:{studentId}:profile
Student attendance for date: school:{schoolId}:student:{studentId}:attendance:{YYYY-MM-DD}
Class roster: school:{schoolId}:class:{classId}:roster
Teacher sessions: school:{schoolId}:teacher:{teacherId}:session:{sessionId}

Example keys:

v1:school:42:student:1001:profile
v1:school:42:class:7:roster
v1:school:42:student:1001:attendance:2026-03-20

Centralized key factory (pattern)

Create a single module that exports functions to build keys and optionally parse or validate them. Benefits:

Single place to enforce naming, version, TTL defaults.
Easier to change structure globally (e.g., add v2:).
Improves code discoverability and reuse.

Example (JavaScript / TypeScript style pseudocode):

// redisKeys.ts
const PREFIX = 'v1';
const SEP = ':';

export const keys = {
  schoolMeta: (schoolId: number | string) =>
    [PREFIX, 'school', schoolId, 'meta'].join(SEP),

  studentProfile: (schoolId: number | string, studentId: number | string) =>
    [PREFIX, 'school', schoolId, 'student', studentId, 'profile'].join(SEP),

  studentAttendance: (schoolId: number | string, studentId: number | string, date: string) =>
    [PREFIX, 'school', schoolId, 'student', studentId, 'attendance', date].join(SEP),

  classRoster: (schoolId: number | string, classId: number | string) =>
    [PREFIX, 'school', schoolId, 'class', classId, 'roster'].join(SEP),
};

Use these helpers everywhere instead of inline strings. If you later need to change PREFIX to v2 or add an environment prefix, you change it in one place.

Typesafety and validation

In TypeScript, type the function inputs (schoolId: string | number). Add runtime checks for format when necessary.
Validate date formats (ISO-8601 or YYYY-MM-DD) for keys that embed dates.
Consider small helper functions that sanitize IDs (e.g., disallow colons in IDs).

Example runtime guard:

function assertId(id: unknown, name = 'id') {
  if (typeof id !== 'string' && typeof id !== 'number') {
    throw new Error(`${name} must be a string or number`);
  }
}

TTL and expiration strategy

Define default TTLs in the key module or in a separate TTL registry.
Use TTLs for ephemeral caches and avoid TTLs for data you treat as persistent (or document exceptions).
Central TTL registry example:

export const ttl = {
  studentProfile: 60 * 60 * 24, // 24 hours
  classRoster: 60 * 10,         // 10 minutes
};

Key versioning and migrations

Prefix keys with a version (v1:). To migrate, write new keys with v2: and keep v1: readers until migration completes.
Alternatively, perform background jobs to re-key or repopulate caches under the new format.

Documentation, discovery, and monitoring

Keep a living registry (the key module doubles as documentation).
Document patterns in README or internal docs accessible by teams.
Log key creation and invalidation events for debugging.
Use Redis keyspace notifications sparingly (they can be noisy) or maintain application-level audit logs for critical keys.

Operational considerations

Namespace separation: consider separate Redis DBs or clusters per environment to avoid accidental collisions.
Key scanning: avoid heavy use of KEYS in production. Prefer known patterns or use SCAN with care for maintenance scripts.
Use Redis memory monitoring and eviction policy tailored for caches (e.g., LRU).
Instrument cache hit/miss metrics per key group. That lets you tune TTLs or caching boundaries.

Migration & refactor checklist

Add versioned keys while keeping old readers active.
Populate new keys on writes (write-through) and read-through fallback to old keys until warm.
Run background rekeying for large datasets when possible.
Monitor for orphaned v1 keys and plan for cleanup after confidence.

Summary

Centralized Redis key management brings immediate benefits:

Fewer bugs from typos and inconsistent naming.
Predictable refactor paths via versioning.
Easier enforcement of TTLs and caching policies.
Better documentation and discoverability across teams.

For a Node.js school management system, adopt a small, well-documented key factory module that:

Exposes domain-specific key builders,
Holds TTLs and versioning info,
Validates inputs,
And serves as the canonical registry for all Redis key usage.

Starting with a centralized approach keeps your cache predictable, debuggable, and ready to scale as your application and team grow.

Postgres Multi Version Concurrency Control - MVCC

Rohan Tiwari — Thu, 19 Feb 2026 17:10:29 GMT

In 1986 database researcher named Michael Stonebraker was working on a problem that has plauged databases since their inception. how do you let many people read and write to the data simultaneously without ever grinding to a hault. The traditional locks solution was like having one bathroom for an entire building. Sure it works but line gets long very quickly.

Stonebraker had a radical idea: what if we kept multiple versions of each row? What if, instead of locking data, we just let everyone see the version that existed when they started their work? This became MVCC, and it's the reason PostgreSQL can do things that seem almost magical.

The fundamental problem

Imagine you're building a banking application. Two tellers are working simultaneously, both looking at the same account. The account has 1000 dollars in it. Teller A starts a transaction to withdraw 100 dollars. At the exact same moment, Teller B starts a transaction to deposit 50 dollars. What should happen?

This is called the "reader blocks writer" problem, and it's a disaster for high-concurrency systems. Teller B is just sitting there, unable to do anything, because Teller A happened to read the account balance first.

Transaction IDs

Every transaction in PostgreSQL that modifies data gets a unique identifier called a transaction ID, or XID. This isn't some abstract concept—it's a 32-bit unsigned integer that gets stamped onto every tuple you insert or update.

CREATE TABLE mvcc_demo (
    id INT PRIMARY KEY,
    account_name TEXT,
    balance NUMERIC
);

-- Start a transaction
BEGIN;

-- Check: do we have an XID yet?
SELECT txid_current_if_assigned();

This returns NULL. Why? Because PostgreSQL is lazy about assigning transaction IDs. A read-only transaction never needs one. Only when you do something that modifies data does PostgreSQL say, "Okay, you need a number."

-- Now force an XID assignment
SELECT txid_current();

INSERT INTO mvcc_demo VALUES (1, 'Alice', 1000);

-- Now let's look at what happened physically
SELECT t_xmin, t_xmax, t_ctid, * 
FROM heap_page_items(get_raw_page('mvcc_demo', 0));

You should see something like this:

 t_xmin | t_xmax | t_ctid | id | account_name | balance 
--------+--------+--------+----+--------------+---------
   1847 |      0 | (0,1)  |  1 | Alice        | 1000

Look at that t_xmin field. It's 1847—the transaction ID we just saw. This tuple was created by transaction 1847. The t_xmax is 0, meaning no transaction has deleted it yet.

Now let's do an update and see what happens:

-- In the same transaction
UPDATE mvcc_demo SET balance = 900 WHERE id = 1;

-- Look at the page again
SELECT t_xmin, t_xmax, t_ctid, id, account_name, balance 
FROM heap_page_items(get_raw_page('mvcc_demo', 0));

 t_xmin | t_xmax | t_ctid | id | account_name | balance 
--------+--------+--------+----+--------------+---------
   1847 |   1847 | (0,2)  |  1 | Alice        | 1000
   1847 |      0 | (0,2)  |  1 | Alice        |  900

Two tuples now! The old version has t_xmax=1847 (my transaction deleted it) and t_ctid=(0,2) pointing to the new version. The new version has t_xmin=1847 (my transaction created it). Both versions exist on disk simultaneously.

COMMIT;

-- Check the snapshot from outside this transaction
SELECT txid_current_snapshot();

Let me decode this for you. The snapshot format is xmin:xmax:xip_list. Here, xmin=1847 (the oldest transaction that was active when this snapshot was taken), xmax=1848 (the next XID to be assigned), and the xip_list is empty (no transactions are currently in progress). This snapshot is the key to everything. It's how PostgreSQL knows which tuple versions you're allowed to see.

Snapshots

A snapshot is a point-in-time view of which transactions are visible to you. Think of it as a photograph of the transaction ID space at the moment your query (or transaction) begins.

Let me demonstrate this with two concurrent sessions. Open two terminal windows and follow along.

Session A

BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
SELECT txid_current();

Let's say this returns 2000.

SELECT txid_current_snapshot();

Output: 2000:2001:

This means: "I am transaction 2000. The next transaction will be 2001. No other transactions are running right now". Now, while keeping Session A open, go to Session B:

Session B

BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
SELECT txid_current();

Returns: 2001

SELECT txid_current_snapshot();

Output: 2000:2002:2000

Read this as: "I am transaction 2001. The next transaction will be 2002. Transaction 2000 is currently in progress." Now, in Session B, let's insert some data:

Session B

INSERT INTO mvcc_demo VALUES (2, 'Bob', 500);
SELECT * FROM mvcc_demo;

You'll see both Alice (from earlier) and Bob. Session B can see its own insert immediately.

COMMIT;

Session B commits. Now let's go back to Session A:

Session A

SELECT * FROM mvcc_demo;

You'll only see Alice! Bob doesn't appear. Why? Because Session A's snapshot was taken before transaction 2001 existed. Even though 2001 has committed, Session A captured a snapshot at the beginning of its transaction that said, "I can't see anything from transaction 2001 or higher."

This is snapshot isolation in action. Session A sees a frozen view of the database as it existed when the transaction started.

Now let's see what happens if we change the isolation level to READ COMMITTED:

Session C (NEW WINDOW)

BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;
SELECT txid_current_snapshot();

Output might be: 2000:2003:2000

SELECT * FROM mvcc_demo;

You see both Alice and Bob! Why? Because in READ COMMITTED mode, PostgreSQL takes a new snapshot at the start of each SQL statement, not at the start of the transaction. The commit from Session B happened before this SELECT started, so it's visible.

This is the difference between REPEATABLE READ and READ COMMITTED:

REPEATABLE READ: One snapshot for the entire transaction
READ COMMITTED: New snapshot for each statement

Let me show you the snapshot structure more precisely. When PostgreSQL creates a snapshot, it builds a small data structure in memory:

typedef struct SnapshotData {
    TransactionId xmin;    // Oldest XID still active
    TransactionId xmax;    // Next XID to be assigned
    uint32 xcnt;           // Number of XIDs in xip[]
    TransactionId *xip;    // Array of in-progress XIDs
} SnapshotData;

When transaction 2000 takes its snapshot while 2001 and 2005 are running:

xmin: 2000
xmax: 2010  (next to be assigned)
xcnt: 2
xip: [2000, 2005]

Now when you look at a tuple with t_xmin=2003, PostgreSQL asks:

Is 2003 < xmin (2000)? No.
Is 2003 >= xmax (2010)? No.
Is 2003 in the xip array? No.
Therefore, check CLOG to see if 2003 committed.

If CLOG says "committed," the tuple is visible. If "aborted," it's not. If "in progress," it's not visible either (unless it's your own transaction). This is how visibility works: every tuple read triggers this check against your snapshot.

PostgreSQL TOAST Storage Models

Rohan Tiwari — Thu, 19 Feb 2026 03:36:39 GMT

The main problem TOAST solves is fundamental : postgres pages are of 8kb, and tuple must fit within that page. so what happens if you try to insert 1MB of text field ?

Without toast you would get an error: "Row too big". With toast PostgreSQL handles the large value by compressing it and if it is still large, it is broken down into chunks and stored in separate TOAST table.

Here is how it works. every table has potentially at least one large column gets assosiated with TOAST table. you don't see these tables in your normal schema, rather they live in special pg_toast schema and have auto generated names like pg_toast_16223. each TOAST table has the same structure.

CREATE TABLE pg_toast.pg_toast_16385 (
    chunk_id OID,
    chunk_seq INT,
    chunk_data BYTEA
);

When you insert a row with a large column, PostgreSQL first tries to fit the entire row in the page. If it doesn't fit, it tries compressing the large columns using the pglz algorithm (or lz4 in newer versions). If a 10KB text field compresses down to 1KB, great—it now fits inline, and no TOAST table is needed.

But if after compression it's still too large, PostgreSQL takes the compressed data and chunks it into ~2KB pieces, inserts those pieces into the TOAST table, and replaces the large column in the main tuple with an 18-byte TOAST pointer. That pointer contains the chunk_id and enough metadata to reassemble and decompress the original value.

This happens completely transparently. When you query the column, PostgreSQL sees the TOAST pointer, reads the chunks from the TOAST table, reassembles them, decompresses, and returns the value. You don't have to do anything special.

But there are performance implications. Fetching a TOASTed column is much more expensive than fetching an inline column. If the main table is cached in memory but the TOAST table isn't, you'll incur I/O. If the value is split into 100 chunks, you're doing 100 additional tuple fetches.

This is why the golden rule of PostgreSQL performance is: only SELECT the columns you need. If you do SELECT * on a table with a large TEXT column, you'll fetch and decompress that column even if you don't use it. If you do SELECT id, name instead, the TOAST column is never touched.

You can control TOAST behavior per column with the SET STORAGE option:

ALTER TABLE mytable ALTER COLUMN mycolumn SET STORAGE EXTERNAL;

TOAST Storage Models

PLAIN means no TOAST at all. This is for fixed-sized types that can't be TOASTed. eg INT

EXTENDED is the default option. it tries to compress, and if still too large move out of the line to the TOAST table.

EXTERNAL means skip compression but do move out of line if needed. This is useful in those scenarios where you have already compressed data from the client side like JPEG image, compressing those again on database level could waste CPU cycle.

MAIN means prefer to keep the value inline. Try compression, and only move out-of-line as a last resort. This is useful for frequently accessed columns where you want to avoid TOAST overhead.

TOAST in Action

CREATE TABLE toast_test(id INT, small_data TEXT, large_data TEXT);
-- small value stored inline
INSERT INTO TOAST_TEST VALUES (1, 'hello', 'hello');
SELECT oid, reltoastelid FROM pg_class where relname = 'toast_test'

You'll see something like toast_demo and pg_toast.pg_toast_16385. Now:

-- Check TOAST table (should be empty)
SELECT COUNT(*) FROM pg_toast.pg_toast_16385;

Zero rows, because we haven't inserted anything large. Now:

-- Large value: triggers TOAST
INSERT INTO toast_demo VALUES (2, 'small', repeat('x', 10000));

-- Check TOAST table again
SELECT chunk_id, chunk_seq, length(chunk_data) 
FROM pg_toast.pg_toast_16385 
ORDER BY chunk_id, chunk_seq;

Now you'll see several rows, each with about 2000 bytes of data, representing the chunks of your 10KB value.

TOAST has an interesting interaction with VACUUM. When you update or delete a row that has TOASTed columns, the old version's TOAST chunks remain in the TOAST table until VACUUM runs. If you do a lot of updates, your TOAST table can become bloated with unreferenced chunks. This is why monitoring TOAST table size is important for tables with large columns.

Tuple Layout

Rohan Tiwari — Wed, 18 Feb 2026 17:19:10 GMT

When you insert a new row into a table, postgreSQL creates a tuple, a contiguous chunk of memory bytes that contains both system metadata and your actual data.

A tuple starts with a 23 byte header. Every single tuple in every table has this same header structure regardless of how many columns you have or which datatype you are using. These 23 bytes are pure overhead, which is why storing lots of tiny rows (say, a two-column table with a smallint and a boolean) is inefficient—you're spending more space on metadata than on data.

Tuple Headers

t_xmin

The first 4 bytes are t_xmin, a transaction ID. This is the ID of the transaction that created this tuple. When you run INSERT INTO users VALUES (1, 'Alice') inside transaction 1000, the resulting tuple gets t_xmin=1000.

This field never changes. Ever. Even if the tuple is later updated or deleted, t_xmin remains the ID of the transaction that originally created it. This is the foundation of MVCC. To determine if a tuple is visible to your transaction, PostgreSQL looks at t_xmin and asks: "Was this transaction committed before my snapshot was taken? Am I allowed to see tuples it created?"

t_xmax

The next 4 bytes are t_xmax, which is more complicated. In the simple case, if the tuple has never been deleted or locked, t_xmax is zero. But if a transaction deletes this tuple, t_xmax gets set to that transaction's ID.

Here's where it gets subtle: t_xmax can also mean "this tuple is locked" (as in SELECT ... FOR UPDATE) rather than deleted. How do you tell the difference? You have to look at the t_infomask flags. If HEAP_XMAX_LOCK_ONLY is set, then t_xmax is a lock, not a deletion. If HEAP_UPDATED is set, this was an UPDATE (so there's a newer version somewhere). If neither is set, it's a plain DELETE.

And there's yet another case: if multiple transactions lock the same tuple concurrently (for example, multiple SELECT ... FOR SHARE statements), t_xmax doesn't hold a transaction ID at all. Instead, it holds a "MultiXactId," which is an ID into a separate structure (pg_multixact) that stores a list of transaction IDs. The HEAP_XMAX_IS_MULTI flag in t_infomask tells you this has happened.

This overloading of t_xmax is a clever space optimization, but it makes the visibility logic quite complex.

t_cid

The next 6 bytes are t_ctid, a tuple identifier consisting of a page number (4 bytes) and a line pointer number (2 bytes). This field serves a dual purpose.

First, it's the tuple's own physical address, often called the TID (tuple identifier). If you run:

SELECT ctid, * FROM users;

You'll see values like (0,1), meaning page 0, line pointer 1. This is how indexes refer to tuples—they store the TID.

Second, t_ctid is used for update chains. When a tuple is the current version (hasn't been updated), its t_ctid points to itself: (0,1) points to (0,1). But when a tuple is updated, the old version's t_ctid gets changed to point to the new version. This creates a chain:

Old version at (0,1): t_ctid = (0,2)
New version at (0,2): t_ctid = (0,2)  [self-pointer]

If you update again:

Code

Old v1 at (0,1): t_ctid = (0,2)
Old v2 at (0,2): t_ctid = (0,3)
Current at (0,3): t_ctid = (0,3)

This chain allows PostgreSQL to follow updates. An index points to (0,1). When you look up that TID, you find an old version with t_ctid=(0,2), so you follow the chain to (0,2), then to (0,3), where you find the current version.

Long update chains are a performance problem. If a tuple has been updated 100 times, you have to follow 100 hops to reach the current version. This is one reason why HOT (Heap-Only Tuple) updates are so valuable—they keep the chain short and on the same page.

t_infomask

Now we come to the most complex field in the tuple header: t_infomask, a 2-byte bitmap containing 16 Boolean flags. This is where PostgreSQL packs a huge amount of state information.

Some flags describe the tuple's data layout. HEAP_HASNULL (bit 0) means at least one column is NULL, so there's a null bitmap after the tuple header. HEAP_HASVARWIDTH (bit 1) means there are variable-length columns. HEAP_HASEXTERNAL (bit 2) means at least one column is stored out-of-line in a TOAST table.

Other flags describe the tuple's MVCC state. HEAP_XMAX_LOCK_ONLY (bit 7) means t_xmax is a lock, not a delete. HEAP_UPDATED (bit 13) means this tuple was updated, so there's a newer version. HEAP_XMAX_IS_MULTI (bit 12) means t_xmax is a MultiXactId.

But the most important flags are the hint bits: HEAP_XMIN_COMMITTED (bit 8), HEAP_XMIN_INVALID (bit 9), HEAP_XMAX_COMMITTED (bit 10), and HEAP_XMAX_INVALID (bit 11). Understanding hint bits is essential.

Here's the problem they solve: To determine if a tuple is visible, we need to know whether t_xmin and t_xmax are committed or aborted. This information lives in the CLOG (commit log), also called pg_xact. The CLOG is on disk (or maybe cached in memory), and checking it requires I/O. If we had to check the CLOG for every tuple we examine during a query, performance would be terrible.

Hint bits cache this information directly in the tuple. The first time someone checks whether transaction 1000 is committed, they look it up in the CLOG. If it's committed, they set the HEAP_XMIN_COMMITTED bit in the tuple's t_infomask and mark the page dirty. From that point on, anyone who looks at this tuple sees the hint bit and knows immediately that transaction 1000 is committed, without having to touch the CLOG.

This has a fascinating consequence: a SELECT query can cause writes. If you run a big INSERT, creating millions of new tuples, and then immediately run a SELECT that scans the table, that SELECT will be the first to check the visibility of each tuple. For every tuple, it will look up the transaction in CLOG (probably finding it committed), set the hint bit, and mark the page dirty. Eventually, those dirty pages get written to disk. Your SELECT just triggered a write of the entire table.

The solution is to run VACUUM immediately after a bulk insert. VACUUM will proactively set all the hint bits, so subsequent queries won't have to.

CREATE TABLE hint_demo (id INT);
INSERT INTO hint_demo VALUES (1);

SELECT t_infomask FROM heap_page_items(get_raw_page('hint_demo', 0));

You might see t_infomask = 2818, which is 0x0B02 in hex. Let's decode that:

Binary: 0000 1011 0000 0010
Bit 1: HEAP_HASVARWIDTH (set)
Bit 8: HEAP_XMIN_COMMITTED (set)
Bit 9: HEAP_XMIN_INVALID (set)
Bit 11: HEAP_XMAX_INVALID (set)

Wait, both HEAP_XMIN_COMMITTED and HEAP_XMIN_INVALID are set? That seems contradictory. Actually, HEAP_XMIN_INVALID set means the tuple was created by an aborted transaction, which overrides the "committed" bit. This tuple is garbage and will never be visible to anyone.

t_infomask2

There's a second infomask field, t_infomask2, which is also 2 bytes. The lower 11 bits store the number of attributes (columns) in this tuple, allowing up to 2047 columns per table. The upper bits are flags related to HOT updates: HEAP_HOT_UPDATED (bit 14) and HEAP_ONLY_TUPLE (bit 13).

PostgreSQL Page Structure - (Slotted Pages)

Rohan Tiwari — Wed, 18 Feb 2026 15:45:16 GMT

From the previous blog we knew that tables are made up of 8KB pages, lets crack open a page and see what's really inside it. This is where the things get interesting because jargons like MVCC, tuple storage, visiblity rules, freespace is dependent on concept of pages.

This two-ended growth pattern is elegant. Line pointers grow down from the top, tuples grow up from the bottom, and free space is always the contiguous region in the middle. Two pointers in the page header—pd_lower and pd_upper—track the boundaries. Free space is simply pd_upper - pd_lower. Want to know if a new tuple fits? It's an O(1) subtraction.

Description Of Important Headers

Log Sequence Number (LSN)

Log Sequence Number comes first, occupying 8 bytes. every time the page is modified, postgreSQL writes a record to WAL (write ahead log) and assigns it a unique identifier (LSN). That LSN is copied into the page header. The. main reason to do so is for crash recovery. When PostgreSQL crashes and restarts, it replays the WAL starting from the last checkpoint. For each WAL record, it reads the corresponding page from disk and checks: is the page's LSN less than this WAL record's LSN? If yes, apply the change. If no, skip it—the page already has this change.

Checksum

The checksum header occupies 2 bytes. if you initialize your cluster with initdb --data-checksums postgreSQL calculates a checksum over the entire page every time it writes to the disk. when reading the page back, it recalculates and compares. if they don't match then scary error may come. The main downside is about 2-5 percent of cpu overhead for the checksum calculation and comparision.

Flag

Flag header occupies 2 bytes too. it is a bitmap of boolean properties about the page. The most important flag is PD_ALL_VISIBLE , which mirrors the visiblity map and tell us that every page in this tuple is visible to all transactions.

pd_lower and pd_upper

Both of the header takes 2 bytes. pd_lower points to the start of the free space area or end of the line pointer array where pd_upper points to the start of the start of the tuple area or end of the freespace area. when new tuple is inserted pd_upper moves down where as pd_lower moves up (towards higher page address)

pd_special field

The pd_special (2 bytes) points to a special area at the end of the page. For heap pages, this is unused and just points to byte 8192 (the end of the page). But for index pages, it points to index-specific metadata. B-tree pages, for example, store pointers to left and right sibling pages in the special area, plus the tree level. This allows the page header format to be generic while still supporting specialized needs.

pd_pagesize_version

The pd_pagesize_version field (2 bytes) encodes both the page size and the page layout version. This is important for pg_upgrade and for debugging. If PostgreSQL loads a page and sees an unexpected version number, it knows the page might have been written by an older or newer version with a different tuple layout. This prevents silent corruption when mixing versions.

pd_prune_xid

Finally, pd_prune_xid (4 bytes) stores the oldest transaction ID that deleted or updated a tuple on this page. This is used for HOT (Heap-Only Tuple) pruning, which we'll cover in depth in the MVCC module. For now, just know that it's an optimization hint that helps PostgreSQL decide whether it's worth trying to clean up dead tuples on this page.

Line Pointers : The Indirection Layer

The size of each line pointers are exactly 4 bytes and point towards the some tuple within the page. This level of indirection is absolutely fundamental to how postgreSQL works.

When i was learning this concept first in university, i was overwhelmed by the concept of slotted array (Line Pointers Array), its usage and all. But i got intuition after i took the database course by Andy Palvo (CMU). Okay now answer the question Why indirection? Consider what happens without it. An index entry for a B-tree index contains a pointer to a heap tuple. If that pointer is a direct byte offset into the page—say, "the tuple is at byte 1000"—then what happens if we need to move the tuple within the page? Maybe we're doing a HOT update, or maybe we're compacting the page to defragment free space. If we move the tuple, we have to update every index entry that points to it. That's expensive and complex.

Line pointers solve this. An index entry doesn't point to byte 1000. It points to "page 5, line pointer 3." The line pointer itself points to byte 1000. If we need to move the tuple, we just update the line pointer. All the indexes remain valid without any changes. This is why line pointers are sometimes called "item pointers" or "item IDs"—they're stable identifiers for tuples.

Each Line Pointer Packs Three Pieces of Information

typedef struct ItemIdData {
    unsigned lp_off:15;     /* Offset to tuple (0-32767) */
    unsigned lp_flags:2;    /* Status flags (4 states) */
    unsigned lp_len:15;     /* Tuple length (0-32767) */
} ItemIdData;

lp_off is also an pointer that stores starting address of specific tuple

lp_len stores the length of the tuple in bytes

lp_flags is where things get interesting. with 2 bits we can represent 4 different states.

lp_flags internals

lp_unused (0) means this line pointer has never been used, or it has been used but reclaimed by the VACUUM. it is available for reuse.

lp_normal (1) means this line pointer points to an actual tuple. tuple might be live or dead version waiting till VACUUM. we cannot tell from the line pointer alone - we have to look header of that particular tuple.

lp_redirect (2) is special. It means this line pointer doesn't point to tuple at all. Instead, it points to another line pointer. This happens during HOT updates. When a tuple is updated in a way that doesn't affect any indexed columns, PostgreSQL can create the new version on the same page and set up a redirect: "The tuple you're looking for is now at line pointer 5." Indexes still point to the original line pointer, and they automatically follow the redirect. This saves having to update every index.

lp_dead (3) means the tuple this line pointer used to point to is definitely dead—no transaction can see it anymore—but VACUUM hasn't reclaimed the space yet. This is useful during index scans. If an index points to a dead line pointer, we can immediately skip it without having to fetch and examine the tuple.

Understanding line pointer states is essential for understanding tuple lifecycle and MVCC. A common mistake is thinking that when you delete a row, it's immediately removed. It's not. The tuple stays right where it is. The line pointer might eventually be marked LP_DEAD, but even then, the tuple data is still there, occupying space, until VACUUM runs.

CREATE EXTENSION pageinspect;
CREATE TABLE lp_demo (id INT, data TEXT);

INSERT INTO lp_demo VALUES (1, 'version1');

SELECT lp, lp_off, lp_flags, lp_len 
FROM heap_page_items(get_raw_page('lp_demo', 0));

Again

UPDATE lp_demo SET data = 'version2' WHERE id = 1;

SELECT lp, lp_off, lp_flags, lp_len 
FROM heap_page_items(get_raw_page('lp_demo', 0));

now observe the differences between the output. In the next blog we will again make our hands dirty with tuple layout.

Postgres OID VS Relfilenode

Rohan Tiwari — Wed, 18 Feb 2026 12:58:33 GMT

When you create a table, PostgreSQL assigns it an OID (Object Identifier). This is just a logical identifier for the table. it is stored in the system catalog pg_class and remains constant for the lifetime of the table.

But under the hood in physical file on disk is identified by the relfilenode, which is also stored in system catalog pg_class. But the good news is that most of the time OID and Relfilenode are same. But they can diverge, and understanding when and why is probably the most crucial.

Try This Experiment

CREATE TABLE demo(
    id INT,
    data TEXT
);

SELECT oid, relfilenode FROM pg_class WHERE relname = 'demo';

Again

TRUNCATE demo;

SELECT oid, relfilenode FROM pg_class WHERE relname = 'demo';

Now you might see the different oid and relfilenode from those commands. The OID stayed the same but why relfilenode changed ?

When you truncate a table, PostgreSQL doesn't actually delete all the rows from the existing file. Instead, it creates a brand new file with a new relfilenode and updates the catalog to point to it. The old file is deleted. This is much faster than scanning through the file and marking every tuple as deleted, and it's safer from a crash-recovery perspective—the old file exists until the transaction commits.

The same thing happens with VACUUM FULL, CLUSTER, and REINDEX (for indexes). These operations rewrite the entire table or index, giving it a new physical file. But the OID never changes. This separation between logical identity (OID) and physical storage (relfilenode) allows PostgreSQL to reorganize data without breaking foreign key constraints, views, or permissions, all of which reference the OID.

Understanding Heap File Storage

Rohan Tiwari — Wed, 18 Feb 2026 12:01:31 GMT

When you execute INSERT INTO users VALUES (1, 'Alice') in PostgreSQL, what actually happens on disk? Where does that data go? How is it organized? Why does a simple SELECTsometimes cause disk writes? These aren't just academic questions—they're the foundation for understanding everything from why VACUUM exists to how indexes work to why your queries might be slower than expected.

This module is about pulling back the curtain on PostgreSQL's storage engine. We're going to look at the raw bytes on disk, understand how pages are structured, and see exactly how tuples are laid out in memory. By the end, you'll be able to inspect pages with surgical precision and understand the physical reality behind every SQL operation.

Let's start at the very beginning: the heap.

What the heck is heap anyway ?

The term "heap" in database terminology doesn't mean anything fancy—it literally means "pile." When PostgreSQL stores your table data, it piles it up in no particular order. This is fundamentally different from how some other databases work, and understanding this difference is crucial.

Imagine you have a table called users with a primary key on id. In MySQL's InnoDB engine, the actual table data is physically sorted by that primary key. Insert a row with id=100, then id=50, then id=200, and InnoDB will rearrange them on disk to be in order: 50, 100, 200. The table itself is structured as a B-tree sorted by the primary key.

PostgreSQL doesn't do this. When you insert those same three rows, they go onto disk in exactly the order you inserted them: 100, 50, 200. The primary key index is separate—it's just another index that happens to enforce uniqueness. The table itself? Just a heap. A pile of tuples with no inherent order.

This design choice has profound implications. It makes inserts fast because there's no need to find the "right" place to put new data—just throw it wherever there's space. It makes updates more flexible because moving a row doesn't require reshuffling an entire tree structure. But it also means that scanning a table by primary key order requires random I/O, since the rows aren't physically sorted.

Where it is stored then ?

Postgres stores all of your data inside the directory pointed by $PGDATA environment variable. use the following command to see where it is in your case

echo $PGDATA

$PGDATA/base/{database_oid}/{relfilenode}

Experiment to find your table physical location

-- create demo table
CREATE TABLE test(
    id INT,
    data TEXT
);

-- find the filepath
SELECT pg_relation_filepath('test') as physical_location,
SELECT pg_relation_size('test') as size_in_bytes;

Output Interpretation

You might see something like base/16384/24601. This means your database has OID 16384, and this particular table has been assigned relfilenode 24601. If you navigate to $PGDATA/base/16384/, you'll find a file named 24601. That's your table. That file contains all the rows you've inserted, organized into 8KB chunks called pages.

Why 8KB? This is a configurable compile-time option, but 8192 bytes is the default, and there are good reasons for it. It's large enough to hold a reasonable number of rows (typically 50-200 for OLTP workloads) but small enough that reading a page from disk is a single, efficient I/O operation. It aligns well with operating system page sizes, which reduces translation lookaside buffer (TLB) misses and makes memory management more efficient. It's been the default for decades because it represents a good compromise for mixed workloads.

Each table file can grow up to 1GB. Once it exceeds that size, PostgreSQL creates a new segment file named 24601.1, then 24601.2, and so on. This segmentation has historical roots—old filesystems had strict file size limits—but it's still useful today for operational reasons. Smaller files are easier to copy, back up, and manage. They also allow for some parallelism in I/O operations.

File Organization on Disk

In the next blog we will briefly understand those files, tuple structure as well really indepth.