<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Tyranny Blogs]]></title><description><![CDATA[Tyranny Blogs]]></description><link>https://rabindranath-tiwari.com.np</link><image><url>https://cdn.hashnode.com/res/hashnode/image/upload/v1771409405703/fa3209ce-cd23-48d1-b825-b1f90a47fd47.png</url><title>Tyranny Blogs</title><link>https://rabindranath-tiwari.com.np</link></image><generator>RSS for Node</generator><lastBuildDate>Wed, 15 Apr 2026 18:39:16 GMT</lastBuildDate><atom:link href="https://rabindranath-tiwari.com.np/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Understanding HTTP MIME Types]]></title><description><![CDATA[The internet hosts many kinds of resources, so each resource needs a label to tell clients how to handle it. MIME (Multipurpose Internet Mail Extensions) was originally developed to solve problems mov]]></description><link>https://rabindranath-tiwari.com.np/understanding-http-mime-types</link><guid isPermaLink="true">https://rabindranath-tiwari.com.np/understanding-http-mime-types</guid><dc:creator><![CDATA[Rohan Tiwari]]></dc:creator><pubDate>Sun, 22 Mar 2026 11:14:43 GMT</pubDate><content:encoded><![CDATA[<p>The internet hosts many kinds of resources, so each resource needs a label to tell clients how to handle it. MIME (Multipurpose Internet Mail Extensions) was originally developed to solve problems moving data between different mail systems, and the same concept is used by HTTP to indicate the media type of a resource.</p>
<p>MIME types are written as <code>type/subtype</code>.</p>
<p>Type</p>
<ul>
<li>The type indicates the broad category of the data, for example: <code>text</code>, <code>image</code>, <code>video</code>, <code>audio</code>, <code>application</code> (binary or structured data), and <code>multipart</code> (multiple parts in a single message).</li>
<li>These categories are formally defined in standards such as RFC 2046.</li>
</ul>
<p>Subtype</p>
<ul>
<li>The subtype specifies the exact format inside the main type, for example: <code>html</code>, <code>css</code>, <code>png</code>, <code>json</code>.</li>
<li>Together the <code>type/subtype</code> pair precisely identifies the resource format (for example, <code>text/html</code>).</li>
</ul>
<p>Common MIME types</p>
<ol>
<li>text/*</li>
</ol>
<ul>
<li><code>text/plain</code> — plain text</li>
<li><code>text/html</code> — HTML documents</li>
<li><code>text/css</code> — CSS stylesheets</li>
<li><code>text/javascript</code> (historical) or <code>application/javascript</code> — JavaScript</li>
<li><code>text/event-stream</code> — Server-Sent Events (SSE)</li>
</ul>
<ol>
<li>image/*</li>
</ol>
<ul>
<li><code>image/jpeg</code> — JPEG images</li>
<li><code>image/png</code> — PNG images</li>
<li><code>image/gif</code> — GIF images (supports animation)</li>
<li><code>image/webp</code> — WebP images</li>
</ul>
<ol>
<li>audio/*</li>
</ol>
<ul>
<li><code>audio/mpeg</code> — MP3 audio</li>
<li><code>audio/wav</code> — WAV audio</li>
<li><code>audio/ogg</code> — Ogg audio</li>
<li><code>audio/webm</code> — WebM audio</li>
<li><code>audio/aac</code> — AAC audio</li>
<li><code>audio/flac</code> — FLAC (lossless) audio</li>
</ul>
<ol>
<li>video/*</li>
</ol>
<ul>
<li><code>video/mp4</code> — MP4 video</li>
<li><code>video/webm</code> — WebM video</li>
<li><code>video/ogg</code> — Ogg video</li>
</ul>
<ol>
<li>application/*</li>
</ol>
<ul>
<li><code>application/json</code> — JSON data</li>
<li><code>application/xml</code> — XML data</li>
<li><code>application/pdf</code> — PDF documents</li>
<li><code>application/octet-stream</code> — arbitrary binary data (default for unknown binaries)</li>
<li><code>application/zip</code> — ZIP archives</li>
</ul>
<ol>
<li>multipart/*</li>
</ol>
<ul>
<li><code>multipart/form-data</code> — used for form submissions that include files</li>
<li><code>multipart/byteranges</code> — multiple byte ranges in a single response</li>
</ul>
<p>HTTP usage</p>
<ul>
<li>Servers send the MIME type in the <code>Content-Type</code> response header so browsers and other clients know how to process the payload. Example:<ul>
<li><code>Content-Type: text/html; charset=utf-8</code></li>
</ul>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Centralized Cache Key  Management In Redis]]></title><description><![CDATA[In modern web applications, efficient data access is essential for performance and user experience. Redis, a blazing-fast in-memory store used for caching, messaging, and short-lived persistence, depe]]></description><link>https://rabindranath-tiwari.com.np/centralized-cache-key-management-in-redis</link><guid isPermaLink="true">https://rabindranath-tiwari.com.np/centralized-cache-key-management-in-redis</guid><dc:creator><![CDATA[Rohan Tiwari]]></dc:creator><pubDate>Fri, 20 Mar 2026 13:36:01 GMT</pubDate><content:encoded><![CDATA[<p>In modern web applications, efficient data access is essential for performance and user experience. Redis, a blazing-fast in-memory store used for caching, messaging, and short-lived persistence, depends heavily on how you name and organize keys. In this article we explore why centralized, modular Redis key management matters and how to design an approach for a Node.js-based school management system that reduces bugs, improves maintainability, and scales cleanly.</p>
<h2>Why Redis Key Management Matters</h2>
<h3>The stakes are high</h3>
<p>Redis keys are the foundation of caching, session handling, queues, and many transient-data workflows. Poor key management causes:</p>
<ul>
<li>Bugs: A typo in a key name leads to cache misses and inconsistent behavior.</li>
<li>Inconsistency: Different services or modules using divergent naming conventions create confusion and integration bugs.</li>
<li>Scalability problems: When keys are scattered, refactoring or changing schemas becomes risky and expensive.</li>
<li>Debugging nightmares: Tracking where a certain key is created, read, or invalidated is difficult.</li>
<li>Naming conflicts: Accidental collisions can overwrite unrelated data.</li>
<li>Type unsafety: No guarantees that the right parameter types or formats are used when building keys.</li>
</ul>
<p>Properly managed keys reduce these risks and make system behavior predictable, testable, and easier to evolve.</p>
<h2>The problem with ad-hoc key management</h2>
<h3>Bad practice: Scattered keys</h3>
<p>Common anti-patterns include:</p>
<ul>
<li>Inconsistent separators and format (<code>user:123</code> vs <code>user_123</code> vs <code>user/123</code>).</li>
<li>Duplicate key construction logic scattered across modules/services.</li>
<li>Hard-coded strings littered through code, leading to silent failures on rename.</li>
<li>No parameter validation (e.g., using raw objects or arrays in parts of the key).</li>
<li>No centralized documentation or discoverability for which keys exist.</li>
</ul>
<p>These patterns make it hard to refactor, test, or enforce cross-cutting rules like TTLs, versioning, and prefixes.</p>
<h2>Good practice: Centralized, modular key management</h2>
<p>Centralizing Redis key creation and lifecycle rules brings clarity, reduces bugs, and speeds development. Key ideas:</p>
<ul>
<li>Single source of truth: central module (or small set of modules) that defines key templates and helper functions.</li>
<li>Consistent naming convention: choose separators and order of namespaces and enforce them.</li>
<li>Parameter validation and typesafety: validate or type the parameters used to construct keys (TypeScript helps).</li>
<li>Versioning: include a version segment or use a prefix to make migrations safe.</li>
<li>Modularization by domain: group keys by bounded context (e.g., students, classes, attendance).</li>
<li>TTL strategy and defaults: centralize TTLs per key or key group so expirations are consistent.</li>
<li>Instrumentation &amp; discovery: log or expose which keys are created, and document the registry for teams.</li>
<li>Migration plans: support supportable migration paths by key versioning or prefixing.</li>
</ul>
<p>Below are concrete recommendations and examples tailored for a school management system.</p>
<h2>Naming conventions (recommendations)</h2>
<ul>
<li>Use a clear separator, such as colon (<code>:</code>). Example: <code>school:123:student:456:profile</code>.</li>
<li>Order segments from broad to specific: <code>{domain}:{orgId}:{resource}:{resourceId}:{subresource}</code>.</li>
<li>Keep keys short but descriptive. Avoid embedding large JSON structures in keys.</li>
<li>Add an optional version segment or prefix: <code>v1:school:...</code> to allow rolling migrations.</li>
<li>Use prefixes for environment when sharing Redis (e.g., <code>prod:</code>, <code>staging:</code>) or use distinct Redis instances.</li>
</ul>
<h2>Domain-based key examples (school management)</h2>
<p>Suggested structure:</p>
<ul>
<li>School-level cache: <code>school:{schoolId}:meta</code></li>
<li>Student profile: <code>school:{schoolId}:student:{studentId}:profile</code></li>
<li>Student attendance for date: <code>school:{schoolId}:student:{studentId}:attendance:{YYYY-MM-DD}</code></li>
<li>Class roster: <code>school:{schoolId}:class:{classId}:roster</code></li>
<li>Teacher sessions: <code>school:{schoolId}:teacher:{teacherId}:session:{sessionId}</code></li>
</ul>
<p>Example keys:</p>
<ul>
<li><code>v1:school:42:student:1001:profile</code></li>
<li><code>v1:school:42:class:7:roster</code></li>
<li><code>v1:school:42:student:1001:attendance:2026-03-20</code></li>
</ul>
<h2>Centralized key factory (pattern)</h2>
<p>Create a single module that exports functions to build keys and optionally parse or validate them. Benefits:</p>
<ul>
<li>Single place to enforce naming, version, TTL defaults.</li>
<li>Easier to change structure globally (e.g., add <code>v2:</code>).</li>
<li>Improves code discoverability and reuse.</li>
</ul>
<p>Example (JavaScript / TypeScript style pseudocode):</p>
<pre><code class="language-ts">// redisKeys.ts
const PREFIX = 'v1';
const SEP = ':';

export const keys = {
  schoolMeta: (schoolId: number | string) =&gt;
    [PREFIX, 'school', schoolId, 'meta'].join(SEP),

  studentProfile: (schoolId: number | string, studentId: number | string) =&gt;
    [PREFIX, 'school', schoolId, 'student', studentId, 'profile'].join(SEP),

  studentAttendance: (schoolId: number | string, studentId: number | string, date: string) =&gt;
    [PREFIX, 'school', schoolId, 'student', studentId, 'attendance', date].join(SEP),

  classRoster: (schoolId: number | string, classId: number | string) =&gt;
    [PREFIX, 'school', schoolId, 'class', classId, 'roster'].join(SEP),
};
</code></pre>
<p>Use these helpers everywhere instead of inline strings. If you later need to change <code>PREFIX</code> to <code>v2</code> or add an environment prefix, you change it in one place.</p>
<h2>Typesafety and validation</h2>
<ul>
<li>In TypeScript, type the function inputs (schoolId: string | number). Add runtime checks for format when necessary.</li>
<li>Validate date formats (ISO-8601 or YYYY-MM-DD) for keys that embed dates.</li>
<li>Consider small helper functions that sanitize IDs (e.g., disallow colons in IDs).</li>
</ul>
<p>Example runtime guard:</p>
<pre><code class="language-ts">function assertId(id: unknown, name = 'id') {
  if (typeof id !== 'string' &amp;&amp; typeof id !== 'number') {
    throw new Error(`${name} must be a string or number`);
  }
}
</code></pre>
<h2>TTL and expiration strategy</h2>
<ul>
<li>Define default TTLs in the key module or in a separate TTL registry.</li>
<li>Use TTLs for ephemeral caches and avoid TTLs for data you treat as persistent (or document exceptions).</li>
<li>Central TTL registry example:</li>
</ul>
<pre><code class="language-ts">export const ttl = {
  studentProfile: 60 * 60 * 24, // 24 hours
  classRoster: 60 * 10,         // 10 minutes
};
</code></pre>
<h2>Key versioning and migrations</h2>
<ul>
<li>Prefix keys with a version (<code>v1:</code>). To migrate, write new keys with <code>v2:</code> and keep <code>v1:</code> readers until migration completes.</li>
<li>Alternatively, perform background jobs to re-key or repopulate caches under the new format.</li>
</ul>
<h2>Documentation, discovery, and monitoring</h2>
<ul>
<li>Keep a living registry (the key module doubles as documentation).</li>
<li>Document patterns in README or internal docs accessible by teams.</li>
<li>Log key creation and invalidation events for debugging.</li>
<li>Use Redis keyspace notifications sparingly (they can be noisy) or maintain application-level audit logs for critical keys.</li>
</ul>
<h2>Operational considerations</h2>
<ul>
<li>Namespace separation: consider separate Redis DBs or clusters per environment to avoid accidental collisions.</li>
<li>Key scanning: avoid heavy use of KEYS in production. Prefer known patterns or use SCAN with care for maintenance scripts.</li>
<li>Use Redis memory monitoring and eviction policy tailored for caches (e.g., LRU).</li>
<li>Instrument cache hit/miss metrics per key group. That lets you tune TTLs or caching boundaries.</li>
</ul>
<h2>Migration &amp; refactor checklist</h2>
<ul>
<li>Add versioned keys while keeping old readers active.</li>
<li>Populate new keys on writes (write-through) and read-through fallback to old keys until warm.</li>
<li>Run background rekeying for large datasets when possible.</li>
<li>Monitor for orphaned v1 keys and plan for cleanup after confidence.</li>
</ul>
<h2>Summary</h2>
<p>Centralized Redis key management brings immediate benefits:</p>
<ul>
<li>Fewer bugs from typos and inconsistent naming.</li>
<li>Predictable refactor paths via versioning.</li>
<li>Easier enforcement of TTLs and caching policies.</li>
<li>Better documentation and discoverability across teams.</li>
</ul>
<p>For a Node.js school management system, adopt a small, well-documented key factory module that:</p>
<ul>
<li>Exposes domain-specific key builders,</li>
<li>Holds TTLs and versioning info,</li>
<li>Validates inputs,</li>
<li>And serves as the canonical registry for all Redis key usage.</li>
</ul>
<p>Starting with a centralized approach keeps your cache predictable, debuggable, and ready to scale as your application and team grow.</p>
]]></content:encoded></item><item><title><![CDATA[Postgres Multi Version Concurrency Control - MVCC]]></title><description><![CDATA[In 1986 database researcher named Michael Stonebraker was working on a problem that has plauged databases since their inception. how do you let many people read and write to the data simultaneously wi]]></description><link>https://rabindranath-tiwari.com.np/postgres-multi-version-concurrency-control-mvcc</link><guid isPermaLink="true">https://rabindranath-tiwari.com.np/postgres-multi-version-concurrency-control-mvcc</guid><dc:creator><![CDATA[Rohan Tiwari]]></dc:creator><pubDate>Thu, 19 Feb 2026 17:10:29 GMT</pubDate><enclosure url="https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/covers/6994419ff4a777784e6fc082/cb80f497-7fbe-4026-ae89-040a05911706.jpg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In 1986 database researcher named Michael Stonebraker was working on a problem that has plauged databases since their inception. how do you let many people read and write to the data simultaneously without ever grinding to a hault. The traditional locks solution was like having one bathroom for an entire building. Sure it works but line gets long very quickly.</p>
<p>Stonebraker had a radical idea: what if we kept multiple versions of each row? What if, instead of locking data, we just let everyone see the version that existed when they started their work? This became MVCC, and it's the reason PostgreSQL can do things that seem almost magical.</p>
<h3>The fundamental problem</h3>
<p>Imagine you're building a banking application. Two tellers are working simultaneously, both looking at the same account. The account has 1000 dollars in it. Teller A starts a transaction to withdraw 100 dollars. At the exact same moment, Teller B starts a transaction to deposit 50 dollars. What should happen?</p>
<p>This is called the "reader blocks writer" problem, and it's a disaster for high-concurrency systems. Teller B is just sitting there, unable to do anything, because Teller A happened to read the account balance first.</p>
<h3>Transaction IDs</h3>
<p>Every transaction in PostgreSQL that modifies data gets a unique identifier called a transaction ID, or XID. This isn't some abstract concept—it's a 32-bit unsigned integer that gets stamped onto every tuple you insert or update.</p>
<pre><code class="language-sql">CREATE TABLE mvcc_demo (
    id INT PRIMARY KEY,
    account_name TEXT,
    balance NUMERIC
);

-- Start a transaction
BEGIN;

-- Check: do we have an XID yet?
SELECT txid_current_if_assigned();
</code></pre>
<p>This returns NULL. Why? Because PostgreSQL is lazy about assigning transaction IDs. A read-only transaction never needs one. Only when you do something that modifies data does PostgreSQL say, "Okay, you need a number."</p>
<pre><code class="language-sql">-- Now force an XID assignment
SELECT txid_current();
</code></pre>
<pre><code class="language-sql">INSERT INTO mvcc_demo VALUES (1, 'Alice', 1000);

-- Now let's look at what happened physically
SELECT t_xmin, t_xmax, t_ctid, * 
FROM heap_page_items(get_raw_page('mvcc_demo', 0));
</code></pre>
<p>You should see something like this:</p>
<pre><code class="language-txt"> t_xmin | t_xmax | t_ctid | id | account_name | balance 
--------+--------+--------+----+--------------+---------
   1847 |      0 | (0,1)  |  1 | Alice        | 1000
</code></pre>
<p>Look at that t_xmin field. It's 1847—the transaction ID we just saw. This tuple was created by transaction 1847. The t_xmax is 0, meaning no transaction has deleted it yet.</p>
<p>Now let's do an update and see what happens:</p>
<pre><code class="language-sql">-- In the same transaction
UPDATE mvcc_demo SET balance = 900 WHERE id = 1;

-- Look at the page again
SELECT t_xmin, t_xmax, t_ctid, id, account_name, balance 
FROM heap_page_items(get_raw_page('mvcc_demo', 0));
</code></pre>
<pre><code class="language-txt"> t_xmin | t_xmax | t_ctid | id | account_name | balance 
--------+--------+--------+----+--------------+---------
   1847 |   1847 | (0,2)  |  1 | Alice        | 1000
   1847 |      0 | (0,2)  |  1 | Alice        |  900
</code></pre>
<p>Two tuples now! The old version has t_xmax=1847 (my transaction deleted it) and t_ctid=(0,2) pointing to the new version. The new version has t_xmin=1847 (my transaction created it). Both versions exist on disk simultaneously.</p>
<pre><code class="language-sql">COMMIT;

-- Check the snapshot from outside this transaction
SELECT txid_current_snapshot();
</code></pre>
<p>Let me decode this for you. The snapshot format is <code>xmin:xmax:xip_list</code>. Here, xmin=1847 (the oldest transaction that was active when this snapshot was taken), xmax=1848 (the next XID to be assigned), and the xip_list is empty (no transactions are currently in progress). This snapshot is the key to everything. It's how PostgreSQL knows which tuple versions you're allowed to see.</p>
<h3>Snapshots</h3>
<p>A snapshot is a point-in-time view of which transactions are visible to you. Think of it as a photograph of the transaction ID space at the moment your query (or transaction) begins.</p>
<p>Let me demonstrate this with two concurrent sessions. Open two terminal windows and follow along.</p>
<p><strong>Session A</strong></p>
<pre><code class="language-sql">BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
SELECT txid_current();
</code></pre>
<p>Let's say this returns 2000.</p>
<pre><code class="language-sql">SELECT txid_current_snapshot();
</code></pre>
<p>Output: <code>2000:2001:</code></p>
<p>This means: "I am transaction 2000. The next transaction will be 2001. No other transactions are running right now". Now, while keeping Session A open, go to Session B:</p>
<p><strong>Session B</strong></p>
<pre><code class="language-sql">BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
SELECT txid_current();
</code></pre>
<p>Returns: 2001</p>
<pre><code class="language-sql">SELECT txid_current_snapshot();
</code></pre>
<p>Output: <code>2000:2002:2000</code></p>
<p>Read this as: "I am transaction 2001. The next transaction will be 2002. Transaction 2000 is currently in progress." Now, in Session B, let's insert some data:</p>
<p><strong>Session B</strong></p>
<pre><code class="language-sql">INSERT INTO mvcc_demo VALUES (2, 'Bob', 500);
SELECT * FROM mvcc_demo;
</code></pre>
<p>You'll see both Alice (from earlier) and Bob. Session B can see its own insert immediately.</p>
<pre><code class="language-plsql">COMMIT;
</code></pre>
<p>Session B commits. Now let's go back to Session A:</p>
<p><strong>Session A</strong></p>
<pre><code class="language-sql">SELECT * FROM mvcc_demo;
</code></pre>
<p>You'll only see Alice! Bob doesn't appear. Why? Because Session A's snapshot was taken before transaction 2001 existed. Even though 2001 has committed, Session A captured a snapshot at the beginning of its transaction that said, "I can't see anything from transaction 2001 or higher."</p>
<p>This is snapshot isolation in action. Session A sees a frozen view of the database as it existed when the transaction started.</p>
<p>Now let's see what happens if we change the isolation level to READ COMMITTED:</p>
<p><strong>Session C (NEW WINDOW)</strong></p>
<pre><code class="language-sql">BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;
SELECT txid_current_snapshot();
</code></pre>
<p>Output might be: <code>2000:2003:2000</code></p>
<pre><code class="language-sql">SELECT * FROM mvcc_demo;
</code></pre>
<p>You see both Alice and Bob! Why? Because in READ COMMITTED mode, PostgreSQL takes a new snapshot at the start of each SQL statement, not at the start of the transaction. The commit from Session B happened before this SELECT started, so it's visible.</p>
<p>This is the difference between REPEATABLE READ and READ COMMITTED:</p>
<ul>
<li><p><strong>REPEATABLE READ</strong>: One snapshot for the entire transaction</p>
</li>
<li><p><strong>READ COMMITTED</strong>: New snapshot for each statement</p>
</li>
</ul>
<p>Let me show you the snapshot structure more precisely. When PostgreSQL creates a snapshot, it builds a small data structure in memory:</p>
<pre><code class="language-c">typedef struct SnapshotData {
    TransactionId xmin;    // Oldest XID still active
    TransactionId xmax;    // Next XID to be assigned
    uint32 xcnt;           // Number of XIDs in xip[]
    TransactionId *xip;    // Array of in-progress XIDs
} SnapshotData;
</code></pre>
<p>When transaction 2000 takes its snapshot while 2001 and 2005 are running:</p>
<pre><code class="language-txt">xmin: 2000
xmax: 2010  (next to be assigned)
xcnt: 2
xip: [2000, 2005]
</code></pre>
<p>Now when you look at a tuple with t_xmin=2003, PostgreSQL asks:</p>
<ol>
<li><p>Is 2003 &lt; xmin (2000)? No.</p>
</li>
<li><p>Is 2003 &gt;= xmax (2010)? No.</p>
</li>
<li><p>Is 2003 in the xip array? No.</p>
</li>
<li><p>Therefore, check CLOG to see if 2003 committed.</p>
</li>
</ol>
<p>If CLOG says "committed," the tuple is visible. If "aborted," it's not. If "in progress," it's not visible either (unless it's your own transaction). This is how visibility works: every tuple read triggers this check against your snapshot.</p>
]]></content:encoded></item><item><title><![CDATA[PostgreSQL TOAST Storage Models]]></title><description><![CDATA[The main problem TOAST solves is fundamental : postgres pages are of 8kb, and tuple must fit within that page. so what happens if you try to insert 1MB of text field ?
Without toast you would get an error: "Row too big". With toast PostgreSQL handles...]]></description><link>https://rabindranath-tiwari.com.np/postgresql-toast-storage-models</link><guid isPermaLink="true">https://rabindranath-tiwari.com.np/postgresql-toast-storage-models</guid><category><![CDATA[PostgreSQL]]></category><category><![CDATA[toast]]></category><dc:creator><![CDATA[Rohan Tiwari]]></dc:creator><pubDate>Thu, 19 Feb 2026 03:36:39 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1771472166050/8a04235c-9b5c-4c37-9b62-668639c63287.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The main problem TOAST solves is fundamental : postgres pages are of 8kb, and tuple must fit within that page. so what happens if you try to insert 1MB of text field ?</p>
<p>Without toast you would get an error: "Row too big". With toast PostgreSQL handles the large value by compressing it and if it is still large, it is broken down into chunks and stored in separate TOAST table.</p>
<p>Here is how it works. every table has potentially at least one large column gets assosiated with TOAST table. you don't see these tables in your normal schema, rather they live in special <code>pg_toast</code> schema and have auto generated names like <code>pg_toast_16223</code>. each TOAST table has the same structure.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> pg_toast.pg_toast_16385 (
    chunk_id <span class="hljs-keyword">OID</span>,
    chunk_seq <span class="hljs-built_in">INT</span>,
    chunk_data BYTEA
);
</code></pre>
<p>When you insert a row with a large column, PostgreSQL first tries to fit the entire row in the page. If it doesn't fit, it tries compressing the large columns using the pglz algorithm (or lz4 in newer versions). If a 10KB text field compresses down to 1KB, great—it now fits inline, and no TOAST table is needed.</p>
<p>But if after compression it's still too large, PostgreSQL takes the compressed data and chunks it into ~2KB pieces, inserts those pieces into the TOAST table, and replaces the large column in the main tuple with an 18-byte TOAST pointer. That pointer contains the <code>chunk_id</code> and enough metadata to reassemble and decompress the original value.</p>
<p>This happens completely transparently. When you query the column, PostgreSQL sees the TOAST pointer, reads the chunks from the TOAST table, reassembles them, decompresses, and returns the value. You don't have to do anything special.</p>
<p>But there are performance implications. Fetching a TOASTed column is much more expensive than fetching an inline column. If the main table is cached in memory but the TOAST table isn't, you'll incur I/O. If the value is split into 100 chunks, you're doing 100 additional tuple fetches.</p>
<p>This is why the golden rule of PostgreSQL performance is: <strong>only SELECT the columns you need</strong>. If you do <code>SELECT *</code> on a table with a large TEXT column, you'll fetch and decompress that column even if you don't use it. If you do <code>SELECT id, name</code> instead, the TOAST column is never touched.</p>
<p>You can control TOAST behavior per column with the <code>SET STORAGE</code> option:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">ALTER</span> <span class="hljs-keyword">TABLE</span> mytable <span class="hljs-keyword">ALTER</span> <span class="hljs-keyword">COLUMN</span> mycolumn <span class="hljs-keyword">SET</span> <span class="hljs-keyword">STORAGE</span> <span class="hljs-keyword">EXTERNAL</span>;
</code></pre>
<h3 id="heading-toast-storage-models">TOAST Storage Models</h3>
<hr />
<p><strong>PLAIN</strong> means no TOAST at all. This is for fixed-sized types that can't be TOASTed. eg INT</p>
<p><strong>EXTENDED</strong> is the default option. it tries to compress, and if still too large move out of the line to the TOAST table.</p>
<p><strong>EXTERNAL</strong> means skip compression but do move out of line if needed. This is useful in those scenarios where you have already compressed data from the client side like JPEG image, compressing those again on database level could waste CPU cycle.</p>
<p><strong>MAIN</strong> means prefer to keep the value inline. Try compression, and only move out-of-line as a last resort. This is useful for frequently accessed columns where you want to avoid TOAST overhead.</p>
<h3 id="heading-toast-in-action">TOAST in Action</h3>
<hr />
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> toast_test(<span class="hljs-keyword">id</span> <span class="hljs-built_in">INT</span>, small_data <span class="hljs-built_in">TEXT</span>, large_data <span class="hljs-built_in">TEXT</span>);
<span class="hljs-comment">-- small value stored inline</span>
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> TOAST_TEST <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">1</span>, <span class="hljs-string">'hello'</span>, <span class="hljs-string">'hello'</span>);
<span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">oid</span>, reltoastelid <span class="hljs-keyword">FROM</span> pg_class <span class="hljs-keyword">where</span> relname = <span class="hljs-string">'toast_test'</span>
</code></pre>
<p>You'll see something like <code>toast_demo</code> and <code>pg_</code><a target="_blank" href="http://toast.pg"><code>toast.pg</code></a><code>_toast_16385</code>. Now:</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Check TOAST table (should be empty)</span>
<span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">COUNT</span>(*) <span class="hljs-keyword">FROM</span> pg_toast.pg_toast_16385;
</code></pre>
<p>Zero rows, because we haven't inserted anything large. Now:</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Large value: triggers TOAST</span>
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> toast_demo <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">2</span>, <span class="hljs-string">'small'</span>, <span class="hljs-keyword">repeat</span>(<span class="hljs-string">'x'</span>, <span class="hljs-number">10000</span>));

<span class="hljs-comment">-- Check TOAST table again</span>
<span class="hljs-keyword">SELECT</span> chunk_id, chunk_seq, <span class="hljs-keyword">length</span>(chunk_data) 
<span class="hljs-keyword">FROM</span> pg_toast.pg_toast_16385 
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> chunk_id, chunk_seq;
</code></pre>
<p>Now you'll see several rows, each with about 2000 bytes of data, representing the chunks of your 10KB value.</p>
<p>TOAST has an interesting interaction with VACUUM. When you update or delete a row that has TOASTed columns, the old version's TOAST chunks remain in the TOAST table until VACUUM runs. If you do a lot of updates, your TOAST table can become bloated with unreferenced chunks. This is why monitoring TOAST table size is important for tables with large columns.</p>
]]></content:encoded></item><item><title><![CDATA[Tuple Layout]]></title><description><![CDATA[When you insert a new row into a table, postgreSQL creates a tuple, a contiguous chunk of memory bytes that contains both system metadata and your actual data.
A tuple starts with a 23 byte header. Every single tuple in every table has this same head...]]></description><link>https://rabindranath-tiwari.com.np/tuple-layout</link><guid isPermaLink="true">https://rabindranath-tiwari.com.np/tuple-layout</guid><category><![CDATA[PostgreSQL]]></category><dc:creator><![CDATA[Rohan Tiwari]]></dc:creator><pubDate>Wed, 18 Feb 2026 17:19:10 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1771435105070/ffb33977-c32f-47e6-bba4-2668a5768cd2.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When you insert a new row into a table, postgreSQL creates a tuple, a contiguous chunk of memory bytes that contains both system metadata and your actual data.</p>
<p>A tuple starts with a 23 byte header. Every single tuple in every table has this same header structure regardless of how many columns you have or which datatype you are using. These 23 bytes are pure overhead, which is why storing lots of tiny rows (say, a two-column table with a smallint and a boolean) is inefficient—you're spending more space on metadata than on data.</p>
<h3 id="heading-tuple-headers">Tuple Headers</h3>
<p><strong>t_xmin</strong></p>
<p>The first 4 bytes are <code>t_xmin</code>, a transaction ID. This is the ID of the transaction that created this tuple. When you run <code>INSERT INTO users VALUES (1, 'Alice')</code> inside transaction 1000, the resulting tuple gets <code>t_xmin=1000</code>.</p>
<p>This field never changes. Ever. Even if the tuple is later updated or deleted, <code>t_xmin</code> remains the ID of the transaction that originally created it. This is the foundation of MVCC. To determine if a tuple is visible to your transaction, PostgreSQL looks at <code>t_xmin</code> and asks: "Was this transaction committed before my snapshot was taken? Am I allowed to see tuples it created?"</p>
<p><strong>t_xmax</strong></p>
<p>The next 4 bytes are <code>t_xmax</code>, which is more complicated. In the simple case, if the tuple has never been deleted or locked, <code>t_xmax</code> is zero. But if a transaction deletes this tuple, <code>t_xmax</code> gets set to that transaction's ID.</p>
<p>Here's where it gets subtle: <code>t_xmax</code> can also mean "this tuple is locked" (as in <code>SELECT ... FOR UPDATE</code>) rather than deleted. How do you tell the difference? You have to look at the <code>t_infomask</code> flags. If <code>HEAP_XMAX_LOCK_ONLY</code> is set, then <code>t_xmax</code> is a lock, not a deletion. If <code>HEAP_UPDATED</code> is set, this was an UPDATE (so there's a newer version somewhere). If neither is set, it's a plain DELETE.</p>
<p>And there's yet another case: if multiple transactions lock the same tuple concurrently (for example, multiple <code>SELECT ... FOR SHARE</code> statements), <code>t_xmax</code> doesn't hold a transaction ID at all. Instead, it holds a "MultiXactId," which is an ID into a separate structure (<code>pg_multixact</code>) that stores a list of transaction IDs. The <code>HEAP_XMAX_IS_MULTI</code> flag in <code>t_infomask</code> tells you this has happened.</p>
<p>This overloading of <code>t_xmax</code> is a clever space optimization, but it makes the visibility logic quite complex.</p>
<p><strong>t_cid</strong></p>
<p>The next 6 bytes are <code>t_ctid</code>, a tuple identifier consisting of a page number (4 bytes) and a line pointer number (2 bytes). This field serves a dual purpose.</p>
<p>First, it's the tuple's own physical address, often called the TID (tuple identifier). If you run:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> ctid, * <span class="hljs-keyword">FROM</span> <span class="hljs-keyword">users</span>;
</code></pre>
<p>You'll see values like <code>(0,1)</code>, meaning page 0, line pointer 1. This is how indexes refer to tuples—they store the TID.</p>
<p>Second, <code>t_ctid</code> is used for update chains. When a tuple is the current version (hasn't been updated), its <code>t_ctid</code> points to itself: <code>(0,1)</code> points to <code>(0,1)</code>. But when a tuple is updated, the old version's <code>t_ctid</code> gets changed to point to the new version. This creates a chain:</p>
<pre><code class="lang-plaintext">Old version at (0,1): t_ctid = (0,2)
New version at (0,2): t_ctid = (0,2)  [self-pointer]
</code></pre>
<p>If you update again:</p>
<p>Code</p>
<pre><code class="lang-plaintext">Old v1 at (0,1): t_ctid = (0,2)
Old v2 at (0,2): t_ctid = (0,3)
Current at (0,3): t_ctid = (0,3)
</code></pre>
<p>This chain allows PostgreSQL to follow updates. An index points to <code>(0,1)</code>. When you look up that TID, you find an old version with <code>t_ctid=(0,2)</code>, so you follow the chain to <code>(0,2)</code>, then to <code>(0,3)</code>, where you find the current version.</p>
<p>Long update chains are a performance problem. If a tuple has been updated 100 times, you have to follow 100 hops to reach the current version. This is one reason why HOT (Heap-Only Tuple) updates are so valuable—they keep the chain short and on the same page.</p>
<p><strong>t_infomask</strong></p>
<p>Now we come to the most complex field in the tuple header: <code>t_infomask</code>, a 2-byte bitmap containing 16 Boolean flags. This is where PostgreSQL packs a huge amount of state information.</p>
<p>Some flags describe the tuple's data layout. <code>HEAP_HASNULL</code> (bit 0) means at least one column is NULL, so there's a null bitmap after the tuple header. <code>HEAP_HASVARWIDTH</code> (bit 1) means there are variable-length columns. <code>HEAP_HASEXTERNAL</code> (bit 2) means at least one column is stored out-of-line in a TOAST table.</p>
<p>Other flags describe the tuple's MVCC state. <code>HEAP_XMAX_LOCK_ONLY</code> (bit 7) means <code>t_xmax</code> is a lock, not a delete. <code>HEAP_UPDATED</code> (bit 13) means this tuple was updated, so there's a newer version. <code>HEAP_XMAX_IS_MULTI</code> (bit 12) means <code>t_xmax</code> is a MultiXactId.</p>
<p>But the most important flags are the hint bits: <code>HEAP_XMIN_COMMITTED</code> (bit 8), <code>HEAP_XMIN_INVALID</code> (bit 9), <code>HEAP_XMAX_COMMITTED</code> (bit 10), and <code>HEAP_XMAX_INVALID</code> (bit 11). Understanding hint bits is essential.</p>
<p>Here's the problem they solve: To determine if a tuple is visible, we need to know whether <code>t_xmin</code> and <code>t_xmax</code> are committed or aborted. This information lives in the CLOG (commit log), also called <code>pg_xact</code>. The CLOG is on disk (or maybe cached in memory), and checking it requires I/O. If we had to check the CLOG for every tuple we examine during a query, performance would be terrible.</p>
<p>Hint bits cache this information directly in the tuple. The first time someone checks whether transaction 1000 is committed, they look it up in the CLOG. If it's committed, they set the <code>HEAP_XMIN_COMMITTED</code> bit in the tuple's <code>t_infomask</code> and mark the page dirty. From that point on, anyone who looks at this tuple sees the hint bit and knows immediately that transaction 1000 is committed, without having to touch the CLOG.</p>
<p>This has a fascinating consequence: a SELECT query can cause writes. If you run a big INSERT, creating millions of new tuples, and then immediately run a SELECT that scans the table, that SELECT will be the first to check the visibility of each tuple. For every tuple, it will look up the transaction in CLOG (probably finding it committed), set the hint bit, and mark the page dirty. Eventually, those dirty pages get written to disk. Your SELECT just triggered a write of the entire table.</p>
<p>The solution is to run VACUUM immediately after a bulk insert. VACUUM will proactively set all the hint bits, so subsequent queries won't have to.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> hint_demo (<span class="hljs-keyword">id</span> <span class="hljs-built_in">INT</span>);
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> hint_demo <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">1</span>);

<span class="hljs-keyword">SELECT</span> t_infomask <span class="hljs-keyword">FROM</span> heap_page_items(get_raw_page(<span class="hljs-string">'hint_demo'</span>, <span class="hljs-number">0</span>));
</code></pre>
<p>You might see <code>t_infomask = 2818</code>, which is <code>0x0B02</code> in hex. Let's decode that:</p>
<pre><code class="lang-txt">Binary: 0000 1011 0000 0010
Bit 1: HEAP_HASVARWIDTH (set)
Bit 8: HEAP_XMIN_COMMITTED (set)
Bit 9: HEAP_XMIN_INVALID (set)
Bit 11: HEAP_XMAX_INVALID (set)
</code></pre>
<p>Wait, both <code>HEAP_XMIN_COMMITTED</code> and <code>HEAP_XMIN_INVALID</code> are set? That seems contradictory. Actually, <code>HEAP_XMIN_INVALID</code> set means the tuple was created by an aborted transaction, which overrides the "committed" bit. This tuple is garbage and will never be visible to anyone.</p>
<p><strong>t_infomask2</strong></p>
<p>There's a second infomask field, <code>t_infomask2</code>, which is also 2 bytes. The lower 11 bits store the number of attributes (columns) in this tuple, allowing up to 2047 columns per table. The upper bits are flags related to HOT updates: <code>HEAP_HOT_UPDATED</code> (bit 14) and <code>HEAP_ONLY_TUPLE</code> (bit 13).</p>
]]></content:encoded></item><item><title><![CDATA[PostgreSQL Page Structure - (Slotted Pages)]]></title><description><![CDATA[From the previous blog we knew that tables are made up of 8KB pages, lets crack open a page and see what's really inside it. This is where the things get interesting because jargons like MVCC, tuple storage, visiblity rules, freespace is dependent on...]]></description><link>https://rabindranath-tiwari.com.np/postgresql-page-structure-slotted-pages</link><guid isPermaLink="true">https://rabindranath-tiwari.com.np/postgresql-page-structure-slotted-pages</guid><category><![CDATA[slotted pages]]></category><category><![CDATA[PostgreSQL]]></category><dc:creator><![CDATA[Rohan Tiwari]]></dc:creator><pubDate>Wed, 18 Feb 2026 15:45:16 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1771429478968/c6aa0249-fc0b-48db-bb61-f20d425be385.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>From the previous blog we knew that tables are made up of 8KB pages, lets crack open a page and see what's really inside it. This is where the things get interesting because jargons like MVCC, tuple storage, visiblity rules, freespace is dependent on concept of pages.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771429358182/483b3533-f5eb-4cdc-a368-270a70e0c86b.png" alt /></p>
<p>This two-ended growth pattern is elegant. Line pointers grow down from the top, tuples grow up from the bottom, and free space is always the contiguous region in the middle. Two pointers in the page header—<code>pd_lower</code> and <code>pd_upper</code>—track the boundaries. Free space is simply <code>pd_upper - pd_lower</code>. Want to know if a new tuple fits? It's an O(1) subtraction.</p>
<h3 id="heading-description-of-important-headers">Description Of Important Headers</h3>
<hr />
<p><strong>Log Sequence Number (LSN)</strong></p>
<p>Log Sequence Number comes first, occupying 8 bytes. every time the page is modified, postgreSQL writes a record to WAL (write ahead log) and assigns it a unique identifier (LSN). That LSN is copied into the page header. The. main reason to do so is for crash recovery. When PostgreSQL crashes and restarts, it replays the WAL starting from the last checkpoint. For each WAL record, it reads the corresponding page from disk and checks: is the page's LSN less than this WAL record's LSN? If yes, apply the change. If no, skip it—the page already has this change.</p>
<p><strong>Checksum</strong></p>
<p>The checksum header occupies 2 bytes. if you initialize your cluster with <code>initdb --data-checksums</code> postgreSQL calculates a checksum over the entire page every time it writes to the disk. when reading the page back, it recalculates and compares. if they don't match then scary error may come. The main downside is about 2-5 percent of cpu overhead for the checksum calculation and comparision.</p>
<p><strong>Flag</strong></p>
<p>Flag header occupies 2 bytes too. it is a bitmap of boolean properties about the page. The most important flag is <code>PD_ALL_VISIBLE</code> , which mirrors the visiblity map and tell us that every page in this tuple is visible to all transactions.</p>
<p><strong>pd_lower and pd_upper</strong></p>
<p>Both of the header takes 2 bytes. <code>pd_lower</code> points to the start of the free space area or end of the line pointer array where <code>pd_upper</code> points to the start of the start of the tuple area or end of the freespace area. when new tuple is inserted <code>pd_upper</code> moves down where as <code>pd_lower</code> moves up (towards higher page address)</p>
<p><strong>pd_special field</strong></p>
<p>The pd_special (2 bytes) points to a special area at the end of the page. For heap pages, this is unused and just points to byte 8192 (the end of the page). But for index pages, it points to index-specific metadata. B-tree pages, for example, store pointers to left and right sibling pages in the special area, plus the tree level. This allows the page header format to be generic while still supporting specialized needs.</p>
<p><strong>pd_pagesize_version</strong></p>
<p>The pd_pagesize_version field (2 bytes) encodes both the page size and the page layout version. This is important for pg_upgrade and for debugging. If PostgreSQL loads a page and sees an unexpected version number, it knows the page might have been written by an older or newer version with a different tuple layout. This prevents silent corruption when mixing versions.</p>
<p><strong>pd_prune_xid</strong></p>
<p>Finally, pd_prune_xid (4 bytes) stores the oldest transaction ID that deleted or updated a tuple on this page. This is used for HOT (Heap-Only Tuple) pruning, which we'll cover in depth in the MVCC module. For now, just know that it's an optimization hint that helps PostgreSQL decide whether it's worth trying to clean up dead tuples on this page.</p>
<h2 id="heading-line-pointers-the-indirection-layer">Line Pointers : The Indirection Layer</h2>
<hr />
<p>The size of each line pointers are exactly 4 bytes and point towards the some tuple within the page. This level of indirection is absolutely fundamental to how postgreSQL works.</p>
<p>When i was learning this concept first in university, i was overwhelmed by the concept of slotted array (Line Pointers Array), its usage and all. But i got intuition after i took the database course by Andy Palvo (CMU). Okay now answer the question Why indirection? Consider what happens without it. An index entry for a B-tree index contains a pointer to a heap tuple. If that pointer is a direct byte offset into the page—say, "the tuple is at byte 1000"—then what happens if we need to move the tuple within the page? Maybe we're doing a HOT update, or maybe we're compacting the page to defragment free space. If we move the tuple, we have to update every index entry that points to it. That's expensive and complex.</p>
<p>Line pointers solve this. An index entry doesn't point to byte 1000. It points to "page 5, line pointer 3." The line pointer itself points to byte 1000. If we need to move the tuple, we just update the line pointer. All the indexes remain valid without any changes. This is why line pointers are sometimes called "item pointers" or "item IDs"—they're stable identifiers for tuples.</p>
<h3 id="heading-each-line-pointer-packs-three-pieces-of-information">Each Line Pointer Packs Three Pieces of Information</h3>
<pre><code class="lang-c"><span class="hljs-keyword">typedef</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">ItemIdData</span> {</span>
    <span class="hljs-keyword">unsigned</span> lp_off:<span class="hljs-number">15</span>;     <span class="hljs-comment">/* Offset to tuple (0-32767) */</span>
    <span class="hljs-keyword">unsigned</span> lp_flags:<span class="hljs-number">2</span>;    <span class="hljs-comment">/* Status flags (4 states) */</span>
    <span class="hljs-keyword">unsigned</span> lp_len:<span class="hljs-number">15</span>;     <span class="hljs-comment">/* Tuple length (0-32767) */</span>
} ItemIdData;
</code></pre>
<p><code>lp_off</code> is also an pointer that stores starting address of specific tuple</p>
<p><code>lp_len</code> stores the length of the tuple in bytes</p>
<p><code>lp_flags</code> is where things get interesting. with 2 bits we can represent 4 different states.</p>
<h3 id="heading-lpflags-internals">lp_flags internals</h3>
<p><code>lp_unused</code> (0) means this line pointer has never been used, or it has been used but reclaimed by the VACUUM. it is available for reuse.</p>
<p><code>lp_normal</code> (1) means this line pointer points to an actual tuple. tuple might be live or dead version waiting till <code>VACUUM</code>. we cannot tell from the line pointer alone - we have to look header of that particular tuple.</p>
<p><code>lp_redirect</code> (2) is special. It means this line pointer doesn't point to tuple at all. Instead, it points to another line pointer. This happens during HOT updates. When a tuple is updated in a way that doesn't affect any indexed columns, PostgreSQL can create the new version on the same page and set up a redirect: "The tuple you're looking for is now at line pointer 5." Indexes still point to the original line pointer, and they automatically follow the redirect. This saves having to update every index.</p>
<p><code>lp_dead</code> (3) means the tuple this line pointer used to point to is definitely dead—no transaction can see it anymore—but VACUUM hasn't reclaimed the space yet. This is useful during index scans. If an index points to a dead line pointer, we can immediately skip it without having to fetch and examine the tuple.</p>
<p>Understanding line pointer states is essential for understanding tuple lifecycle and MVCC. A common mistake is thinking that when you delete a row, it's immediately removed. It's not. The tuple stays right where it is. The line pointer might eventually be marked LP_DEAD, but even then, the tuple data is still there, occupying space, until VACUUM runs.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> EXTENSION pageinspect;
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> lp_demo (<span class="hljs-keyword">id</span> <span class="hljs-built_in">INT</span>, <span class="hljs-keyword">data</span> <span class="hljs-built_in">TEXT</span>);

<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> lp_demo <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">1</span>, <span class="hljs-string">'version1'</span>);

<span class="hljs-keyword">SELECT</span> lp, lp_off, lp_flags, lp_len 
<span class="hljs-keyword">FROM</span> heap_page_items(get_raw_page(<span class="hljs-string">'lp_demo'</span>, <span class="hljs-number">0</span>));
</code></pre>
<h3 id="heading-again">Again</h3>
<pre><code class="lang-sql"><span class="hljs-keyword">UPDATE</span> lp_demo <span class="hljs-keyword">SET</span> <span class="hljs-keyword">data</span> = <span class="hljs-string">'version2'</span> <span class="hljs-keyword">WHERE</span> <span class="hljs-keyword">id</span> = <span class="hljs-number">1</span>;

<span class="hljs-keyword">SELECT</span> lp, lp_off, lp_flags, lp_len 
<span class="hljs-keyword">FROM</span> heap_page_items(get_raw_page(<span class="hljs-string">'lp_demo'</span>, <span class="hljs-number">0</span>));
</code></pre>
<p>now observe the differences between the output. In the next blog we will again make our hands dirty with tuple layout.</p>
]]></content:encoded></item><item><title><![CDATA[Postgres OID VS Relfilenode]]></title><description><![CDATA[When you create a table, PostgreSQL assigns it an OID (Object Identifier). This is just a logical identifier for the table. it is stored in the system catalog pg_class and remains constant for the lifetime of the table.
But under the hood in physical...]]></description><link>https://rabindranath-tiwari.com.np/postgres-oid-vs-relfilenode</link><guid isPermaLink="true">https://rabindranath-tiwari.com.np/postgres-oid-vs-relfilenode</guid><category><![CDATA[PostgreSQL]]></category><dc:creator><![CDATA[Rohan Tiwari]]></dc:creator><pubDate>Wed, 18 Feb 2026 12:58:33 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1771419475567/0c5de8f0-feff-48b8-af87-152f1a362892.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When you create a table, PostgreSQL assigns it an OID (Object Identifier). This is just a logical identifier for the table. it is stored in the system catalog <code>pg_class</code> and remains constant for the lifetime of the table.</p>
<p>But under the hood in physical file on disk is identified by the <code>relfilenode</code>, which is also stored in system catalog <code>pg_class</code>. But the good news is that most of the time OID and Relfilenode are same. But they can diverge, and understanding when and why is probably the most crucial.</p>
<h3 id="heading-try-this-experiment">Try This Experiment</h3>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> demo(
    <span class="hljs-keyword">id</span> <span class="hljs-built_in">INT</span>,
    <span class="hljs-keyword">data</span> <span class="hljs-built_in">TEXT</span>
);

<span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">oid</span>, relfilenode <span class="hljs-keyword">FROM</span> pg_class <span class="hljs-keyword">WHERE</span> relname = <span class="hljs-string">'demo'</span>;
</code></pre>
<h3 id="heading-again">Again</h3>
<pre><code class="lang-plaintext">TRUNCATE demo;
</code></pre>
<pre><code class="lang-plaintext">SELECT oid, relfilenode FROM pg_class WHERE relname = 'demo';
</code></pre>
<p>Now you might see the different oid and relfilenode from those commands. The OID stayed the same but why relfilenode changed ?</p>
<p>When you truncate a table, PostgreSQL doesn't actually delete all the rows from the existing file. Instead, it creates a brand new file with a new relfilenode and updates the catalog to point to it. The old file is deleted. This is much faster than scanning through the file and marking every tuple as deleted, and it's safer from a crash-recovery perspective—the old file exists until the transaction commits.</p>
<p>The same thing happens with <code>VACUUM FULL</code>, <code>CLUSTER</code>, and <code>REINDEX</code> (for indexes). These operations rewrite the entire table or index, giving it a new physical file. But the OID never changes. This separation between logical identity (OID) and physical storage (relfilenode) allows PostgreSQL to reorganize data without breaking foreign key constraints, views, or permissions, all of which reference the OID.</p>
]]></content:encoded></item><item><title><![CDATA[Understanding Heap File Storage]]></title><description><![CDATA[When you execute INSERT INTO users VALUES (1, 'Alice') in PostgreSQL, what actually happens on disk? Where does that data go? How is it organized? Why does a simple SELECTsometimes cause disk writes? These aren't just academic questions—they're the f...]]></description><link>https://rabindranath-tiwari.com.np/understanding-heap-file-storage</link><guid isPermaLink="true">https://rabindranath-tiwari.com.np/understanding-heap-file-storage</guid><category><![CDATA[postgres optimization]]></category><category><![CDATA[PostgreSQL]]></category><category><![CDATA[heap]]></category><category><![CDATA[Postgresql-performance ]]></category><dc:creator><![CDATA[Rohan Tiwari]]></dc:creator><pubDate>Wed, 18 Feb 2026 12:01:31 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1771416013194/b20d04bc-4131-41e9-b86a-a747a6fc3deb.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When you execute <code>INSERT INTO users VALUES (1, 'Alice')</code> in PostgreSQL, what actually happens on disk? Where does that data go? How is it organized? Why does a simple <code>SELECT</code>sometimes cause disk writes? These aren't just academic questions—they're the foundation for understanding everything from why VACUUM exists to how indexes work to why your queries might be slower than expected.</p>
<p>This module is about pulling back the curtain on PostgreSQL's storage engine. We're going to look at the raw bytes on disk, understand how pages are structured, and see exactly how tuples are laid out in memory. By the end, you'll be able to inspect pages with surgical precision and understand the physical reality behind every SQL operation.</p>
<p>Let's start at the very beginning: the heap.</p>
<h2 id="heading-what-the-heck-is-heap-anyway">What the heck is heap anyway ?</h2>
<p>The term "heap" in database terminology doesn't mean anything fancy—it literally means "pile." When PostgreSQL stores your table data, it piles it up in no particular order. This is fundamentally different from how some other databases work, and understanding this difference is crucial.</p>
<p>Imagine you have a table called <code>users</code> with a primary key on <code>id</code>. In MySQL's InnoDB engine, the actual table data is physically sorted by that primary key. Insert a row with <code>id=100</code>, then <code>id=50</code>, then <code>id=200</code>, and InnoDB will rearrange them on disk to be in order: 50, 100, 200. The table itself is structured as a B-tree sorted by the primary key.</p>
<p>PostgreSQL doesn't do this. When you insert those same three rows, they go onto disk in exactly the order you inserted them: 100, 50, 200. The primary key index is separate—it's just another index that happens to enforce uniqueness. The table itself? Just a heap. A pile of tuples with no inherent order.</p>
<p>This design choice has profound implications. It makes inserts fast because there's no need to find the "right" place to put new data—just throw it wherever there's space. It makes updates more flexible because moving a row doesn't require reshuffling an entire tree structure. But it also means that scanning a table by primary key order requires random I/O, since the rows aren't physically sorted.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771415452587/924afd5e-ffc8-4589-8fd6-8d8df7fed70f.png" alt /></p>
<h2 id="heading-where-it-is-stored-then">Where it is stored then ?</h2>
<p>Postgres stores all of your data inside the directory pointed by <code>$PGDATA</code> environment variable. use the following command to see where it is in your case</p>
<pre><code class="lang-bash"><span class="hljs-built_in">echo</span> <span class="hljs-variable">$PGDATA</span>
</code></pre>
<pre><code class="lang-bash"><span class="hljs-variable">$PGDATA</span>/base/{database_oid}/{relfilenode}
</code></pre>
<h2 id="heading-experiment-to-find-your-table-physical-location">Experiment to find your table physical location</h2>
<hr />
<pre><code class="lang-sql"><span class="hljs-comment">-- create demo table</span>
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> <span class="hljs-keyword">test</span>(
    <span class="hljs-keyword">id</span> <span class="hljs-built_in">INT</span>,
    <span class="hljs-keyword">data</span> <span class="hljs-built_in">TEXT</span>
);

<span class="hljs-comment">-- find the filepath</span>
<span class="hljs-keyword">SELECT</span> pg_relation_filepath(<span class="hljs-string">'test'</span>) <span class="hljs-keyword">as</span> physical_location,
<span class="hljs-keyword">SELECT</span> pg_relation_size(<span class="hljs-string">'test'</span>) <span class="hljs-keyword">as</span> size_in_bytes;
</code></pre>
<h2 id="heading-output-interpretation">Output Interpretation</h2>
<p>You might see something like <code>base/16384/24601</code>. This means your database has OID 16384, and this particular table has been assigned relfilenode 24601. If you navigate to <code>$PGDATA/base/16384/</code>, you'll find a file named <code>24601</code>. That's your table. That file contains all the rows you've inserted, organized into 8KB chunks called pages.</p>
<p>Why 8KB? This is a configurable compile-time option, but 8192 bytes is the default, and there are good reasons for it. It's large enough to hold a reasonable number of rows (typically 50-200 for OLTP workloads) but small enough that reading a page from disk is a single, efficient I/O operation. It aligns well with operating system page sizes, which reduces translation lookaside buffer (TLB) misses and makes memory management more efficient. It's been the default for decades because it represents a good compromise for mixed workloads.</p>
<p>Each table file can grow up to 1GB. Once it exceeds that size, PostgreSQL creates a new segment file named <code>24601.1</code>, then <code>24601.2</code>, and so on. This segmentation has historical roots—old filesystems had strict file size limits—but it's still useful today for operational reasons. Smaller files are easier to copy, back up, and manage. They also allow for some parallelism in I/O operations.</p>
<h2 id="heading-file-organization-on-disk">File Organization on Disk</h2>
<p><a target="_blank" href="https://www.google.com/url?sa=t&amp;source=web&amp;rct=j&amp;url=https%3A%2F%2Flink.springer.com%2Fchapter%2F10.1007%2F979-8-8688-1507-2_3&amp;ved=0CBYQjRxqFwoTCJDuxPD04pIDFQAAAAAdAAAAABBY&amp;opi=89978449"><img src="https://media.springernature.com/lw685/springer-static/image/chp%3A10.1007%2F979-8-8688-1507-2_3/MediaObjects/635661_1_En_3_Fig2_HTML.jpg" alt="PostgreSQL Physical Structures | Springer Nature Link" /></a></p>
<p>In the next blog we will briefly understand those files, tuple structure as well really indepth.</p>
]]></content:encoded></item></channel></rss>