What Is a Lakehouse Archive: The Modern Data Archiving Architecture

Andrew Marsh
•
June 11, 2026

Key Points

Database-centric archiving was built for relational data and single-source retrieval, creating sprawl when your estate is now 60% unstructured and multi-application.
Lakehouse-native archives solve four architectural failures: handle all data formats, enable source system retirement, scale on object storage economics, and serve AI directly.
WORM immutability only becomes defensible when enforced at ingestion, not applied after the fact via permissions that create vulnerability windows.
Retrieval independence is the hard requirement that separates archived data from hostaged data. Your archive cannot outlive the source system.
Consolidating structured and unstructured archives into one Lakehouse foundation cuts governance complexity and enables policy-driven retention across the entire estate.

Most of what the enterprise calls “archiving” is a database doing a job it was never designed to do.

For thirty years, the answer to “where does old data go?” was the same: move the rows out of the production database and into another database.

A cheaper one, a colder one, a compressed one but a relational database all the same. The schema came along for the ride. The vendor lock came along for the ride. And the assumption underneath it all ,that enterprise data is rows and columns, that retrieval means SQL, that the source application will be around to make sense of the bytes — came along for the ride too.

That assumption is now wrong in at least four expensive ways. And if you are the architect, CTO, or CDO who signs off on the next decade of data infrastructure, the gap between “database archiving” and what your estate actually needs is about to become your problem, not your predecessor’s.

This guide will walkthrough why the Lakehouse, a pattern most people still file under “analytics,” turns out to be the only architecture that handles the modern archive’s full job description: structured and unstructured data, queryable by both humans and machines, immutable by design, and scalable without a re-architecture every time the estate changes shape.

Let’s start by being precise about what archiving was originally built to do, because the design constraints from that era explain almost everything that’s broken about it now.

Archiving Was Designed for a World That No Longer Exists

Database-centric archiving was a brilliant solution to a specific 1990s and 2000s problem: production OLTP databases were getting fat, expensive, and slow, and most of the rows weren’t being touched.

The classic 80/20 held — a large fraction of stored rows were never queried after the first 90 days, yet they sat on the same expensive, high-performance tier as live transactional data, dragging down backup windows, recovery times, and licensing costs.

The fix was elegant for its time. Identify the inactive rows by business rules — closed orders, terminated employees, completed claims. Extract them. Move them to a secondary relational store on cheaper infrastructure. Leave a pointer or a retrieval mechanism behind so that, on the rare occasion someone needed a 2003 invoice, you could fetch it back.

Tools like SAP’s Archive Development Kit and ILM, Informatica’s ILM, OpenText’s structured archiving, and the various database-tiering products were all variations on this theme.

Note the design assumptions baked into that approach, because every one of them has since failed:

The data is relational. Archiving meant preserving table structures, foreign keys, and the relational model. Anything that wasn’t rows-and-columns like documents, images, scanned correspondence, log streams, sensor data, email — was somebody else’s problem, parked in a separate content management system with its own retention regime and its own retrieval pain.
Retrieval means SQL against the original schema. To read the archive, you needed to understand the source schema. Which meant you needed the source application’s data dictionary, or in the worst cases the source application itself, alive and licensed, to make the archived data legible.
Scale means more database. Need to hold more? Provision more relational storage, more compute, more licenses. The cost curve was linear at best and frequently worse, because relational engines were never cheap per terabyte.
The archive is a destination, not an asset. Archived data was something you stored against the possibility of audit or litigation. It was a liability you managed, not a corpus you used.

Every one of those assumptions is now a constraint you’re paying for. The estate stopped being relational. Retrieval stopped being a SQL-only problem. Scale stopped being affordable when measured in relational terabytes.

And, this is the part most archiving vendors are still pretending isn’t happening — old data stopped being inert. AI made it an asset, and the relational archive can’t hand it over in any form an AI engine can consume.

The Four Ways Database Archiving Breaks Now

1. The estate isn’t relational anymore and the archive can’t see most of it

Walk the data estate of any large enterprise and count the formats. There’s the structured core, yes — SAP, Oracle EBS, the HRIS, the claims system.

But there’s at least as much value, and far more volume, in everything that isn’t a tidy table: contracts and signed PDFs, clinical documents, email and chat that has to be retained for compliance, machine and application logs, IoT telemetry, images, recordings, and the semi-structured exhaust of every SaaS application you’ve adopted in the last decade.

A database archive, by construction, handles the first category and shrugs at the rest. So enterprises end up with a portfolio of archives — a relational tool for the ERP, a content archive for documents, a separate journaling product for email, a log platform for the machine data — each with its own retention engine, its own legal-hold mechanism, its own search interface, its own audit trail, and its own renewal invoice. The “archive” is not one system you can reason about. It’s a sprawl you can’t.

The architectural failure here is not that any single tool is bad. It’s that the relational model was the wrong organizing principle for the archive.

The moment your retention obligations spanned structured and unstructured data — which is to say, the moment you had a single regulation that touched both a transaction record and the document supporting it, the database-centric archive stopped being able to represent the thing you actually needed to govern.

2. Retrieval depends on a source system you’re trying to retire

Here is the quiet absurdity at the centre of most ERP and application decommissioning programmes. The entire point of decommissioning a legacy system — say, retiring ECC after the move to S/4HANA, or sunsetting a payroll platform after a Workday cutover — is to stop paying to run it. But if your archive of that system’s data is a relational extract that can only be interpreted through the source application’s schema, then you haven’t actually decommissioned anything. You’ve kept the corpse on life support, so you can still read the will.

This is the difference between application retirement and application archiving that enables retirement. A genuine enterprise archive must be independently retrievable: the data, its structure, its business context, and its access controls must live in the archive itself, not be reconstituted by reaching back into a system you intended to switch off.

Database-centric approaches routinely fail this test, because they preserve rows but not the self-describing context that makes those rows legible without the original engine.

If your archiving strategy still requires the source system to be queryable to make sense of the archive, you do not have an archive. You have a hostage situation.

3. Scale is priced in the wrong currency

Relational storage is expensive per terabyte, and archives are, by definition, the largest data sets you own — they only ever grow. Pricing the long tail of your data in relational-database terms is like renting climate-controlled vault space to store the contents of your loft. The economics were always strained; at petabyte scale they break entirely.

There’s a second, subtler scaling failure. In the database-centric model, storage and compute are coupled. You provision capacity for an archive you query a handful of times a year as if you might query it constantly. You pay for the engine to be ready even when it’s idle.

The architecture has no concept of “store cheaply forever, spin up compute only on the rare occasion someone asks a question.” That decoupling, store and compute as independent, independently priced layers, is exactly what the relational era couldn’t offer and what the modern estate desperately needs.

4. AI can’t read a relational archive and that’s where the value moved

This is the one that should keep architects up at night, because it inverts the entire premise of archiving. For thirty years, archived data was dead weight: stored against risk, retrieved under duress. Then large language models and retrieval-augmented generation turned the historical corpus into the single most valuable training and grounding asset the enterprise owns.

Decades of contracts, claims, correspondence, decisions, and transactions; the institutional memory became something you’d want to query semantically, ground a model on, and mine for patterns. A relational archive cannot participate in that.

To make archived relational data available to an AI engine, you have to extract it, transform it, re-format it, and load it somewhere the model can reach; at which point you’ve built a second pipeline on top of the archive you already built, and you’ve duplicated the data you were trying to consolidate. Unstructured archives in proprietary content formats are even worse: the AI can’t see them at all without bespoke extraction.

The architecture that wins, then, is the one where the archive is already in an open, analytics-ready, AI-readable form the moment data lands in it, not a destination you have to re-export from to do anything useful. Which brings us, finally, to the Lakehouse.

What a Lakehouse Actually Is?

The term “Lakehouse” has been marketed half to death, so let’s strip it back to the architecture, because the architecture is the entire argument.

A Lakehouse is what you get when you take the cheap, open, infinitely-scalable storage of a data lake and add the transactional integrity, schema management, and governance of a data warehouse — without copying the data into a separate proprietary warehouse engine to get those guarantees. It rests on three technical pillars that matter enormously for archiving:

Open columnar storage on object stores. Data lives as open file formats (Parquet, ORC) on commodity object storage (S3, ADLS, GCS, or on-prem equivalents). This is the cheapest durable storage tier available; it scales to exabytes without re-architecture, and crucially, the format is open: any engine that speaks Parquet can read it. No vendor owns your bytes.
An open table format providing ACID transactions and metadata. On top of those files sits an open table layer, Delta Lake, Apache Iceberg, or Apache Hudi, that does the thing data lakes historically couldn’t: it provides ACID transactions, schema enforcement and evolution, and a transaction log over data sitting in object storage. This is what turns an ungoverned “data swamp” into something with warehouse-grade integrity. It also provides time travel: the ability to query the state of a table as of a previous point in time, which is a property that should make any records manager sit up straight.
Decoupled, elastic compute. Storage and compute are separate. The data sits at rest, costing almost nothing, in open format. Compute — SQL engines, Spark, vector search, an AI model — is brought to the data on demand and scaled independently. You store everything forever and pay for processing only when a question is actually asked.

That third pillar quietly solves the scaling failure of the relational archive. The first two pillars solve the independence and the AI-readability problems: open formats mean any engine including a future one you haven’t bought yet can read the archive, and the self-describing table metadata means the data is legible without the source application.

But and this is the distinction most “we do Lakehouse” vendors gloss over — a Lakehouse is not an archive. A Lakehouse is an analytics platform. It is built to be written, updated, deleted, and re-shaped constantly.

An archive needs almost the opposite properties. So the interesting architectural question isn’t “is a Lakehouse good?” It’s “what do you have to add to a Lakehouse to make it a defensible archive?” That’s where the design gets genuinely interesting, and where most of the market simply isn’t.

A Lakehouse Is Not an Archive — Until You Add the Things an Archive Needs

Take the open, scalable, AI-readable Lakehouse foundation. Now ask what an archive — in the legal, regulatory, evidentiary sense — actually requires that an analytics Lakehouse does not provide out of the box:

Immutability at ingestion, not after. An archive’s defensibility rests on the data being unalterable from the moment it lands. Not “we set permissions, so people probably won’t change it.” Write-Once-Read-Many (WORM) is enforced at the storage layer, applied at the point of data ingestion, so the record is immutable before anyone touches it. A standard Lakehouse table is mutable by design; you can update and delete rows. An archive must lock that down at the bottom of the stack.
Policy-driven retention and disposition. The archive must know, per record class, how long data must be kept, when it must be destroyed, and must execute defensible disposition automatically — including holding records past their schedule when a legal hold demands it. This is retention orchestration, not a TTL setting.
Legal hold that overrides retention. When litigation or investigation is reasonably anticipated, the relevant records must be frozen — exempt from disposition — regardless of their retention schedule, with the hold itself auditable. An analytics platform has no concept of this.
Chain of custody and evidentiary integrity. Append-only audit logs of every access and every policy action. Cryptographic hashing of records so tampering is detectable. Trusted timestamps establishing when a record existed in a given state. Optional notarization or ledger anchoring so the integrity claim doesn’t rest on the vendor’s word. These are the properties that align an archive with evidentiary principles, the kind of integrity standards seen in eIDAS, QTSP, and ETSI contexts, rather than merely “we kept a copy.”
Cross-application, content-agnostic search. Because the archive holds structured and unstructured data from across the estate, search has to span all of it to find every record relating to a person, a matter, or a transaction regardless of which retired system it originated in, without knowing that system’s schema.

One Lakehouse Archive replaces relational, content, email, and log archives.

See the cost breakdown and 3-year ROI in our TCO benchmark.

A Lakehouse gives you the foundation: cheap, open, scalable, AI-ready storage with transactional integrity and time travel. A Lakehouse Archive is what you get when you build the governance, immutability, retention, and evidentiary layer into that foundation rather than bolting a relational archive onto the side of it. That is the architecture. Everything else is a database wearing an archive’s badge.

Archon Data Store: The Lakehouse-Native Reference Architecture

This is the architecture Archon Data Store (ADS) is built on and the reason it sits in a different category from the database-centric and content-archiving incumbents.

ADS is a Lakehouse-native archive. Data is ingested into open, columnar table formats on object storage, which means the three Lakehouse properties — open formats, decoupled elastic compute, transactional integrity with time travel — are foundational rather than retrofitted.

On top of that foundation, ADS adds the archive layer that an analytics Lakehouse lacks: WORM and immutability applied at ingestion, policy-driven retention and legal-hold orchestration, append-only logs, cryptographic hashing, trusted timestamps, and notarization/ledger anchoring for evidentiary integrity.

The result is a single archive that is simultaneously cheap to hold at scale, legible without the source system, defensible under audit, and directly queryable by analytics and AI engines.

A few of the concrete capabilities that matter when you’re evaluating this as a reference architecture rather than a pitch:

250+ source connectors. The archive only consolidates the estate if it can ingest from the estate. ADS ships with 250+ connectors spanning ERP, HRIS, CRM, databases, content systems, email, and file sources — structured and unstructured alike — so the long tail of legacy applications can actually be retired into one place rather than archived into a dozen.
1,000+ built-in transformations. Ingesting raw data isn’t enough; it has to be normalized, classified, masked where regulation demands, and rendered self-describing. 1,000+ transformations handle that at ingestion, so what lands in the archive is governed and legible, not a raw dump you’ll have to re-engineer to read in five years.
Cross-application search. Because everything lands in one open, governed corpus, search spans the whole archive — every record about a person, a contract, a claim, a transaction — across every retired system, without needing the schema or the source application of any of them. This is the capability that turns “we kept the data” into “we can answer the regulator in an afternoon.”
WORM / immutability at ingestion. Records are written immutably at the moment they land. The defensibility property isn’t a configuration you hope holds; it’s enforced at the storage layer from ingestion onward, which is the only point at which immutability is actually evidentially meaningful.

The placement of ADS in this argument is deliberate: it’s the reference implementation of the pattern, not the reason for the pattern. The pattern would be correct even if ADS didn’t exist. ADS happens to be built the way the architecture says an archive should be built.

Database-Centric vs Lakehouse-Native: The Side-By-Side

Strip away the messaging and the difference reduces to a handful of architectural properties. Here’s the honest comparison an architect should run before signing anything.

Property	Database-centric archive	Lakehouse-native archive
Data scope	Structured / relational only; unstructured handled by separate systems	Structured and unstructured in one governed corpus
Storage format	Proprietary or relational; vendor-bound	Open columnar (Parquet/ORC) on object storage; engine-agnostic
Storage / compute	Coupled — pay for the engine even when idle	Decoupled — store cheaply forever, pay compute only on query
Cost at scale	Linear-to-worse; priced in relational terabytes	Object-storage economics; scales to exabytes without re-architecture
Retrieval independence	Often requires source schema or source app to interpret	Self-describing; legible without the source system
AI / analytics access	Requires a second extract/transform pipeline	Directly queryable by SQL, vector search, and AI engines
Immutability	Permissions-based; frequently applied after the fact	WORM enforced at the storage layer, at ingestion
Scaling the estate	Re-architect / add tooling per new data type or source	Add a connector; the architecture doesn’t change
Evidentiary integrity	Audit logs, sometimes	Append-only logs, hashing, trusted timestamps, ledger anchoring

The pattern in that table is the whole thesis: the database-centric column is a set of constraints inherited from an era when data was relational, storage was expensive, and nobody expected to point a machine-learning model at a fifteen-year-old contract. The Lakehouse-native column is what you’d design if you started from today’s estate and today’s obligations.

“But Our Platform Already Has Retention” — the Native-Retention Objection

Every major SaaS and ERP platform now ships some flavor of retention, immutability, or “archive” tier. The reasonable objection follows: if Microsoft Purview, or our ERP’s built-in retention, or the platform’s cold-storage tier already does this, why do we need a separate archive at all?

Because native retention and enterprise archiving are different things that happen to share vocabulary.

Native platform retention is built to manage that platform’s data, inside that platform’s lifecycle, for as long as you keep paying for that platform. It is excellent at keeping a SharePoint document or a Dynamics record under a retention label while it lives in SharePoint or Dynamics. It is structurally incapable of three things an enterprise archive must do:

Span the estate. Native retention governs one platform. Your retention obligations and your litigation don’t respect platform boundaries — they follow people, matters, and records across every system, including the ones you’ve retired.
Outlive the source. The entire economic case for archiving is being able to switch the source system off. Native retention dies with the platform; if the data has to outlive the application, native retention is the wrong tool by definition.
Provide independent evidentiary integrity. “We trust the platform that holds the data to also certify that the data wasn’t altered” is a circular integrity claim. Defensible archiving requires the immutability and the integrity proofs to be independent of the system that could, in principle, alter the record.

Native retention is a feature of an application. An archive is an architecture that the applications feed into and then get switched off behind. Confusing the two is how enterprises end up unable to retire anything, because the data is still hostage to the platform that “retains” it.

What this means for the next decade of your estate

If you’re the architect, CTO, or CDO setting data infrastructure direction, the practical implications are concrete:

Stop pricing the archive as a database problem. The long tail of your data should be priced in object-storage economics with decoupled compute. If your archive’s cost scales with relational terabytes, you’re financing the wrong architecture.
Treat retrieval independence as a hard requirement. Any archive that needs the source system alive to be legible has not enabled decommissioning. The test is simple: could you switch the source application off tomorrow and still answer a regulator from the archive alone?
Assume the archive is an AI asset, not dead weight. Whatever you build will be asked to serve historical data to models within its lifetime. Open, analytics-ready formats at rest are no longer a nice-to-have; they’re the difference between a corpus you can use and a liability you re-export from.
Demand immutability at ingestion and integrity proofs independent of the platform. Permissions are not immutability, and a vendor vouching for its own data isn’t evidentiary integrity. Hold the line on WORM-at-ingestion, append-only logs, hashing, and trusted timestamps.
Consolidate, don’t proliferate. Every additional point archive is another retention engine, another legal-hold mechanism, another audit surface, and another renewal. One open- governed corpus across structured and unstructured data is the architectural simplification the estate has been missing.

The Lakehouse wasn’t designed for archiving. That’s precisely why it’s the right foundation for it: it solves, as side effects of being an open and decoupled analytics architecture, the exact problems the relational archive was structurally unable to solve — scale, openness, retrieval independence, and machine-readability. Add immutability, retention orchestration, and evidentiary integrity on top, and you have an archive built for the estate you actually have, rather than the one your archiving vendor designed for in 2004.

The database-centric archive isn’t wrong because it’s old. It’s wrong because the assumptions it was correct about have all quietly stopped being true.

Archon Data Store is the Lakehouse-native reference architecture with 250+ connectors, 1,000+ transformations, cross-application search, and WORM immutability enforced at ingestion. If your current archive still needs the source system switched on to read it, that’s the conversation worth having.

See the architecture → book a technical walkthrough

Frequently Asked Questions

A Lakehouse is an analytics platform built for frequent querying, updates, and schema changes. An archive is a records-management system built for immutability, retention policy enforcement, and long-term retrieval independence. A Lakehouse Archive combines the Lakehouse’s open, scalable foundation with archive-specific governance: WORM immutability at ingestion, policy-driven retention, legal hold orchestration, and evidentiary integrity. Without these five layers, a Lakehouse is not an archive.

Immutability at ingestion eliminates a vulnerability window. When a record arrives and is immediately locked at the storage layer, there is no period during which someone could alter it. Post-hoc permissions applied hours or days later cannot prove the record wasn’t altered before the lock was set.

No. If your archive depends on the source ERP’s schema to be legible, the source system must remain licensed and queryable. You cannot truly retire the ERP. This is the core failure of database-centric archiving: it preserves rows but not the self-describing context that makes those rows independent. A Lakehouse-native archive solves this by including metadata, structural information, and business context at ingestion, so the archive is legible without the source application.

Relational archives are designed for tables and rows. Documents, images, email, and semi-structured data require a separate content archive with its own retention engine, audit trail, and search interface. A Lakehouse-native archive handles both in one system: data arrives in open columnar formats (Parquet, ORC), metadata is applied at ingestion via 1,000+ transformations, and a single governance policy spans all formats. Cross-application search retrieves records regardless of format or source system.

Legal hold places a freeze on records subject to litigation or investigation, exempting them from normal disposition schedules indefinitely. In a Lakehouse archive, legal hold is orchestrated at the governance layer: records flagged with a hold are automatically excluded from deletion workflows, the hold itself is logged in an append-only audit trail, and holds can be released only via documented process.

What Is a Lakehouse Archive and Why It’s Architecturally Different