Case Study Financial Services Storage Reclamation Linux · SAN · fstrim

Before You Buy More SAN: A 40TB Reclamation Case Study

How WUC Technologies helped a financial services organization recover 40TB of enterprise storage capacity — and defer a multi-hundred-thousand-dollar emergency procurement — using nothing more than properly coordinated fstrim operations.

Engagement at a glance
Client industryFinancial services (regulated workload, mid-six-figure annual storage spend)
Capacity reclaimed40+ TB of thin-provisioned SAN storage
Capital expenditure deferredEstimated $400K+ in flash capacity + expedited freight (based on list-price quotes for the same array family)
Engagement timelineDiscovery to first production reclamation: 9 business days · Full execution: 4 weeks
Scale touched47 production Linux hosts · 184 LUNs · 312 mount points evaluated · 67 actively trimmed · 3 explicitly declined after risk review
Production impactZero unplanned downtime · zero application disruption
MethodologyVendor-agnostic (validated on NetApp ONTAP, Pure Storage, and Dell PowerStore-class arrays)

The Problem

A large financial services organization approached us with a familiar problem: their enterprise storage arrays were approaching critical utilization thresholds, and procurement timelines for additional capacity were measured in months — not days.

The environment was heavily virtualized, Linux-based, and backed by thin-provisioned SAN storage. On paper, application teams had already deleted significant amounts of data. In reality, the storage arrays showed almost no reduction in consumed capacity.

Pre-engagement state when we arrived:

  • Production aggregates sustained at 94% utilization, with three trending past 97% within a 6-week forecast based on observed growth velocity
  • Escalating risk of write failures, performance degradation, and snapshot-reserve collapse
  • Active pressure from procurement and architecture teams to accelerate a costly storage expansion — including expedited freight surcharges
  • Limited maintenance windows in a highly regulated environment governed by formal change-control gates

After a full storage and host-level analysis, we identified the issue:

Deleted data inside the Linux filesystems was never being returned to the storage array.

Using coordinated fstrim operations across targeted Linux systems, we reclaimed more than 40TB of thin-provisioned capacity without downtime, eliminating the immediate need for an emergency storage purchase.

The Root Cause

In thin-provisioned environments, deleting files at the filesystem layer does not automatically return blocks to the storage array.

From the storage array perspective:

  • Blocks remain allocated to the LUN
  • LUN consumption metrics remain high
  • Thin provisioning efficiency degrades over time
  • Array-side dedupe and compression ratios are computed against allocated blocks that no longer contain meaningful data

This pattern is especially common in VMware vSphere environments, Oracle databases, log-heavy applications, backup staging systems, and large transient workloads (ETL, data-warehouse refresh windows, scientific compute scratch).

THE MENTAL-MODEL GAP

Customer’s mental model: “We deleted the data, so the storage should be free.”

The actual mechanic: Thin provisioning only works efficiently when unused blocks are explicitly unmapped or discarded back to the array via SCSI UNMAP / WRITE_SAME operations.

This is the single most common — and most expensive — misunderstanding in thin-provisioned SAN environments.

Environment Overview

ComponentDetails
OSRHEL / Oracle Linux (mix of 7.x and 8.x)
StorageEnterprise all-flash SAN arrays (vendor-agnostic methodology validated against NetApp ONTAP, Pure Storage, and Dell PowerStore-class systems)
ConnectivityFibre Channel (32 Gb dual-fabric)
HypervisorVMware vSphere
ProvisioningThin-provisioned LUNs at both array and VMware datastore layer
FilesystemsXFS and EXT4
MultipathingDM-Multipath

The storage arrays fully supported SCSI UNMAP, thin reclamation, and block discard propagation through VMFS. The missing piece was host-side reclamation — the discard chain wasn’t completing end-to-end.

What We Discovered

Several Linux systems contained large amounts of stale data: decommissioned application datasets that had been removed months earlier, rotated log files past retention windows, temporary processing files from ETL jobs that had completed, and snapshot-derived clones that were detached but not freed.

The discrepancy was stark:

$ df -h /u02
Filesystem      Size  Used Avail Use% Mounted on
/dev/mapper/u02  20T  7.2T   13T  36% /u02

The host reported 64% free space. The array reported 94% allocated. That delta — roughly 12 TB on this single LUN — was unreclaimed thin-provisioned space.

Multiplied across 67 hosts touched in the engagement, the total stale-block footprint approached 50 TB. Final reclamation landed at 40+ TB after accounting for hosts where we deferred or declined trim operations.

The Solution: Coordinated fstrim

Linux provides the fstrim utility to inform underlying storage that blocks are no longer in use:

$ fstrim -av

This sends DISCARD / TRIM requests through:

Filesystem → Device Mapper → Multipath → Fibre Channel → Array

When the entire chain supports discard propagation, the array can reclaim those blocks immediately. The trick is that every link in that chain has to be validated before you run fstrim at production scale. A broken link anywhere causes the trim to either fail silently, hang, or — worst case — induce IO latency spikes that propagate to other workloads on the same fabric.

Validation Checklist (Before Any Execution)

1. Thin provisioning support at the array

Confirmed the storage arrays supported SCSI UNMAP advertised in inquiry data, thin reclaim enabled at the LUN level, and space return measurable via array-side analytics.

2. Discard propagation through DM-Multipath

$ lsblk --discard
NAME              DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
mpathb                   0        4K       4G         0
└─mpathb1                0        4K       4G         0

Key fields: DISC-GRAN (smallest unit the device can discard) and DISC-MAX (largest single discard operation). If DISC-GRAN is 0, the multipath device isn’t propagating discard. We caught this on two hosts where the multipath configuration explicitly disabled discard hints.

3. Filesystem compatibility

Validated that all mounted filesystems were XFS or EXT4 with discard support enabled in the kernel module.

4. VMware compatibility

Confirmed virtual disks were configured with disk.EnableUUID set, VM hardware versions were 13 or later (required for guest-to-VMFS UNMAP propagation), VMFS datastores had UNMAP enabled at the datastore level, and guest OS discard operations weren’t being blocked by ESXi-level snapshot consolidation.

Risk Calls We Made

A field-honest case study should include what we didn’t do, not just what we did. Of the 47 hosts evaluated, we explicitly declined fstrim operations on 3 hosts:

HOST 1 · IN-FLIGHT ORACLE REDO LOG WRITES

The host was inside an active Oracle write-heavy window. fstrim on filesystems hosting redo logs can compete with the redo write path and induce log-buffer pressure. We deferred until the next scheduled change window when the workload was idle.

HOST 2 · SNAPSHOT RETENTION POLICY CONFLICT

The host had array-side snapshots that were referenced by a compliance retention policy. Discarding blocks that the snapshots still referenced would have invalidated the retention chain — not a recovery issue, but a compliance audit trail issue. We deferred until the snapshot rotation completed.

HOST 3 · UNSUPPORTED HBA FIRMWARE LEVEL

The host was running an HBA firmware revision with a known issue around DISCARD command serialization on the fabric. We deferred the trim and flagged the host for firmware update in the next planned maintenance window before executing reclamation.

Three additional failure modes we screened for but did not encounter — worth flagging for any team running this themselves:

  • VMs with active snapshots block fstrim from reaching the array. Consolidate first.
  • Oracle ASM disks don’t filesystem-trim. They require srvctl + ASM-level rebalance to reclaim — a different procedure outside this engagement’s scope.
  • If multipath -ll shows any path in failed/faulty state, fstrim -av may hang silently. Verify multipath -ll is fully clean before starting on any host.

Execution Strategy — Phased

Because the environment was production-critical and under formal change control, we used a phased approach.

Phase 1 — Low-Risk Systems (Week 1)

We first targeted non-production environments, reporting and analytics systems, and archive workloads. This validated end-to-end reclaim behavior across the full discard chain, array response curves under sustained UNMAP load, and performance impact on neighboring LUNs sharing the same SAN ports.

Per-host trim time observed: 12–45 minutes per TB of reclaimable space, depending on fabric load and array-side reclaim queue depth.

Phase 2 — Controlled Production Rollout (Weeks 2–4)

We expanded to production systems during approved change windows. Execution batches were sized to keep total array-side reclaim activity under 15% of available IOPS budget at any moment.

$ sudo fstrim -av
/        : 1.2 TiB (1,326,389,452,800 bytes) trimmed on /dev/mapper/root
/u02     : 4.8 TiB (5,277,655,531,520 bytes) trimmed on /dev/mapper/u02
/archive : 7.1 TiB (7,807,825,182,720 bytes) trimmed on /dev/mapper/archive

Monitoring During Reclamation

Host metrics: CPU utilization (sustained < 15% incremental load from fstrim), IO latency on trimmed filesystems and on neighbors, multipath stability (multipath -ll polling every 30 seconds during active trim), queue depth at the HBA layer.

SAN metrics: port utilization on both fabrics, backend latency at the array, array CPU and reclaim-engine queue depth.

Storage efficiency: thin allocation reduction (the headline metric), aggregate / LUN consumption drop, snapshot-reserve recovery, dedup / compression ratio stabilization.

Results

Total capacity recovered: 40+ TB

Immediate business impact, in order of dollar value:

  • An estimated $400K+ in deferred capital expenditure (based on list-price quotes for the same array family + expedited freight surcharges that would have applied to the original emergency timeline)
  • Avoided expedited shipping costs (which alone would have added a low-six-figure premium)
  • Eliminated unplanned budget escalation for the current fiscal cycle
  • Removed the immediate risk of write-failure events on three production aggregates
  • Restored ~24 months of organic growth headroom before the next legitimate capacity expansion conversation

Operational impact: zero unplanned downtime, zero application disruption, full execution completed within an existing operational budget envelope — no overtime, no out-of-cycle change windows. 47 hosts evaluated, 67 mount points actively trimmed, 3 hosts deferred for documented reasons.

Operational Lessons Learned

1. Deleted Files ≠ Reclaimed Storage

This misunderstanding is extraordinarily common in enterprise environments. Filesystem free space does not automatically translate into reclaimed SAN capacity. The fix is operational discipline, not technology. Most environments already have the technology.

2. Thin Provisioning Requires Lifecycle Management

Thin provisioning is not “set and forget.” Organizations running thin-provisioned storage should implement scheduled fstrim operations (typically weekly or monthly per filesystem class), reclamation monitoring as part of routine capacity review, periodic storage efficiency audits across the full discard chain, and a documented runbook for fstrim execution under change control.

3. End-to-End Validation Matters

Not every stack handles discard operations correctly. Validate every link in the chain — array, fabric, multipath, hypervisor, filesystem — before scaling. Skipping validation is how organizations end up either with fstrim operations that silently do nothing (the cheap failure mode) or with fstrim operations that induce latency spikes during production hours (the expensive failure mode).

Recommended Best Practices

Enable scheduled fstrim on modern Linux distributions

$ systemctl enable fstrim.timer
$ systemctl start fstrim.timer
$ systemctl status fstrim.timer

The default unit triggers weekly. Adjust the cadence to match your workload churn and change-control posture.

Avoid continuous mount-time discard in most SAN environments

While the discard mount option is technically functional, continuous discard can saturate the array’s reclaim engine during high-write windows. Periodic scheduled trimming is typically preferred for SAN-attached workloads. Continuous discard makes more sense for local NVMe / SSD volumes where the discard path is short and cheap.

Monitor thin provisioning ratios

Track logical vs physical allocation per LUN, snapshot growth rate, space efficiency trends across all storage tiers, and oversubscription risk at the array level. These belong on a capacity dashboard, reviewed monthly.

Key Takeaway

Many organizations are sitting on massive amounts of reclaimable storage capacity without realizing it. Before purchasing additional SAN capacity, validate whether unused blocks are actually being returned to the array.

In this engagement, a relatively small operational initiative recovered over 40TB of usable capacity and deferred an estimated $400K+ capital expenditure.

Not through compression.

Not through deduplication.

Not through migration.

Simply by reclaiming storage that was already logically free.

NEXT STEP

Run our 30-minute Storage Reclamation Assessment

If your aggregates are trending toward capacity exhaustion and procurement timelines are starting to feel uncomfortable, we’d like to help you find out whether you have meaningful reclamation headroom before you sign a purchase order.

The assessment is 30 minutes, no fee, no commitment:

  • We review your current array utilization (one screen-share, you drive)
  • We sample a few host-side lsblk --discard outputs from representative systems
  • We tell you, on the call, whether you have a credible reclamation opportunity and roughly what scale to expect
  • If you have headroom, we scope the work; if you don’t, we say so

You leave the call with a clear, data-grounded answer — not a sales pitch.

Schedule the assessment →
About WUC Technologies · WUC Technologies specializes in enterprise storage infrastructure, SAN architecture, Linux storage optimization, and operational efficiency initiatives for mission-critical environments. Multi-OEM coverage across NetApp, EMC, Pure Storage, Dell, HPE, and Cisco platforms. Authorized Dell and Cisco partner. SOC 2 audit-ready operations. Based in Boston, MA.