Archive uncompressed. Transcode only to the destination.
The master is WAV or FLAC. Every other format is a derivative, made from the master, and discarded after delivery. We keep WAVs on a 14 TB local RAID; storage is cheap.
A long-form audit of the five conversions that quietly delete the work: PDF to Word, HEIC to JPG, CSV to Excel, MP3 to AAC, DOCX to Markdown. What is actually lost at the byte and pixel level — and the defensive workflow each one demands before you click Save As.
A folio of five conversions, one ledger of losses, and the smaller methods we use to keep our files honest.
We did not set out to write a field manual. We set out to print one memo — a 22-page Q1 client letter for Sterling-Kelman Advisory — and we lost two letterforms in the wash. Specifically: the kerning pair between A and V in the body face, and the small-caps replacement in the running head. Both vanished during a routine PDF-to-Word round-trip our compliance team requested, and we did not notice until 600 copies were already drying.
The compliance reason was reasonable. Sterling-Kelman's auditors flag PDFs as opaque artefacts; they wanted a redlineable Word file alongside the print master. So our designer ran the PDF through Word's native importer (Microsoft 365, build 16.84, on macOS Sonoma 14.5). The Word file opened beautifully. It looked, on screen, like the source. It was not the source. The kerning was gone.
This is the conversion problem in miniature. Most file-format conversions are not lossless, but they appear to be — because the loss happens in tables we don't render, in color spaces we don't compare, in calendar coercions we don't audit. The conversion smiles at you. The bytes shrug. By the time damage is visible, the source file has been replaced, the cache has rolled, the print run is on a truck.
This is a field manual for the five conversions our editorial staff and our readers run more than any others: PDF → Word, HEIC → JPG, CSV → Excel, MP3 → AAC, and DOCX → Markdown. For each, we cataloged a piece of damage we observed in production between March and June of 2026, measured the loss in bytes or pixels or musical content, and built a defensive workflow with a junior staffer in mind — someone two months into the job, working under deadline pressure, told to "just export it."
The folio is not exhaustive. It does not cover PSD-to-PNG flattening, MOV-to-MP4 transcodes, EPUB-to-MOBI conversions, or the catastrophic iconv errors that come from feeding GBK as UTF-8. Those are forthcoming. This is the slice that we had documentation for, before the calendar quarter ended.
Each section begins with one piece of damage, named and dated. Then a measurement table. Then a small workflow you can paste into a runbook. We have tried to avoid the trap of telling you to "use better tools." The better tool is usually the file you already had, kept unconverted, with the right reader open.
— H. Marsh, 12 June 2026, Brooklyn
The PDF specification, as of ISO 32000-2:2020, permits font embedding in three flavors: full, subset, and reference-by-name. Most PDFs in the wild — anything coming out of InDesign with default settings — subset. That means only the glyphs you used are embedded, plus a stub of the cmap table, and crucially, a partial copy of OpenType positional features (kern, liga, GPOS in general).
When Word imports a subsetted PDF, the importer rebuilds the font from the glyph outlines it found, then tries to match those outlines to an installed font on your system. If it cannot, it falls back: Calibri, Cambria, sometimes the dread Times New Roman. Even when it does match — and Sentinel did match for us, because we had it installed — Word does not re-attach the original OpenType positioning. It uses the metrics of the installed font. The kerning pairs are silently re-derived from a default table.
The visible result is what you saw on the masthead: AVA set with +18 units of slack between the A and V where the source had −42. Multiply that by every cap pair on a 22-page memo and the page color shifts. The block paragraphs look thinner. Readers do not say "the kerning is gone." They say "this one feels wrong."
We pulled the embedded font tables from both files. The source PDF carried a kern table of 4,896 bytes and a GPOS table of 17,420 bytes. The Word import discarded both and substituted a 0-byte placeholder. The hinting program (the byte-coded instructions that tell the rasterizer how to snap stems to pixel boundaries at small sizes) survived in five of nine samples, and not the masthead one.
| Source file | Face & weight | Role | kern bytes (src) | kern bytes (out) | hinting | Visible damage |
|---|---|---|---|---|---|---|
| Sterling-Kelman_Q1_Memo.pdf | Sentinel Light | masthead | 4,896 | 0 | dropped | "AVA" loose by +60 units; "Tiv" fell back to Times.noticed by Rolland's print buyer on press |
| NYU_AnnualReport_2025.pdf | Tiempos Text Regular | body | 12,288 | 0 | kept | Paragraph color shifted; 3 of 84 pages reflowed. |
| Coda_Restaurant_Menu.pdf | Lyon Display | display | 2,140 | 0 | partial | Section dividers' swash f rendered as plain f. |
| Hempel_Brochure_v3.pdf | Söhne Buch | body | 8,492 | 0 | kept | Numerals shifted from oldstyle to lining. |
| Bowery_Magazine_03.pdf | Founders Grotesk | masthead | 3,810 | 0 | dropped | Title bar grew 4mm; cover overflowed bleed. |
| Park_Slope_PTA_Flyer.pdf | Helvetica Now | flyer | 5,212 | 0 | kept | Mostly intact; small numerals reflowed. |
| OAS_Hearing_Transcript.pdf | Cambria (system) | legal body | 1,180 | 1,180 | kept | None — system font survived round-trip. |
| Inwood_Cycle_Club_Zine.pdf | Söhne Mono | mono body | 2,016 | 0 | partial | Tabular spacing broke; columns shifted 1.2ch. |
| Marsh_Studio_Invoice_0214.pdf | JetBrains Mono | invoice | 3,304 | 0 | dropped | Totals column lost monospace; alignment broke. |
"A subsetted font is a recipe, not the meal. PDF-to-Word reheats the recipe and serves it to you, but it has already thrown out the spice cabinet."
The HEIC container is, in 2026, the default capture format on every iPhone from the XS forward. It wraps an HEVC-encoded image (10-bit per channel, in our pro models) and tags it with the Display P3 color profile. P3 is a wider gamut than sRGB; it can describe roughly 25% more colors, with the gain concentrated in saturated reds, oranges, and greens.
When the iPhone exports a HEIC to JPG — and it does this often, automatically, whenever you AirDrop a photo to a non-Apple device, attach a photo to a Gmail draft, or upload to many web forms — it does three things in sequence. First, it decodes the HEVC bitstream to a 10-bit YUV buffer. Second, it converts that buffer to 8-bit RGB in the destination color space (almost always sRGB, by default). Third, it strips the ICC profile from the output, on the assumption that "JPG means sRGB."
The result is the sunset I shot from the 79th Street sea-wall on 22 May. In HEIC, opened in Preview, the sky between the upper sun edge and the cumulus shoulder is a smooth gradient through warm orange — peak chroma at roughly Lab(75, 38, 62). In the JPG, that same gradient is banded, peak chroma clipped at Lab(75, 24, 48), and the brightest 7% of the sun's edge has gone to pure 255-128-64. The sun is, in effect, a sticker.
None of this is a bug. It is the JPEG-on-iOS export pipeline doing exactly what Apple says it does. The bug is that we look at the JPG on the same iPhone, which interprets unprofiled JPGs as sRGB and renders them on a P3 display, mostly correctly — so you do not see the loss until the file is shown on a calibrated print proof or a non-Apple monitor.
If the file is going to print, you want Adobe RGB or the printer's CMYK. If it is going to a web CDN, sRGB. If it is staying on Apple devices, keep HEIC.
The single biggest source of loss in our 14 paired samples was exporting "to share" without knowing where the file would live. A JPG bound for a Heidelberg Speedmaster needs different settings than a JPG bound for an Instagram crop.
An 8-bit JPG has 256 values per channel. A 16-bit TIFF has 65,536. The difference is invisible until you push exposure or recover highlights.
For our sunset, an 8-bit export clipped the brightest 7% of the sun edge to a single value; the 16-bit TIFF preserved 142 distinct values across the same region, enough to recover detail in print.
A JPG without an ICC profile is a Schrödinger file. Most viewers assume sRGB; some assume monitor-native. The image is correct on your screen, wrong on someone else's.
The fix is one checkbox in Export dialogs across Photoshop, Affinity, and Capture One. The cost is roughly 3KB of file size. The savings are an unmeasurable number of arguments with print buyers.
sips -s formatOptions normal --setProperty hasMakerNote 0The HEIC is your raw. Treat it the way a film photographer treats a negative — you can always reprint, but you cannot un-clip.
We back HEICs to a separate disk, organized by month, before we let any client touch the JPG derivatives. The cost is roughly 380 GB for a year of shooting; the alternative is reshooting a sunset, which the Hudson does not provide on demand.
Preview on macOS will happily render an sRGB JPG as if it were P3, on a P3 display, and lie to your face about how it will look elsewhere.
We use a soft-proof view in Capture One or the print simulation mode in ColorSync Utility before we sign off on any export. Five minutes. Saved 14 reprints in the quarter.
The CSV is not a format. It is a 50-year-old gentleman's agreement about how to put commas between things, and Excel has never honored it. The damage Excel does to CSVs on import is so common that the genomics community had to rename twenty-seven human genes in 2020 because Excel kept turning SEPT2 into 2-Sep and MARCH1 into 1-Mar.
Our damage was smaller, but expensive. A junior analyst opened ny_nj_subscribers_2026Q1.csv in Excel 16.84 by double-clicking it. Excel auto-detected types per column. ZIP codes got Number. Dates of birth got Date (en-US). SKU codes starting with = got Formula. The file looked fine on screen. The analyst saved as .xlsx, handed it to the mailing house, and 8,114 of our New Jersey subscribers received mail addressed to a phantom five-digit ZIP that did not exist.
The mailing house caught the ZIP issue on a sample. They did not catch the dates of birth. We mailed birthday cards to 2,403 people on the wrong day. We mailed an apology to 41 people whose SKU codes had been corrupted into #NAME?. The cost was approximately $4,180 in reprints and remailings, plus one very tired Tuesday for the customer-service team.
The "auto-detect column type" behavior was designed when Excel was a 1985 spreadsheet on a Mac with 1 MB of RAM, intended for a financial analyst who wanted to type numbers into cells. It was not designed to read CSVs from production databases. But because Excel registers itself as the default handler for .csv on every Windows and macOS install, that is what it does, several billion times a year, and it does it badly.
The fix is to not double-click a CSV. Use the Data → "Get Data → From Text/CSV" import wizard, which lets you specify the type of each column. Or use a separate tool entirely. Or — and this is what we now do — convert the CSV to .xlsx with pandas first, with explicit dtypes, and never let Excel see the unconverted file.
| Column | Sample value (source) | What Excel did | Sample value (result) | Role | rows affected |
|---|---|---|---|---|---|
| postal_code | 07030 | parsed as integer, stripped leading 0 | 7030 | silent | 8,114 |
| dob_iso | 03/04/1988 | parsed as en-US (Mar 4) where source was DD/MM | 2088-03-04 | locale | 2,403 |
| sku | =H4B-220 | evaluated as formula, errored | #NAME? | formula | 41 |
| account_id | 0001423 | parsed as integer | 1423 | silent | 12,981 |
| phone | +1-201-555-0148 | parsed as text, kept | +1-201-555-0148 | survived | 0 |
| scientific_id | 1E2 | parsed as scientific notation | 100 | silent | 18 |
| signed_on | 2026-03-08 | parsed as date, OK | 2026-03-08 | survived | 0 |
| amount_usd | 1,200.00 | locale-comma parsed as thousands | 1200 | numeric | 184,221 |
| street | March 4 Avenue | kept as text, but flagged by Excel | March 4 Avenue | survived | 0 |
"The format is the contract. Conversion is breach."
Lossy audio codecs work by throwing away frequencies the human ear is least likely to notice. MP3 uses a psycho-acoustic model from 1993, AAC a refinement from 1997. Both make educated guesses about masking: when a loud cymbal is playing, you cannot hear the soft hum behind it, so the codec discards the hum. The trick works once. It does not work twice.
When you transcode an MP3 to AAC — say, to upload to a service that prefers AAC, or because your podcast tool only accepts .m4a — the AAC encoder gets a signal that already has the maskable frequencies removed. It runs its own masking model on the result and removes more frequencies, including some that were preserved by MP3 but happen to fall in AAC's discard zone. Each hop is a small theft. Five hops is a robbery.
For our WNYC interview, the source WAV measured a ViSQOL score of 4.94 against itself (1.0 to 5.0 scale, 4.5+ considered transparent). After one hop to MP3 320, 4.86. After AAC 192, 4.43. After MP3 192, 4.01. After AAC 128, 3.71. The journalist on the interview did not hear the loss; her editor did, in the s-sibilants and the room tone. The room sounded "smaller" by the final hop, which is exactly what psycho-acoustic codecs do at low bitrates: they prune the reverb tail.
This particular damage is invisible to a meter. The bitrates look right, the file lengths look right, the waveforms look identical at default zoom. You have to look at a spectrogram, or use a reference model, to see what is missing. Most people do not.
The master is WAV or FLAC. Every other format is a derivative, made from the master, and discarded after delivery. We keep WAVs on a 14 TB local RAID; storage is cheap.
Open both the source WAV and the final transcode in Spek. Look for the cliff at the top of the frequency range — AAC LC truncates at 16 kHz at 128 kbps, AAC HE at 11 kHz. If your transcript or music has highs there, you lost them.
No "MP3 archive that we will convert later." If we need both MP3 and AAC, we make each from the WAV master. The pipeline is a star, never a chain.
Markdown is a beautifully simple format. It was designed by John Gruber in 2004 to be a writer's plaintext format with a clean ASCII look that converts to clean HTML. It is, by deliberate design, incapable of representing certain things Word documents commonly contain: merged cells, nested tables, drawing canvases, comments threads, tracked changes, footnotes with multi-paragraph bodies, and equation editor formulas.
When you run a DOCX through pandoc — the standard tool, used by GitHub, GitLab, Hugo, and the Static Site Generator on your machine right now — pandoc converts what it can and silently coerces what it cannot. A merged 2×2 cell in your DOCX becomes a single Markdown table cell with the four pieces of text concatenated. A nested sub-table inside a cell becomes a literal blob of text with pipe characters in it, which then breaks the parent table for any downstream Markdown renderer.
Our community-radio cooperative's manual had a 7-row × 5-column table on page 14 describing the studio's audio routing matrix. After pandoc, it had 19 Markdown rows, three of which were empty pipes, and the channel labels had been duplicated into adjacent cells in a pattern that suggested rowspan flattening. The next operations meeting started with the volunteer engineer holding up the Markdown printout and asking, simply, "what is this."
We measured the loss across 14 tables in the manual. Two converted cleanly. Five lost their headers (rowspan in the top row collapsed). Five lost their structure entirely (became single-row tables with concatenated content). Two preserved structure but lost typesetting — bold cells became plain, multi-paragraph cells became single paragraphs with line breaks deleted. Of 41 footnotes, 28 survived; 13 had paragraph breaks deleted; 7 lost embedded links because the link text had a comma in it that broke pandoc's footnote parser.
Total information loss, measured in characters between source and output: 38.1% for table content, 9.2% for footnote content, 0.4% for body prose. Body prose is fine. Tables are not.
Across five conversions and 37 paired files, the defensive workflow that emerged is simpler than any of the individual remedies we drafted. It is, in three sentences: keep the source unconverted, in a labeled archive, with a checksum. Generate every derivative from the source, never from another derivative. Audit the derivative against a known property of the source before you ship it.
The audit is the part most workflows skip. For a PDF-to-Word: open both, look at the masthead in the Word file, see if it still kerns. For a HEIC-to-JPG: open both, soft-proof the JPG in the destination color space, look for clipping. For a CSV-to-Excel: check the column dtypes against a manifest, count rows, sample 10 random ZIP codes. For an MP3-to-AAC: spectrogram diff. For a DOCX-to-Markdown: render the Markdown back to HTML and visually diff against the DOCX. None of these takes more than five minutes. All of them would have saved us at least one print run this quarter.
The deeper principle is that file formats are contracts, and conversion is the only operation that changes which contract you have signed without renegotiating with the other party — the printer, the auditor, the mailing house, the listener. Treat conversions like contract amendments. Read what you are signing.
HEIF on a non-Apple system, or to TIFF with the P3 ICC profile attached. JPG is fine for the web; PNG is fine for screenshots; neither is fine for a print pipeline that originates in HEIC.= to Text. Even better, pre-process with pandas: pd.read_csv('x.csv', dtype={'zip': str, 'sku': str}) and write out as .xlsx. Excel will then respect the types.