Field Manual · 014 Cyan plate aligned Yellow at +0.4°

Lossy conversions & the files we trusted too much

A long-form audit of the five conversions that quietly delete the work: PDF to Word, HEIC to JPG, CSV to Excel, MP3 to AAC, DOCX to Markdown. What is actually lost at the byte and pixel level — and the defensive workflow each one demands before you click Save As.

Conversions audited
37
paired before/after files
Median data loss
4.6%
vs source manifest
Worst observed
38.1%
DOCX→MD table run
Bytes inspected
2.41 GB
paired sample corpus
Field days
94
Mar 4 – Jun 6, 2026
Stations
7
macOS · Win · Linux mix

A folio of five conversions, one ledger of losses, and the smaller methods we use to keep our files honest.

  1. § 01 An editor's letter, with a missing kerning pair p. 003 · 9 min
  2. § 02 PDF → Word: hinting is the first to leave the building p. 008 · 6 min
  3. § 03 HEIC → JPG: a Display P3 sunset becomes sRGB ash p. 014 · 7 min
  4. § 04 CSV → Excel: the date that ate New Jersey p. 020 · 8 min
  5. § 05 MP3 → AAC: re-encoding as compound interest, but for damage p. 026 · 5 min
  6. § 06 DOCX → Markdown: tables that came back as soup p. 031 · 6 min
  7. § 07 Defensive workflow, FAQ, and a small confession p. 037 · 11 min
Section 01Editor's letter · An apology in proof-marks · 9 min read

The missing kerning pair that started this folio.

We did not set out to write a field manual. We set out to print one memo — a 22-page Q1 client letter for Sterling-Kelman Advisory — and we lost two letterforms in the wash. Specifically: the kerning pair between A and V in the body face, and the small-caps replacement in the running head. Both vanished during a routine PDF-to-Word round-trip our compliance team requested, and we did not notice until 600 copies were already drying.

The compliance reason was reasonable. Sterling-Kelman's auditors flag PDFs as opaque artefacts; they wanted a redlineable Word file alongside the print master. So our designer ran the PDF through Word's native importer (Microsoft 365, build 16.84, on macOS Sonoma 14.5). The Word file opened beautifully. It looked, on screen, like the source. It was not the source. The kerning was gone.

This is the conversion problem in miniature. Most file-format conversions are not lossless, but they appear to be — because the loss happens in tables we don't render, in color spaces we don't compare, in calendar coercions we don't audit. The conversion smiles at you. The bytes shrug. By the time damage is visible, the source file has been replaced, the cache has rolled, the print run is on a truck.

What this manual is

This is a field manual for the five conversions our editorial staff and our readers run more than any others: PDF → Word, HEIC → JPG, CSV → Excel, MP3 → AAC, and DOCX → Markdown. For each, we cataloged a piece of damage we observed in production between March and June of 2026, measured the loss in bytes or pixels or musical content, and built a defensive workflow with a junior staffer in mind — someone two months into the job, working under deadline pressure, told to "just export it."

The folio is not exhaustive. It does not cover PSD-to-PNG flattening, MOV-to-MP4 transcodes, EPUB-to-MOBI conversions, or the catastrophic iconv errors that come from feeding GBK as UTF-8. Those are forthcoming. This is the slice that we had documentation for, before the calendar quarter ended.

How to read it

Each section begins with one piece of damage, named and dated. Then a measurement table. Then a small workflow you can paste into a runbook. We have tried to avoid the trap of telling you to "use better tools." The better tool is usually the file you already had, kept unconverted, with the right reader open.

— H. Marsh, 12 June 2026, Brooklyn

Section 02Conversion 01 of 05 · PDF → Word · 6 min read

Where the hinting went, and why your masthead looks like a hotel sign.

The PDF specification, as of ISO 32000-2:2020, permits font embedding in three flavors: full, subset, and reference-by-name. Most PDFs in the wild — anything coming out of InDesign with default settings — subset. That means only the glyphs you used are embedded, plus a stub of the cmap table, and crucially, a partial copy of OpenType positional features (kern, liga, GPOS in general).

When Word imports a subsetted PDF, the importer rebuilds the font from the glyph outlines it found, then tries to match those outlines to an installed font on your system. If it cannot, it falls back: Calibri, Cambria, sometimes the dread Times New Roman. Even when it does match — and Sentinel did match for us, because we had it installed — Word does not re-attach the original OpenType positioning. It uses the metrics of the installed font. The kerning pairs are silently re-derived from a default table.

The visible result is what you saw on the masthead: AVA set with +18 units of slack between the A and V where the source had −42. Multiply that by every cap pair on a 22-page memo and the page color shifts. The block paragraphs look thinner. Readers do not say "the kerning is gone." They say "this one feels wrong."

What the bytes tell us

We pulled the embedded font tables from both files. The source PDF carried a kern table of 4,896 bytes and a GPOS table of 17,420 bytes. The Word import discarded both and substituted a 0-byte placeholder. The hinting program (the byte-coded instructions that tell the rasterizer how to snap stems to pixel boundaries at small sizes) survived in five of nine samples, and not the masthead one.

02.A   PDF → Word: glyph-level losses across nine paired files
Source file Face & weight Role kern bytes (src) kern bytes (out) hinting Visible damage
Sterling-Kelman_Q1_Memo.pdf Sentinel Light masthead 4,896 0 dropped "AVA" loose by +60 units; "Tiv" fell back to Times.noticed by Rolland's print buyer on press
NYU_AnnualReport_2025.pdf Tiempos Text Regular body 12,288 0 kept Paragraph color shifted; 3 of 84 pages reflowed.
Coda_Restaurant_Menu.pdf Lyon Display display 2,140 0 partial Section dividers' swash f rendered as plain f.
Hempel_Brochure_v3.pdf Söhne Buch body 8,492 0 kept Numerals shifted from oldstyle to lining.
Bowery_Magazine_03.pdf Founders Grotesk masthead 3,810 0 dropped Title bar grew 4mm; cover overflowed bleed.
Park_Slope_PTA_Flyer.pdf Helvetica Now flyer 5,212 0 kept Mostly intact; small numerals reflowed.
OAS_Hearing_Transcript.pdf Cambria (system) legal body 1,180 1,180 kept None — system font survived round-trip.
Inwood_Cycle_Club_Zine.pdf Söhne Mono mono body 2,016 0 partial Tabular spacing broke; columns shifted 1.2ch.
Marsh_Studio_Invoice_0214.pdf JetBrains Mono invoice 3,304 0 dropped Totals column lost monospace; alignment broke.
"A subsetted font is a recipe, not the meal. PDF-to-Word reheats the recipe and serves it to you, but it has already thrown out the spice cabinet."
— D. Berlow, in correspondence · The Font Bureau · April 2026
Section 03Conversion 02 of 05 · HEIC → JPG · 7 min read

A Display P3 sunset, exported as sRGB ash.

The HEIC container is, in 2026, the default capture format on every iPhone from the XS forward. It wraps an HEVC-encoded image (10-bit per channel, in our pro models) and tags it with the Display P3 color profile. P3 is a wider gamut than sRGB; it can describe roughly 25% more colors, with the gain concentrated in saturated reds, oranges, and greens.

When the iPhone exports a HEIC to JPG — and it does this often, automatically, whenever you AirDrop a photo to a non-Apple device, attach a photo to a Gmail draft, or upload to many web forms — it does three things in sequence. First, it decodes the HEVC bitstream to a 10-bit YUV buffer. Second, it converts that buffer to 8-bit RGB in the destination color space (almost always sRGB, by default). Third, it strips the ICC profile from the output, on the assumption that "JPG means sRGB."

The result is the sunset I shot from the 79th Street sea-wall on 22 May. In HEIC, opened in Preview, the sky between the upper sun edge and the cumulus shoulder is a smooth gradient through warm orange — peak chroma at roughly Lab(75, 38, 62). In the JPG, that same gradient is banded, peak chroma clipped at Lab(75, 24, 48), and the brightest 7% of the sun's edge has gone to pure 255-128-64. The sun is, in effect, a sticker.

None of this is a bug. It is the JPEG-on-iOS export pipeline doing exactly what Apple says it does. The bug is that we look at the JPG on the same iPhone, which interprets unprofiled JPGs as sRGB and renders them on a P3 display, mostly correctly — so you do not see the loss until the file is shown on a calibrated print proof or a non-Apple monitor.

A defensive workflow, in five phases.

01

Confirm the destination color space before you export.

If the file is going to print, you want Adobe RGB or the printer's CMYK. If it is going to a web CDN, sRGB. If it is staying on Apple devices, keep HEIC.

The single biggest source of loss in our 14 paired samples was exporting "to share" without knowing where the file would live. A JPG bound for a Heidelberg Speedmaster needs different settings than a JPG bound for an Instagram crop.

Field observations
  • Tool: Photos app · File → Export
  • Setting: "Color Profile: Most Compatible" is the trap
  • Better: "Original" preserves the embedded ICC
  • Verified on: iOS 19.2, macOS 15.4
02

Export at 16-bit if any tonal work is downstream.

An 8-bit JPG has 256 values per channel. A 16-bit TIFF has 65,536. The difference is invisible until you push exposure or recover highlights.

For our sunset, an 8-bit export clipped the brightest 7% of the sun edge to a single value; the 16-bit TIFF preserved 142 distinct values across the same region, enough to recover detail in print.

Field observations
  • Container: TIFF or PSD, not JPG
  • Bit depth: 16-bit per channel
  • Profile: ProPhoto RGB if going to retoucher
  • File size cost: 12.4× the JPG
03

Embed the ICC profile, every time.

A JPG without an ICC profile is a Schrödinger file. Most viewers assume sRGB; some assume monitor-native. The image is correct on your screen, wrong on someone else's.

The fix is one checkbox in Export dialogs across Photoshop, Affinity, and Capture One. The cost is roughly 3KB of file size. The savings are an unmeasurable number of arguments with print buyers.

Field observations
  • Photoshop: Save As → ICC Profile checkbox
  • Affinity: Export → "Embed ICC Profile"
  • sips (macOS CLI): sips -s formatOptions normal --setProperty hasMakerNote 0
  • Cost: <5KB
04

Keep the HEIC. Always. Even after exporting.

The HEIC is your raw. Treat it the way a film photographer treats a negative — you can always reprint, but you cannot un-clip.

We back HEICs to a separate disk, organized by month, before we let any client touch the JPG derivatives. The cost is roughly 380 GB for a year of shooting; the alternative is reshooting a sunset, which the Hudson does not provide on demand.

Field observations
  • Storage: 1.8 MB avg per HEIC vs 4.6 MB JPG
  • Verification: SHA-256 checksum on archive
  • Annual cost: 380 GB at our cadence
  • Worth it: yes
05

Inspect, do not trust the preview.

Preview on macOS will happily render an sRGB JPG as if it were P3, on a P3 display, and lie to your face about how it will look elsewhere.

We use a soft-proof view in Capture One or the print simulation mode in ColorSync Utility before we sign off on any export. Five minutes. Saved 14 reprints in the quarter.

Field observations
  • Soft proof in: Capture One Pro 16.4
  • Simulate: destination's CMYK profile
  • Gamut warning: on
  • Reprints saved: 14 in Q2
Section 04Conversion 03 of 05 · CSV → Excel · 8 min read

The date that ate New Jersey, and the leading zeros it took with it.

The CSV is not a format. It is a 50-year-old gentleman's agreement about how to put commas between things, and Excel has never honored it. The damage Excel does to CSVs on import is so common that the genomics community had to rename twenty-seven human genes in 2020 because Excel kept turning SEPT2 into 2-Sep and MARCH1 into 1-Mar.

Our damage was smaller, but expensive. A junior analyst opened ny_nj_subscribers_2026Q1.csv in Excel 16.84 by double-clicking it. Excel auto-detected types per column. ZIP codes got Number. Dates of birth got Date (en-US). SKU codes starting with = got Formula. The file looked fine on screen. The analyst saved as .xlsx, handed it to the mailing house, and 8,114 of our New Jersey subscribers received mail addressed to a phantom five-digit ZIP that did not exist.

The mailing house caught the ZIP issue on a sample. They did not catch the dates of birth. We mailed birthday cards to 2,403 people on the wrong day. We mailed an apology to 41 people whose SKU codes had been corrupted into #NAME?. The cost was approximately $4,180 in reprints and remailings, plus one very tired Tuesday for the customer-service team.

Why Excel does this

The "auto-detect column type" behavior was designed when Excel was a 1985 spreadsheet on a Mac with 1 MB of RAM, intended for a financial analyst who wanted to type numbers into cells. It was not designed to read CSVs from production databases. But because Excel registers itself as the default handler for .csv on every Windows and macOS install, that is what it does, several billion times a year, and it does it badly.

The fix is to not double-click a CSV. Use the Data → "Get Data → From Text/CSV" import wizard, which lets you specify the type of each column. Or use a separate tool entirely. Or — and this is what we now do — convert the CSV to .xlsx with pandas first, with explicit dtypes, and never let Excel see the unconverted file.

04.A   CSV → Excel: coercions observed in the ny_nj_subscribers Q1 file
Column Sample value (source) What Excel did Sample value (result) Role rows affected
postal_code 07030 parsed as integer, stripped leading 0 7030 silent 8,114
dob_iso 03/04/1988 parsed as en-US (Mar 4) where source was DD/MM 2088-03-04 locale 2,403
sku =H4B-220 evaluated as formula, errored #NAME? formula 41
account_id 0001423 parsed as integer 1423 silent 12,981
phone +1-201-555-0148 parsed as text, kept +1-201-555-0148 survived 0
scientific_id 1E2 parsed as scientific notation 100 silent 18
signed_on 2026-03-08 parsed as date, OK 2026-03-08 survived 0
amount_usd 1,200.00 locale-comma parsed as thousands 1200 numeric 184,221
street March 4 Avenue kept as text, but flagged by Excel March 4 Avenue survived 0
"The format is the contract. Conversion is breach."
— Marginal note, left by Lena Korsakov, page 27, second proof
Section 05Conversion 04 of 05 · MP3 → AAC · 5 min read

Re-encoding as compound interest, but for damage.

Lossy audio codecs work by throwing away frequencies the human ear is least likely to notice. MP3 uses a psycho-acoustic model from 1993, AAC a refinement from 1997. Both make educated guesses about masking: when a loud cymbal is playing, you cannot hear the soft hum behind it, so the codec discards the hum. The trick works once. It does not work twice.

When you transcode an MP3 to AAC — say, to upload to a service that prefers AAC, or because your podcast tool only accepts .m4a — the AAC encoder gets a signal that already has the maskable frequencies removed. It runs its own masking model on the result and removes more frequencies, including some that were preserved by MP3 but happen to fall in AAC's discard zone. Each hop is a small theft. Five hops is a robbery.

For our WNYC interview, the source WAV measured a ViSQOL score of 4.94 against itself (1.0 to 5.0 scale, 4.5+ considered transparent). After one hop to MP3 320, 4.86. After AAC 192, 4.43. After MP3 192, 4.01. After AAC 128, 3.71. The journalist on the interview did not hear the loss; her editor did, in the s-sibilants and the room tone. The room sounded "smaller" by the final hop, which is exactly what psycho-acoustic codecs do at low bitrates: they prune the reverb tail.

This particular damage is invisible to a meter. The bitrates look right, the file lengths look right, the waveforms look identical at default zoom. You have to look at a spectrogram, or use a reference model, to see what is missing. Most people do not.

Card 01 · Defensive habit

Archive uncompressed. Transcode only to the destination.

The master is WAV or FLAC. Every other format is a derivative, made from the master, and discarded after delivery. We keep WAVs on a 14 TB local RAID; storage is cheap.

Disk cost / hour0.42 GB / 0.04 USD
Card 02 · A test you can run

Spectrogram diff against the source, every time.

Open both the source WAV and the final transcode in Spek. Look for the cliff at the top of the frequency range — AAC LC truncates at 16 kHz at 128 kbps, AAC HE at 11 kHz. If your transcript or music has highs there, you lost them.

Time per file30 sec
Card 03 · The rule we made

One hop, never two. Source to destination, direct.

No "MP3 archive that we will convert later." If we need both MP3 and AAC, we make each from the WAV master. The pipeline is a star, never a chain.

Hops permitted1
Section 06Conversion 05 of 05 · DOCX → Markdown · 6 min read

Tables that came back as soup, and footnotes that walked off.

Markdown is a beautifully simple format. It was designed by John Gruber in 2004 to be a writer's plaintext format with a clean ASCII look that converts to clean HTML. It is, by deliberate design, incapable of representing certain things Word documents commonly contain: merged cells, nested tables, drawing canvases, comments threads, tracked changes, footnotes with multi-paragraph bodies, and equation editor formulas.

When you run a DOCX through pandoc — the standard tool, used by GitHub, GitLab, Hugo, and the Static Site Generator on your machine right now — pandoc converts what it can and silently coerces what it cannot. A merged 2×2 cell in your DOCX becomes a single Markdown table cell with the four pieces of text concatenated. A nested sub-table inside a cell becomes a literal blob of text with pipe characters in it, which then breaks the parent table for any downstream Markdown renderer.

Our community-radio cooperative's manual had a 7-row × 5-column table on page 14 describing the studio's audio routing matrix. After pandoc, it had 19 Markdown rows, three of which were empty pipes, and the channel labels had been duplicated into adjacent cells in a pattern that suggested rowspan flattening. The next operations meeting started with the volunteer engineer holding up the Markdown printout and asking, simply, "what is this."

The numbers

We measured the loss across 14 tables in the manual. Two converted cleanly. Five lost their headers (rowspan in the top row collapsed). Five lost their structure entirely (became single-row tables with concatenated content). Two preserved structure but lost typesetting — bold cells became plain, multi-paragraph cells became single paragraphs with line breaks deleted. Of 41 footnotes, 28 survived; 13 had paragraph breaks deleted; 7 lost embedded links because the link text had a comma in it that broke pandoc's footnote parser.

Total information loss, measured in characters between source and output: 38.1% for table content, 9.2% for footnote content, 0.4% for body prose. Body prose is fine. Tables are not.

Section 07Closing · FAQ · Confession · 11 min read

A defensive workflow, a frequently asked set of questions, and a small confession.

Across five conversions and 37 paired files, the defensive workflow that emerged is simpler than any of the individual remedies we drafted. It is, in three sentences: keep the source unconverted, in a labeled archive, with a checksum. Generate every derivative from the source, never from another derivative. Audit the derivative against a known property of the source before you ship it.

The audit is the part most workflows skip. For a PDF-to-Word: open both, look at the masthead in the Word file, see if it still kerns. For a HEIC-to-JPG: open both, soft-proof the JPG in the destination color space, look for clipping. For a CSV-to-Excel: check the column dtypes against a manifest, count rows, sample 10 random ZIP codes. For an MP3-to-AAC: spectrogram diff. For a DOCX-to-Markdown: render the Markdown back to HTML and visually diff against the DOCX. None of these takes more than five minutes. All of them would have saved us at least one print run this quarter.

The deeper principle is that file formats are contracts, and conversion is the only operation that changes which contract you have signed without renegotiating with the other party — the printer, the auditor, the mailing house, the listener. Treat conversions like contract amendments. Read what you are signing.

Q.01Is there a "safe" tool for PDF-to-Word that preserves kerning?+
Not reliably. Adobe Acrobat Pro's "Export to Word" is the best we tested — it preserved kern tables in 6 of 9 paired samples — but it still re-renders type using the destination's installed fonts, so OpenType positioning depends on whether the recipient has the same font version. For typographic fidelity, send the PDF. For redline editing, send the PDF plus a clearly-labeled Word version generated by Acrobat, and tell your reviewer not to publish from the Word file.
Q.02Why not just always export HEIC to PNG instead of JPG?+
PNG is lossless within 8-bit sRGB, but iOS still down-samples from 10-bit P3 to 8-bit sRGB when it writes the PNG. The container is lossless; the conversion is not. If you want true preservation of P3 and 10-bit, export to HEIF on a non-Apple system, or to TIFF with the P3 ICC profile attached. JPG is fine for the web; PNG is fine for screenshots; neither is fine for a print pipeline that originates in HEIC.
Q.03How do I open a CSV in Excel without breaking it?+
Do not double-click the file. Instead: open Excel first, then File → Import → CSV (or Data → From Text/CSV). The wizard lets you set each column's data type explicitly. Set ZIP codes to Text, dates to whatever your source format actually is, and any column that might contain a leading = to Text. Even better, pre-process with pandas: pd.read_csv('x.csv', dtype={'zip': str, 'sku': str}) and write out as .xlsx. Excel will then respect the types.
Q.04If I keep transcoding audio, how many hops before it's audible?+
In our WNYC test corpus, audible artifacts began at hop 3 of mixed MP3/AAC transcoding (ViSQOL ~4.0) and were obvious to trained listeners at hop 4 (~3.7). For untrained listeners on consumer headphones, hop 4–5 is where complaints start. For music with substantial high-frequency content (cymbals, sibilants, high-Q synths), the threshold drops by one hop. Keep your archive at WAV or FLAC and you never have this problem.
Q.05Is there a better DOCX → Markdown tool than pandoc?+
For prose, no — pandoc is the gold standard, and the loss is minimal. For documents with tables, no tool can perfectly convert because Markdown lacks the table syntax DOCX uses. The practical answer is to not convert. Keep the DOCX as the source of truth for table-heavy documents, and publish HTML or PDF derivatives. Markdown is wonderful for blog posts and READMEs; it is a poor archival format for technical manuals.
Q.06What about EPUB → MOBI, MOV → MP4, PSD → PNG?+
Folio 015, coming July 2026. The short version: EPUB → MOBI loses CSS3 features and most modern typography; MOV → MP4 is mostly safe if the codec doesn't change (ProRes → H.264 is not safe, H.264 → H.264 remux is); PSD → PNG flattens layer effects and 16-bit-per-channel detail. We are auditing 22 paired samples now.
Q.07Is any conversion truly lossless?+
A few. WAV → FLAC → WAV is lossless by design — FLAC is a compression algorithm, not a re-encoding. PNG → BMP → PNG within sRGB and 8-bit is lossless. JSON → YAML → JSON is lossless if you preserve key order and avoid YAML's date/anchor parsing. Everything else, including the conversions on this folio, is lossy somewhere — even if the loss is too small for you to see.