Data Formats

I've often wished a data format connoiseur (they exist!) would write a guide to what makes a good data format. I've encountered critiques or comments on specific formats, but never principles behind designing a good format or ways that formats go bad.

The topic subsumes configuration files, and data interchange formats, though it's debatable whether there should be any overlap. And yet we have JSON...

TOML seems solid for configuration files, but I've seen an order of magnitude less discussion of it than JSON, YAML or XML.

Parsing JSON is a Minefield

There are quite a few edge cases with JSON.

Fixing JSON

Tom Bray, who authored some JSON specs, identifies three pain points: commas, timestamps and schemas. See also the TOML discussion of how to handle times (which seems to be still unresolved).

Toml: Comparison With Other Formats

Toml prioritizes human editability more than JSON (comments, syntax), and simplicity more than YAML. It has some similarities to .ini, but is better specified.

Dhall

Dhall is a programmable configuration language that is not Turing-complete

You can think of Dhall as: JSON + functions + types

Choosing Powerful Primitives For A Simplified Computing System

So, to a great extent, you can forget about the space-efficiency of your file formats and wire formats if you run them through a generic compression algorithm as a last step, and optimize them entirely for readability, extensibility, and simplicity.

Deserialization Vulnerabilities

There's at least three types of deserialization vulnerabilities: buffer overflows in languages that aren't memory safe, denial of service attacks, and allowing the deserialization of arbitrary classes (which typically means remote code execution).

The Java Deserialization Bug

Java serialization has provided an extensive series of security issues.

YAML f7u12

Describes the security vulnerabilities in YAML deserialization that hit rails in 2013. Nicely points out that even restrictive whitelists can enable attacks. Maybe YAML is just too expressive.

Graydon Hoare's Criterion

(IMO a good criterion for data format robustness is "how much work does a conforming processor have to do to skip a quoted payload")

— Graydon Hoare (@graydon_pub) May 18, 2017

Defused XML Fixes For XML

Security/DOS vulnerabilities in XML. Written from a Python perspective.