Deletion is never guaranteed: How your computer lies to you

You cannot prove the absence of data—you can only prove the presence of data. You also cannot prove if a particular piece of data was copied, or whether a specific digital object is an original or a copy.

This situation applies to information theory, and more specifically computer science. In the non-digital world, for example, you can destroy physical paper documents up to a certain standard [archived version], and, depending on the effort involved in this destruction, recovery might prove difficult or even impossible (for example, if you chemically destroy those documents, or set them on fire).

The closest example to destruction of physical paper documents in digital environments are various implementations of data erasure mechanisms. There are standards such as the “Gutmann method”, that specify an algorithm for securely erasing the contents of an entire hard disk, but this assumes there are no copies of the data to destroy, which are perfect and nearly impossible to prevent.

Perfection (in a mathematical context)

Perfection in this context refers to digital environments. In analogue environments, perfect copies don’t exist.

There are concepts such as “an orange”, but no two oranges are ever identical. Or apples. Or snowflakes. Each specimen will have its own shape, weight, colour variations, volume, and such.

If you take a picture of an orange, print it, and then scan it, and print the scan again, the copy is not identical to the original print. Every time information is materialised into the analogue world, a portion of it is lost. This is natural. But in a digital environment, you can copy the JPEG file of the orange 10,000 times, and the 10,000th copy will be identical to the first. We can say that analogue copies are “lossy”, while digital copies are “lossless”.

Copy and compute

At their core, computers really can only do a very limited range of operations: they can copy data, and they can perform calculations.

“Surely there are more things a computer can do?” you may ask. Fundamentally, no. Let’s explore some examples:

Displaying a picture on a monitor: that is a copy. The image is stored as a file on a storage medium, such as a hard disk, and then copied from the hard disk into the screen buffer. Actually, it is copied several times: First to the RAM, then to the VRAM; the GPU then copies the information from its VRAM into the output pins of the cable; and finally, the screen copies the data from the input pins of the cable into its own memory. In some cases, additional calculations are performed when the picture is compressed in formats like JPEG or PNG.
Sending a chat message: this is a copy. As you type, the keys are copied into memory, then displayed on the screen in the order you type them. Once you click “send”, the message is copied from the memory into the ethernet cables and sent off to its destination. If we’re using encrypted messaging, additional calculations are performed before sending the message.
Taking a picture with a camera: this is a copy. The camera is equipped with sensors that are able to capture photons coming in. When you press the shutter button, the camera will, very quickly, take the information registered in the sensor, and copy it to memory, to make a digital file. Here, we’re interacting with the analogue world as input. Additional calculations are performed at this step for compressing the file before saving it to a permanent storage medium.
Playing music: would you be surprised if I said this is a copy? The digital file represents a sound wave as a sequence of discrete positions, but sound waves are actually continuous; they’re just vibrations. Depending on how quickly air molecules vibrate, we can generate different tones. We use this to our advantage through a complex system of speakers and membranes of different sizes that can make the air vibrate at various frequencies, and we hear those vibrations. Additional calculations are performed in the case of compressed files; you know the drill.

I could go on, but I hope you get the idea by now. This is the most fundamental expression of information theory: At a fundamental level, most computer actions boil down to moving and transforming bits; logically irreversible erasure is possible, but it is not possible to prove whether other copies exist in an open system.

Then, deleting arbitrary instances of data is, strictly speaking, possible, but it is impossible to prove the deletion of a concept such as a specific picture or music file actually took place, because that concept will continue to exist for as long as copies remain somewhere, which, as explained before, are perfect, and impossible to detect.

So, how does file deletion work?

Would you be surprised to learn that right-clicking a file and then selecting “delete” doesn’t actually delete the file physically? No, I’m not talking about sending it to the recycle bin.

When you execute the “delete” action on a file, your file manager only marks it as deleted. Internally, this means the filesystem finds the physical space on the surface of the hard disk (or the cells of an SSD), and marks down this section as available. For all intents and purposes, new data could potentially be allocated in this now available space, but it is not defined how long this available space will remain unmodified. Until the moment this space is actually utilised, the storage medium continues to keep old data around.

Once the file is marked as deleted, recovery is difficult or even impossible, depending on the state of the data. Typically, deleted data can be recovered if the storage is not in use for an extended period after the data is deleted. For example, suppose you delete a file, shut down immediately, and attempt recovery. In that case, you have a better chance of recovery than if you continue using the computer for one hour or one year after the file deletion.

The two most common methods for destruction of digital data are overwriting empty areas with random garbage and physically destroying the drive. With full-disk encryption, we can further optimise deletion by overwriting the encryption header and discarding the encryption key. Still, the principle is the same as overwriting the entire disk’s surface. This works because data encrypted with well-designed encryption algorithms appear to be completely random to anyone who doesn’t have the key to decrypt them.

“Aha! So data deletion does exist! Just overwrite empty areas with zeros!”. Close, but not quite.

What you are doing when overwriting data is merely copying data from one source to the storage medium. If you copy all zeros or random bits of data obtained from entropically secure sources, it doesn’t matter; the principle in question here is copy, not deletion.

But let’s go one level deeper. If you think about it, the data to delete exists in the storage medium, right? And how do you interface with this storage medium? Well, assuming this is a SATA flash drive: SATA doesn’t even have a “delete” command. For file manipulation, it has a “read” command, which, if you recall from above, merely copies the data from the storage cell into the physical SATA pins, so the operating system can access the requested information. It also has a “write” command that operates in the opposite direction: reading from the pins and copying that to the storage cell.

The mechanism for SATA hard disks (mechanical, or spinning) is similar, although instead of “cells”, hard disks use sectors, but those are physical implementation details that don’t change the nature of deletion.

So for file deletion, this command only exists at the filesystem level. The filesystem is responsible for mapping out the physical layout of the disk and making it available to the OS in a logical way that is able to store data. When a file is deleted, the filesystem just “forgets” about it, and marks the place it used to reside in as unoccupied.

Because of the above, it is plausible for a SATA device to be made to transparently keep a copy of every single block, even when the operating system orders the disk to overwrite it with new data. The disk would keep the current state in its storage controller, presenting to the operating system exactly what it expects, but in the background it could very well be keeping a copy of every file operation, and not actually overwriting new blocks with data when instructed, but just allocating new blocks.

Conclusion

You cannot trust a digital system to:

Keep your information safe and secure, preventing copies.
Permanently delete any given piece of data.

That is why, among other reasons, sensitive pieces of data like recovery keys, cryptocurrency private keys, or signing keys are recommended to be kept offline, especially if you can write them down with pen and paper and store them in a safe deposit box, or are handled by dedicated hardware encryption devices that cost thousands of dollars.