Why MD5 is broken but still useful

Walk into any developer chat and ask "is MD5 still safe?" and you will get one of two answers: a stern no, it has been broken for decades, or a casual yeah, for what I'm doing it's fine. Both are correct, depending on what you are doing. The interesting and useful answer lives in the gap between them.

This article walks through what cryptographers mean when they say MD5 is broken, what kinds of work MD5 is still genuinely good at, and where the line is.

What "broken" means cryptographically

A cryptographic hash function is supposed to have three properties:

One-way. Given a hash, you cannot find an input that produces it (preimage resistance).
Second-preimage resistant. Given a specific input, you cannot find a different input that hashes to the same value.
Collision-resistant. You cannot find any two different inputs that hash to the same value.

Collision resistance is the weakest of the three guarantees — and it is the one MD5 lost first.

The first practical MD5 collision was published by Wang and Yu in 2004, fitting on a single PowerPoint slide: two 128-byte messages with identical MD5 hashes. By 2008, a research team used MD5 collisions to forge a real, browser-trusted SSL certificate (the MD5 Considered Harmful Today attack). By the mid-2010s, anyone with a laptop could compute MD5 collisions in seconds using freely available tools.

The other two properties — preimage resistance and second-preimage resistance — have not been broken in the same way. Given a single MD5 hash, finding any input that produces it is still computationally infeasible in 2026. So MD5 is broken as a cryptographic hash, but it is not broken as a deterministic fingerprint.

The distinction matters. It is what makes some uses of MD5 still safe.

Where MD5 fails: when an adversary picks the input

If anyone other than you can pick what gets hashed, MD5 is dangerous. The attacker can craft two inputs with identical hashes and use that to swap one for the other after the hash is recorded.

Concrete examples of where this matters:

Digital signatures. Signing an MD5 hash of a document is meaningless once the attacker can produce a second, different document with the same hash — and then claim that document is what was signed. This is exactly the SSL certificate attack from 2008.

File integrity for downloads where the source is untrusted. If you publish a download with an MD5 checksum on the same page, an attacker who can replace the binary can also replace the checksum. (This is true of any hash, but MD5 makes it worse: the attacker can also produce a backdoored binary that has the original MD5, no checksum tampering needed.)

Any kind of commit-and-reveal scheme. Bidder commits to a sealed bid by publishing its MD5 hash, then later reveals the bid. With MD5 collisions, the bidder can prepare two bids in advance — both hashing to the same value — and reveal whichever is more favourable.

Git object integrity (historical). Git uses SHA-1 (also broken, but more on that below) for object addressing. The fact that an attacker who controls a contributor's machine can craft two different files with the same hash is the reason Git has been migrating to SHA-256 since 2018.

The pattern in all of these: the security property being relied on is collision resistance, and that is the one MD5 lost.

Where MD5 still works: when nobody adversarial picks the input

If you control the input, or if it comes from a non-adversarial source, MD5 remains a fast, deterministic, well-distributed fingerprint. The collision attacks are constructive — an attacker engineers two inputs to collide. If nobody is engineering inputs, the chance of a random collision is 2^64, which is astronomically small for any practical workload.

Concrete uses where MD5 is still appropriate in 2026:

Cache keys. When you cache the result of a computation, the cache key is the canonical representation of the input. MD5 of the canonical input is a perfectly good key — it is fast, the digest is short, and a random collision is not going to happen. (If your cache holds 100 billion entries, the birthday-bound probability of a single MD5 collision is around 0.014%. Fewer entries, much smaller probability.)

Deduplication. Content-addressed storage that deduplicates by hash works fine with MD5 if the inputs are not adversarially chosen. Backup tools (rsync, restic in some modes), large-file deduplication, and warehouse loaders all do this.

File transfer integrity, when the source is trusted. If you copy a file from one of your own servers to another and use MD5 to verify the bytes arrived intact, that is a check against accidental corruption — not an adversary. MD5 detects accidental corruption with the same 2^128 probability of a false-clean as SHA-256. Some legacy file-transfer protocols and database backup formats still ship MD5 for exactly this reason.

Identifying images in a non-security pipeline. A photo-management app that hashes thumbnails to detect "this image is the same as that one" is doing fingerprinting, not crypto. MD5 is fine.

Test fixtures and snapshot testing. Comparing the MD5 of a generated artifact to a stored expected value catches regressions. The artifact's content is not adversarially controlled in your test suite, so MD5 works.

The thread running through all of these: the consequences of a collision are bounded and not security-critical. A cache that occasionally has a false hit is a performance bug, not a security incident. A backup that thinks two files are the same when they are not is a bug, but not a breach.

SHA-1 has the same problem

It is worth saying explicitly: most of what is true of MD5 is also true of SHA-1.

SHA-1's first practical collision was demonstrated by Google in 2017 — the SHAttered attack — costing roughly $110,000 in cloud compute at the time. By 2020, the SHA-1-is-a-Shambles team improved the attack into a chosen-prefix collision, the more dangerous form, for around $45,000. Browsers have rejected SHA-1 in TLS certificates since 2017. Git is moving to SHA-256.

SHA-1 is broken for the same reason MD5 is broken — collision resistance fell — and it is still appropriate for the same uses MD5 is appropriate for, by the same logic. If you cannot use MD5 for security, you cannot use SHA-1 for security either; if you can use SHA-1 for cache keys, you can also use MD5.

So what should I use for what?

Here is the practical choice in 2026:

Use case	Recommended	Why
Cryptographic signature	SHA-256	Standard, no known weaknesses
TLS / HTTPS certificate digest	SHA-256	Browsers reject SHA-1 since 2017
Content addressing (Git, IPFS)	SHA-256	Git's newer format default
File integrity for download (adversarial source)	SHA-256	Collision attacks are too cheap on weaker hashes
File integrity for transfer (trusted source)	MD5 or SHA-1	Detects accidental corruption fine
Cache keys	MD5 or xxHash	Speed matters, randomness is good enough
Deduplication of trusted content	MD5	Same; collisions are not adversarial
Password storage	Argon2id (or bcrypt)	Plain SHA hashes are too fast
Message authentication	HMAC-SHA-256	Use HMAC, not raw hash
Network frame / archive checksum	CRC32	Designed for accidental bit errors

The first four rows are security-sensitive. The next three are not. The line between them is whether anyone picking the input has a reason to attack you — and that reason almost always involves either money or impersonation.

Why MD5 is still everywhere

Despite being broken for collision resistance for over twenty years, MD5 still appears in production code in 2026. Some of that is technical debt, but a lot of it is appropriate. The protocols and formats that ship MD5 — rsync, RFC 1321, ETags in HTTP, database row checksums, plenty of backup formats — were designed in or before the early 2000s for non-adversarial integrity-checking purposes, and rewriting them to use SHA-256 would not improve their security in any meaningful way (because they were never relying on collision resistance).

There is a real cost to switching: SHA-256 is several times slower than MD5 in pure computation, the digests are twice as long, and a lot of legacy tooling works against MD5 hashes natively. If the use is not security-critical, that cost is worth paying only when there is a benefit — and for true integrity-checking, there is not one.

When in doubt, use SHA-256

The simple rule: if you have any doubt about whether a use is adversarial, use SHA-256. The cost is small. The harm of getting it wrong on a security-sensitive use is potentially large.

But "use SHA-256 for everything" is folk wisdom that gets the security analysis backwards. The right question is "what guarantees am I relying on?" and the right tool follows from the answer. MD5 fails the collision-resistance guarantee; for many real workloads, that guarantee is not what you need.

The hash generator on this site computes MD5, SHA-1, all the SHA-2 variants, all the SHA-3 variants, and CRC32 over your input — useful both for the security-critical work where you need SHA-256 and the non-security work where MD5 is fine. As with all the tools, the input is hashed in your browser; nothing about it is sent to a server.

The summary

MD5 has been broken for collision resistance since 2004; chosen-prefix collisions are trivial in 2026.
The other two cryptographic guarantees (preimage and second-preimage resistance) have not been broken — given a hash, you still cannot find an input that produces it.
For any work where an attacker picks the input, do not use MD5 (or SHA-1). Use SHA-256.
For cache keys, deduplication, integrity of trusted file transfer, snapshot testing, and similar non-adversarial fingerprinting, MD5 remains a fine, fast, well-distributed hash function.
For password storage, neither MD5 nor any plain SHA is appropriate — use Argon2id or bcrypt.
The decision rule is "what guarantee am I relying on?", not "is this hash strong?".

The "MD5 is broken, never use it" advice is correct in spirit and wrong in detail. The detail is what tells you which jobs to take it off — and which jobs it is still the right tool for.