Comments on: How Did DeepSeek Train Its AI Model On A Lot Less – And Crippled – Hardware?

By: itellu3times

itellu3times — Tue, 28 Jan 2025 00:50:57 +0000

So it’s all mechanics, and really not a touch of actual theory.
But then, what is the theory behind LLMs? Doh!

By: Carl Schumacher

Carl Schumacher — Tue, 28 Jan 2025 00:03:51 +0000

If this is indeed “DeepFake”, then its one of the best engineered shorts (both in an energy and IT capital sense) in decades.

By: Tapa Ghosh

Tapa Ghosh — Mon, 27 Jan 2025 22:33:03 +0000

“And here is another side effect: The V3 model uses pipeline parallelism and data parallelism, but because the memory in managed so tightly, and overlaps forward and backward propagations as the model is being built, V3 does not have to use tensor parallelism at all. Weird, right?”

This is mostly because of the small # of GPUs used, they can use expert parallelism as well, to eliminate the need for TP, if you used more GPUs, you’d need TP