You can’t just write off capital expenditure though. The hardware, even for “effecient” MOE inference is still very expensive to buy, house, run, and cool. Even assuming open-weight model serving at $0 r&d for the models themselves, mixing high-prefill workloads doesn’t batch well with decode heavy concurrency (or other prefill-heavy jobs). The moment you do anything nontrivial you start running into very complicated architectural problems to efficiently solve at scale.
Hardware that is useful for 5-10 years at most, plus development and support for the inference workflows, doesn’t leave a lot of margin on the table.
My gut, along with basically everything I read, suggests that not most (even pure inference) shops are not profitable and are still floating on loans or vc money.
You can’t just write off capital expenditure though. The hardware, even for “effecient” MOE inference is still very expensive to buy, house, run, and cool. Even assuming open-weight model serving at $0 r&d for the models themselves, mixing high-prefill workloads doesn’t batch well with decode heavy concurrency (or other prefill-heavy jobs). The moment you do anything nontrivial you start running into very complicated architectural problems to efficiently solve at scale.
Hardware that is useful for 5-10 years at most, plus development and support for the inference workflows, doesn’t leave a lot of margin on the table.
My gut, along with basically everything I read, suggests that not most (even pure inference) shops are not profitable and are still floating on loans or vc money.
At 10 years lifetime, it’s sounding like the hardware costs as much to buy as it does to run - not factoring in time value of money…
If you assume they are unprofitable, the Q only becomes whether they are more or less unprofitable by serving the older models for longer.