Technology

1160 readers

32 users here now

A tech news sub for communists

founded 3 years ago

MODERATORS

muad_dibber@lemmygrad.ml

Meet DeepSeek: the Chinese start-up that is changing how AI models are trained (www.scmp.com)

submitted 6 months ago by yogthos@lemmygrad.ml to c/technology@lemmygrad.ml

8 comments fedilink hide all child comments

https://archive.ph/7b2wk

you are viewing a single comment's thread
view the rest of the comments

[–] RedClouds@lemmygrad.ml 12 points 6 months ago (7 children)

This isn't a super surprising result. Even American companies have been talking about how China is quickly catching up in the AI space. And if Americans are admitting it, you know it's true. Also, anybody who's been watching the open source scene has understood that the Chinese models are very competitive. There are many many leaderboards comparing things, but Qwen, built by Alibaba cloud, is constantly at the top of the list. In fact, in one list that I'm watching, the Qwen-based models encompass the top 20.

Then, of course, they have their own closed source language models, so a little harder to test against, but by most accounts, they are right behind ChatGPT and Claude.

DeepSeek V3 is an exceptionally large model, so it's a little hard to do direct comparisons exactly, but it's blowing the things out of the water, and that's pretty crazy.

[–] yogthos@lemmygrad.ml 5 points 6 months ago (4 children)

What's remarkable about DeepSeek V3 is the use of mixture-of-experts approach. While it has 671 billion parameters overall, it only uses 37 billion at a time, making it very efficient. For comparison, Meta’s Llama3.1 uses 405 billion parameters used all at once. It also has 128K token context window means it can process and understand very long documents, and processes text at 60 tokens per second, twice as fast as GPT-4o.

[–] RedClouds@lemmygrad.ml 6 points 6 months ago (1 children)

That's an important distinction yes, it uses a lot of smaller models added up. I haven't been able to test it yet as I'm working with downstream tools and the raw stuff just isn't something I've set up (Plus, I have like 90 gigs of ram, not..... well) I read in one place you need 500 gb+ of ram to run it, so I think all 600+ billion params need to be in memory at once, and you need to use a quantized model, to get it to fit in even that space, which kinda sucks. However, that's how it is for Mistral's mixture of experts models too. So no difference there. MoE's are pretty promising.

[–] yogthos@lemmygrad.ml 3 points 6 months ago

Exactly, it's the approach itself that's really valuable. Now that we know the benefits those will translate to all other models too.

load more comments (2 replies)

load more comments (4 replies)