this post was submitted on 28 Jan 2025
875 points (94.4% liked)

memes

11276 readers
2600 users here now

Community rules

1. Be civilNo trolling, bigotry or other insulting / annoying behaviour

2. No politicsThis is non-politics community. For political memes please go to !politicalmemes@lemmy.world

3. No recent repostsCheck for reposts when posting a meme, you can only repost after 1 month

4. No botsNo bots without the express approval of the mods or the admins

5. No Spam/AdsNo advertisements or spam. This is an instance rule and the only way to live.

A collection of some classic Lemmy memes for your enjoyment

Sister communities

founded 2 years ago
MODERATORS
 

Office space meme:

"If y'all could stop calling an LLM "open source" just because they published the weights... that would be great."

you are viewing a single comment's thread
view the rest of the comments
[–] KillingTimeItself@lemmy.dbzer0.com 30 points 2 days ago (4 children)

i mean, if it's not directly factually inaccurate, than, it is open source. It's just that the specific block of data they used and operate on isn't published or released, which is pretty common even among open source projects.

AI just happens to be in a fairly unique spot where that thing is actually like, pretty important. Though nothing stops other groups from creating an openly accessible one through something like distributed computing. Which seems to be a fancy new kid on the block moment for AI right now.

[–] fushuan@lemm.ee 14 points 2 days ago* (last edited 2 days ago)

The running engine and the training engine are open source. The service that uses the model trained with the open source engine and runs it with the open source runner is not, because a biiiig big part of what makes AI work is the trained model, and a big part of the source of a trained model is training data.

When they say open source, 99.99% of the people will understand that everything is verifiable, and it just is not. This is misleading.

As others have stated, a big part of open source development is providing everything so that other users can get the exact same results. This has always been the case in open source ML development, people do provide links to their training data for reproducibility. This has been the case with most of the papers on natural language processing (overarching branch of llm) I have read in the past. Both code and training data are provided.

Example in the computer vision world, darknet and tool: https://github.com/AlexeyAB/darknet

This is the repo with the code to train and run the darknet models, and then they provide pretrained models, called yolo. They also provide links to the original dataset where the tool models were trained. THIS is open source.

[–] FooBarrington@lemmy.world 10 points 2 days ago* (last edited 2 days ago)

But it is factually inaccurate. We don't call binaries open-source, we don't even call visible-source open-source. An AI model is an artifact just like a binary is.

An "open-source" project that doesn't publish everything needed to rebuild isn't open-source.

[–] Miaou@jlai.lu 2 points 2 days ago

Is it common? Many fields have standard, open datasets. That's not the case here, and this data is the most important part of training an LLM.

[–] Treczoks@lemmy.world 2 points 2 days ago

That "specific block of data" is more than 99% of such a project. Hardly insignificant.