LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
1
2
 
 

hi folks,

simple question really - what model (finetuned or otherwise) have you found that can extract data from a bunch of text.

I'm happy to finetune, so if there are any successes there, would really appreciate some pointers in the right direction.

Really looking for a starting point here. I'm aware of the DETR class of models and how Microsoft trained table-transformers on DETR. Wondering if that can be done on llama2,etc models ?

P.S. cannot use GPT because of sensitive PII data.

3
 
 

Hi,

Red teaming is one of the crucial steps for safe guarding llms.

I want to know how to get started with red teaming, what process should I follow.

4
 
 

Currently running them on-CPU:

  • Ryzen 9 3950x

  • 64gb DDR4 3200mhz

  • 6700xt 12gb (does not fit much more than 13b models, so not relevant here)

While running on-CPU with GPT4All, I'm getting 1.5-2 tokens/sec. It finishes, but man is there a lot of waiting.

What's the most affordable way to get a faster experience? The two models I play with the most are Wizard-Vicuna 30b, and WizardCoder and CodeLlama 34b

5
 
 

I’m working on a project to generate text from a 1.2B parameter full precision LLM (5gb)

Unfortunately I’m limited in the infrastructure I can use to deploy this model. There is no batch inference supported. The infrastructure I have allows me to deploy a copy of the model on a single A100, 1 per process with up to 9 processes supported (these are called “replicas”). I understand that this makes little sense given my model is memory bound, and each process will fight for memory bandwidth to read in the same weights, but I can’t change that for now.

My average input and output tokens are roughly 1000 each. I estimate the kv cache per token is roughly 400kB using full precision.

I have benchmarks of the latency of the model using various “replicas” as described above. I wanted to compare this to the theoretical performance of the A100. For my use case time to first token is negligible (<200ms), and generation is memory bound.

I find that with 5 or more replicas, the math works out and my model is roughly as fast as I expect. For example, with 1000 output tokens, 6 replicas, it’s like I’m generating using a batch of 6 requests from a 30gb model + 5gb for the kv cache. At a memory bandwidth around 1-1.3tbps that translates to ~30s per request, which is not far from what I see. The same goes for other replica numbers, 5, 7, 8 and 9.

However, when I run with a single replica, I expect generation to hover around the 5-6s mark on average. Instead, I see > 20s. I need to add 4 more replicas before the number starts to make sense. It almost seems like the model takes up too little memory to be allocated the entire memory bandwidth.

Does anyone know where this extra latency could be coming from? Do models have to reach a certain amount of used memory for A100 memory bandwidth to hit their available memory bandwidth?

6
 
 

Great news! Beijing Academy of Artificial Intelligence(BAAI) published a new dataset Chinese Corpus Internet (CCI v1.0.0), a large-scale dataset for Chinese language model pretraining and collected with leading institues in China. This open-source dataset is designed to offer an important data foundation for the AI Large-Language Model in Chinese. It includes contents from >1000 most important websites in Chinese, from Jan. 2001 to Nov. 2023. It has been filtered for high quality, content safety, deduplication, and content correction with lots of manual checking. This dataset is 104GB in total, filtered from a much larger one (original size is >800GB). I would encourage you to include this dataset for training an LLM supporting Chinese as one of its languages.

URL for downloading:

https://huggingface.co/datasets/BAAI/CCI-Data

https://data.baai.ac.cn/details/BAAI-CCI

7
8
 
 

Currently have a msi x670 carbon motherboard with 4090/3090 combo on it that works well enough, but when tinkering with ai pain to have to close it at times when friends messing with bots or stable diff, want to load up a game so was thinking since I mostly just play stuff like rimworld or dota 2 lately and have a 7950x3d, i could get the thinnest 4060ti 16gb could get more vram for the larger models and could find and give up gaming on 4090 so could fit that on the bottom pci slot (its by far my biggest card)

https://rog.asus.com/uk/motherboards/rog-strix/rog-strix-x670e-e-gaming-wifi-model/ looking at this one thinking might be enough room for a middle slot card. Rest of pc is a 7950x3d/96 gig ram. I managed to get a small 3090 ( 2ish slot evga) that would fit on top slot. (The 4090 is like 4 slots in size. I also have most of bits to build a second pc but thinking for cost of new cpu/ram/mb i could try this option too since could sell the old mb for part of cost but does anyone know of any other mb options for 3 gpu.

(https://i.imgur.com/SWxUm5i.jpeg) Looks tight but i have fair bit of space below that 4090 that maybe could get enough space to fit another card between them, have gpus running at 60% power so never really get into high temp ranges.

4090 is 4 slot so has to go on bottom to fit in case. 3090 2 slot 4060 (or anything for gaming can go anywhere)

Thanks.

9
 
 

Can you make any suggestions for a model that is good for general chat, and is not hyper-woke?

I've just had one of the base Llama-2 models tell me it's offensive to use the word "boys" because it reinforces gender stereotypes. The conversation at the time didn't even have anything to do with gender or related topics. Any attempt to get it to explain why it thought this resulted in the exact same screen full of boilerplate about how all of society is specifically designed to oppress women and girls. This is one of the more extreme examples, but I've had similar responses from a few other models. It's as if they tried to force their views on gender and related matters into conversations, no matter what they were about. I find it difficult to believe this would be so common if the training had been on a very broad range of texts, and so I suspect a deliberate decision was made to imbue the models with these sorts of ideas.

I'm looking for something that isn't politically or socially extreme in any direction, and is willing to converse with someone taking a variety of views on such topics.

10
 
 

Optimum Intel int4 on iGPU UHD 770

I'd like to share the result of inference using Optimum Intel library with Starling-LM-7B Chat model quantized to int4 (NNCF) on iGPU Intel UHD Graphics 770 (i5 12600) with OpenVINO library.

I think it's quite good 16 tk/s with CPU load 25-30%. Same performance with int8 (NNCF) quantization.

This is inside a Proxmox VM with SR-IOV virtualized GPU 16GB RAM and 6 cores. I also found that the ballooning device might cause crash of the VM so I disabled it while the swap is on a zram device.

free -h output while inferencing:

total used free shared buff/cache available

Mem: 15Gi 6.2Gi 573Mi 4.7Gi 13Gi 9.3Gi

Swap: 31Gi 256Ki 31Gi

Code adapted from https://github.com/OpenVINO-dev-contest/llama2.openvino

What's your thoughts on this?

11
 
 

Where can I find charts about top performing 13b parameters LLM models?

I am trying to download a model and run it locally which fit my PC specs

Appreciate your feedback in advanced boys

12
 
 

Armen Aghajanyan, a research scientist at Meta AI, tweeted a few hours ago that they hit a big breakthrough last night. Unknown if it's related to LLMs or if it will even be open-sourced, but just thought I'd share here to huff some hopium with y'all.

13
 
 

It's working great so far. Just wanted to share and spread awareness that running multiple instances of webui (oobabooga) is basically a matter of having enough ram. I just finished running three models simultaneously (taking turns of course). Only offloaded one layer to gpu per model, used 5 threads per model, and all contexts were set to 4K. (the computer has 6 core cpu, 6GB vram, 64GB ram)

The models used were:

dolphin-2.2.1-ashhlimarp-mistral-7b.Q8_0.gguf

causallm_7b.Q5_K_M.gguf

mythomax-l2-13b.Q8_0.gguf (i meant to load a 7B on this one though)

I like it because it's similar to the group chat on character.ai but without the censorship and I can edit any of the responses. Downsides are having to copy/paste between all the instances of the webui, and it seems that one of the models was focusing on one character instead of both. Also, I'm not sure what the actual context limit would be before the gpu would go out of memory.

https://preview.redd.it/8i6wwjjtt54c1.png?width=648&amp%3Bformat=png&amp%3Bauto=webp&amp%3Bs=26adca2a850f62165301390cdd4ba11548447c0d

https://preview.redd.it/3c9z5ee9u54c1.png?width=1154&amp%3Bformat=png&amp%3Bauto=webp&amp%3Bs=210d7c67bcf0efafeb3f328e76199f13159dae64

https://preview.redd.it/lt8aizhbu54c1.png?width=1154&amp%3Bformat=png&amp%3Bauto=webp&amp%3Bs=d24f8b2bf899084bbdb11d73e34b5564b629e0be

https://preview.redd.it/8lbl4nzeu54c1.png?width=1154&amp%3Bformat=png&amp%3Bauto=webp&amp%3Bs=a81b8f1d8630e3d17ad37885915f8c7e3077584c

14
 
 

Here is an amazing interactive tool I found on X/Twitter made by Brendan Bycroft that helps you understand how GPT LLMs work.

Web UI

With this, you can see the whole thing at once. You can see where the computation takes place, its complexity, and relative sizes of the tensors & weights.

LLM Visualization

A visualization and walkthrough of the LLM algorithm that backs OpenAI's ChatGPT. Explore the algorithm down to every add & multiply, seeing the whole process in action.

LLM Visualization Github

This project displays a 3D model of a working implementation of a GPT-style network. That is, the network topology that's used in OpenAI's GPT-2, GPT-3, (and maybe GPT-4).

The first network displayed with working weights is a tiny such network, which sorts a small list of the letters A, B, and C. This is the demo example model from Andrej Karpathy's minGPT implementation.

The renderer also supports visualizing arbitrary sized networks, and works with the smaller gpt2 size, although the weights aren't downloaded (it's 100's of MBs).

15
 
 

Right now it seems we are once again on the cusp of another round of LLM size upgrades. It appears to me that having 24gb VRAM gets you access to a lot of really great models, but 48gb VRAM really opens the door towards the impressive 70B models and allows you to nicely run the 30B models. However, im seeing more and more 100B+ models being created that push the 48 gb VRAM specs down into lower quants if they are able to run the model at all.

this is in my opinion is big, because 48gb is currently the magic number for in my opinion consumer level cards, 2x 3090's or 2x 4090s. adding an extra 24gb to a build via consumer GPUs turns into a monumental task due to either space in the tower or capabilities of the hardware AND it would put you at 72gb VRAM putting you at the very edge of the recommended VRAM for the 120GB 4KM models.

I genuinely don't know what i am talking about and i am just rambling, because i am trying to wrap my head around HOW to upgrade my vram to load the larger models without buying a massively overpriced workstation card. should i stuff 4 3090's into a large tower? settle up 3 4090's in a rig?

how can the average hobbyist make the jump from 48gb to 72gb+?

is taking the wait and see approach towards nvidia dropping new scalper priced high VRAM cards feasible? Hope and pray for some kind of technical magic that drops the required VRAM while simultaneously keeping quality?

the reason i am stressing about this and asking for advice is because the quality difference between smaller models and 70B models is astronomical. and the difference between the 70B models and the 100+B models is a HUGE jump too. from my testing it seems that the 100B+ models really turn the "humanization" of the LLM up to the next level, leaving the 70B models to sound like...well.. AI.

I am very curious to see where this gets to by the end of 2024, but for sure.... i won't be seeing it on a 48gb VRAM set up.

16
 
 

I seen a lot of posts here talking about this or that model being great for storytelling/writing but when I try them out the prose is…well…flat, boring and plain unfunny. I’m not interested in models to write NSFW (nothing against it just not my thing). I’m looking for models that can actually output stuff that sounds literary - for example, if I ask the model to write something in the style of x (where x is an author with a very unique style) that the output has some of the author’s style in it. Or if I ask it to write a poem, that the result isn’t like something out of some kid’s book written by Dr. Seuss.

With the exception of one model I tried [Storywriter 13b] (which sort of produced something literary after a little coaxing and leading). All the others produced results which sounded like entries from an encyclopedia or dictionary (lifeless, droning, emotionless, etc.). And the leaderboard hasn’t been much help in identifying anything that’s close to what I’m looking for - the top rated models I have tried are the worst when it comes to prose of the kind I’m looking for, in my limited experience.

Does anyone know of any models, that I can run on my local computer, that can produce “literary” prose (I.e. moving, detailed descriptions plus creative story writing)? Not looking for perfect just better… I’m hoping one of you might have come across a model I haven’t seen/tried so any and all suggestions will be appreciated.

17
 
 

What is everyone's experiences so far with DPO trained versions of their favorite models? Been messing around with different models and my two new favorite models are actually just the DPO versions of my previous favorite models (causalLM 14b and openhermes 2.5 7b). Links below for the models in question.

CausalLM 14B-DPO-alpha - GGUF: https://huggingface.co/tastypear/CausalLM-14B-DPO-alpha-GGUF

NeuralHermes 2.5 Mistral 7B - GGUF: https://huggingface.co/TheBloke/NeuralHermes-2.5-Mistral-7B-GGUF

The former runs at 30 t/s for me with koboldcpp-rocm on a 6900 XT, and the latter at 15 t/s, both at Q6K. I don't have a favorite between these two models, they seem to be better at different things and trade blows in all the logic + creative writing tasks I've tested them in, despite causalLM being a larger model. I'm looking forward to seeing what nousresearch/teknium and CausalLM are bringing next.

18
19
 
 

I feel there’s go to be away without having a mega fast computer. There’s a couple on Google Colab but I have privacy issues

20
 
 

I am planing to use retrieval augmented generation (RAG) based chatbot to look up information from documents (Q&A).

I did try with GPT3.5 and It works pretty well. Now I want to try with Llama (or its variation) on local machine. The idea is to only need to use smaller model (7B or 13B), and provide good enough context information from documents to generate the answer for it.

Have anyone done it before, any comments?

Thanks!

21
 
 

Hello,

I know it's not local nor a LLM but I don't know better suited community for what I'm looking for.

I'd like to demonstrate the habilities of AI in a context of basic office work; for this, I suggested that we record our meetings and use Whisper to transcribe the conversations and use Claude to summarize everything.

Hopefully, we'll buy a computer strong enough for this kind of things next year but for now, we basically own fancy typewriters with screens.

Can someone recommand a good website for this ? I know that Fal.ai give an access to Whisper, but maybe something better exists somewhere ?

Maybe even an easy to use implementation of WhisperX (for diarizations of different speakers) ? I watched a tuto in a collab but our meetings are usually more than 1h long so I don't know if it's suitable ?

I'd be glad to read about your ideas (and feel free to roast my english, I'm just a basic frenchy sacrebleu).

22
 
 

I've got MacBook Pro M1 16GB. In order to run deepseek-coder with 6.7b parameters, I need to reduce context, as it haven't got much ram. So, how can it affect this model performance? How far I can go reducing context?

EDIT: I may have used the wrong word. Instead of performance, I meant accuracy. Sorry for my bad English

23
 
 

so just like how screenshot 2 code/design works but instead it will be for videos.
A good example could be someone uploading a video of a simple 2d game like angry bird and the model will give back a working game/demo. is this possible now? if not can some please explain why

24
 
 

6 bit for now - having strange llama memory issues with 8 bit even though plenty of GPU mem. It is also somewhat throttled response

It is uncensored so use at your own risk !!!

You can put prompt hints in parentheses brackets stars etc https://projectatlantis.eu.ngrok.io/chat.html

Curious how ppl think it compares to smaller models

25
 
 

I'm thinking of upgrading to 64GB of RAM so I can lot larger models on my rtx 3090.

If I want to run tigerbot-70b-chat-v2.Q5_K_M.gguf which has max RAM usage of 51.61GB, assuming I load 23GB worth of layers into VRAM that leaves 51.61-23=28.61 left to load in RAM. My operating system already uses up to 9.2GB of RAM which means I need 37.81GB of RAM (hence 64GB).

How many tokens/s can I expect from 23GB out of 51.61GB being loaded in VRAM, and 28.61GB being loaded in RAM on an rtx 3090? I'm mostly curious about Q5_K_M quant, but I'm still interested in other quants.

view more: next ›