LocalLLaMA

3 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

70b Q5_K_M gguf models on rtx 3090 (24gb) (alien.top)

submitted 2 years ago by JawGBoi@alien.top to c/localllama@poweruser.forum

0 comments fedilink hide all child comments

I'm thinking of upgrading to 64GB of RAM so I can lot larger models on my rtx 3090.

If I want to run tigerbot-70b-chat-v2.Q5_K_M.gguf which has max RAM usage of 51.61GB, assuming I load 23GB worth of layers into VRAM that leaves 51.61-23=28.61 left to load in RAM. My operating system already uses up to 9.2GB of RAM which means I need 37.81GB of RAM (hence 64GB).

How many tokens/s can I expect from 23GB out of 51.61GB being loaded in VRAM, and 28.61GB being loaded in RAM on an rtx 3090? I'm mostly curious about Q5_K_M quant, but I'm still interested in other quants.

no comments (yet)

sorted by: hot top controversial new old

there doesn't seem to be anything here