this post was submitted on 01 Dec 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

I'm thinking of upgrading to 64GB of RAM so I can lot larger models on my rtx 3090.

If I want to run tigerbot-70b-chat-v2.Q5_K_M.gguf which has max RAM usage of 51.61GB, assuming I load 23GB worth of layers into VRAM that leaves 51.61-23=28.61 left to load in RAM. My operating system already uses up to 9.2GB of RAM which means I need 37.81GB of RAM (hence 64GB).

How many tokens/s can I expect from 23GB out of 51.61GB being loaded in VRAM, and 28.61GB being loaded in RAM on an rtx 3090? I'm mostly curious about Q5_K_M quant, but I'm still interested in other quants.

no comments (yet)
sorted by: hot top controversial new old
there doesn't seem to be anything here