Apple is addressing the challenge of effectively managing Large Language Models (LLMs) that surpass the existing Dynamic Random-Access Memory (DRAM) capacity.
Apple recently released a paper titled ‘LLM in a flash: Efficient Large Language Model Inference with Limited Memory,’ introducing a groundbreaking method enabling the operation of Large Language Models (LLMs) on devices that surpass the available DRAM capacity. The innovation involves storing model parameters on flash memory and selectively transferring them to DRAM when required. This model serves as a blueprint for optimizing two critical aspects: minimizing the volume of data transfers from flash memory and reading data in larger, more cohesive units.
Apple’s approach within this flash memory-informed framework encompasses two key techniques. The first technique, known as “windowing,” strategically reduces data transfer by reusing previously activated neurons. The second technique, termed “row-column bundling,” capitalizes on flash memory’s sequential data access strengths by enhancing the size of data chunks read from flash memory. Collectively, these techniques enable the efficient and effective operation of LLMs on devices facing constraints in available DRAM capacity.