The exponential growth in Large Language Model
(LLM) deployment has intensified the need for efficient model
compression techniques to reduce computational costs and memory requirements. While pruning and quantization have shown
promising results, their combined potential remains largely
unexplored. In this paper, we examine joint compression and
how a strategic combination of pruning and quantization could
achieve superior compression-to-performance ratios compared
to individual-method approaches. Recognizing the challenges
in accurately assessing LLMs performance, we address key
limitations of previous evaluation frameworks and introduce the
Semantic Retention Compression Rate (SrCr), a novel metric that
quantifies the trade-off between model compression and semantic
preservation, facilitating optimization of pruning-quantization
configurations. Experiments demonstrate that our recommended
combination achieves on average a 20% performance increase
compared to equivalent quantization-only model at the same
theoretical compression-ratio.