Mistral 7b - Mistral AI

Model Overview

  • 7.3B parameter model
  • Outperforms Llama 2 13B on all benchmarks
  • Approaches CodeLlama 7B performance on code tasks
  • Utilizes Grouped-query attention (GQA) for faster inference
  • Incorporates Sliding Window Attention (SWA) for handling longer sequences efficiently
  • Released under Apache 2.0 license

Performance Highlights

  • Surpasses Llama 2 13B on all metrics
  • Comparable to Llama 34B in various benchmarks
  • Demonstrates superior capabilities in code, reasoning, and English tasks
  • Provides a model fine-tuned for chat, outperforming Llama 2 13B chat

Equivalent Model Sizes

  • Mistral 7B performs equivalently to a Llama 2 three times its size in reasoning, comprehension, and STEM reasoning (MMLU)
  • Significant savings in memory and enhanced throughput

Attention Mechanisms

  • Utilizes Sliding Window Attention (SWA) for linear compute cost and improved speed
  • Linear compute cost of O(sliding_window.seq_len)
  • Explores attention drift with local attention, limiting cache size for improved memory efficiency

Fine-Tuning for Chat

  • Fine-tuned on instruction datasets available on HuggingFace
  • Mistral 7B Instruct model outperforms all 7B models on MT-Bench and is comparable to 13B chat models
  • No tricks or proprietary data used in fine-tuning

