DeepSeek is not new and is not alone
China's broader technological momentum in artificial intelligence should neither be underestimated nor blown out of proportion. DeepSeek is one of many impressive innovators in China.
This is a copy of an internal research note about DeepSeek, the first version of which dates back from the summer of 2024, right after their paper on DeepSeek V2 had been made public. The most recent update of the note was in December 2024 right before V3 was announced if memory serves. This version of the note does not cover V3 or R1.
Just because people need to realise:
DeepSeek V3 and DeepSeek R1 did not come out of the blue
DeepSeek is only one of many Chinese AI technologies, albeit the most impressive based on what we know and maybe not as impressive based on what some people suspect in terms of how it was actually trained and what it can do, both good and bad (cybersecurity, censorship, foreign political influence, industrial intelligence…)
References of the papers we reviewed in 2024 about DeepSeek are listed at the bottom of this post.
BQ Research Note
DeepSeek has demonstrated strong performance across various applications, particularly in areas such as language modeling, coding, and mathematical reasoning, while also showing potential in multimodal applications. Here's a breakdown of DeepSeek's performance in different areas:
Language Modeling
Strong Performance: DeepSeek-V2, an open-source Mixture-of-Experts (MoE) language model, achieves top-tier performance among open-source models, even with only 21B activated parameters. It has a total of 236B parameters and supports a context length of 128K tokens.
Efficiency: DeepSeek-V2 is characterized by economical training and efficient inference. It reduces the KV cache by 93.3% and boosts the maximum generation throughput to 5.76 times compared to DeepSeek 67B.
Multi-Lingual Capabilities: DeepSeek-V2 is trained on a bilingual corpus and demonstrates strong performance in both English and Chinese. DeepSeek-V2 Chat (RL) outperforms all open-source models in Chinese, even surpassing most closed-source models.
Multi-head Latent Attention (MLA): DeepSeek-V2 uses MLA which reduces the KV cache and boosts inference efficiency, while achieving better performance than Multi-Head Attention (MHA).
DeepSeekMoE: DeepSeek-V2 utilizes the DeepSeekMoE architecture, which enables training strong models at an economical cost through sparse computation.
Coding
DeepSeek-Coder: DeepSeek has developed a series of open-source code models called DeepSeek-Coder, ranging from 1.3B to 33B parameters, trained on 2 trillion tokens. These models are pre-trained on a high-quality project-level code corpus and use a fill-in-the-blank task to enhance code generation and infilling capabilities.
State-of-the-art Performance: DeepSeek-Coder models achieve state-of-the-art performance among open-source code models across multiple benchmarks. The DeepSeek-Coder-Instruct 33B surpasses OpenAI's GPT-3.5 Turbo in the majority of evaluation benchmarks.
Cross-file Code Completion: DeepSeek-Coder demonstrates superior performance in cross-file code completion tasks compared to other models, due to its repository-level pre-training.
Fill-In-the-Middle (FIM): DeepSeek-Coder models are trained with a 0.5 FIM rate, which enables them to proficiently generate code by filling in blanks based on surrounding context.
Multi-turn Dialogue: DeepSeek-Coder-Instruct models can provide complete solutions in multi-turn dialogue settings.
Long Context: DeepSeek-Coder has an extended context window of 16K, enabling it to handle more complex coding tasks. It can theoretically process up to 64K tokens.
Library Usage: DeepSeek-Coder models can use libraries accurately in real data science workflows.
Continued Pre-Training: DeepSeek-Coder-v1.5 maintains the high level coding performance while exhibiting enhanced natural language comprehension and improved mathematical reasoning abilities.
Mathematical Reasoning
Strong Performance: DeepSeek models demonstrate a strong capacity for mathematical reasoning. DeepSeek-V2 achieves top-ranking performance on the MMLU benchmark, which includes math problems.
Program-Aided Math Reasoning: DeepSeek-Coder models excel at program-based math reasoning, outperforming other models on various benchmarks.
DeepSeekMath: DeepSeek has a dedicated model for math called DeepSeekMath.
Offline Reinforcement Learning: DeepSeek models can improve their math reasoning performance using offline reinforcement learning.
Multimodal Applications
Safety Awareness: While DeepSeek models show potential in multimodal applications, they may lack genuine safety awareness in unsafe situations.
Situational Safety: DeepSeek models may struggle with recognizing unsafe situations, especially in embodied scenarios where they need to understand visual cues.
Multi-Agent Framework: DeepSeek's performance in multimodal tasks can be improved by using a multi-agent framework, which enhances safety awareness and situational judgment.
Other Applications:
Domain-Specific Applications: DeepSeek models can be adapted for domain specific applications, as seen in the development of medical and telecom focused LLMs.
Chat Applications: DeepSeek has developed chat versions of its language models through Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), which enhance the model’s performance in conversational settings.
Overall Performance:
Low-Cost Performance: DeepSeek is known for developing high-performing models at a lower cost compared to their US counterparts.
Efficient Training: DeepSeek models are trained efficiently, using less computing resources.
Open-Source Commitment: DeepSeek releases many of its models as open-source, facilitating further research and development.
DeepSeek demonstrates impressive capabilities in language, coding, and math applications, achieving state-of-the-art performance while also focusing on efficiency and cost-effectiveness. While DeepSeek models may have some limitations in safety awareness and visual understanding in multimodal tasks, they show promising results in other applications.
References
DeepSeek-AI. (2024). DeepSeek LLM: Scaling open-source language models with longtermism. CoRR, abs/2401.02954. https://arxiv.org/abs/2401.02954
Dai, D., et al. (2024). DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models. CoRR, abs/2401.06066. https://arxiv.org/abs/2401.06066
Guo, D., et al. (2024). DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence. arXiv preprint arXiv:2401.14196. https://arxiv.org/abs/2401.14196
Shao, Z., et al. (2024). DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. https://arxiv.org/abs/2402.03300
Lu, H., et al. (2024). Deepseek-vl: Towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525. https://arxiv.org/abs/2403.05525
DeepSeek-AI. (2024). DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. arXiv preprint arXiv:2405.04434v5. https://arxiv.org/abs/2405.04434
DeepSeek-AI. (2024). TelecomGPT. arXiv preprint arXiv:2407.09424v1. https://arxiv.org/abs/2407.09424
DeepSeek-AI. (2024). Huatuo-MedicalReasoning. arXiv preprint arXiv:2412.18925v1. https://arxiv.org/abs/2412.18925