Add 'Understanding DeepSeek R1'

6 months ago · d920230735
parent a23ef9498a
commit d920230735
1 changed files with 92 additions and 0 deletions
--- a/Understanding-DeepSeek-R1.md
+++ b/Understanding-DeepSeek-R1.md
@ -0,0 +1,92 @@
 <br>DeepSeek-R1 is an open-source language design built on DeepSeek-V3-Base that's been making waves in the [AI](http://www.jumpgatetravel.com) community. Not just does it match-or even [surpass-OpenAI's](http://pairring.com) o1 model in many criteria,  [demo.qkseo.in](http://demo.qkseo.in/profile.php?id=989635) however it also comes with [totally MIT-licensed](https://www.rojikurd.net) [weights](https://git.oncolead.com). This marks it as the very first non-OpenAI/Google design to provide strong thinking [capabilities](http://rhmasaortum.com) in an open and available manner.<br>
 <br>What makes DeepSeek-R1 particularly amazing is its transparency. Unlike the less-open approaches from some market leaders, DeepSeek has published a [detailed training](https://ashleylaraque.com) method in their paper.
 The design is likewise remarkably cost-efficient, with input tokens costing just $0.14-0.55 per million (vs o1's $15) and [output tokens](https://git.amic.ru) at $2.19 per million (vs o1's $60).<br>
 <br>Until ~ GPT-4, the typical wisdom was that better models needed more information and calculate. While that's still legitimate, designs like o1 and R1 show an option: inference-time scaling through thinking.<br>
 <br>The Essentials<br>
 <br>The DeepSeek-R1 paper provided several models, but main amongst them were R1 and R1-Zero. Following these are a series of distilled models that, while interesting, I won't talk about here.<br>
 <br>DeepSeek-R1 [utilizes](http://janwgroot.nl) 2 major concepts:<br>
 <br>1. A multi-stage pipeline where a small set of cold-start data kickstarts the design, followed by large-scale RL.
 2. Group Relative Policy Optimization (GRPO), a support knowing approach that relies on comparing several model outputs per timely to avoid the need for a different critic.<br>
 <br>R1 and R1-Zero are both [reasoning models](https://dancescape.gr). This essentially implies they do Chain-of-Thought before [answering](http://holddrc.org). For the R1 series of models, this takes kind as thinking within a tag, before [answering](https://impiantiantigrandine.it) with a [final summary](https://lythamstannestyres.com).<br>
 <br>R1-Zero vs R1<br>
 <br>R1-Zero uses [Reinforcement Learning](https://vibrantclubs.com) (RL) straight to DeepSeek-V3-Base without any supervised fine-tuning (SFT). RL is used to enhance the design's policy to make the most of benefit.
 R1-Zero attains excellent accuracy but sometimes produces complicated outputs, such as mixing several languages in a single action. R1 repairs that by incorporating restricted monitored fine-tuning and numerous RL passes, which improves both correctness and [readability](https://deadmannotwalking.org).<br>
 <br>It is interesting how some [languages](http://gekka.info) may [express](https://www.h0sting.org) certain ideas better, which leads the model to choose the most [meaningful language](http://124.221.255.92) for the job.<br>
 <br>Training Pipeline<br>
 <br>The [training pipeline](http://gls2021.ff.cuni.cz) that DeepSeek released in the R1 paper is exceptionally intriguing. It [showcases](http://www.youly.top3000) how they created such strong reasoning models, and what you can [anticipate](https://code.qinea.cn) from each stage. This [consists](https://www.bprcitradarian.co.id) of the problems that the resulting models from each stage have, and how they resolved it in the next phase.<br>
 <br>It's intriguing that their training pipeline varies from the typical:<br>
 <br>The usual training strategy: Pretraining on big dataset (train to forecast next word) to get the base model → supervised fine-tuning → [preference](http://canarias.angelesverdes.es) tuning via RLHF
 R1-Zero: [Pretrained](https://www.applynewjobz.com) → RL
 R1: Pretrained → Multistage training pipeline with numerous SFT and RL phases<br>
 <br>Cold-Start Fine-Tuning: Fine-tune DeepSeek-V3-Base on a few thousand Chain-of-Thought (CoT) samples to [guarantee](http://theconfidencegame.org) the [RL process](http://www.dionjohnsonstudio.com) has a good [starting](https://holic.vaslekarnik.sk) point. This offers a great design to begin RL.
 First RL Stage: [Apply GRPO](http://img.topmoms.org) with rule-based rewards to improve reasoning correctness and format (such as requiring chain-of-thought into believing tags). When they were near [merging](https://palawanrealty.com) in the RL procedure, they moved to the next step. The [outcome](https://pureperformancewater.com) of this step is a [strong thinking](http://git.scxingm.cn) model but with weak basic abilities, e.g., poor formatting and language mixing.
 Rejection Sampling + basic information: Create brand-new SFT information through rejection sampling on the [RL checkpoint](https://theweddingresale.com) (from step 2), integrated with monitored information from the DeepSeek-V3[-Base design](https://www.2ci.fr). They collected around 600k premium thinking samples.
 Second Fine-Tuning: [Fine-tune](http://47.120.20.1583000) DeepSeek-V3-Base again on 800k total samples (600[k reasoning](https://leron-nuts.ru) + 200k basic tasks) for wider [capabilities](https://www.steinhauser-zentrum.ch). This step resulted in a strong reasoning design with general abilities.
 Second RL Stage: Add more benefit signals (helpfulness, harmlessness) to refine the final model, in addition to the thinking rewards. The result is DeepSeek-R1.
 They likewise did model distillation for [numerous Qwen](https://www.luisdorosario.com) and Llama models on the reasoning traces to get distilled-R1 designs.<br>
 <br>Model distillation is a technique where you use an instructor model to enhance a trainee design by [generating training](https://www.studioat.biz) data for the trainee design.
 The teacher is normally a [larger design](https://patriotscredo.com) than the trainee.<br>
 <br>Group Relative Policy [Optimization](https://skylockr.app) (GRPO)<br>
 <br>The fundamental [concept](http://bella18ffs.twilight4ever.yooco.de) behind using support learning for LLMs is to tweak the design's policy so that it naturally produces more accurate and useful responses.
 They utilized a benefit system that inspects not only for [accuracy](https://sportworkplace.com) however likewise for appropriate format and language consistency,  [pipewiki.org](https://pipewiki.org/wiki/index.php/User:XVPLeonore) so the model slowly discovers to prefer actions that satisfy these [quality requirements](http://wwitos.com).<br>
 <br>In this paper, they encourage the R1 design to create [chain-of-thought reasoning](https://kpimarketing.es) through RL training with GRPO.
 Rather than adding a separate module at inference time, the [training procedure](https://hausarzt-schneider-spranger.de) itself nudges the design to produce detailed, detailed outputs-making the chain-of-thought an emerging behavior of the enhanced policy.<br>
 <br>What makes their technique especially fascinating is its reliance on straightforward, [rule-based benefit](https://blog.isi-dps.ac.id) functions.
 Instead of depending upon pricey external models or [human-graded examples](https://recordingblogsr.blogs.lincoln.ac.uk) as in [traditional](https://www.thatmatters.cz) RLHF, the RL used for R1 uses simple requirements: it may provide a higher benefit if the response is proper, if it follows the expected/ format, and if the language of the answer matches that of the prompt.
 Not counting on a reward model likewise means you don't need to hang out and effort training it, and it does not take memory and calculate away from your main model.<br>
 <br>GRPO was presented in the DeepSeekMath paper. Here's how GRPO works:<br>
 <br>1. For each input prompt, the design creates various reactions.
 2. Each reaction gets a scalar benefit based upon [factors](https://nocturne.amberavara.com) like accuracy, formatting, and language consistency.
 3. Rewards are [adjusted relative](https://elsantanderista.com) to the group's performance, basically determining just how much better each action is compared to the others.
 4. The [model updates](https://www.tziun3.co.il) its method somewhat to prefer reactions with higher relative advantages. It just makes minor adjustments-using strategies like clipping and a KL penalty-to ensure the policy does not wander off too far from its initial habits.<br>
 <br>A cool element of GRPO is its flexibility. You can use simple rule-based benefit [functions-for](https://suachuativi.vn) instance, [granting](https://onixassessoria.com) a perk when the design correctly utilizes the [syntax-to guide](https://sushian-handicrafts.ir) the [training](https://www.obaacglobal.com).<br>
 <br>While [DeepSeek](https://tw.8fun.net) used GRPO, you could use [alternative techniques](https://www.dcnadiagroup.com) instead (PPO or PRIME).<br>
 <br>For those aiming to dive much deeper, Will Brown has actually written quite a great application of training an LLM with RL utilizing GRPO. GRPO has actually also currently been contributed to the [Transformer Reinforcement](https://socialsnug.net) Learning (TRL) library, which is another good resource.
 Finally, Yannic Kilcher has a terrific video explaining GRPO by going through the [DeepSeekMath paper](http://47.120.20.1583000).<br>
 <br>Is RL on LLMs the course to AGI?<br>
 <br>As a last note on [explaining](https://littleyellowtent.cz) DeepSeek-R1 and the methods they have actually presented in their paper, I wish to highlight a passage from the DeepSeekMath paper, based on a point Yannic Kilcher made in his video.<br>
 <br>These findings show that RL boosts the design's general efficiency by rendering the output distribution more robust, in other words, it [appears](http://www.roxaneduraffourg.com) that the improvement is [credited](https://bbs.flashdown365.com) to enhancing the proper reaction from TopK instead of the enhancement of [essential abilities](https://www.lockwiki.com).<br>
 <br>Simply put, RL fine-tuning tends to form the output circulation so that the highest-probability outputs are most likely to be appropriate, although the general ability (as measured by the variety of correct answers) is mainly present in the [pretrained model](https://guardiandoors.net).<br>
 <br>This recommends that [support learning](https://impiantiantigrandine.it) on LLMs is more about refining and "forming" the existing distribution of responses instead of enhancing the design with completely [brand-new abilities](https://www.chinacurated.com).
 Consequently, while RL strategies such as PPO and GRPO can produce considerable efficiency gains, there [appears](https://www.cannabiscare.is) to be a [fundamental ceiling](https://blog-kr.dreamhanks.com) [determined](https://publicidadmarketing.cl) by the underlying design's [pretrained understanding](https://blogs.smith.edu).<br>
 <br>It is [uncertain](http://angie.mowerybrewcitymusic.com) to me how far RL will take us. Perhaps it will be the stepping stone to the next big milestone. I'm [thrilled](http://www.open201.com) to see how it unfolds!<br>
 <br>Running DeepSeek-R1<br>
 <br>I have actually [utilized](http://smpn1leksono.sch.id) DeepSeek-R1 by means of the main chat [interface](http://klzv-haeslach.de) for different problems, which it seems to fix well enough. The extra search performance makes it even better to utilize.<br>
 <br>Interestingly, o3-mini(-high) was [launched](https://patrioticjournal.com) as I was composing this post. From my [initial](https://aaronpexa.com) testing, R1 [appears](http://www.comitreservicos.com.br) more [powerful](https://gitea.codedbycaleb.com) at [mathematics](https://fashionsoftware.it) than o3-mini.<br>
 <br>I also rented a single H100 by means of Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some [experiments](http://trud.mikronacje.info).
 The [main goal](http://professionalaudio.com.mx) was to see how the model would carry out when released on a single H100 GPU-not to extensively evaluate the design's abilities.<br>
 <br>671B via Llama.cpp<br>
 <br>DeepSeek-R1 1.58-bit (UD-IQ1_S) quantized design by Unsloth, with a 4-bit quantized KV-cache and [partial](http://git2.guwu121.com) GPU [offloading](https://sportworkplace.com) (29 layers working on the GPU), [running](https://www.enpabologna.org) via llama.cpp:<br>
 <br>29 layers seemed to be the sweet area given this setup.<br>
 <br>Performance:<br>
 <br>A r/localllama user explained that they had the [ability](https://savico.com.br) to overcome 2 tok/sec with [DeepSeek](https://psychomatrix.in) R1 671B, without using their GPU on their [local gaming](https://outsideschoolcare.com.au) setup.
 [Digital Spaceport](https://git.adminkin.pro) composed a complete guide on how to run Deepseek R1 671b fully in your area on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second. <br>
 <br>As you can see, the tokens/s isn't quite [bearable](http://106.12.172.1053000) for any major work, but it's enjoyable to run these big designs on available [hardware](https://cffghana.org).<br>
 <br>What matters most to me is a mix of effectiveness and [time-to-usefulness](https://jarang.kr) in these models. Since reasoning designs need to think before addressing, their time-to-usefulness is generally greater than other models, but their [effectiveness](https://fliesenleger-hi.de) is likewise normally greater.
 We require to both make the most of usefulness and lessen time-to-usefulness.<br>
 <br>70B through Ollama<br>
 <br>70.6 b params, 4-bit KM quantized DeepSeek-R1 [running](https://glampings.co.uk) by means of Ollama:<br>
 <br>GPU usage shoots up here, as [expected](https://welcomeboard.net) when compared to the mainly CPU-powered run of 671B that I [showcased](http://metzgerei-griesshaber.de) above.<br>
 <br>Resources<br>
 <br>DeepSeek-R1: Incentivizing Reasoning [Capability](https://carlinaleon.com) in LLMs by means of Reinforcement Learning
 [2402.03300] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
 DeepSeek R1 - Notion (Building a [totally regional](http://git.zonaweb.com.br3000) "deep scientist" with DeepSeek-R1 - YouTube).
 DeepSeek R1's recipe to duplicate o1 and the future of thinking LMs.
 The Illustrated DeepSeek-R1 - by Jay Alammar.
 Explainer: What's R1 & Everything Else? - Tim Kellogg.
 DeepSeek R1 Explained to your grandma - YouTube<br>
 <br>DeepSeek<br>
 <br>- Try R1 at chat.deepseek.com.
 GitHub - deepseek-[ai](https://www.delrioservicios.com.ar)/[DeepSeek-R](http://soapopera.co.in) 1.
 deepseek-[ai](https://www.cattedralefermo.it)/Janus-Pro -7 B · Hugging Face (January 2025): Janus-Pro is a novel autoregressive structure that merges multimodal understanding and generation. It can both comprehend and generate images.
 DeepSeek-R1: Incentivizing Reasoning Capability in Large Language Models through Reinforcement Learning (January 2025) This paper introduces DeepSeek-R1, an open-source thinking model that equals the performance of OpenAI's o1. It provides a detailed methodology for training such designs using large-scale reinforcement knowing [methods](https://academie.lt).
 DeepSeek-V3 Technical Report (December 2024) This [report discusses](https://prebur.co.za) the application of an FP8 blended accuracy training structure verified on an extremely large-scale design, [attaining](http://www.youly.top3000) both sped up training and decreased GPU memory use.
 DeepSeek LLM: Scaling Open-Source Language Models with Longtermism (January 2024) This paper looks into [scaling laws](https://git.qingbs.com) and provides findings that assist in the scaling of [large-scale designs](https://thefreshfinds.net) in open-source setups. It introduces the DeepSeek LLM job, [committed](https://kpimarketing.es) to advancing open-source language models with a long-lasting perspective.
 DeepSeek-Coder:  [kenpoguy.com](https://www.kenpoguy.com/phasickombatives/profile.php?id=2445879) When the Large Language Model Meets Programming-The Rise of Code Intelligence (January 2024) This research study presents the [DeepSeek-Coder](http://www.business-terms.sblinks.net) series, a variety of open-source code models trained from scratch on 2 trillion tokens. The models are pre-trained on a top [quality project-level](http://aiwellnesscare.com) code corpus and use a [fill-in-the-blank](http://www.aurens.or.jp) task to enhance code  and [infilling](http://www.gmpbc.net).
 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (May 2024) This paper presents DeepSeek-V2, a [Mixture-of-Experts](https://www.hotelnumi.it) (MoE) language model characterized by affordable training and efficient reasoning.
 DeepSeek-Coder-V2: [Breaking](https://www.e-redmond.com) the [Barrier](https://forum.darievna.ru) of Closed-Source Models in Code Intelligence (June 2024) This research study introduces DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that attains efficiency [comparable](https://cybersoundsroadshow.co.uk) to GPT-4 Turbo in [code-specific tasks](https://gitea.alaindee.net).<br>
 <br>Interesting occasions<br>
 <br>- Hong Kong University duplicates R1 results (Jan 25, '25).
 - Huggingface [announces](http://www.rlmachinery.nl) huggingface/open-r 1: Fully open [recreation](https://sportarena.com) of DeepSeek-R1 to [duplicate](http://git.hsgames.top3000) R1, fully open source (Jan 25, '25).
 - OpenAI researcher verifies the DeepSeek group independently found and utilized some core ideas the OpenAI team utilized en route to o1<br>
 <br>Liked this post? Join the newsletter.<br>