SCNet 超算互联网 LLM Fine-Tuning FSDP LoRA 多卡分布式微调训练 实例

张开发
2026/5/22 22:18:17 15 分钟阅读
SCNet 超算互联网 LLM Fine-Tuning FSDP LoRA 多卡分布式微调训练 实例
注释本文是“SCNet 超算互联网 LLM Fine-Tuning LoRa 实例”的拓展聚焦于多卡分布式训练技术的实例有关SCNet的具体注册步骤和使用见原文示范。https://blog.csdn.net/YucongCai/article/details/159696147?spm1001.2014.3001.5502https://blog.csdn.net/YucongCai/article/details/159696147?spm1001.2014.3001.5502有所不同的是为了演示多卡分布步骤7更改为使用两张4090加速卡。7更改. 点击 “Notebook”-“创建Notebook” 选择“013组” 华东一区【昆山】“加速卡数量2 使用 4090 加速卡包含24GB缓存避免初学者调试需求。同时因为7B模型更大其生成文本信息自由度DoF更高需要对prompt作针对性修改以提升训练稳定性。具体为将上文步骤19中的instruction修改为instruction Analyze the sentiment of the following tweet. Output exactly one word, with no punctuation or extra text:已阅读上文或有经验的可直接参考本文步骤11121314151617然后直接参考本文步骤2627282930获得快速开发代码。在SCNet配置两张4090加速卡完成后切换阿里源安装library下载并添加“twitter-airline-sentimentSentiment_Analysis.csv”文档到与“.ipynb”相同文件夹内以及代号为deepseek-ai/deepseek-llm-7b-base的LLM到本地文件夹/root/private_data/DeepSeek7B。---------------------------------------------------------------------------------------------------------------------------------PyTorch的多卡分布训练技术主要分为DDP (Distributed Data Parallel)和FSDP (Fully Sharded Data Parallel)。DDP (Distributed Data Parallel)为将同一模型全部权重在每张GPU都进行复制和计算以加速训练速度。FSDP (Fully Sharded Data Parallel)为将一模型权重分割为多组每组在不同的GPU上运行。本文中的多卡分布式微调训练采用FSDPLoRA的形式。接下来文章将展示如何使用SCNet节点对代号为deepseek-ai/deepseek-llm-7b-base 的大模型进行多卡分布式微调训练目的是介绍SCNet以及云环境部署还有一二线和中小企业所需的快速开发代码。本文为公益类代码由DeepSeek辅助生成经过实例测试。1. 注册超算互联https://www.scnet.cn/ui/mall/https://www.scnet.cn/ui/mall/2. 点击右上角红色按钮“控制台”3. 点击“服务导航”-“人工智能”蓝色按钮这会进入人工智能Notebook所需界面https://www.scnet.cn/ui/console/index.html#/notebookhttps://www.scnet.cn/ui/console/index.html#/notebook4. 点击右上角“费用”-“总览”5. 点击“充值” -“支付宝”具体充值金额按服务所需选择接下来的案例消耗在30元以内实测约20元。6.充值完成后返回人工智能Notebook 界面https://www.scnet.cn/ui/console/index.html#/notebook7. 点击 “Notebook”-“创建Notebook” 选择“013组” 华东一区【昆山】“加速卡数量2 使用 两张4090 加速卡每张加速卡包含24GB缓存。本文将采用bfloat16比特格式需要使用两张加速卡的内存。8. 点击”开发镜像“-”基础镜像“-框架名称PyTorch-”框架版本2.6.0“-Python版本py3.12-ubuntu22.04-“CUDA/DTK版本 cuda12.4” 点击右下角红色“创建”按钮这会自动创建并切换回Notebook界面在此界面可以直接操作 Jupyter Notebook 或使用VS Code通过Remote SSH登录。9.点击“快捷工具”-“JupyterLab”这会加载界面通常系统会自动允许网络连接若不能则需联系客服。10. 点击“root”-“笔记本”-“Python3”同时建议点击页面中间上方”“新增标签页打开“其他”-终端注意容器内默认为root账号。11. 在“终端”内输入(代码可直接复制粘贴)pip install --upgrade pip如果不行则切换阿里源pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/ pip install --upgrade pip12. 安装相关librarypip install transformers accelerate peft bitsandbytes datasets trl scikit-learn pandas13. 下载 crowdflower/twitter-airline-sentiment 的数据库为案例需要注册账户免费https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentimenthttps://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment点击“Download”-Download dataset as zip大约3MB的.csv 文件其内容大致如下。tweet_id sentiment author content 1956967341 empty xoshayzers tiffanylue i know i was listenin to bad habit earlier and i started freakin at his part [ 1956967666 sadness wannamama Layin n bed with a headache ughhhh...waitin on your call...这是一个非常知名的数据文件也可以替换为选择的任意数据库。14. 直接解压.zip 并将“twitter-airline-sentimentSentiment_Analysis.csv” 直接拖拽至jupyter notebook左侧与.ipynb相同的文件夹内/root/。15. 下载LLM文件deepseek-ai/deepseek-llm-7b-base至本地 /root/private_data/DeepSeek7B为了加快下载速度切换huggingface镜像为“https://hf-mirror.com”import os os.environ[HF_ENDPOINT] https://hf-mirror.com from transformers import AutoModelForCausalLM, AutoTokenizer import torch # Define local save directory for the 7B model local_model_dir /root/private_data/DeepSeek7B # ✅ Correct Model Name (7Billion parameters) model_name deepseek-ai/deepseek-llm-7b-base print(fLoading model: {model_name} (Official size: 7B parameters)) tokenizer AutoTokenizer.from_pretrained(model_name, trust_remote_codeTrue) model AutoModelForCausalLM.from_pretrained( model_name, trust_remote_codeTrue, torch_dtypetorch.bfloat16, # bfloat16 is safe and memory efficient device_mapauto # automatic device placement ) # Save locally tokenizer.save_pretrained(local_model_dir) model.save_pretrained(local_model_dir) print(fModel saved to {local_model_dir})在切换镜像后这会消耗大约5-10分钟。16. 附注一段SCNet官方的提示# https://www.scnet.cn/help/docs/mainsite/ai/notebook/function-introduction/ # 一、关机环境保存/保存镜像 # 可以在关机时保存开发环境或者使用“保存镜像”功能对开发环境进行备份保证机器具有一致的环境和配置满足再次启动环境、团队开发环境搭建、在其他平台复现环境等需求容器实例开关机条件下皆可保存镜像。 # 注意 为保证镜像正常运行保存环境镜像时单层镜像数据量不得超过15 GiB系统会对镜像大小进行校验若镜像大小超过限额限制您需要手动将容器环境下文件转移到文件存储中。 # 您可以使用如下代码快速定位当前环境中的大文件含文件夹 # cd / # find . -path ./proc -prune -o \ # -path /root/private_data/* -prune -o \ ##排除个人文件 # -path /root/public_data/* -prune -o \ ##排除平台共享文件 # -path /root/group_data/* -prune -o \ ##排除团队共享文件 # -path /public/* -prune -o \ ##排除共享存储文件 # -path /work/* -prune -o \ ##排除共享存储文件 # -type f -exec du -h {} | sort -hr | head -n 20 ##展示大小排名前20的文件 # 识别到大文件后使用如下代码将文件迁移至文件存储永久保存 # mv /root/model_file /root/private_data/model_file # 迁移后文件可能无法在文件存储中使用属主为root您需要在当前环境中执行如下代码修改权限 # # 其中user_name需要替换为你的计算用户名 # chown user_name:user_name /root/private_data/model_file17. 测试已下载的LLM模型from transformers import AutoModelForCausalLM, AutoTokenizer import torch local_model_dir /root/private_data/DeepSeek7B tokenizer AutoTokenizer.from_pretrained(local_model_dir, trust_remote_codeTrue) model AutoModelForCausalLM.from_pretrained( local_model_dir, torch_dtypetorch.bfloat16, device_mapauto ) # Example prompt input_text Explain quantum computing in simple terms inputs tokenizer(input_text, return_tensorspt).to(model.device) outputs model.generate( **inputs, max_new_tokens200, temperature0.7, do_sampleTrue, pad_token_idtokenizer.eos_token_id ) print(tokenizer.decode(outputs[0], skip_special_tokensTrue))注意以下代码18.b-25.b为示范和解释并非为直接训练代码。受过训练的研究人员可略过直接跳至26节开始实例。18.b 从.csv文件中引入头5000条数据import pandas as pd df pd.read_csv(twitter-airline-sentimentSentiment_Analysis.csv) first_50 df.head(5000) print(fLoaded {len(first_50)} rows) print(first_50[[tweet_id, sentiment, author, content]].head())为快速试错可将数据改为头50条。19.b 将准备数据给训练器。# Define instruction instruction Analyze the sentiment of the following tweet. Output exactly one word, with no punctuation or extra text: # Create a list of formatted texts formatted_texts [] for idx, row in first_50.iterrows(): text fInstruction: {instruction}\nInput: {row[content]}\nOutput: {row[sentiment]} formatted_texts.append(text) # Convert to a Hugging Face Dataset from datasets import Dataset dataset Dataset.from_dict({text: formatted_texts}) print(dataset[0][text])注意该实例中的instruction与前文有所不同。因7B模型生成信息的自由度DoF更高以对prompt进行特定优化。https://blog.csdn.net/YucongCai/article/details/159696147?spm1001.2014.3001.5502https://blog.csdn.net/YucongCai/article/details/159696147?spm1001.2014.3001.550220.b 获取transformer layer class名称此示范为class transformers.models.llama.modeling_llama.LlamaDecoderLayerimport torch from transformers import AutoModelForCausalLM, AutoTokenizer local_model_dir /root/private_data/DeepSeek7B model AutoModelForCausalLM.from_pretrained(local_model_dir, torch_dtypetorch.bfloat16) print(type(model.model.layers[0])) # should output something like class transformers.models.llama.modeling_llama.LlamaDecoderLayer # Replace the class name in the code below accordingly.PyTorch默认multiprocessing方法为spawn如有需要的可以通过以下代码规定。对于multiprocessing方法的设定要在library导入之前完成。import multiprocessing multiprocessing.set_start_method(spawn, forceTrue) import torch.multiprocessing as tmp tmp.set_start_method(spawn, forceTrue) # Now import torch and other libraries (this will be after start method is set) import torch import functools import pandas as pd from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer from peft import LoraConfig, get_peft_model from datasets import Dataset from accelerate import Accelerator, notebook_launcher from accelerate.utils import FullyShardedDataParallelPlugin from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy # Verify start method (optional) print(fStart method: {multiprocessing.get_start_method()})21.b 导入训练所需模型原型开始设置LoRA以及FSDP参数。注意LoRA的设置需要在FSDP前完成。为了演示需要本文采用了bfloat16比特格式。Accelerator的initialization需要有所注意。# Configuration local_model_dir /root/private_data/DeepSeek7B output_dir /root/private_data/DeepSeek7B_finetuned lora_r 8 lora_alpha 32 lora_dropout 0.1 target_modules [q_proj, v_proj] batch_size 2 grad_accum 4 learning_rate 2e-4 num_epochs 3 max_length 512 # ------------------------------ # Load tokenizer # ------------------------------ tokenizer AutoTokenizer.from_pretrained(local_model_dir, trust_remote_codeTrue) tokenizer.pad_token tokenizer.eos_token # ------------------------------ # Load base model (bfloat16) # ------------------------------ model AutoModelForCausalLM.from_pretrained( local_model_dir, torch_dtypetorch.bfloat16, trust_remote_codeTrue, ) # --- Apply LoRA --- lora_config LoraConfig( rlora_r, lora_alphalora_alpha, target_modulestarget_modules, lora_dropoutlora_dropout, biasnone, task_typeCAUSAL_LM, ) model get_peft_model(model, lora_config) # --- Cast everything to bfloat16 --- model model.to(torch.bfloat16) # --- Detect transformer layer class for FSDP auto-wrap --- try: base_model model.base_model.model layer_class type(base_model.model.layers[0]) print(fDetected layer class: {layer_class}) except: from transformers.models.llama.modeling_llama import LlamaDecoderLayer layer_class LlamaDecoderLayer auto_wrap_policy functools.partial( transformer_auto_wrap_policy, transformer_layer_cls{layer_class} ) fsdp_plugin FullyShardedDataParallelPlugin( auto_wrap_policyauto_wrap_policy, use_orig_paramsTrue, ) # --- Accelerator with FSDP --- accelerator Accelerator(fsdp_pluginfsdp_plugin) model accelerator.prepare(model)22.b 以下代码可以检测FSDP的分布式训练是否如实被启用# After accelerator.prepare(model) print( FSDP Status ) print(fNumber of processes: {accelerator.num_processes}) print(fProcess index: {accelerator.process_index}) print(fDevice: {accelerator.device}) # Check if the model is wrapped with FSDP (it should be) from torch.distributed.fsdp import FullyShardedDataParallel as FSDP if isinstance(model, FSDP): print(✅ Model is wrapped with FSDP.) # Print some info about the FSDP wrapped modules for name, module in model.named_modules(): if isinstance(module, FSDP): print(fFSDP module: {name} with params: {sum(p.numel() for p in module.parameters())}) else: print(⚠️ Model is not an FSDP instance – maybe its wrapped inside a PEFT model?) # PeftModel might wrap the base model, so we need to check base_model if hasattr(model, base_model) and isinstance(model.base_model, FSDP): print(✅ Base model (inside PeftModel) is wrapped with FSDP.) else: print(❌ Model not wrapped with FSDP. Check your FSDP plugin.) # Optional: print device of first few parameters to see sharding print(\n Parameter device assignment (first 5 named parameters) ) for i, (name, param) in enumerate(model.named_parameters()): if param.device ! accelerator.device: # might be different for sharded params print(f{name}: device{param.device}) if i 5: break23.b 定义Tokenizer function准备训练样本def tokenize_function(examples): return tokenizer( examples[text], truncationTrue, paddingmax_length, max_length512, # adjust as needed return_tensorsNone, ) tokenized_dataset dataset.map(tokenize_function, batchedTrue, remove_columnsdataset.column_names) tokenized_dataset tokenized_dataset.train_test_split(test_size0.1) # optional split train_dataset tokenized_dataset[train] eval_dataset tokenized_dataset[test]24.b 定义模型训练过程并开始模型微调from transformers import TrainingArguments, Trainer training_args TrainingArguments( output_dir./deepseek-lora-fsdp, per_device_train_batch_size2, per_device_eval_batch_size2, gradient_accumulation_steps4, learning_rate2e-4, num_train_epochs3, bf16True, logging_steps10, save_steps500, save_total_limit2, remove_unused_columnsFalse, dataloader_num_workers4, ddp_find_unused_parametersFalse, eval_strategysteps, # -- changed from evaluation_strategy eval_steps500, ) trainer Trainer( modelmodel, argstraining_args, train_datasettrain_dataset, eval_dataseteval_dataset, tokenizertokenizer, data_collatorNone, ) trainer.train() # Launch on 2 GPUs (change num_processes to 3 if you have three) # notebook_launcher(train_function, num_processes2, args())25.b 保存训练文件model.save_pretrained(./deepseek-lora-fsdp-final) tokenizer.save_pretrained(./deepseek-lora-fsdp-final)26.实际上步骤18.b到步骤25.b需要被统写到一个.py文件“train_deepseek7B_finetuned.py”里以便实际运作。“train_deepseek7B_finetuned.py”应当与.ipynb文件同在一个文件夹内。以下为“train_deepseek7B_finetuned.py”的实际代码import torch import functools import pandas as pd import os from transformers import AutoTokenizer, AutoModelForCausalLM from peft import LoraConfig, get_peft_model from datasets import Dataset from accelerate import Accelerator from accelerate.utils import FullyShardedDataParallelPlugin from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy from torch.utils.data import DataLoader from tqdm import tqdm # Disable tokenizer parallelism os.environ[TOKENIZERS_PARALLELISM] false # Configuration local_model_dir /root/private_data/DeepSeek7B output_dir /root/private_data/DeepSeek7B_finetuned lora_r 8 lora_alpha 32 lora_dropout 0.1 target_modules [q_proj, v_proj] batch_size 2 grad_accum 4 learning_rate 2e-4 num_epochs 3 max_length 512 def main(): # --- Load dataset --- df pd.read_csv(twitter-airline-sentimentSentiment_Analysis.csv) first_50 df.head(5000) # instruction Analyze the sentiment of the following tweet: instruction Analyze the sentiment of the following tweet. Output exactly one word, with no punctuation or extra text: formatted_texts [] for _, row in first_50.iterrows(): text fInstruction: {instruction}\nInput: {row[content]}\nOutput: {row[sentiment]} formatted_texts.append(text) dataset Dataset.from_dict({text: formatted_texts}) # --- Tokenizer --- tokenizer AutoTokenizer.from_pretrained(local_model_dir, trust_remote_codeTrue) tokenizer.pad_token tokenizer.eos_token # --- Load base model (dtype bfloat16) --- model AutoModelForCausalLM.from_pretrained( local_model_dir, dtypetorch.bfloat16, trust_remote_codeTrue, ) # --- Apply LoRA --- lora_config LoraConfig( rlora_r, lora_alphalora_alpha, target_modulestarget_modules, lora_dropoutlora_dropout, biasnone, task_typeCAUSAL_LM, ) model get_peft_model(model, lora_config) # --- Cast everything to bfloat16 --- model model.to(torch.bfloat16) # --- Detect transformer layer class for FSDP auto-wrap --- try: base_model model.base_model.model layer_class type(base_model.model.layers[0]) print(fDetected layer class: {layer_class}) except: from transformers.models.llama.modeling_llama import LlamaDecoderLayer layer_class LlamaDecoderLayer auto_wrap_policy functools.partial( transformer_auto_wrap_policy, transformer_layer_cls{layer_class} ) fsdp_plugin FullyShardedDataParallelPlugin( auto_wrap_policyauto_wrap_policy, use_orig_paramsTrue, ) # --- Accelerator with FSDP --- accelerator Accelerator(fsdp_pluginfsdp_plugin) model accelerator.prepare(model) # --- Print trainable parameters (only on main process) --- if accelerator.is_main_process: model.print_trainable_parameters() print( FSDP Status ) print(fNumber of processes: {accelerator.num_processes}) # Check embedding weight shape before training # We need to get the original embedding layer (the base models embed_tokens) # Because the model is wrapped, we have to dig into the FSDP-wrapped structure. try: embed_layer model.base_model.model.model.embed_tokens print(fEmbedding weight shape before training: {embed_layer.weight.shape}) except Exception as e: print(fCould not inspect embedding weight: {e}) # --- Tokenize dataset --- def tokenize_function(examples): return tokenizer( examples[text], truncationTrue, paddingmax_length, max_lengthmax_length, ) tokenized_dataset dataset.map(tokenize_function, batchedTrue, remove_columns[text]) tokenized_dataset tokenized_dataset.train_test_split(test_size0.1) train_dataset tokenized_dataset[train] eval_dataset tokenized_dataset[test] # --- Create data loader (num_workers0 to avoid forking) --- def collate_fn(batch): # batch is a list of dicts with input_ids, attention_mask input_ids torch.stack([torch.tensor(item[input_ids]) for item in batch]) attention_mask torch.stack([torch.tensor(item[attention_mask]) for item in batch]) labels input_ids.clone() # Set padding tokens to -100 so they are ignored in loss labels[labels tokenizer.pad_token_id] -100 return {input_ids: input_ids, attention_mask: attention_mask, labels: labels} train_loader DataLoader( train_dataset, batch_sizebatch_size, shuffleTrue, collate_fncollate_fn, num_workers0, # crucial: no extra processes ) eval_loader DataLoader( eval_dataset, batch_sizebatch_size, shuffleFalse, collate_fncollate_fn, num_workers0, ) # --- Prepare optimizer and scheduler --- optimizer torch.optim.AdamW(model.parameters(), lrlearning_rate) # Use accelerator to prepare data loaders and optimizer (model already prepared) train_loader, eval_loader, optimizer accelerator.prepare(train_loader, eval_loader, optimizer) # --- Training loop --- model.train() for epoch in range(num_epochs): total_loss 0 progress_bar tqdm(train_loader, descfEpoch {epoch1}, disablenot accelerator.is_main_process) for step, batch in enumerate(progress_bar): # Forward pass outputs model(**batch) loss outputs.loss loss loss / grad_accum accelerator.backward(loss) if (step 1) % grad_accum 0 or step len(train_loader) - 1: optimizer.step() optimizer.zero_grad() total_loss loss.item() * grad_accum if accelerator.is_main_process: progress_bar.set_postfix({loss: total_loss / (step 1)}) # Evaluation (optional) model.eval() eval_loss 0 with torch.no_grad(): for batch in eval_loader: outputs model(**batch) eval_loss outputs.loss.item() eval_loss / len(eval_loader) if accelerator.is_main_process: print(fEpoch {epoch1} - Train loss: {total_loss/len(train_loader):.4f}, Eval loss: {eval_loss:.4f}) model.train() # --- Save final model --- accelerator.wait_for_everyone() unwrapped_model accelerator.unwrap_model(model) unwrapped_model.save_pretrained(/root/private_data/DeepSeek7B_finetuned) tokenizer.save_pretrained(/root/private_data/DeepSeek7B_finetuned) print(Training completed and model saved.) if __name__ __main__: main()27. 重启“.ipynb”文件内核在一个新的cell里运行!pkill -f torchrun # kills all torchrun processes清空GPU内的runtime后在另一个新的cell里运行以下命令开始训练。# Rune the command to train the network !torchrun --nproc_per_node2 --master_addr127.0.0.1 --master_port29501 train_deepseek7B_finetuned.py其中--nproc_per_node2为GPU数量。默认5000个文本输入训练3次会消耗约1.5小时如需快速试错可减少样本量来加快进程。在终端内输入以下命令可检测GPU使用情况。watch -n 1 nvidia-smi可显示两个GPU均被启用。其中一个GPU内存占用约在23GB左右另一个GPU内存占用约在21GB左右。所微调的大模型会被保存在/root/private_data/DeepSeek7B_finetuned文件夹内。28. 使用微调的模型对比微调所使用的数据集进行案例示范from peft import PeftModel import torch import pandas as pd from transformers import AutoTokenizer, AutoModelForCausalLM # Paths base_model_dir /root/private_data/DeepSeek7B lora_adapter_dir /root/private_data/DeepSeek7B_finetuned # Load tokenizer tokenizer AutoTokenizer.from_pretrained(base_model_dir, trust_remote_codeTrue) tokenizer.pad_token tokenizer.eos_token # Load base model (use same dtype as during training) base_model AutoModelForCausalLM.from_pretrained( base_model_dir, torch_dtypetorch.bfloat16, device_mapauto # automatically distributes layers across available GPUs/CPU ) # Load LoRA adapter model PeftModel.from_pretrained(base_model, lora_adapter_dir) model.eval() # Load dataset df pd.read_csv(twitter-airline-sentimentSentiment_Analysis.csv) # adjust path if needed # Randomly select a few examples import random test_indices random.sample(range(len(df)), 5) for idx in test_indices: tweet df.iloc[idx][content] true_sentiment df.iloc[idx][sentiment] prompt fInstruction: Analyze the sentiment of the following tweet. Output exactly one word, with no punctuation or extra text:\nInput: {tweet}\nOutput: # Tokenize and move to the models device inputs tokenizer(prompt, return_tensorspt).to(model.device) # Generate with no gradients (faster, less memory) with torch.no_grad(): outputs model.generate( **inputs, max_new_tokens20, temperature0.7, do_sampleTrue, pad_token_idtokenizer.eos_token_id ) generated tokenizer.decode(outputs[0], skip_special_tokensTrue) # Extract the generated sentiment (text after Output:) predicted generated.split(Output:)[-1].strip().split(\n)[0] print(fTweet: {tweet[:80]}...) print(fTrue sentiment: {true_sentiment}) print(fPredicted: {predicted}) print(- * 50)29. 返回控制台-“Notebook”-操作 关闭容器防止额外付费。30. 在左侧“人工智能”-“文件管理“内可下载微调后的/root/private_data/DeepSeek7B_finetuned模型。至此一个通过多加速卡采用FSDP和LoRA分布式训练获得的七十亿参数的微调模型便完成了。这类微调后的模型可在本地高效部署大量节约API调用的时间成本延迟)和费用。与十亿及大模型不同的是其数十亿至数百亿参数规模较大自由度更高可作为Agentic AI的各类Agent或专业针对性tool的模型构架也适合需要在本地部署大模型但硬件受限制的专业环境使用。相同方法可以直接类推到600GB以内百亿的大模型8张A800显卡每张显卡配80GB显存。如显卡数递增并采用4bit格式可涵盖百亿至千亿参数的大模型即支持除个别头部公司外所有一二线和中小企业需求。我在找工作HR或项目合作请联系yucongcai_businessoutlook.com与科研相关的请联系yucongcai_researchoutlook.com

更多文章