返回分析流程中心

Pipeline Detail

BSA-seq基因组变异与定位

BSA-seq 突变位点定位分析

面向 Bulked Segregant Analysis 的变异检测和候选区间定位流程,覆盖数据下载、质控、比对、GATK SNP calling、过滤、SNP-index 与 ED5 下游分析。

创建时间
2026/6/3
分析难度
入门
推荐场景
遗传定位
预计耗时
0.5-1 天

Metadata

流程元数据

先看应用场景、输入输出和工具依赖,再进入正文命令细节。

Difficulty

入门

Scenario

遗传定位

Estimated Time

0.5-1 天

Inputs

FASTQBAMVCF

Workflow DAG

流程图

用步骤节点快速理解这个分析从原始数据到结果报告的流转关系。

STEP 1

测序数据下载

STEP 2

FastQC/Trimmomatic 质控

STEP 3

BWA-MEM 比对 TAIR10

STEP 4

GATK MarkDuplicates

STEP 5

HaplotypeCaller GVCF

STEP 6

CombineGVCFs/GenotypeGVCFs

STEP 7

SNP/INDEL 硬过滤

STEP 8

SNP-index/ED5 下游定位

Protocol

流程文档

正文保留 Markdown 排版、代码语言标识和表格样式,适合边学边复现。

BSA-seq 突变位点定位分析

适用场景

适用于 Bulked Segregant Analysis(BSA)或 QTL-seq 类型项目:将极端表型个体混池测序,与亲本或对照样本比较等位基因频率差异,从而定位候选突变区间。

本流程来自拟南芥 BSA 分析记录,参考基因组使用 TAIR10_chr_all.fas,核心样本包括 1-3727-3-127NNegtop1α-10。流程前半段使用 BWA/GATK 产出高质量 VCF,后半段在 R 中计算 SNP-index、ΔSNP-index 和 ED5,并绘制候选区间图。

样本与分组

SampleRoleNotes
1-37bulk / segregant用于 37 组合分析
27-3-1bulk / segregant用于 27 组合分析
27Nnormal/control27 组合对照
Negwild type/control37 组合对照
top1α-10parent/reference mutant亲本或参考突变材料

1. 下载原始数据

./lnd login -u X101SC250910114-Z01-J003 -p ggeah41n

./lnd cp -d   oss://CP2023071800097/H101SC250910114/RSPD00204/X101SC250910114-Z01/X101SC250910114-Z01-J003/01.RawData   ../../../01_raw_data/

2. FastQC 质控与 Trimmomatic 裁剪

module load FastQC/0.11.9

bsub -J fastqc1-37 -n 1 -R "span[hosts=1]" -o %J.out -e %J.err -q normal "fastqc 1-37_*"
bsub -J fastqc27-3 -n 1 -R "span[hosts=1]" -o %J.out -e %J.err -q normal "fastqc 27-3*"
bsub -J fastqc27N -n 1 -R "span[hosts=1]" -o %J.out -e %J.err -q normal "fastqc 27N*"
bsub -J fastqcNeg -n 1 -R "span[hosts=1]" -o %J.out -e %J.err -q normal "fastqc Neg*"
bsub -J fastqctop1 -n 1 -R "span[hosts=1]" -o %J.out -e %J.err -q normal "fastqc top1α*"
for sample in 1-37 27-3-1 27N Neg top1α-10
do
  trimmomatic.sh PE -threads 10 -phred33     01_raw_data/${sample}_1.fq.gz     01_raw_data/${sample}_2.fq.gz     02_clean_data/${sample}_R1_paired.fq.gz     02_clean_data/${sample}_R1_unpaired.fq.gz     02_clean_data/${sample}_R2_paired.fq.gz     02_clean_data/${sample}_R2_unpaired.fq.gz     HEADCROP:2
done

3. BWA-MEM 比对到 TAIR10

REF="./ref_genomic/TAIR10_chr_all.fas"

for sample in 1-37 27-3-1 27N Neg top1α-10
do
  bwa mem -M -t 8     -R "@RG	ID:${sample}	SM:${sample}"     ${REF}     02_clean_data/${sample}_R1_paired.fq.gz     02_clean_data/${sample}_R2_paired.fq.gz |     samtools sort -@ 8 -o 03_alignment/${sample}.sorted.bam -

  samtools flagstat 03_alignment/${sample}.sorted.bam > 03_alignment/${sample}_flagstat.log
  samtools index 03_alignment/${sample}.sorted.bam
done

4. GATK MarkDuplicates 去重

for sample in 1-37 27-3-1 27N Neg top1α-10
do
  gatk MarkDuplicates     -I 03_alignment/${sample}.sorted.bam     -M 03_alignment/${sample}.markup_metrics.txt     -O 03_alignment/${sample}.sorted.markup.bam

  samtools index 03_alignment/${sample}.sorted.markup.bam
done

5. HaplotypeCaller 生成 GVCF

REF="./ref_genomic/TAIR10_chr_all.fas"

for sample in 1-37 27-3-1 27N Neg top1α-10
do
  gatk --java-options "-Xmx10G" HaplotypeCaller     -R ${REF}     -I 03_alignment/${sample}.sorted.markup.bam     -O 04_snp_calling/${sample}.g.vcf.gz     -ERC GVCF     -stand-call-conf 10
done

6. 合并 GVCF 与联合分型

REF="./ref_genomic/TAIR10_chr_all.fas"

gatk CombineGVCFs   -R ${REF}   -V 04_snp_calling/1-37.g.vcf.gz   -V 04_snp_calling/Neg.g.vcf.gz   -V 04_snp_calling/top1α-10.g.vcf.gz   -O 04_snp_calling/BSA-37.gvcf.gz

gatk CombineGVCFs   -R ${REF}   -V 04_snp_calling/27-3-1.g.vcf.gz   -V 04_snp_calling/27N.g.vcf.gz   -V 04_snp_calling/top1α-10.g.vcf.gz   -O 04_snp_calling/BSA-27.gvcf.gz

gatk GenotypeGVCFs   -R ${REF}   -V 04_snp_calling/BSA-27.gvcf.gz   -O 05_snp/BSA_no_filter-27.HC.vcf

7. SNP/INDEL 硬过滤

gatk SelectVariants   -R ./ref_genomic/TAIR10_chr_all.fas   -V 05_snp/BSA_no_filter-27.HC.vcf   -select-type SNP   -O 05_snp/SNPs-27.vcf

gatk VariantFiltration   -V 05_snp/SNPs-27.vcf   --filter-expression "MQ < 30.0"   --filter-name "MQ_filter_SNP"   --filter-expression "QD < 2.0"   --filter-name "QD_filter_SNP"   -O 05_snp/SNPs_filter-27.vcf

grep -E '^#|PASS' 05_snp/SNPs_filter-27.vcf > 05_snp/BSA_SNPs_filterPASSED-27.vcf

INDEL 可继续使用 SelectVariants -select-type INDEL 后按 MQSORQDFS 进行硬过滤,并与 SNP 结果合并为最终 VCF。

8. R 下游:SNP-index 与 ED5 定位

library(QTLseqr)
library(vcfR)
library(ggplot2)
library(dplyr)

vcf <- read.vcfR("BSA_SNPs_filterPASSED-27.vcf")
chrom <- getCHROM(vcf)
pos <- getPOS(vcf)
ref <- getREF(vcf)
alt <- getALT(vcf)
ad <- extract.gt(vcf, "AD")
gt <- extract.gt(vcf, "GT")

ref_split <- masplit(ad, record = 1, sort = 0)
alt_split <- masplit(ad, record = 2, sort = 0)

df <- data.frame(
  CHROM = chrom,
  POS = pos,
  REF = ref,
  ALT = alt,
  AD_REF.1_37 = ref_split[, 1],
  AD_ALT.1_37 = alt_split[, 1],
  AD_REF.Neg = ref_split[, 2],
  AD_ALT.Neg = alt_split[, 2],
  AD_REF.top1 = ref_split[, 3],
  AD_ALT.top1 = alt_split[, 3],
  P_Top1 = gt[, "top1α-10"],
  son_1_37 = gt[, "1-37"],
  son_Neg = gt[, "Neg"]
)

df <- df %>%
  filter(
    nchar(as.character(REF)) == 1,
    nchar(as.character(ALT)) == 1,
    !is.na(P_Top1),
    (AD_REF.1_37 + AD_ALT.1_37) >= 10,
    (AD_REF.Neg + AD_ALT.Neg) >= 10,
    (AD_REF.top1 + AD_ALT.top1) >= 10
  )
df_result <- df %>%
  mutate(
    SNPindex.L = AD_ALT.1_37 / (AD_REF.1_37 + AD_ALT.1_37),
    SNPindex.S = AD_ALT.Neg / (AD_REF.Neg + AD_ALT.Neg),
    deltaSNP = SNPindex.L - SNPindex.S
  )

delta_extreme_points <- df_result %>%
  filter(deltaSNP > 0.5 | deltaSNP < -0.5)

ggplot(df_result, aes(x = POS / 1e6, y = deltaSNP)) +
  geom_point(color = "#000000", size = 0.5, alpha = 0.6) +
  geom_point(
    data = delta_extreme_points,
    aes(color = ifelse(deltaSNP > 0, "Positive", "Negative")),
    size = 1.2,
    alpha = 0.8
  ) +
  geom_hline(yintercept = c(-0.5, 0.5), linetype = "dashed") +
  facet_wrap(~ CHROM, scales = "free_x", nrow = 3) +
  labs(x = "Genomic Position (Mb)", y = "Delta SNP-index")
valid_rows <- complete.cases(df) &
  df$AD_REF.1_37 + df$AD_ALT.1_37 > 0 &
  df$AD_REF.Neg + df$AD_ALT.Neg > 0

df_filtered <- df[valid_rows, ]

ED_values <- apply(df_filtered, 1, function(row) {
  ref_mutant <- as.numeric(row["AD_REF.1_37"])
  alt_mutant <- as.numeric(row["AD_ALT.1_37"])
  ref_wild <- as.numeric(row["AD_REF.Neg"])
  alt_wild <- as.numeric(row["AD_ALT.Neg"])

  depth_mutant <- ref_mutant + alt_mutant
  depth_wild <- ref_wild + alt_wild

  freq_ref_mutant <- ref_mutant / depth_mutant
  freq_alt_mutant <- alt_mutant / depth_mutant
  freq_ref_wild <- ref_wild / depth_wild
  freq_alt_wild <- alt_wild / depth_wild

  freq_diff <- sqrt(
    (freq_ref_wild - freq_ref_mutant)^2 +
      (freq_alt_wild - freq_alt_mutant)^2
  )

  freq_diff^5
})

result_df <- data.frame(
  CHROM = df_filtered$CHROM,
  POS = df_filtered$POS,
  ED = ED_values
)

关键质控与解释

  • FASTQ 阶段关注碱基质量、接头残留和头部异常碱基,当前流程使用 HEADCROP:2
  • 比对阶段保留 read group,方便 GATK 正确识别样本。
  • GVCF 阶段采用每个样本单独 HaplotypeCaller,再按组合合并和联合分型。
  • SNP 过滤至少包含 MQ >= 30QD >= 2;INDEL 过滤可加入 SORFS
  • 下游定位同时看 Delta SNP-indexED5,比只看单一指标更稳。

主要输出

  • FastQC 质控报告
  • Trimmomatic clean reads
  • 排序和去重后的 BAM
  • 样本级 GVCF
  • 合并分型 VCF
  • 过滤后的 SNP/INDEL VCF
  • SNP-index / Delta SNP-index 图
  • ED5 全基因组定位图