Pipeline Detail

Long-read Transcriptomics长读长、翻译组与非模式物种

长读长转录组 Iso-Seq/Nanopore

面向 PacBio Iso-Seq 和 Oxford Nanopore cDNA/direct RNA 的长读长转录组流程，覆盖 isoform discovery、FLAIR/TALON/StringTie2、alternative promoter/polyA 和表达定量。

创建时间

2026/6/3

分析难度

高级

推荐场景

长读长转录组

预计耗时

3-5 天

Metadata

流程元数据

先看应用场景、输入输出和工具依赖，再进入正文命令细节。

Difficulty

高级

Scenario

长读长转录组

Estimated Time

3-5 天

Tools

STARStringTieIso-SeqNanoporeFLAIRTALON

Inputs

FASTQBAMGTFTPM

Outputs

isoform annotationreport

Workflow DAG

流程图

用步骤节点快速理解这个分析从原始数据到结果报告的流转关系。

STEP 1

建立长读长项目

→

STEP 2

reads QC

→

STEP 3

minimap2 比对

→

STEP 4

isoform collapse

→

STEP 5

FLAIR/TALON/StringTie2

→

STEP 6

isoform 定量

→

STEP 7

promoter/polyA 分析

→

STEP 8

novel isoform 注释

→

STEP 9

isoform 报告

Protocol

流程文档

正文保留 Markdown 排版、代码语言标识和表格样式，适合边学边复现。

长读长转录组 Iso-Seq/Nanopore

一、项目目录

mkdir -p longread_tx_project/{00_metadata,01_reads,02_qc,03_alignment,04_isoforms,05_quant,06_annotation,07_polyA,report}

二、示例数据

00_metadata/sample_info.csv：

sample_id,platform,condition,reads
Sample_1,PacBio_IsoSeq,Ctrl,01_reads/Sample_1.hifi.fastq.gz
Sample_2,ONT_cDNA,Treat,01_reads/Sample_2.fastq.gz

isoform 表示例：

isoform_id,gene_id,category,length
PB.1.1,GENE1,known,2400
PB.1.2,GENE1,novel_exon_combination,3100

三、整体流程图

flowchart TD
    A[PacBio HiFi / ONT reads] --> B[read QC]
    B --> C[minimap2 splice alignment]
    C --> D[FLAIR correct/collapse 或 TALON]
    D --> E[isoform annotation]
    D --> F[isoform quantification]
    E --> G[novel isoform discovery]
    E --> H[alternative promoter/polyA]
    F --> I[differential isoform usage]
    G --> J[isoform report]
    H --> J
    I --> J

四、minimap2 比对

minimap2 -ax splice -uf -k14   ref/genome.fa   01_reads/Sample_1.hifi.fastq.gz   | samtools sort -@ 8 -o 03_alignment/Sample_1.sorted.bam

samtools index 03_alignment/Sample_1.sorted.bam

ONT cDNA：

minimap2 -ax splice   ref/genome.fa   01_reads/Sample_2.fastq.gz   | samtools sort -@ 8 -o 03_alignment/Sample_2.sorted.bam

五、FLAIR 流程

flair correct   -q 03_alignment/Sample_1.sorted.bam   -g ref/genome.fa   -f ref/genes.gtf   -o 04_isoforms/Sample_1

flair collapse   -g ref/genome.fa   -r 01_reads/Sample_1.hifi.fastq.gz   -q 04_isoforms/Sample_1_all_corrected.bed   -f ref/genes.gtf   -o 04_isoforms/Sample_1_flair

flair quantify   -r reads_manifest.tsv   -i 04_isoforms/Sample_1_flair.isoforms.fa   -o 05_quant/flair_quant

六、TALON 思路

TALON 适合对长读长转录本进行注释分类。

talon_initialize_database   --f ref/genes.gtf   --g hg38   --a gencode   --o 06_annotation/talon

talon   --f 03_alignment/Sample_1.sorted.bam   --db 06_annotation/talon.db   --build hg38   --o 06_annotation/Sample_1_talon

七、StringTie2 长读长组装

stringtie   -L   -G ref/genes.gtf   -o 04_isoforms/Sample_1_stringtie2.gtf   03_alignment/Sample_1.sorted.bam

八、alternative promoter/polyA

长读长可以直接观察 transcript start site 和 polyA site。

import pandas as pd

isoforms = pd.read_csv("06_annotation/isoform_annotation.tsv", sep="	")

apa = isoforms.groupby("gene_id")["polyA_site"].nunique().reset_index()
apa = apa.rename(columns={"polyA_site": "polyA_site_count"})
apa = apa[apa["polyA_site_count"] >= 2]

apa.to_csv("07_polyA/genes_with_multiple_polyA.tsv", sep="	", index=False)

九、图例解释

类别	含义
full-splice match	与已知转录本剪接结构完全匹配
novel in catalog	使用已知 splice junction 的新组合
novel not in catalog	包含新的 splice junction
alternative promoter	同一基因使用不同 TSS
alternative polyA	同一基因使用不同 polyA site

十、交付物

long-read QC 报告
sorted BAM 和 index
isoform GTF/FASTA
isoform annotation table
novel isoform list
isoform count/TPM matrix
alternative promoter/polyA 表
differential isoform usage 结果
重点基因 isoform browser plot