TCGA更新了？最新数据处理脚本，轻松拿捏住！

2022/11/7 15:29:48　阅读：2443　发布者：

TCGA数据库更新啦！跟着之前TCGA系列教程进行官方数据下载和合并实操时遇到问题和报错，反馈主要有两点：

1. 原本GDC-client下载的转录组数据文件格式后缀为.gz,更新后后缀为.tsv；

2. 除了格式不同，文件内存放的数据及组织形式也发生了改变。

确实，TCGA悄悄在四月初更新了，但全新的组织方式对于我们来说是更佳便捷的！

一、以TCGA-KIRC为例，进行新版TCGA-KIRC数据下载与整理

第一步，下载临床和表达矩阵

1.下载gdc-client软件。进入https://gdc.cancer.gov/access-data/gdc-data-transfer-tool，选择自己电脑匹配的软件，下载到工作目录解压。

2. 下载表达数据。进入https://portal.gdc.cancer.gov/repository

然后进入Cart

点击Dowload下拉菜单中的Manifest，保存为gdc_manifest_expdata.txt文件到工作目录。这个文件用于下载每个样本表达矩阵。

3. 下载表达数据的Metadata文件，保存为metadata.cart.json文件到工作目录。这个文件用于TCGA_ID和文件名ID转化。

4. 下载临床数据。进入https://portal.gdc.cancer.gov/repository

点击Manifest，保存为gdc_manifest_clinical.txt文件到工作目录。这个文件用于下载临床信息。

接下来就是在Rstudio中下载文件了。

options(stringsAsFactors = F)

library(stringr)

project="TCGA-KIRC"

if(!dir.exists("clinical"))dir.create("clinical")

if(!dir.exists("expdata"))dir.create("expdata")

dir()

# [1] "clinical" "expdata" "gdc-client.exe" "gdc_manifest_clinical.txt"

# [5] "gdc_manifest_expdata.txt" "metadata.cart.json" "step01_prepare_data.Rmd" "TCGA_KIRC.Rproj"

command1 <- "./gdc-client download -m gdc_manifest_clinical.txt -d clinical"

command2 <- "./gdc-client download -m gdc_manifest_expdata.txt -d expdata"

system(command = command1) # download clinical data

system(command = command2) # download expression data

length(dir("./clinical/"))

#[1] 537

length(dir("./expdata/"))

#[1] 613

至此，537个临床文件，613个测序文件都下载完了。

第二步，准备临床信息文件

在这一步中，将会合并所用样本的临床信息。

library(XML)

xmls = dir("clinical/",pattern = "*.xml$",recursive = T)

cl = list()

for(i in 1:length(xmls)){

result = xmlParse(paste0("clinical/",xmls[[i]]))

rootnode = xmlRoot(result)

cl[[i]] = xmlToDataFrame(rootnode[2])

}

clinical = do.call(rbind,cl)

clinical[1:3,1:3]

# additional_studies tumor_tissue_site histological_type

# 1 Kidney Kidney Clear Cell Renal Carcinoma

# 2 Kidney Kidney Clear Cell Renal Carcinoma

# 3 Kidney Kidney Clear Cell Renal Carcinoma

第三步，准备表达矩阵

在这一步中，将会把所用样本的表达矩阵汇总，随机去除重复基因名称，并以gene_name作为合并后表达矩阵的行名。

count_files = dir("expdata/",pattern = "*.tsv$",recursive = T)

exp = list()

for(i in 1:length(count_files)){

exp[[i]] = read.table(paste0("expdata/",count_files[[i]]),header=T,sep="\t")

exp[[i]] = exp[[i]][-(1:4),] # The first 4 rows contain unneeded information

exp[[i]] = exp[[i]]$unstranded # the fourth column (unstranded) was the count value

}

exp = as.data.frame(do.call(cbind,exp))

dim(exp)

#[1] 60660 613

exp[1:4,1:4]

# V1 V2 V3 V4

# 1 2157 4455 10949 2375

# 2 26 13 33 13

# 3 978 1610 1621 1264

# 4 604 831 462 764

#TCGA ID and file name match

meta = jsonlite::fromJSON("metadata.cart.json")

ID = sapply(meta$associated_entities,

function(x){x$entity_submitter_id})

file2id = data.frame(file_name = meta$file_name,

ID = ID)

count_files2 = stringr::str_split(count_files,"/",simplify = T)[,2]

table(count_files2 %in% file2id$file_name)

# TRUE

# 613

file2id = file2id[match(count_files2,file2id$file_name),]

identical(file2id$file_name,count_files2)

#[1] TRUE

colnames(exp) = file2id$ID

gene_name = data.table::fread(paste0("expdata/",count_files[1]))$gene_name

gene_name = gene_name[-seq(1,4)] # The first 4 rows contain unneeded information

exp = cbind(gene_name=gene_name,exp)

dim(exp)

#[1] 60660 614

exp = exp[!duplicated(exp$gene_name),]

rownames(exp) = exp$gene_name

exp = exp[,-1]

dim(exp)

#[1] 59427 613

第四步，过滤基因

这里只保留在50%样本中都表达的基因。

#gene filter

exp = exp[apply(exp, 1, function(x) sum(x > 0) > 0.5*ncol(exp)), ] # only keep genes express in half of samples

dim(exp)

#[1] 32809 613

第五步，确定分组信息

TCGA-KIRC数据中包含了541个Tumor和72个Normal样本。

#group information

library(stringr)

table(str_sub(colnames(exp),14,15))

Group = ifelse(as.numeric(str_sub(colnames(exp),14,15)) < 10,'tumor','normal')

Group = factor(Group,levels = c("normal","tumor"))

table(Group)

# normal tumor

# 72 541

第六步，保存数据。

if(!dir.exists("data"))dir.create("data")

save(exp,clinical,Group,project,file = paste0("data/",project,"_gdc.Rdata")

二、如何使用TCGAbiolinks包下载最新的TCGA数据

第一步：重新安装R包

TCGAbiolinks及TCGAbiolinksGUI.data 包最近都有更新，但在Bioconductor上还没有更新，所以要从git.hub上安装。此外，一些依赖包也需要安装。安装前记得删除之前的TCGAbiolinks及TCGAbiolinksGUI.data 包。

if (!require("devtools"))

install.packages("devtools")

devtools::install_github("BioinformaticsFMRP/TCGAbiolinksGUI.data", host = "https://api.github.com")

BiocManager::install("AnnotationHub")

BiocManager::install("ExperimentHub")

BiocManager::install("SummarizedExperiment")

BiocManager::install("GenomeInfoDbData")

BiocManager::install("GenomicDataCommons")

devtools::install_github("BioinformaticsFMRP/TCGAbiolinks",

host = "https://api.github.com")

第二步：读取基因注释文件

Ginfo<-read.table("overlapTable27.txt", header=T,sep="\t",check.names=F,row.names = 1) #可点击下方下载

第三步：开始下载

此处以TCGA-PAAD为例。更新后的STAR-Counts数据，由于包含了不同类别的数据。所以特别大。

library(TCGAbiolinks)

projects <- c("TCGA-PAAD")#选择PAAD数据集

# RNA Transcriptome

query <- GDCquery(

project = projects,

data.category = "Transcriptome Profiling",

data.type = "Gene Expression Quantification",

workflow.type = "STAR - Counts",#选择STAR-Count数据格式

#barcode = ids

)

GDCdownload(query,

method = "api",

directory = "./data",

files.per.chunk = 10)

第四步：数据转换

rna <- GDCprepare(query, save = FALSE,

summarizedExperiment = T, directory = "./data",

remove.files.prepared = F)

expMatrix<-rna@assays@data@listData$tpm_unstrand

#上面选择了tpm_unstrand数据格式，可以更换为其他格式

rownames(expMatrix)<-rna@rowRanges$gene_id

colnames(expMatrix)<-rna@colData@listData$barcode

colnames(expMatrix) <- substr(colnames(expMatrix),

start = 1,stop = 15)

第五步：基因注释

rownames(Ginfo)<-Ginfo$gene_id

comgene <- intersect(rownames(expMatrix),rownames(Ginfo))

expr <- as.data.frame(expMatrix)[comgene,]

Ginfo <- Ginfo[comgene,]

expr$gene <- Ginfo$genename

expr <- expr[!duplicated(expr$gene),]

Ginfo <- Ginfo[rownames(expr),]

rownames(expr) <- expr$gene

expr <- expr[,-ncol(expr)]

第六步：获得数据的表达谱

第7步：获取临床数据

clinicaldata<-as.data.frame(rna@colData)

第8步：如何选择合适的数据类型来分析

在TCGA的数据中，每个基因的unstranded表达数值=stranded_first+stranded_second。以及官方转换好的FPKM/TMP数据均基于unstranded的数据，因此笔者认为，我们日常常规分析的时候，就仍然使用unstranded的数据就好，stranded的数据可能适用于更精细的研究。

转自：“科研Z库”微信公众号

如有侵权，请联系本站删除！

上一篇： 新冠导致欧美国家预期寿命70年来前所未有的下降
下一篇： 全球新冠疫情真的会在2023年结束吗?

投稿问答最小化 关闭

TCGA更新了？最新数据处理脚本，轻松拿捏住！

本文评论

暂无相应记录！

学界研圈热门文章

本站推荐

最近更新

投稿问答最小化 关闭

TCGA更新了？最新数据处理脚本，轻松拿捏住！

本文评论

暂无相应记录！

学界研圈热门文章

本站推荐

最近更新

投稿问答最小化关闭