从Scanpy的Anndata对象提取信息并转成Seurat对象（实用于空间组且涉及h5文

开发者 · 2024-9-15 03:51:26

关键字

Anndata对象转成Seurat对象
h5文件读写
空间组格式转换

已增补快速利用的函数整理版本，如果不想看细节可以直接看已整理好的版本。
实用配景

众所周知，单细胞数据分析有两大软件：基于R语言的Seurat和基于Python的Scanpy，在平常的分析中经常须要把Seurat对象转成Scanpy的Anndata对象，这已经有比力成熟的流程了。但是，如果反过来把Anndata对象转成Seurat对象，网上搜到的方案寥寥无几，而且在本人亲测之下均报错无法乐成实现。再加上我须要转的是空间组对象，结构比单细胞的更为复杂，只好自己想法从Anndata对象提取信息重新构建出一个Seurat对象了。
这个步调紧张分为2步：
步调一从Scanpy的Anndata对象中提取信息

1提取矩阵

import osimport sysimport scanpy as scimport anndata as adimport numpy as npimport pandas as pdimport h5pyob1=sc.read('tmp.h5ad')mat=pd.DataFrame(data=ob1.X.todense(),index=ob1.obs_names,columns=ob1.var_names)mat.to_hdf("mat.h5","mat")如果要在Python里重新读取h5格式的矩阵，可以运行下面代码：
mat=pd.read_hdf("mat.h5",key="mat")
上面的脚本是我测试过比力好的生存矩阵的方案，下面代码块则是最初的版本，但是在R内里读入之后会缺失行名和列名，所以还要额外生存行名与列名，之后加上，特殊贫困，所以接纳上面的代码块更为简便方便。
mat=ob1.X.todense().Twith h5py.File('mat.h5','w') as f1: f1.create_dataset("mat",data=mat)##在Python里读取h5格式矩阵的方法with h5py.File('mat.h5','r') as f1: mat=f1['mat'][:]为什么用h5生存矩阵？
理论上是可以转成数据框之后生存为csv或tsv文件，但如许生存很慢，读取数据也很慢，因此存为h5文件更加方便。但对于小文件，h5文件反而占的空间更大，但h5文件应该是压缩过的，就很希奇，不太懂其中原理，但如果是大文件则能有用举行压缩使得所占空间更小。
pd.DataFrame(data=ob1.X.todense().T, index=ob1.var_names,columns=ob1.obs_names).to_csv('raw_mat.csv',sep="\t",float_format='%.0f')

2 提取metadata信息

meta=pd.DataFrame(data=ob1.obs)meta.to_csv('metadata.tsv',sep="\t")

3 提取坐标信息（UMAP坐标或空间组坐标）

#生存UMAP坐标cord=pd.DataFrame(data=ob1.obsm['X_umap'],index=ob1.obs_names,columns=['x','y'])cord.to_csv('position_X_umap.tsv',sep="\t") #生存空间坐标cord_name='spatial'cord=pd.DataFrame(data=ob1.obsm[cord_name],index=ob1.obs_names,columns=['x','y'])cord.to_csv('position_'+cord_name+'.tsv',sep="\t") 提取完以上信息之后，就可以在R内里重修Seurat对象了。
步调二重修Seurat对象

1 读取上面提取的信息

library(rhdf5)library(Matrix)mydata <- h5read("mat.h5","mat")#如果在Python生存h5文件是用【with h5py.File】方式则直接运行上面一行代码即可得到矩阵的值，但之后要手动添加行名和列名，下面的代码不实用，但如果用的【to_hdf】则运行下面代码即可。#获取矩阵值mat <- mydata$block0_values#添加行名rownames(mat) <- mydata$axis0#添加列名colnames(mat) <- mydata$axis1#转成希奇矩阵mat <- Matrix(mat, sparse = TRUE)#读取metadata信息cord_name='X_umap'meta <- read.table('metadata.tsv',sep="\t",header=T,row.names=1)pos <- read.table(paste0('position_',cord_name,'.tsv'),sep="\t",header=T,row.names=1)

2 重修单细胞Seurat对象

obj <- CreateSeuratObject(mat,assay='Spatial',meta.data=meta)须要对seurat对象举行简单聚类后才能添加UMAP坐标信息，可略过直接举行通例分析
obj <- seob_cluster(obj)obj@reductions$umap@cell.embeddings[,1]<-pos[,1]obj@reductions$umap@cell.embeddings[,2]<-pos[,2]如果只是单细胞数据，以上步调已经充足了，但如果是重修一个Seurat空间组对象，实在也就是补充image的slice1槽，就须要运行以下脚本。

3 重修空间组Seurat对象

library(dplyr)library(data.table)library(Matrix)library(rjson)#以下脚本参考了部门华大生命科学研究院的余浩师兄和邹轩轩师姐写的脚本，在此表示感谢tissue_lowres_image <- matrix(1, max(pos$y), max(pos$x))tissue_positions_list <- data.frame(row.names = colnames(obj), tissue = 1, row = pos$y, col = pos$x, imagerow = pos$y, imagecol = pos$x)scalefactors_json <- toJSON(list(fiducial_diameter_fullres = 1, tissue_hires_scalef = 1, tissue_lowres_scalef = 1))mat <- obj@assays$Spatial@countsseurat_spatialObj <- CreateSeuratObject(mat, project = 'Spatial', assay = 'Spatial',min.cells=5, min.features=5)generate_spatialObj <- function(image, scale.factors, tissue.positions, filter.matrix = TRUE){ if (filter.matrix) { tissue.positions <- tissue.positions[which(tissue.positions$tissue == 1), , drop = FALSE] } unnormalized.radius <- scale.factors$fiducial_diameter_fullres * scale.factors$tissue_lowres_scalef spot.radius <- unnormalized.radius / max(dim(image)) return(new(Class = 'VisiumV1', image = image, scale.factors = scalefactors(spot = scale.factors$tissue_hires_scalef, fiducial = scale.factors$fiducial_diameter_fullres, hires = scale.factors$tissue_hires_scalef, lowres = scale.factors$tissue_lowres_scalef), coordinates = tissue.positions, spot.radius = spot.radius))}spatialObj <- generate_spatialObj(image = tissue_lowres_image, scale.factors = fromJSON(scalefactors_json), tissue.positions = tissue_positions_list)spatialObj <- spatialObj[Cells(seurat_spatialObj)]DefaultAssay(spatialObj) <- 'Spatial'seurat_spatialObj[['slice1']] <- spatialObj看一下新建的对象结构，分析已经是一个标准的Seurat空间组对象了：
> str(seurat_spatialObj@images$slice1)Formal class 'VisiumV1' [package "Seurat"] with 6 slots ..@ image : num [1:6, 1:9] 1 1 1 1 1 1 1 1 1 1 ... ..@ scale.factors

ist of 4  .. ..$ spot : num 1  .. ..$ fiducial: num 1  .. ..$ hires : num 1  .. ..$ lowres  : num 1  .. ..- attr(*, "class")= chr "scalefactors"  ..@ coordinates  :'data.frame':    71668 obs. of  5 variables:  .. ..$ tissue  : num [1:71668] 1 1 1 1 1 1 1 1 1 1 ...  .. ..$ row    : num [1:71668] -2.904 -2.379 -0.798 -2.162 -1.436 ...  .. ..$ col    : num [1:71668] 5.84 5.58 8.63 7.08 7.99 ...  .. ..$ imagerow: num [1:71668] -2.904 -2.379 -0.798 -2.162 -1.436 ...  .. ..$ imagecol: num [1:71668] 5.84 5.58 8.63 7.08 7.99 ...  ..@ spot.radius  : num 0.111  ..@ assay       : chr "Spatial"  ..@ key       : chr "slice1_"可以以背面举行分析了。
小结与增补

盼望Scanpy或Seurat官方能出一下相干函数吧，涉及到空间组数据时，通例的转换流程也容易报错，有很多bugs。
以下为艰苦的报错心路历程，可鉴戒可忽略……

error1
网上一查，根本都是以下教程，很简单嘛，然而……
> library(Seurat)library(SeuratDisk)Attaching SeuratObject> library(SeuratDisk)Registered S3 method overwritten by 'cli':  method    from  print.boxx spatstat.geomRegistered S3 method overwritten by 'SeuratDisk':  method          from  as.sparse.H5Group Seurat> Convert("cell_adata.h5ad",'h5SeuratWarning: Unknown file type: h5adError in H5File.open(filename, mode, file_create_pl, file_access_pl) :  HDF5-API Errors: error #000: H5F.c in H5Fopen(): line 793: unable to open file       class: HDF5       major: File accessibility       minor: Unable to open file error #001: H5VLcallback.c in H5VL_file_open(): line 3500: open failed       class: HDF5       major: Virtual Object Layer       minor: Can't open object error #002: H5VLcallback.c in H5VL__file_open(): line 3465: open failed       class: HDF5       major: Virtual Object Layer       minor: Can't open object error #003: H5VLnative_file.c in H5VL__native_file_open(): line 100: unable to open file       class: HDF5       major: File accessibility       minor: Unable to open file error #004: H5Fint.c in H5F_open(): line 1622: unable to lock the file       class: HDF5       major: File accessibility       minor: Unable to open file error #005: H5FD.c in H5FD_lock(): line 1675: driver lock request failed       class: HDF5       major: Virtual File Layererror2
修复上面的bug后（export HDF5_USE_FILE_LOCKING=FALSE），照旧报错
>  Convert("cell_adata.h5ad",'h5SeuraWarning: Unknown file type: h5adCreating h5Seurat file for version 3.1.5.9900Adding X as dataAdding X as countsAdding meta.features from varAdding bbox as cell embeddings for bboxAdding contour as cell embeddings for contourError: Not a matrix dataseterror3
根据某个教程，可以直接在Rstudio里利用Python，加载半天后直接报错，大概是因为我用的是集群……
> library(reticulate)> scanpy <- import("scanpy")*** Error in `/share/app/R/4.0.2/lib64/R/bin/exec/R': free(): invalid pointer: 0x000000000aa5f3b8 ***======= Backtrace: =========/usr/lib64/libc.so.6(+0x81329)[0x7fb79efc4329]error4
实验在Python导出矩阵后再用R读入，R直接崩溃退出……
> fmat <- npyLoad("fmat.npy")> fmat <- npyLoad("fmat.npy") *** caught segfault ***address 0x7f3bf5770000, cause 'invalid permissions'Traceback: 1: .External(list(name = "InternalFunction_invoke", address = <pointer: 0x421bb10>,    dll = list(name = "Rcpp", path = "/jdfssz1/software/R4.0/lib/Rcpp/libs/Rcpp.so",       dynamicLookup = TRUE, handle = <pointer: 0x957cc10>,       info = <pointer: 0x1d5ba90>), numParameters = -1L), <pointer: 0x140526f0>,    filename, type, dotranspose) 2: npyLoad("/jdfssz3/fmat.npy")Possible actions:1: abort (with core dump, if enabled)2: normal R exit3: exit R without saving workspace4: exit R saving workspaceSelection: 3rm: cannot remove ‘/hwfssz5/tmp/RtmpRSe1IF’: Directory not emptyerro n++ ……

从Scanpy的Anndata对象提取信息并转成Seurat对象（实用于空间组且涉及h5文

所属分类: 问答交流

新帖推荐: 30日

推荐作品