User Guide#

The user guide will talk about how the pipeline is implemented and how to modified pipeline or setup a custom pipeline.

Overview of MATTE#

Gene clustering process#

Broadly speaking, there are three steps:

preprocessing
clustering
Module analysis

The results that each of these parameters can lead to and produce will be described in detail in the following three procedures.

Pipeline design#

PipeFunc is a most basic class in MATTE python package. The class store a function and it’s arguments but not run it until calling it like calling a funtion. And a Pipefunc will show it’s arguments and name by print(PipeFunc) or str(PipeFunc). The string will be updated everytime parameters are added by add_params, but not when being called. When you call it, stored arguments can be ellipsised. But arguments you put in will also be considered.

In a AlignPipe class, there are several PipeFunc and transformer. Here, transformer means object that has fit and transform methods. When calling AlignPipe.calculate, PipeFunc and transformer will be performed one by one. All funcs are stored in AlignPipe.funcs attributes and AlignPipe.cluster_func attributes. One can use build-in function add_steps to add a preprocessing step into AlignPipe.funcs and set_cluster_method to add clustering step into AlignPipe.cluster_func. These two are list object that store PipeFunc and transformer. You can also add parameters by add_param, (in this step, a new function that return a dict contain param value and name will be generated) add transformer by add_transformer. When you create a AlignPipe object, by setting init=True, default pipeline will be created. All functions in the pipeline should return a dict object,(all these returns are saved in a dict temp_result) and by this way its return will be record and use in the following functions.

There are four methods to run pipeline, calculate and fit_transform is the same, but fit_transform has a parameters that allow save temp result. The saved temp dict object can be used in calculate_from_temp, where only clustering functions will be performed. And transform will not fit the data but use fitted transformers in the pipeline. If you want to use attributes of transformers, use get_attribute_from_transformer.

Some key temps are following: * When running the pipeline, before clustering there should generate a object named “before_clustering_df” to used in the cluster_fun. * And in the preprocessing steps, return object named “df_exp” to corver input. * When using CrossCluster object to perform clustering, cluster_label and cluster_properties will be passed to build_results * And build_results will return a object named Result.

For convenience, a decorator utils.kw_decorator is used in most of functions in pipeline. It makes a fucntion return a dict according to parameter kw(str or list), and the function will accept additional parameters but not raise error.

Preprocessing#

Key function in preprocessing is Kernel_transform, who perform mostly important relative differential expression. Some kernel transform is implemented: ‘meanrbf’,‘meanlocalrbf’,‘mean’,‘median’,‘medianrbf’ and ‘medianlocalrbf’. For normal, mean kernels is recommended, but with outliers median function may have a better performance. rbf related kernels performs a variant Gauss kernel transformation after RDE, and in the localrbf kernel, each genes’ gamma is depended on it’s nearest K neibors’ distance. A function is also accept to kernel_type parameter. And this function should be performed in two vector, (it should be acctepted by scipy.spatial.distance.cdist). But if the funtion is of great complexities, it can cost much time for this step will performed to each gene in the each phenotype. After kernel matrix calculation, a double_centering and sklearn.preprocessing.KernelCenterer will be performed by default. In the previous test, these two will make results get more score, but may change in your case.

Other functions’ detail can be seen in API page, exp_filter filter out genes with extremely low expression(custom function is allowed too), and RPKM2TPM is an important normalization preprocessing steps. Keeping default preprocessing steps is highly recommended.

Clustering#

CrossCluster is clustering function or clusterer warpper. There are three pre-implemented methods: kmeans (wrapper from Bio.Cluster.kclust), spectrum and spectral_bicluster (wrapper from sklearn implemention). So other keywords parameters should seen in the original implemention. In spectrum clustering, by default will calculate distance matrix first(parameter use_aff). If not using preset implemention, by build_from_func or build_from_model to set up clustering methods.

Gene Ranker#

GeneRanker accept more than two phenotypes, and will use pipeline to calculate pairs of phenotypes one by one, fincally sum/concatenate the results. Inputs of ModuleEmbedder is not the same as pipeline. Row of Expression data is sample, column is gene.

There are two types of ranking :

There are several types of GeneRanker: 1. ‘module’ and ‘gene’

will cluster genes according to their expression. And Use module SNR to rank genes. In gene mode, the SNR will be corrected by the correlation of gene expression and module eigen.

‘dist’,‘cross-dist’.

In dist mode, the distance of each genes will be calculated. And genes will be ranked according to the sum of distance to each other genes. In cross-dist mode, the distance of differential expression and differential co-expression will be merged.