Quick Start#
Description#
MATTE (Module Alignment of TranscripTomE) is a python package aiming to analysis transcriptome from samples with different phenotypes in a module view. Differiential expression (DE) is commonly used in analysing transcriptome data. But genes are not work alone, they collaborate. Network and module based differential methods are developed in recent years to obtain more information. New problems appears that how to make sure module or network structure is preserved in all of the phenotypes. To that end, we proposed MATTE to find the conserved module and diverged module by treating genes from different phenotypes as individual ones. By doing so, meaningful markers and modules can be found to better understand what’s really difference between phenotypes.
Advantages
In the first place, MATTE merges the data from phenotypes, seeing genes from different phenotypes as new analyzing unite. By doing so, benefits got as follows:
MATTE considering the information in phenotypes in the preprocessing stage, hoping to find more interesting conclusion.
MATTE is actually making transcriptome analysis includes the relationship between phenotypes, which is of significance in cancer or other complex phenotypes.
MATTE can deal with more noise thanks to calculation of relative different expression (RDE) and ignore some of batch effect.
In a module view, “Markers” can be easily transfer to other case but not over fits compare to in a gene view.
The result of MATTE can be easily analysed.
Install#
Install from pip is recommended.
pip install MATTE
Genes’ Clustering#
Preprocess
CLustering
Analysis
Pipeline#
import MATTE
print(MATTE.__version__)
## init with default settings
pipeline = MATTE.AlignPipe(stats_type='mean', target='cluster', preprocess=True)
## Showing the Pipe composition
pipeline
1.2.1
MATTE calculation pipeline
## STEP 0 <PipeFunc> inputs_check()
## STEP 1 <PipeFunc> RPKM2TPM()
## STEP 2 <PipeFunc> log2transform()
## STEP 3 <PipeFunc> expr_filter(gene_filter=None)
## STEP 4 <PipeFunc> LocKernel_Transform(kernel_type='mean',centering_kernel=True,outer_subtract_absolute=True,double_centering=True)
## STEP 5 PCA(n_components=16)
## CLUSTER STEP 0 <PipeFunc> CrossCluster(preset='kmeans',n_clusters=8,method='a',dist_type='a',n_iters=20)
## CLUSTER STEP 1 <PipeFunc> build_results()
## In MATTEPipe stores some functions (PipeFunc type)
pipeline.funcs,pipeline.cluster_func
([<PipeFunc> inputs_check(),
<PipeFunc> RPKM2TPM(),
<PipeFunc> log2transform(),
<PipeFunc> expr_filter(gene_filter=None),
<PipeFunc> LocKernel_Transform(kernel_type='mean',centering_kernel=True,outer_subtract_absolute=True,double_centering=True),
PCA(n_components=16)],
[<PipeFunc> CrossCluster(preset='kmeans',n_clusters=8,method='a',dist_type='a',n_iters=20),
<PipeFunc> build_results()])
## Running the test.(data is generated randomly)
R,data = MATTE.package_test(n_genes=1000,pipe=pipeline,verbose=False)
# basic usage
R = pipeline.calculate(df_exp=data['df_exp'],df_pheno=data['df_pheno'])
Mon May 30 15:25:52 2022 Running function <PipeFunc> inputs_check()
Mon May 30 15:25:52 2022 Running function <PipeFunc> RPKM2TPM()
Mon May 30 15:25:52 2022 Running function <PipeFunc> log2transform()
Mon May 30 15:25:52 2022 Running function <PipeFunc> expr_filter(gene_filter=None)
Mon May 30 15:25:52 2022 Running function <PipeFunc> LocKernel_Transform(kernel_type='mean',centering_kernel=True,outer_subtract_absolute=True,double_centering=True)
Mon May 30 15:25:52 2022 Calculating the kernel matrix using mean
Mon May 30 15:25:53 2022 Tranforming using model PCA(n_components=16)
Mon May 30 15:25:58 2022 Running function <PipeFunc> CrossCluster(preset='kmeans',n_clusters=8,method='a',dist_type='a',n_iters=20)
Mon May 30 15:25:58 2022 Running function <PipeFunc> build_results()
Mon May 30 15:25:58 2022 building cluster results
Inputs#
## Standard inputs
data['df_exp']
| sample0 | sample3 | sample4 | sample5 | sample8 | sample10 | sample13 | sample14 | sample16 | sample19 | ... | sample86 | sample87 | sample88 | sample90 | sample92 | sample93 | sample95 | sample96 | sample98 | sample99 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| gene0 | 2068.782009 | 2074.743627 | 2358.613186 | 2214.779271 | 2615.754304 | 2416.816078 | 2324.006712 | 2568.534221 | 1790.074733 | 2156.944223 | ... | 699.020783 | 408.182918 | 13.719141 | 614.162325 | 242.881932 | 537.560430 | 640.396277 | 71.989106 | 15.671641 | 121.134253 |
| gene1 | 1736.262834 | 1102.800776 | 1202.438027 | 1846.884467 | 1004.449435 | 1161.452514 | 1267.909764 | 1432.889514 | 1176.173534 | 633.488180 | ... | 1426.345172 | 1447.027209 | 1606.243963 | 2253.905879 | 1643.103867 | 2278.306248 | 1456.288578 | 2015.417148 | 1947.948739 | 1425.494850 |
| gene2 | 2014.528625 | 2398.080280 | 1944.729892 | 2316.274409 | 2131.565037 | 2298.541242 | 2531.612209 | 2596.111747 | 2413.634703 | 2207.004282 | ... | 805.591423 | 937.059757 | 811.347534 | 819.525380 | 617.231009 | 660.709923 | 652.394533 | 823.183763 | 890.001682 | 982.703612 |
| gene3 | 659.427115 | 163.787569 | 561.642612 | 378.384480 | 519.343153 | 19.082749 | 847.503441 | 381.925232 | 707.469305 | 276.173993 | ... | 1487.512143 | 1086.595268 | 315.433694 | 1820.512500 | 1701.598813 | 1402.320642 | 1623.801592 | 1282.006193 | 1237.460095 | 862.684200 |
| gene4 | 557.430594 | 391.416889 | 842.972964 | 675.541378 | 850.962173 | 811.020469 | 986.334022 | 1345.391218 | 1264.336918 | 1136.040696 | ... | 492.540540 | 1170.198803 | 637.125151 | 83.639511 | 846.553239 | 718.903346 | 285.646841 | 68.010063 | 426.350989 | 523.634085 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| gene995 | 1079.554741 | 1256.576785 | 371.790347 | 1552.897702 | 837.588520 | 781.422702 | 1410.911788 | 280.789440 | 1074.169879 | 891.334274 | ... | 914.248736 | 1039.659511 | 1424.090367 | 1528.602309 | 1048.966685 | 1217.551321 | 1595.634636 | 892.179251 | 733.385461 | 1326.974023 |
| gene996 | 1466.756618 | 682.381925 | 655.547941 | 1217.328283 | 1027.033929 | 743.552669 | 1303.702866 | 156.088532 | 1100.372258 | 1653.174072 | ... | 171.781193 | 409.069384 | 1064.053578 | 409.015074 | 1108.110725 | 522.949709 | 1141.158675 | 807.635314 | 650.720516 | 935.940121 |
| gene997 | 2667.592315 | 2705.673085 | 2692.679566 | 2451.598273 | 2265.107811 | 1688.030061 | 3214.672455 | 2487.450931 | 3213.472788 | 1963.800244 | ... | 2788.161130 | 2177.646822 | 1659.035894 | 1952.969200 | 2790.787782 | 2053.803419 | 2259.536840 | 2437.241921 | 1967.708017 | 2296.309486 |
| gene998 | 201.558856 | 400.279793 | 812.383524 | 262.929812 | 671.040851 | 580.943332 | 343.901157 | 476.913661 | 667.557218 | 168.932862 | ... | 621.693365 | 832.883736 | 1035.085086 | 512.018102 | 722.357924 | 507.593183 | 608.552576 | 169.301006 | 612.163599 | 186.982519 |
| gene999 | 1407.004628 | 1603.523673 | 1292.689612 | 1675.310108 | 1112.094279 | 907.000656 | 741.737107 | 720.647700 | 1740.447591 | 844.582854 | ... | 1109.305059 | 1289.918539 | 1080.680714 | 1104.604265 | 224.328929 | 1545.090453 | 1048.014265 | 1194.242678 | 2064.968748 | 1023.087880 |
1000 rows × 100 columns
data['df_pheno']
sample0 P0
sample1 P1
sample2 P1
sample3 P0
sample4 P0
..
sample95 P1
sample96 P1
sample97 P0
sample98 P1
sample99 P1
Length: 100, dtype: object
Clustering Results#
R.cluster_properties
{'error': 0.019708484590104557,
'method': 'kmeans_a',
'dist_type': 'a',
'n_clusters': 8,
'npass': 20,
'score': 1201.8480840703078}
R.res
| P0 | P1 | matched | |
|---|---|---|---|
| gene0 | 6 | 1 | False |
| gene1 | 0 | 3 | False |
| gene2 | 6 | 2 | False |
| gene3 | 4 | 3 | False |
| gene4 | 2 | 4 | False |
| ... | ... | ... | ... |
| gene995 | 0 | 0 | True |
| gene996 | 7 | 7 | True |
| gene997 | 6 | 6 | True |
| gene998 | 5 | 2 | False |
| gene999 | 0 | 0 | True |
1000 rows × 3 columns
from MATTE.analysis import Fig_SampleFeature
sf = R.SampleFeature(corr=False)
f = Fig_SampleFeature(sf,R.pheno)
R.ModuleSNR(sf)[0:5]
M6.1_0 3.900981
M6.7_0 3.737174
M4.6_0 3.094177
M6.5_0 2.974524
M2.6_0 2.712792
dtype: float64
GeneRanker#
GeneRanker is a buildin class that select key genes or embed data by
module calculation.
In this step, multiple phenotypes can be received.
from MATTE import GeneRanker
ranker = GeneRanker(
view='dist', # or cross-dist or module or gene
pipeline=None)
gene_rank = ranker.gene_rank(X = data['df_exp'].T, y=data['df_pheno'],verbose=False)
gene_rank
gene0 49.673632
gene1 30.753564
gene2 52.062673
gene3 42.914437
gene4 29.995634
...
gene995 30.475068
gene996 29.614591
gene997 52.701640
gene998 34.376199
gene999 30.843078
Length: 1000, dtype: float64
Module Analysis#
from MATTE.analysis import Fig_SampleFeature
# Showing the Summary.
R.summary()
# two figures can be get by following:
if False:
f1 = R.Vis_Jmat() # genes' distribution
# Showing the samples' distribution
sf = R.SampleFeature()
f = Fig_SampleFeature(sf,labels=R.pheno,dpi=300,model=PCA())
# --- Number of genes:
Same Module Genes: 592
Different Module Genes: 408
# --- clustering score:
error 0.019385038222335626
method kmeans_a
dist_type a
n_clusters 8
npass 20
score 1347.1291565248384
# --- samples' distribution:
Function Analysis#
Read go annote files. File can be downloaded from https://ftp.ncbi.nih.gov/gene/DATA/
import pandas as pd
annote_file = pd.read_table("A:/Data/Annotation/gene2go")
annote_file = annote_file[annote_file["#tax_id"] == 9606]
def lst_change(lst,target,changed):
ret = []
for i in lst:
if i == target:
ret.append(changed)
else:
ret.append(i)
return ret
## Change columns name.
annote_file.columns = lst_change(annote_file.columns,"GO_term","Term")
annote_file.columns = lst_change(annote_file.columns,"GO_ID","Term_ID")
## randomly select some genes
import numpy as np
from random import sample
unique_genes = np.unique(annote_file['GeneID'].values)
selected_genes = sample(unique_genes.tolist(),100)
The format of input files are following:
gene_set iteral object, containing gene id.
annote_file with columns ["Term_ID","GeneID","Term","Category"],and each row is an entry.
annote_file
| #tax_id | GeneID | Term_ID | Evidence | Qualifier | Term | PubMed | Category | |
|---|---|---|---|---|---|---|---|---|
| 640889 | 9606 | 1 | GO:0003674 | ND | enables | molecular_function | - | Function |
| 640890 | 9606 | 1 | GO:0005576 | HDA | located_in | extracellular region | 27068509 | Component |
| 640891 | 9606 | 1 | GO:0005576 | IDA | located_in | extracellular region | 3458201 | Component |
| 640892 | 9606 | 1 | GO:0005576 | TAS | located_in | extracellular region | - | Component |
| 640893 | 9606 | 1 | GO:0005615 | HDA | located_in | extracellular space | 16502470 | Component |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 971204 | 9606 | 118568804 | GO:0004930 | IEA | enables | G protein-coupled receptor activity | - | Function |
| 971205 | 9606 | 118568804 | GO:0004984 | IEA | enables | olfactory receptor activity | - | Function |
| 971206 | 9606 | 118568804 | GO:0007186 | IEA | involved_in | G protein-coupled receptor signaling pathway | - | Process |
| 971207 | 9606 | 118568804 | GO:0016021 | IEA | located_in | integral component of membrane | - | Component |
| 971208 | 9606 | 118568804 | GO:0050911 | IEA | involved_in | detection of chemical stimulus involved in sen... | - | Process |
330320 rows × 8 columns
from MATTE.analysis import FunctionEnrich
all_items,term_genes = FunctionEnrich(annote_file,selected_genes)
100%|██████████| 18684/18684 [02:06<00:00, 147.40it/s]
The function FunctionEnrich return two object:
all_item Terms with p_value, fdr and other information
term_genes each term enriches what genes
## Filtering the enriched results
target = all_items.groupby(by="Category").apply(lambda x: x.sort_values(by="p_value").iloc[0:5,:])
target.index= [i[1] for i in target.index]
target
| Term | Category | n_enriched | n_backgroud | p_value | fdr | gene_ratio | |
|---|---|---|---|---|---|---|---|
| GO:0005685 | U1 snRNP | Component | 3 | 33 | 0.000538 | 0.94667 | 0.03 |
| GO:0034709 | methylosome | Component | 2 | 12 | 0.001479 | 1.0 | 0.02 |
| GO:0042627 | chylomicron | Component | 2 | 13 | 0.001743 | 1.0 | 0.02 |
| GO:0034361 | very-low-density lipoprotein particle | Component | 2 | 20 | 0.004153 | 1.0 | 0.02 |
| GO:0097453 | mesaxon | Component | 1 | 1 | 0.004834 | 1.0 | 0.01 |
| GO:0004729 | oxygen-dependent protoporphyrinogen oxidase ac... | Function | 1 | 1 | 0.004834 | 1.0 | 0.01 |
| GO:0061627 | S-methylmethionine-homocysteine S-methyltransf... | Function | 1 | 1 | 0.004834 | 1.0 | 0.01 |
| GO:0032029 | myosin tail binding | Function | 1 | 1 | 0.004834 | 1.0 | 0.01 |
| GO:0030742 | GTP-dependent protein binding | Function | 2 | 22 | 0.005018 | 1.0 | 0.02 |
| GO:0004364 | glutathione transferase activity | Function | 2 | 28 | 0.008057 | 1.0 | 0.02 |
| GO:0045652 | regulation of megakaryocyte differentiation | Process | 2 | 8 | 0.000636 | 1.0 | 0.02 |
| GO:0045665 | negative regulation of neuron differentiation | Process | 3 | 58 | 0.002789 | 1.0 | 0.03 |
| GO:0045653 | negative regulation of megakaryocyte different... | Process | 2 | 18 | 0.003365 | 1.0 | 0.02 |
| GO:1905608 | positive regulation of presynapse assembly | Process | 1 | 1 | 0.004834 | 1.0 | 0.01 |
| GO:1905095 | negative regulation of apolipoprotein A-I-medi... | Process | 1 | 1 | 0.004834 | 1.0 | 0.01 |
from MATTE.analysis import Fig_Fuction
f = Fig_Fuction(target,"p_value",dpi=300)