# Simulation of Synthetic Cells (AnnData Object)

### Building OneSC simulator 
After the inference of GRNs from [previous step](infer_grn_scanpy.md), we can perform simulations using the GRN as a backbone. You can download the inferred GRN from previous step [here](https://cnobjects.s3.amazonaws.com/OneSC/OneSC_network.csv). For all simulations, we need to define the initial state. In our case, we know that cells start in the CMP state. To determine the Boolean state of genes in the CMP state, we subset the adata to those cells and then apply thresholds on the mean expression and the percent of cells in which the gene is detected:
```
boolean_states_df = onesc.define_boolean_states_anndata(adata, cluster_col = 'cell_types')
xstates = boolean_states_df['CMP'] * 2
```
Now we construct a OneSC simulator object:
```
netname = 'CMPdiff'
netsim = onesc.network_structure()
netsim.fit_grn(iGRN)
sim = onesc.OneSC_simulator()
sim.add_network_compilation(netname, netsim)
```
### Simulate wildtype trajectory 
Finally we are ready to simulate expression state trajectories using our GRN. 

- OneSC_simulator: the OneSC simulator object to use. 
- initial_exp_dict (dict): dictionary of initial state values.
- network_name (str): the name of the fitted gene regulatory network structure that the user want to use for simulation. 
- perturb_dict (dict, optional): a dictionary with the key of gene name and value indicating the in silico perturbation. If the user want to perform overexpression, then set the value between 0 and 2. If the user want to perform knockdown, then set the value between 0 and -2. Defaults to dict().
- num_runs (int, optional): number of simulations to run. Defaults to 10.
- n_cores (int, optional): number of cores for parallel computing. Defaults to 2.
- num_sim (int, optional): number of simulation steps. Defaults to 1000.
- t_interval (float, optional): the size of the simulation step. Defaults to 0.01.
- noise_amp (int, optional): the amplitude of noise. Defaults to 0.1

```
simlist_wt = onesc.simulate_parallel_adata(sim, xstates, netname, n_cores = 8, num_sim = 1000, num_runs = 32, t_interval = 0.1, noise_amp = 0.5)
```
`simulate_parallel_adata` returns a list of annData objects. Each annData contains `num_sim` simulated cells starting at the `initial_exp_dict` state, and progressing subject to the regulatory constraints defined by the provided GRN modulated by randomly generated noise (if allowed). Each annData has .obs['sim_time']. 

### Setup PySCN for classification 
We will use PySCN to annotate the states of the simulated cells. Train the classifer now:
```
adata.var['highly_variable'] = True

tids, vids = pySCN.ut.split_adata_indices(adata, n_cells=50,groupby="cell_types")
adTrain = adata[tids].copy()
adHO = adata[vids].copy()

groupby="cell_types"
n_rand = 50
nTopGenes = 6
nTopGenePairs = 8
n_comps = 3

clf = pySCN.tl.train_classifier(adTrain, groupby, nRand = n_rand, nTopGenes = nTopGenes, nTopGenePairs = nTopGenePairs, n_comps = n_comps)
```
Classify held out data:
```
pySCN.tl.classify_anndata(adHO, clf)
pySCN.pl.heatmap_scores(adHO, groupby='SCN_class_argmax')
```
![pySCN heatmap](./_static/images/pyscn_heatmap.png)
Note: the function `train_rank_classifier()` ranks transforms training and query data instead of TSP. Be forewarned that it is likely to be slow if applied to adata objects with a lot of genes. 

### Classify Simulated Cells 
Let's look at the transcriptomic states over sim_time for one simulated trajectory. We will classify the cells first, and then visualize
```
ad_sim1 = simlist_wt[0].copy()
pySCN.tl.classify_anndata(ad_sim1, clf)

# a hack because sc.pl.heatmap requires a 'groupby', so groupby simulation time bin
tmp_obs = ad_sim1.obs.copy()
bins = np.linspace(-1, 999, 11)
labels = [f"{int(bins[i]) + 1}-{int(bins[i+1])}" for i in range(len(bins)-1)]

tmp_obs['sTime_bin'] = pd.cut(tmp_obs['sim_time'], bins=bins, labels=labels)
ad_sim1.obs = tmp_obs

pySCN.pl.heatmap_scores(ad_sim1, groupby = 'sTime_bin')
```
![pySCN heatmap one wt](./_static/images/scn_hm_one_wt_trajectory.png)
Instead of looking the trajectory of a single cell, we can sample the end stages of all simulations:
```
ad_wt = onesc.sample_and_compile_anndatas(simlist_wt, X=50, time_bin=(80, 100), sequential_order_column='sim_time')
pySCN.tl.classify_anndata(ad_wt, clf)
pySCN.pl.heatmap_scores(ad_wt, groupby = 'SCN_class_argmax')
```
![pySCN heatmap of final stages](./_static/images/scn_hm_wt.png)

### Simulate knockout trajectory 
Now, let's simulate trajectories of cells in which *Cebpa* is knocked out
```
perturb_dict = dict()
perturb_dict['Cebpa'] = -1 
simlist_cebpa_ko = onesc.simulate_parallel_adata(sim, xstates, 'CMPdiff', perturb_dict = perturb_dict, n_cores = 8, num_sim = 1000, t_interval = 0.1, noise_amp = 0.5)
```
Now fetch the simulated cells, sampling from the end stages, classify them, and visualize the results.
```
ad_cebpa_ko = onesc.sample_and_compile_anndatas(simlist_cebpa_ko, X=50, time_bin=(80, 100), sequential_order_column='sim_time')
pySCN.tl.classify_anndata(ad_cebpa_ko, clf)
pySCN.pl.heatmap_scores(ad_cebpa_ko, groupby = 'SCN_class_argmax')
```
![pySCN KO heatmap](./_static/images/scn_hm_cepba_ko.png)
We can also directly compare the proportions of cell types across simulations/perturbations as follows:
```
pySCN.pl.stackedbar_composition_list([adHO,ad_wt, ad_cebpa_ko], obs_column = "SCN_class_argmax", labels=["HeldOut","Wt", "Cebpa_KO"], color_dict = adHO.uns['SCN_class_argmax_colors'])
```
![proportion plot](./_static/images/sim_results.png)

If we want to perform multiple perturbations, we can add that into the perturb dict, and pass that into the simulation function. 
```
perturb_dict = dict()
perturb_dict['Cepbe'] = -1 
perturb_dict['Gata2] = 1
```