======================================================================================================== Tutorial: Using scExtract to generate single-cell skin atlas from published data ======================================================================================================== In this tutorial, we will demonstrate how to use scExtract to generate a single-cell skin atlas from published data. scExtract setup and configuration ---------------------------------- First, we need to install scExtract and configure the API key as described in the :ref:`installation` and :ref:`usage` sections. Metadata extraction ------------------- Then, we will find some published article that contains single-cell RNA-seq data related to our topic of interest. You can download the PDF file and save it in the `pdfs` folder. Next, we will run the following command to extract the metadata from the PDF file, add :code:`-s` create project structure: .. code-block:: bash scExtract get_metadata -i pdfs/*.pdf -o metadata.csv -s After that, we will get the metadata in the :code:`metadata.csv` file, which contains information such as authors, title, abstract, cell count per dataset, dataset location, and other details. Meanwhile, directories named after the :code:`sample_id` will be created in the current working directory to store subsequent data processing results. .. image:: sample_metadata.png Raw data download and processing -------------------------------- Next, we need to manually download each dataset. We can exclude datasets that are difficult to access or require mapping, and it's recommended to focus on GSE datasets only. Using :code:`sample0` as an example, after downloading, place it in the :code:`sample0/raw_data directory`. Then use :code:`sc.read_text()` directly to read the dataset. It's important to note that we need to add a :code:`Batch` column to :code:`adata.obs` for subsequent batch effect correction. It's also recommended to add a :code:`Disease` column for cell type annotation across different diseases. To ensure accuracy, this step requires manual input, but we have prepared comprehensive notebooks to help you complete this step, please refer to :doc:`notebooks`. .. code-block:: python data.obs['Batch'] = data.obs.index.str.split('_').str[0] data.obs['Disease'] = data.obs['Batch'].str[:-1] Then store it in the :code:`raw_data/sample0_raw.h5ad` file. Integration ----------- One-step integration ~~~~~~~~~~~~~~~~~~~~~~~ Next, we can run the :code:`auto_integrate.smk` script in :code:`scExtract/Snakemake` to batch process automatic datasets. Remember to install prior-aware integration optional dependencies as described in the :ref:`installation` section First, place the script and config file in the project root directory, then rename :code:`config_sample.yaml` to :code:`config.yaml`, and modify project directory and other information in the config file: .. code-block:: yaml project_dir: /home/wu/projects/scExtract # project directory init_config_ini: config.ini output_suffix: claude3 # custom suffix debug: False applied_files: all # Step 1: Auto extract config_pkl: config.pkl log_file: auto_extract.log # Step 2: Add embedding AddEmbedding.user_dataset: # you can add your own dataset here. Note that the dataset should also contain `Batch` and `Disease` columns # and the cell type annotation should be specified `cell_type` column # Step 3: Integration method: scExtract Finally, run the following command: .. code-block:: bash snakemake -s auto_integrate.smk By default, individual datasets will be downsampled in a hierarchical and desenty-based manner, and then integrated. If you are confident in memory usage, you can remove the :code:`--downsample` option in the :code:`Integrate` rule. If you don't want to bother with prior-aware integration, you can set the :code:`method` to :code:`cellhint` to use the original CellHint method. Step-wise integration ~~~~~~~~~~~~~~~~~~~~~~~~~ * Step 1: Auto extract If you want to integrate datasets on High-Performance Computing (HPC) clusters, you can use the step-wise integration method. Use rule :code:`AddEmbedding` and :code:`Integrate_Input` instead of the :code:`Integrate` rule in the :code:`auto_integrate.smk` script. This will process each dataset and generate the merged embedding dictionary in the output directory. .. code-block:: bash ls integrated_input_claude3_5/ # Output claude3_5_embedding_dict.pkl sample{i}_claude3_5_extracted.h5ad * Step 2: Integration After generating the merged embedding dictionary, we upload the dictionary and the extracted data to the HPC cluster. Then run the following command: .. code-block:: bash scExtract integrate -f *.h5ad -m cellhint_prior \ --embedding_dict_path claude3_5_embedding_dict.pkl --output_path integrate_output_tmp.h5ad This step will correct mis-annotations, so we need to generate new annotations embeddings with internet again using: .. code-block:: bash scExtract extract_celltype_embedding -f integrate_output_tmp.h5ad --cell_type_column cell_type \ --output_embedding_pkl harmonized_embedding_dict.pkl Finally, we can run the second turn integration to generate the final integrated dataset: .. code-block:: bash scExtract integrate -f integrate_output_tmp.h5ad -m scanorama_prior \ --embedding_dict_path harmonized_embedding_dict.pkl --output_path integrate_output.h5ad