Usage

LLM API Configuration

First, initialize scExtract configuration by running:

scExtract init

This will generate a config file (default config.ini) with the following structure:

[API]
API_KEY = Fill in your API key here
API_BASE_URL = Fill in your API base URL here, e.g. https://api.openai.com/v1
# Supported API styles: openai, claude
TYPE = openai
MODEL = Fill in your API model here, e.g. gpt-4o-mini
TOOL_MODEL = Fill in your API model here, e.g. gpt-4o-mini
...

For annotation-to-embedding conversion:

# Whether to convert the embedding for later use
CONVERT_EMBEDDING = false
EMBEDDING_MODEL = text-embedding-3-large
API_STYLES = azure
# Note: If you are using the openai API and set the API_STYLES to same,
# there is no need to specify the EMBEDDING_API_KEY and EMBEDDING_ENDPOINT
EMBEDDING_API_KEY = Fill in your API key here
EMBEDDING_ENDPOINT = Fill in your API endpoint here, e.g. https://api.openai.com/v1

Set CONVERT_EMBEDDING = true if you want to:
- Benchmark auto-annotation using similarity of annotated text
- Later integrate your datasets in a prior-aware manner
If using GPTs for text extraction:
- Set API_STYLES = same to use same settings
If using Claude:
- Set API_STYLES = azure|openai since Claude lacks official text-to-embedding models
- Provide additional API key and endpoint to use OpenAI’s text-to-embedding model

Raw Data Preparation

scExtract accepts raw data in .h5ad format. Raw counts / Normalized counts / Log-transformed counts are all supported and will be automatically detected.
Batch column must be specified in adata.obs for possible batch effect correction.
For most single-cell data forms deposited in public repositories, we provide various notebooks for preprocessing, see Notebooks for Preprocessing GSE Data.

Data Processing

To process raw data, run:

scExtract auto_extract -i adata.h5ad -p paper.pdf -o processed.h5ad

This will automatically extract processing parameters from the paper and process the data accordingly.

Benchmark Annotation

To benchmark cell type annotation, run (suppose the true cell type is stored in cell_type column):

scExtract benchmark -i processed.h5ad -o benchmark.h5ad -r benchmark_results.csv \
    --true_group_key cell_type --predict_group_key scExtract --similarity_key scExtract_similarity