Usage
LLM API Configuration
First, initialize scExtract configuration by running:
scExtract init
This will generate a config file (default config.ini) with the following structure:
[API]
API_KEY = Fill in your API key here
API_BASE_URL = Fill in your API base URL here, e.g. https://api.openai.com/v1
# Supported API styles: openai, claude
TYPE = openai
MODEL = Fill in your API model here, e.g. gpt-4o-mini
TOOL_MODEL = Fill in your API model here, e.g. gpt-4o-mini
...
For annotation-to-embedding conversion:
# Whether to convert the embedding for later use
CONVERT_EMBEDDING = false
EMBEDDING_MODEL = text-embedding-3-large
API_STYLES = azure
# Note: If you are using the openai API and set the API_STYLES to same,
# there is no need to specify the EMBEDDING_API_KEY and EMBEDDING_ENDPOINT
EMBEDDING_API_KEY = Fill in your API key here
EMBEDDING_ENDPOINT = Fill in your API endpoint here, e.g. https://api.openai.com/v1
- Set CONVERT_EMBEDDING = true if you want to:
Benchmark auto-annotation using similarity of annotated text
Later integrate your datasets in a prior-aware manner
- If using GPTs for text extraction:
Set API_STYLES = same to use same settings
- If using Claude:
Set API_STYLES = azure|openai since Claude lacks official text-to-embedding models
Provide additional API key and endpoint to use OpenAI’s text-to-embedding model
Raw Data Preparation
scExtractaccepts raw data in .h5ad format. Raw counts / Normalized counts / Log-transformed counts are all supported and will be automatically detected.Batchcolumn must be specified inadata.obsfor possible batch effect correction.For most single-cell data forms deposited in public repositories, we provide various notebooks for preprocessing, see Notebooks for Preprocessing GSE Data.
Data Processing
To process raw data, run:
scExtract auto_extract -i adata.h5ad -p paper.pdf -o processed.h5ad
This will automatically extract processing parameters from the paper and process the data accordingly.
Benchmark Annotation
To benchmark cell type annotation, run (suppose the true cell type is stored in
cell_typecolumn):
scExtract benchmark -i processed.h5ad -o benchmark.h5ad -r benchmark_results.csv \
--true_group_key cell_type --predict_group_key scExtract --similarity_key scExtract_similarity