Metadata-Version: 2.4
Name: coderdata
Version: 2.0.2
Summary: A package to download, load, and process multiple benchmark multi-omic drug response datasets
Project-URL: Homepage, https://github.com/PNNL-CompBio/candleDataProcessing
Project-URL: Documentation, https://pnnl-compbio.github.io/coderdata/
Project-URL: Repository, https://github.com/PNNL-CompBio/coderdata.git
Project-URL: Issues, https://github.com/PNNL-CompBio/coderdata/issues
Author-email: Jeremy Jacobson <jeremy.jacobson@pnnl.gov>, Yannick Mahlich <yannick.mahlich@pnnl.gov>, Sara Gosline <sara.gosline@pnnl.gov>
License: 2-clause BSD
License-File: LICENSE
Classifier: License :: OSI Approved :: BSD License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.9
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: pyyaml
Requires-Dist: requests
Requires-Dist: scikit-learn
Description-Content-Type: text/markdown

## Cancer Omics Drug Experiment Response Dataset 

There is a recent explosion of deep learning algorithms that to tackle the computational problem of predicting drug treatment outcome from baseline molecular measurements. To support this,we have built a benchmark dataset that harmonizes diverse datasets to better assess algorithm performance.

This package collects diverse sets of paired molecular datasets with corresponding drug sensitivity data. All data here is reprocessed and standardized so it can be easily used as a benchmark dataset for the 
This repository leverages existing datasets to collect the data
required for deep learning model development. Since each deep learning model
requires distinct data capabilities, the goal of this repository is to
collect and format all data into a schema that can be leveraged for
existing models.

![Coderdata Motivation](coderdata_overview.jpg?raw=true "Motivation behind
coderdata develompent")


The goal of this repository is two-fold: First, it aims to collate and
standardize the data for the broader community. This requires
running a series of scripts to build and append to a standardized data
model. Second, it has a series of scripts that pull from the data
model to create model-specific data files that can be run by the data
infrastructure. 

## Data access
For the access to the latest version of CoderData, please visit our
[documentation site](https://pnnl-compbio.github.io/coderdata/) which provides access to Figshare and
instructions for using the Python package to download the data.

## Data format
All coderdata files are in text format - either comma delimited or tab
delimited (depending on data type). Each dataset can be evaluated
individually according to the CoderData schema that is maintained in [LinkML](schema/coderdata.yaml)
and can be udpated via a commit to the repository. For more details,
please see the [schema description](schema/README.md).

## Building a local version

The build process can be found in our [build
directory](build/README.md). Here you can follow the instructions to
build your own local copy of the data on your machine. 

## Adding a new dataset

We have standardized the build process so an additional dataset can be
built locally or as part of the next version of coder. Here are the
steps to follow:

1. First visit the [build
directory](build/README.md) and ensure you can build a local copy of
CoderData. 

2. Checkout this repository and  create a subdirectory of the
[build directory](build) with your own build files. 

3. Develop your scripts to build the data files according to our
[LinkML Schema](schema/coderdata.yaml]). This will require collecting
the following metadata:
- entrez gene identifiers (or you can use the `genes.csv` file
- sample information such as species and model system type
- drug name that can be searched on PubChem

You can validate each file by
using the [linkML
validator](https://linkml.io/linkml/data/validating-data) together
with our schema file. 

You can use the following scripts as part of your build process:
- [build/utils/fit_curve.py](build/utils/fit_curve.py): This script
  takes dose-response data and generates the dose-response statistics
  required by CoderData/
- [build/utils/pubchem_retrieval.py](build/utils/pubchem_retreival.py):
  This script retreives structure and drug synonym information
  required to populate the `Drug` table. 

4. Wrap your scripts in standard shell scripts with the following names
and arguments:

| shell script     | arguments                | description         |
|------------------|--------------------------|---------------------|
| `build_samples.sh` | [latest_samples] | Latest version of samples generated by coderdata build |
| `build_omics.sh` | [gene file] [samplefile] | This includes the `genes.csv` that was generated in the original build as well as the sample file generated above. |
| `build_drugs.sh` | [drugfile1,drugfile2,...]       | This includes a comma-delimited list of all drugs files generated from previous build  |
| `build_exp.sh`| [samplfile ] [drugfile] | sample file and drug file generated by previous scripts |

5. Put the Docker container file inside the [Docker
directory](./build/docker) with the name
`Dockerfile.[datasetname]`. 

6. Run `build_all.py` from the root directory, which should now add in
your Dockerfile in the mix and call the scripts in your Docker
container to build the files.


