Effortlessly Convert Childhood Cancer Data with MCI_JSON2TSV: A Comprehensive Guide
Transforming complex JSON data into usable formats can be a headache, especially when dealing with critical research data. The MCI_JSON2TSV tool streamlines the process of converting COG and IGM formatted Childhood Cancer Data Initiative (CCDI) clinical report JSON files into organized TSV files. This enhances data accessibility leading to efficiency in research workflows. If you're working with the CCDI, this guide unveils how to use MCI_JSON2TSV to simplify your data wrangling.
What is MCI_JSON2TSV and Why Should You Use It?
MCI_JSON2TSV is a Python script designed to convert COG (Children's Oncology Group) and IGM (Integrated Genomics Medicine) JSON files into tab-separated values (TSV) format. Here's why it's a game-changer:
- Aggregated Data: Consolidates fields from various form types (Demographics, Treatment, Follow-Up, etc.) into a single TSV file.
- Form-Specific Parsing: Option to parse COG JSONs into separate TSVs based on form type (Demography, Final Diagnosis, etc.).
- Variant Result Extraction: Extracts variant result information from IGM JSONs into individual TSVs (methylation results, somatic and germline variant results, etc.).
System Requirements for MCI_JSON2TSV
Make sure your system meets these requirements to run MCI_JSON2TSV smoothly:
- Python: Version 3.8 or higher.
- Pandas: Version 2.0 or higher.
- Numpy: Version 2.0 or higher.
- pytest & pytest-mock: if you intend to use unit testing.
Installation: Get Started in Minutes
Install MCI_JSON2TSV with these simple steps:
- Clone the repository:
git clone https://github.com/CBIIT/ChildhoodCancerDataInitiative-MCI_JSON2TSV.git
- Navigate to the
/src
directory to find the python scripts.
Usage: Converting Your JSON Files to TSV
The basic command-line structure is as follows:
python MCI_JSON2TSV.py -d <input DIR> -o <output DIR> (-f -r)
-d/--directory
: Specifies the path to the directory containing your JSON files.-o/--output_path
: Defines the directory where the converted TSV files will be saved.-f/--form_parsing
(Optional): Use this flag to generate separate TSV files for each COG form type.-r/--results_variants_section_parse
(Optional): Use this flag to extract IGM variant results into separate TSV files.
Diving Deep: Understanding the Functions
MCI_JSON2TSV comes packed with specialized functions, here's a breakdown:
MCI_JSON2TSV.refresh_date()
: Returns the current date and time.MCI_JSON2TSV.distinguisher(f_path: str)
: Identifies the type of JSON file (COG, IGM, or other).MCI_JSON2TSV.distinguish(dir_path: str)
: Categorizes all JSON files in a directory into COG, IGM, or other.cog_utils.read_cog_jsons(dir_path: str, cog_jsons: list)
: Reads COG JSON files and concatenates them into a Pandas DataFrame.cog_utils.custom_json_parser(pairs: dict)
: Handles duplicate keys in JSON files, preserving all data.cog_utils.expand_cog_df(df: DataFrame)
: Parses participant JSON data and outputs a TSV with updated field names (e.g., DEMOGRAPHY.DM_BRTHDAT). Form instances output as rows in TSV.cog_utils.cog_to_tsv(dir_path: str, cog_jsons: list, cog_op: str, timestamp: str)
: Orchestrates the reading and transformation of COG JSON files.cog_utils.form_parser(df: pd.DataFrame, timestamp: str, cog_op: str)
: Splits transformed JSON data into separate TSVs for each form type.igm_utils.null_n_strip(df: DataFrame, get_time: str)
: Formats strings in IGM JSONs, handling null values.igm_utils.flatten_igm(json_obj: dict, parent_key='', flatten_dict=None, parse_type=None)
: Converts nested IGM into unnested.igm_utils.full_form_convert(flatten_dict: dict)
: Converts flattened JSON to Pandas DataFrame.igm_utils.igm_to_tsv(dir_path: str, igm_jsons: list, assay_type: str, igm_op: str, timestamp: str, results_parse: bool)
: Orchestrates the reading and conversion of IGM JSON files, handling different assay types (Archer Fusion, WXS, methylation).igm_utils.igm_results_variants_parsing(form: dict, form_name: str, assay_type: str, results_types: list)
: Parses results sections in long format.
Real-World Example: Streamlining Childhood Cancer Research
Imagine you have a directory filled with hundreds of COG and IGM JSON files related to childhood cancer patients. Using MCI_JSON2TSV, you can:
- Convert all JSON files into a single, aggregated TSV file for a comprehensive overview.
- Generate separate TSV files for each form type (Demography, Treatment, etc.) for targeted analysis.
- Extract specific variant information from IGM JSON files to accelerate genomic studies.
Get Involved
For questions or to contribute, please reach out to TBD