Extract and Transform Childhood Cancer Data: A Guide to Using MCI_JSON2TSV for Enhanced Analysis
Unlocking insights from childhood cancer data requires efficient tools. This guide introduces MCI_JSON2TSV, a powerful script for transforming complex JSON files into accessible TSV format. Learn how to use it to aggregate, parse, and analyze critical research data.
Why Use MCI_JSON2TSV for Childhood Cancer Data Analysis?
- Consolidate data: Aggregate data from multiple JSON files into a single TSV file for easier analysis.
- Simplify complex data: Transform nested JSON structures into a flat, tabular format.
- Extract specific information: Parse JSON files by form type or variant results section.
- Automate data processing: Streamline your workflow with a script designed for large-scale data transformation.
- Focus on research: Spend less time wrangling data and more time on valuable analysis.
Getting Started with MCI_JSON2TSV
1. System Requirements
Before diving in, ensure your system meets these requirements:
- Python: Version 3.8 or higher.
- pandas: Version 2.0 or higher.
- numpy: Version 2.0 or higher.
- (Optional) pytest & pytest-mock: For developers running unit tests.
2. Installation Guide
Follow these steps to install MCI_JSON2TSV:
- Clone the repository:
git clone https://github.com/CBIIT/ChildhoodCancerDataInitiative-MCI_JSON2TSV.git
- Navigate to the directory:
cd ChildhoodCancerDataInitiative-MCI_JSON2TSV
- Install dependencies:
pip install pandas numpy
(andpytest pytest-mock
if needed)
3. Basic Usage
The core command structure is simple:
-d/--directory
: Specifies the directory containing your JSON files.-o/--output_path
: Defines the output directory for the generated TSV files.-f/--form_parsing
: (Optional): Parses COG JSONs into separate TSV files based on form type (e.g., DEMOGRAPHY, FINAL_DIAGNOSIS).-r/--results_variants_section_parse
: (Optional): Parses IGM JSONs into TSV files containing variant result information (methylation, somatic, and germline variants).
Example Scenario:
To convert all JSON in /input_jsons
and save the aggregated TSV to /output_tsvs
, use:
To parse each form of COG JSON to separate TSVs in /output_tsvs:
Key Functions Decoded
MCI_JSON2TSV uses a variety of functions to process the childhood cancer data.
MCI_JSON2TSV.refresh_date()
: Returns the current date and time, useful for timestamping output files.MCI_JSON2TSV.distinguisher(f_path: str)
: Identifies the type of JSON file (COG, IGM, or other) based on its structure.MCI_JSON2TSV.distinguish(dir_path: str)
: Categorizes all JSON files in a directory by file type (COG, IGM, or other), creating separate lists.cog_utils.read_cog_jsons(dir_path: str, cog_jsons: list)
: Reads COG JSON files and concatenates them into a pandas DataFrame.cog_utils.custom_json_parser(pairs: dict)
: Handles duplicate keys in JSON files, ensuring data integrity.cog_utils.expand_cog_df(df: DataFrame)
: Transforms COG JSON data into a DataFrame with updated field names, reflecting the form (e.g., DEMOGRAPHY.DM_BRTHDAT).cog_utils.cog_to_tsv(dir_path: str, cog_jsons: list, cog_op: str, timestamp: str)
: orchestrates the reading, transformation, and aggregation of COG JSONs into a TSV file.cog_utils.form_parser(df: pd.DataFrame, timestamp: str, cog_op: str)
: Splits the transformed COG JSON data into separate TSV files for each form type, such as Demography or Follow-Up.igm_utils.null_n_strip(df: DataFrame, get_time: str)
: Formats strings within IGM JSON files by handling null values and removing surrounding whitespaces.igm_utils.flatten_igm(json_obj: dict, parent_key='', flatten_dict=None, parse_type=None)
: Recursively flattens nested IGM JSON structures, simplifying complex data into a more usable format.igm_utils.full_form_convert(flatten_dict: dict)
: Converts the flattened IGM JSON data into a pandas DataFrame for further analysis.igm_utils.igm_to_tsv(dir_path: str, igm_jsons: list, assay_type: str, igm_op: str, timestamp: str, results_parse: bool)
: Manages the reading, transformation, and aggregation of IGM JSON files into a TSV format.igm_utils.igm_results_variants_parsing(form: dict, form_name: str, assay_type: str, results_types: list)
: Facilitates the parsing of specific results sections (in long format) from IGM JSON files.
Advanced Parsing for Deeper Insights using IGM and COG Utilities
Parsing COG Data by Form Type
To further refine your analysis, MCI_JSON2TSV allows you to parse COG data into separate TSV files based on form type.
- Use the
-f
or--form_parsing
flag to enable this feature. - Each form (e.g., DEMOGRAPHY, TREATMENT, FOLLOW_UP) will be extracted into its own TSV file.
This allows for targeted analysis of specific data subsets.
Extracting Variant Results from IGM Data
Extracting variant results from the IGM data is also manageable. To extract variant results from IGM JSON files, use the -r
or --results_variants_section_parse
flag. This option parses the data and converts it to a long format TSV.
Contributing and Getting Help
MCI_JSON2TSV is a valuable tool for accelerating childhood cancer research. Reach out to the developers if you have questions or would like to make contributions.