Metadata-Version: 2.1
Name: ctodd-python-lib-data-science
Version: 1.0.0
Summary: Python utilities used for practicing data science and engineering
Home-page: https://github.com/ChristopherHaydenTodd/ctodd-python-lib-data-science
Author: Christopher H. Todd
Author-email: Christopher.Hayden.Todd@gmail.com
License: MIT
Description: # Christopher H. Todd's PROJECT_STRING_NAME
        
        The PROJECT_GIT_NAME project is responsible for ...
        
        The library ...
        
        ## Table of Contents
        
        - [Dependencies](#dependencies)
        - [Libraries](#libraries)
        - [Example Scripts](#example-scripts)
        - [Notes](#notes)
        - [TODO](#todo)
        
        ## Dependencies
        
        ### Python Packages
        
        - great-expectations>=0.4.5
        - pandas>=0.24.2
        - tensorflow>=1.13.1
        
        ## Libraries
        
        ### [data_engineering_helpers.py](https://github.com/ChristopherHaydenTodd/ctodd-python-lib-data-science/blob/master/data_science_helpers/data_engineering_helpers.py)
        
        Library for Dealing with redundant Data Engineering Tasks. This will include functions for tranforming dictionaries and PANDAS Dataframes
        
        Functions:
        
        ```
        def remove_overly_null_columns(df, percentage_null=.25):
            """
                Purpose:
                    Remove columns with the count of null values
                    exceeds the passed in percentage. This defaults
                    to 25%.
                Args:
                    df (Pandas DataFrame): DataFrame to remove columns
                        from
                    percentage_null (float): Percentage of null values
                        that will be the threshold for removing or
                        keeping columns. Defaults to .25 (25%)
                Return
                    df (Pandas DataFrame): DataFrame with columns removed
                        based on thresholds
            """
        ```
        
        ```
        def remove_high_cardinality_numerical_columns(df, percentage_unique=1):
            """
                Purpose:
                    Remove columns with the count of unique values
                    matches the count of rows. These are usually
                    unique identifiers (primary keys in a database)
                    that are not useful for modeling and can result
                    in poor model performance. percentage_unique
                    defaults to 100%, but this can be passed in
                Args:
                    df (Pandas DataFrame): DataFrame to remove columns
                        from
                    percentage_unique (float): Percentage of null values
                        that will be the threshold for removing or
                        keeping columns. Defaults to 1 (100%)
                Return
                    df (Pandas DataFrame): DataFrame with columns removed
                        based on thresholds
            """
        ```
        
        ```
        def remove_high_cardinality_categorical_columns(df, max_unique_values=20):
            """
                Purpose:
                    Remove columns with the count of unique values
                    for categorical columns are over a specified threshold.
                    These values are difficult to transform into dummies,
                    and would not work for logistic/linear regression.
                Args:
                    df (Pandas DataFrame): DataFrame to remove columns
                        from
                    max_unique_values (int): Integer of unique values
                        that is the threshold to remove column
                Return
                    df (Pandas DataFrame): DataFrame with columns removed
                        based on thresholds
            """
        ```
        
        ```
        def remove_single_value_columns(df):
            """
                Purpose:
                    Remove columns with a single value
                Args:
                    df (Pandas DataFrame): DataFrame to remove columns
                        from
                Return
                    df (Pandas DataFrame): DataFrame with columns removed
            """
        ```
        
        ```
        def remove_quantile_equality_columns(df, low_quantile=.05, high_quantile=.95):
            """
                Purpose:
                    Remove columns where the low quantile matches the
                    high quantile (data is heavily influenced by outliers)
                    and data is not well spread out
                Args:
                    df (Pandas DataFrame): DataFrame to remove columns
                        from
                    low_quantile (float): Percentage quantile to compare
                    high_quantile (float): Percentage quantile to compare
                Return
                    df (Pandas DataFrame): DataFrame with columns removed
            """
        ```
        
        ```
        def mask_outliers_numerical_columns(df, low_quantile=.05, high_quantile=.95):
            """
                Purpose:
                    Update outliers to be equal to the low_quantile and
                    high_quantile values specified.
                Args:
                    df (Pandas DataFrame): DataFrame to update data
                    low_quantile (float): Percentage quantile to set values
                    high_quantile (float): Percentage quantile to set values
                Return
                    df (Pandas DataFrame): DataFrame with columns updated
            """
        ```
        
        ```
        def convert_categorical_columns_to_dummies(df, drop_first=True):
            """
                Purpose:
                    Convert Categorical Values into Dummies. Will also
                    remove the initial column being converted. If
                    remove first is true, will remove one of the
                    dummy variables to remove prevent multicollinearity
                Args:
                    df (Pandas DataFrame): DataFrame to convert columns
                    drop_first (bool): to remove or not remove a column
                        from dummies generated
                Return
                    df (Pandas DataFrame): DataFrame with columns converted
            """
        ```
        
        ```
        def ensure_categorical_columns_all_string(df):
            """
                Purpose:
                    Ensure all values for Categorical Values are strings
                    and converts any non-string value into strings
                Args:
                    df (Pandas DataFrame): DataFrame to convert columns
                Return
                    df (Pandas DataFrame): DataFrame with columns converted
            """
        ```
        
        ```
        def encode_categorical_columns_as_integer(df):
            """
                Purpose:
                    Convert Categorical Values into single value
                    using sklearn LabelEncoder
                Args:
                    df (Pandas DataFrame): DataFrame to convert columns
                Return
                    df (Pandas DataFrame): DataFrame with columns converted
            """
        ```
        
        ```
        def replace_null_values_numeric_columns(df, replace_operation='median'):
            """
                Purpose:
                    Replace all null values in a dataframe with other
                    values. Options include 0, mean, and median; the
                    default operation converts numeric columns to
                    median
                Args:
                    df (Pandas DataFrame): DataFrame to remove columns
                        from
                    replace_operation (string/enum): operation to perform
                        in replacing null values in the dataframe
                Return
                    df (Pandas DataFrame): DataFrame with nulls replaced
            """
        ```
        
        ```
        def replace_null_values_categorical_columns(df):
            """
                Purpose:
                    Replace all null values in a dataframe with "Unknown"
                Args:
                    df (Pandas DataFrame): DataFrame to remove columns
                        from
                    replace_operation (string/enum): operation to perform
                        in replacing null values in the dataframe
                Return
                    df (Pandas DataFrame): DataFrame with nulls replaced
            """
        ```
        
        ```
        def get_categorical_columns(df):
            """
                Purpose:
                    Returns the categorical columns in a
                    DataFrame
                Args:
                    df (Pandas DataFrame): DataFrame to describe
                Return
                    categorical_columns (list): List of string
                        names of categorical columns
            """
        ```
        
        
        ```
        def get_numeric_columns(df):
            """
                Purpose:
                    Returns the numeric columns in a
                    DataFrame
                Args:
                    df (Pandas DataFrame): DataFrame to describe
                Return
                    numeric_columns (list): List of string
                        names of numeric columns
            """
        ```
        
        
        ```
        def get_columns_with_null_values(df):
            """
                Purpose:
                    Get Columns with Null Values
                Args:
                    df (Pandas DataFrame): DataFrame to describe
                Return
                    columns_with_nulls (dict): Dictionary where
                        keys are columns with nulls and the value
                        is the number of nulls in the column
            """
        ```
        
        ### [data_exploration_helpers.py](https://github.com/ChristopherHaydenTodd/ctodd-python-lib-data-science/blob/master/data_science_helpers/data_exploration_helpers.py)
        
        Library for aiding the understanding and investigation into the data provided for modeling. These helpers will help explain, graph, and explore the data
        
        Functions:
        
        ```
        def get_numerical_column_statistics(df):
            """
                Purpose:
                    Describe the numerical columns in a dataframe.
                    This will include, total_count, count_null, count_0,
                    mean, median, mode, sum, 5% quantile, and 95% quantile.
                Args:
                    df (Pandas DataFrame): DataFrame to describe
                Return
                    num_statistics (dictionary): Dictionary with key being
                    the column and the data being statistics for the
                    column
            """
        ```
        
        
        ```
        def get_column_correlation(df):
            """
                Purpose:
                    Determine the true correlation between
                    all column pairs in a passed in DataFrame.
                    This is the pure correlation; this is useful
                    if you are looking for the detailed correlation
                    and the direction of the correlation
                Args:
                    df (Pandas DataFrame): DataFrame to determine correlation
                Return
                    unique_value_correlation (Pandas DataFrame): DataFrame
                    of correlations for each column set in the DataFrame
            """
        ```
        
        
        ```
        def get_column_absolute_correlation(df):
            """
                Purpose:
                    Determine the absolute correlation between
                    all column pairs in a passed in DataFrame.
                    Absolute converts all correlations to a
                    positive value; this is useful if you are
                    only looking for the existance of a coorelation
                    and not the direction.
                Args:
                    df (Pandas DataFrame): DataFrame to determine correlation
                Return
                    unique_value_abs_correlation (Pandas DataFrame): DataFrame
                    of correlations for each column set in the DataFrame
            """
        ```
        
        
        ```
        def get_column_pairs_significant_correlation(df, pos_corr=.20, neg_corr=.20):
            """
                Purpose:
                    Determine Columns with highly positive or highly
                    negative correlation. Defaults for positive and
                    negative correlations are 20% and can be passed
                    in as parameters
                Args:
                    df (Pandas DataFrame): DataFrame to determine correlation
                    pos_corr (float): Float percentage to consider a positive
                    correlation as significant. Default 20%
                    neg_corr (float): Float percentage to consider a negative
                    correlation as significant. Default 20%
                Return
                    high_positive_correlation_pairs (List of Sets): List of column
                    pairs with a high positive correlation
                    high_negative_correlation_pairs (List of Sets): List of column
                    pairs with a high negative correlation
            """
        ```
        
        
        ```
        def get_unique_column_paris(df):
            """
                Purpose:
                    Get unique pairs of columns from a DataFrame. This
                    assumes there is no direction (A, B) and returns
                    a Set of column pairs that can be used for identifying
                    correlation, mapping columns, and other functions
                Args:
                    df (Pandas DataFrame): DataFrame to determine column pairs
                Return
                    unique_pairs (Set): Set of unique column pairs
            """
        ```
        
        ### [model_persistence_helpers.py](https://github.com/ChristopherHaydenTodd/ctodd-python-lib-data-science/blob/master/data_science_helpers/model_persistence_helpers.py)
        
        Library for helping store/load/persist data science models using Python libraries
        
        Functions:
        
        ```
        def store_model_as_pickle(filename, config={}, metadata={}):
            """
            Purpose:
                Store a model in memory to a .pkl file for later
                usage. ALso store a .config file and .metadata
                file with information about the model
            Args:
                filename (String): Filename of a pickled model (.pkl)
                config (Dict): Configuration data for the model
                metadata (Dict): Metadata related to the model/training/etc
            Return:
                N/A
            """
        ```
        
        
        ```
        def load_pickled_model(filename):
            """
            Purpose:
                Load a model that has been pickled and stored to
                persistance storage into memory
            Args:
                filename (String): Filename of a pickled model (.pkl)
            Return:
                model (Pickeled Object): Pickled model loaded from .pkl
            """
        ```
        
        ### [model_training_helpers.py](https://github.com/ChristopherHaydenTodd/ctodd-python-lib-data-science/blob/master/data_science_helpers/model_training_helpers.py)
        
        Library for helping train data science models using Python libraries
        
        Functions:
        
        ```
        def split_dataframe_for_model_training(
            df, dependent_variable, independent_variables=None, train_size=.70):
            """
                Purpose:
                    Takes in DataFrame and creates 4 DataFrames.
                    2 DataFrames holding X varib DataFrames and 2 Model Y DataFrames.
                    Train size is defaulted at 70% and the split defaults to using
                    all passed in columns.
                Args:
                    df (Pandas DataFrame): DataFrame to split
                    dependent_variable (string): dependent variable being
                        that the model is being created to predict
                    independent_variables (List of strings): independent variables that
                        will be used to predict the dependent varilable. If no columns
                        are passed, use all columns in the dataframe except the
                        dependent variable.
                    train_size (float): Percentage of rows in DataFrame
                        to use testing model. Inverse precentage will/can
                        be used to test the model's effectiveness
                Return
                    train_x (Pandas DataFrame): DataFrame with all independent variables
                        for training the model. Size is equal to a percentage of the
                        base dataset multiplied by the train size
                    test_x (Pandas DataFrame): DataFrame with all independent variables
                        for testing the trained model. Size is equal to a percentage
                        of the base dataset subtracted by the train size
                    train_y_observed (Pandas DataFrame): DataFrame with all dependant
                        variables for training the model. Size is equal to a percentage
                        of the base dataset multiplied by the train size
                    test_y_observed (Pandas DataFrame): DataFrame with all dependant
                        variables testing the trained model. Size is equal to a
                        percentage of the base dataset multiplied by the train size
            """
        ```
        
        ```
        def split_dataframe_by_column(df, column):
            """
                Purpose:
                    Split dataframe into multipel dataframes based on uniqueness
                    of columns passed in. The dataframe is then split into smaller
                    dataframes, one for each value of the variable.
                Args:
                    df (Pandas DataFrame): DataFrame to split
                    column (string): string of the column name to split on
                Return
                    split_df (Dict of Pandas DataFrames): Dictionary with the
                        split dataframes and the value that the column maps to
                        e.g false/true/0/1
            """
        ```
        
        ## Example Scripts
        
        Example executable Python scripts/modules for testing and interacting with the library. These show example use-cases for the libraries and can be used as templates for developing with the libraries or to use as one-off development efforts.
        
        ### N/A
        
        ## Notes
        
         - Relies on f-string notation, which is limited to Python3.6.  A refactor to remove these could allow for development with Python3.0.x through 3.5.x
        
        ## TODO
        
         - Unittest framework in place, but lacking tests
        
Keywords: python,libraries,numpy,pandas,data science
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: English
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Requires-Python: >3.6
Description-Content-Type: text/markdown
