# pycarus
pycarus is a Python library designed to help you calibrate survey weights to match known population totals or proportions. This guide will walk you through the basic concepts, how to use the library, and provide examples to get you started.

[![État de la Barrière Qualité](https://sonar.pleiade.edf.fr/api/project_badges/measure?project=gamme-conso-client_acacia_benchmarks-methodos_pycarus_AZSO3DNC2JThbS9_YjLQ&metric=alert_status&token=sqb_eb1273b5a712234f92948e6065b27e40c1c7ae34)](https://sonar.pleiade.edf.fr/dashboard?id=gamme-conso-client_acacia_benchmarks-methodos_pycarus_AZSO3DNC2JThbS9_YjLQ)
[![Couverture (TU)](https://sonar.pleiade.edf.fr/api/project_badges/measure?project=gamme-conso-client_acacia_benchmarks-methodos_pycarus_AZSO3DNC2JThbS9_YjLQ&metric=coverage&token=sqb_eb1273b5a712234f92948e6065b27e40c1c7ae34)](https://sonar.pleiade.edf.fr/dashboard?id=gamme-conso-client_acacia_benchmarks-methodos_pycarus_AZSO3DNC2JThbS9_YjLQ)

## Table of Contents

- [Introduction](#introduction)
- [Installation](#installation)
- [Usage](#usage)

## Introduction

Survey weight calibration (known as "calage sur marges" in French) is a statistical technique used to adjust survey weights to match known population totals while preserving the structure of the survey data. This guide provides an overview of the mathematical foundations and explains how pycarus implements calibration efficiently.

### What are Population Margins?

Population margins are known values derived from the total population. These margins serve as benchmarks to ensure that the calibrated sample represents the population accurately. Margins can be expressed in two forms:

1. **Absolute Totals**
   - Total population aged 18-25: 1,000,000
   - Total male population: 600,000
   - Total female population: 650,000

2. **Proportions**
   - Average age: 46 years
   - Male proportion: 48%
   - Female proportion: 52%

### The Calibration Process

The calibration process involves adjusting initial survey weights ($d_k$) to produce calibrated weights ($w_k$) that satisfy predefined constraints based on the population margins. The process involves three main components:

1. **Initialization of Weights**
   - Every survey unit starts with an initial weight ($d_k$), which represents the inverse probability of selection in the sample.
   - If no weights are provided, pycarus initializes them using a uniform approach: $d_k = \frac{N}{n}$, where:
     - $N$ = total population size
     - $n$ = sample size

2. **Specification of Margins**
   - Margins are provided in a dictionary format with specific structures:
     - For continuous variables:
       - Key: column name
       - Value: single margin value
     - For categorical variables:
       - Key: column name
       - Value: dictionary where:
         - Keys are unique categories
         - Values are margins for each category

3. **Adjustment of Weights**
   - The calibrated weights minimize the overall adjustment cost, as measured by a distance function, while satisfying the constraints defined by the margins.

### Mathematical Framework

Key Variables:

- $U$: The entire population (all units of interest)
- $s$: The sample, a subset of the population ($s \subset U$)
- $d_k$: The **initial weight** for unit $k$ in the sample
- $w_k$: The **calibrated weight** for unit $k$ in the sample
- $x_k$: The **calibration variables** for unit $k$ (e.g., age, sex)
- $T_x$: The known **population totals** for the calibration variables

### Distance Functions

A **distance function** quantifies the penalty for deviations between the initial weights ($d_k$) and the calibrated weights ($w_k$). pycarus supports the following distance functions:

1. **Linear Distance**:
   - Advantages: Simple and almost always converges
   - Limitations: May produce negative weights

2. **Raking Distance**:
   - Advantages: Always produces positive weights; most widely used method
   - Limitations: Does not converge sometimes (no feasible solution)

3. **Logit Distance**:
   - Advantages: Very similar to raking method, but keeps the ratio of calibrated weights and initial weights within boundaries
   - Limitations: Does not converge sometimes (no feasible solution), especially if the bounds are tight; weights ratio accumulates close to boundaries

4. **Truncated Linear Distance**:
   - Advantages: Very similar to linear method, but keeps the ratio of calibrated weights and initial weights within boundaries; can enforce positive weights
   - Limitations: Does not converge sometimes (no feasible solution), especially if the bounds are tight; weights ratio accumulates close to boundaries

### Bounded Methods: Advantages and Limitations

Bounded methods (Logit and Truncated Linear distances) offer important advantages for controlling extreme weights:

- Restrict weights to the interval $[L, U]$
- Prevent over-representation of specific units
- Help maintain stability in estimates

### Practical Considerations

1. **Convergence Issues**:
   - Calibration may fail if margins are inconsistent with sample data
   - Users should verify coherence of margins and calibration variables
   - When using bounded methods, ensure $[L, U]$ interval is sufficiently wide

2. **Method Selection**:
   - Use raking ratio method for general purposes
   - Choose bounded methods when weight control is crucial
   - Avoid linear method in production
   - Start with simpler methods and progress to more complex ones if needed

## Installation

This section provides instructions on how to install pycarus, either using `pip` or `uv`. You can choose the method that best fits your workflow.

### Installing with uv

To install pycarus using `uv`, run the following command:
```bash
uv add python-icarus
```


### Installing with pip
To install pycarus using pip, run the following command:
```bash
pip install python-icarus
```

### Notes
- Dependencies: Both installation methods will automatically handle pycarus dependencies based on the configuration in the pyproject.toml file.
- Python version: Ensure that your Python version is compatible with pycarus. Check the README for version requirements.
- Virtual environments: It is recommended to use a virtual environment (e.g., venv or conda) to avoid conflicts with system-wide Python packages.

## Usage
pycarus is a Python library designed to help you calibrate survey weights to match known population totals or proportions. This guide will walk you through the basic concepts, how to use the library, and provide examples to get you started.

### Basic Usage
The main function in pycarus is calibrate(), which handles the calibration process. Here's a simple example to get you started:

```python
from pycarus import calibrate
import pandas as pd

# Sample survey data
data = pd.DataFrame(
    {
        "age": [10, 24, 22, 28, 30, 35, 50, 41, 16, 33, 8, 45],
        "country": [
            "France",
            "France",
            "France",
            "France",
            "USA",
            "USA",
            "USA",
            "Japan",
            "Japan",
            "Japan",
            "Japan",
            "Japan",
        ],
        "weights": [2, 2, 2, 2, 4, 4, 4, 5.5, 5.5, 5.5, 5.5, 5.5],
    }
)

# Margins : on the whole population, we know that the variables sum to thoses margins
margins_dict = {
    "age": 1000,
    "country": {
        "France": 10,
        "USA": 10,
        "Japan": 24,
    },
}


# Calibration
weights, info = calibrate(
    data,
    margins_dict,
    initial_weights_column="weights",
    method="logit",
    bounds=(0.4, 2.5)
)

# Analyze results
print(f"Convergence achieved in {info.iterations} iterations")
print(f"Relative gaps: {info.relative_gaps}")
```

### Defining Margins
Margins are the target values you want your survey data to match. They can be specified for both continuous and categorical variables.

#### Continuous Variables
For continuous variables, specify the population total directly:
```python
margins = {
    'age': 1_035,        # Sum of ages in population
    'income': 1_500_000   # Total income in population
}
```
#### Categorical Variables
For categorical variables, specify the count for each unique category, as a dictionary:

```python
margins = {
    'gender': {
        'M': 520,   # Number of males
        'F': 480    # Number of females
    }
}
```

#### Working with Proportions
If you have margins as proportions, set margins_as_proportions=True and specify the total population size:

```python
# Margins as proportions
margins = {
    "age": 45,  # Average age
    "income": 120_000,  # Average income
    'gender': {
        'M': 0.52,   # Proportion of males
        'F': 0.48    # Proportion of females
    }
}

# Calibration using proportions
weights, result = calibrate(
    survey_data=data,
    margins_dict=margins,
    margins_as_proportions=True,
    population_total=1000  # Total population size
)
```

For a continuous variable, it means that the margin isn't the sum over the population, but the mean over the population.
The margins are either all sums, are all means/proportions. One can't mix those two kinds of margins.

### Initial Weights
You can provide initial weights through the initial_weights_column parameter:

```python
weights, result = calibrate(
    survey_data=data,
    margins_dict=margins,
    initial_weights_column='sampling_weights'  # Column containing initial weights
)
```

If no initial weights are specified, pycarus assumes simple random sampling:
$$d_k = \frac{N}{n}$$

where $N$ is the population size and $n$ is the sample size.

### Bounded Methods
For bounded methods like logit and truncated_linear, you need to specify the bounds. These bounds are for the ratio of the weights after and before calibration.

Bounds are specified as a tuple or list of two values: the lower bound ($L$) and the upper bound ($U$). The weights after calibration will be constrained to be within these bounds relative to the initial weights. Specifically, the bounds should be such that $L < 1 < U$. After calibration, the ratio of weights $\frac{w_k}{d_k}$ will be such that $L \leq \frac{w_k}{d_k} \leq U$.

```python
weights, result = calibrate(
    survey_data=data,
    margins_dict=margins,
    method="logit",
    bounds=[0.5, 2.0]  # Weights will be between 0.5 and 2 times initial weights
    initial_weights_column="weights"
)
```

## Complete Example
Here's a comprehensive example showcasing the main features:

```python
import pandas as pd
from pycarus import calibrate

# Sample survey data
data = pd.DataFrame(
    {
        "age": [10, 24, 22, 28, 30, 35, 50, 41, 16, 33, 8, 45],
        "country": [
            "France",
            "France",
            "France",
            "France",
            "USA",
            "USA",
            "USA",
            "Japan",
            "Japan",
            "Japan",
            "Japan",
            "Japan",
        ],
        "weights": [2, 2, 2, 2, 4, 4, 4, 5.5, 5.5, 5.5, 5.5, 5.5],
    }
)

# Margins : on the whole population, we know that the variables sum to thoses margins
margins_dict = {
    "age": 1000,
    "country": {
        "France": 10,
        "USA": 10,
        "Japan": 24,
    },
}


# Calibration
weights, info = calibrate(
    data,
    margins_dict,
    initial_weights_column="weights",
    method="logit",
    bounds=(0.4, 2.5)
)

# Analyze results
print(f"Convergence achieved in {info.iterations} iterations")
print(f"Relative gaps: {info.relative_gaps}")
```