# Stapply.ai Jobs API Client

A production-ready Python client for interacting with the Stapply.ai jobs platform, reverse-engineered from HAR file analysis.

## Overview

This API client provides programmatic access to job listings on [Stapply.ai](https://map.stapply.ai/jobs). The website is built with Next.js and uses React Server Components (RSC), requiring browser automation for proper interaction.

## Discovered APIs

Based on HAR file analysis, the following API patterns were identified:

### 1. Main Jobs Page
- **Endpoint**: `GET /jobs`
- **Description**: Returns the main jobs listing page
- **Type**: HTML page with Next.js application
- **Authentication**: None required

### 2. React Server Component (RSC) Requests
These are internal Next.js API calls for client-side navigation:

- **Pattern**: `GET /jobs/{company}?_rsc={token}`
- **Pattern**: `GET /jobs/{company}/{job-slug}?_rsc={token}`
- **Type**: `text/x-component` (Next.js RSC format)
- **Headers Required**:
  - `rsc: 1`
  - `next-router-prefetch: 1`
  - `next-router-state-tree: [encoded-state]` (for navigation)
  - `referer: https://map.stapply.ai/jobs`

### 3. Analytics Endpoints
- **Endpoint**: `POST /_vercel/insights/view`
- **Endpoint**: `POST /_vercel/speed-insights/vitals`
- **Description**: Vercel analytics (not required for job data)

### 4. External Job Boards
Jobs link to external application systems:
- **Greenhouse**: `https://job-boards.greenhouse.io/`
- **Amazon Jobs**: `https://account.amazon.jobs/`
- Direct company career pages

## Architecture

The website uses:
- **Framework**: Next.js 13+ with App Router
- **Rendering**: React Server Components (RSC)
- **Hosting**: Vercel
- **Client-side Routing**: Next.js navigation with prefetching

## Installation

### Prerequisites
- Python 3.8+
- pip

### Install Dependencies

```bash
pip install playwright
python -m playwright install chromium
```

## Usage

### Basic Example

```python
from api_client import StapplyAPIClient

# Use context manager for automatic cleanup
with StapplyAPIClient(headless=True) as client:
    # Get all jobs
    jobs = client.get_jobs()
    print(f"Found {len(jobs)} jobs")

    # Get details for a specific job
    if jobs:
        first_job = jobs[0]
        details = client.get_job_details(first_job.relative_url)
        print(f"Job: {details['title']}")
        print(f"Company: {details['company']}")
```

### Replicate Original User Action

The HAR file captured this action: "Go to /jobs and click on the first job"

```python
from api_client import StapplyAPIClient

with StapplyAPIClient() as client:
    # Replicate the exact user action
    job_details = client.click_first_job()

    print(f"Clicked on: {job_details['title']}")
    print(f"Company: {job_details['company']}")
    print(f"Apply at: {job_details['external_apply_url']}")
```

### Manual Session Management

```python
from api_client import StapplyAPIClient

client = StapplyAPIClient(headless=False)  # Show browser
client.start()

try:
    jobs = client.get_jobs()

    for job in jobs:
        print(f"{job.title} - {job.company}")

    # Save to JSON
    client.save_jobs_to_json(jobs, 'jobs_output.json')

finally:
    client.stop()
```

### Advanced Configuration

```python
from api_client import StapplyAPIClient

client = StapplyAPIClient(
    headless=True,           # Run in headless mode
    timeout=60000,           # 60 second timeout
    user_agent='Custom UA'   # Custom user agent
)
```

## API Reference

### `StapplyAPIClient`

Main client class for interacting with Stapply.ai.

#### Methods

##### `start()`
Start the browser session. Must be called before making requests.

##### `stop()`
Stop the browser session and clean up resources.

##### `get_jobs(wait_time: int = 2000) -> List[Job]`
Fetch all jobs from the main jobs page.

**Parameters:**
- `wait_time`: Time to wait for page load (milliseconds)

**Returns:** List of `Job` objects

**Example:**
```python
jobs = client.get_jobs()
for job in jobs:
    print(f"{job.title} at {job.company}")
```

##### `get_job_details(job_url: str, wait_time: int = 2000) -> Dict[str, Any]`
Fetch detailed information about a specific job.

**Parameters:**
- `job_url`: Relative job URL (e.g., `/jobs/vercel/content-engineer-shkfa9`)
- `wait_time`: Time to wait for page load (milliseconds)

**Returns:** Dictionary with job details

**Example:**
```python
details = client.get_job_details('/jobs/vercel/content-engineer-shkfa9')
print(details['title'])
print(details['description'])
```

##### `click_first_job(wait_time: int = 2000) -> Dict[str, Any]`
Navigate to /jobs and click on the first job listing.

**Returns:** Dictionary containing the clicked job details

**Example:**
```python
job = client.click_first_job()
print(f"Clicked: {job['title']}")
```

##### `save_jobs_to_json(jobs: List[Job], filepath: str)`
Save jobs list to a JSON file.

**Parameters:**
- `jobs`: List of Job objects
- `filepath`: Output file path

### `Job` Dataclass

Represents a job listing.

**Attributes:**
- `title`: Job title
- `company`: Company name
- `url`: Full URL to job page
- `relative_url`: Relative URL (e.g., `/jobs/company/slug`)
- `slug`: Job slug identifier
- `external_apply_url`: External application URL (if available)
- `location`: Job location (if available)
- `description`: Job description (if fetched)

## Running the Example

```bash
cd /Users/kalilbouzigues/.reverse-api/runs/scripts/1b71990efa04
python3 api_client.py
```

This will:
1. Navigate to the jobs page
2. Click on the first job
3. Extract job details
4. Fetch all available jobs
5. Save results to `jobs_list.json`

## Authentication

Currently, no authentication is required to browse job listings on Stapply.ai. The site is publicly accessible.

## Rate Limiting

- The client includes reasonable delays (2 second default) to avoid overwhelming the server
- Playwright handles browser-level rate limiting naturally
- No explicit rate limit headers were observed in the HAR file

## Error Handling

The client includes comprehensive error handling:
- Automatic browser cleanup with context managers
- Detailed logging at INFO level
- Graceful handling of missing elements
- Timeout protection for long-running operations

## Technical Notes

### Why Playwright Instead of Requests?

The website uses Next.js with React Server Components (RSC), which means:
1. The initial HTML has minimal content
2. Jobs are rendered client-side via JavaScript
3. RSC responses are in a proprietary format that's difficult to parse
4. Browser automation ensures we get the fully rendered content

### RSC Request Format

When navigating client-side, Next.js makes requests like:
```
GET /jobs/company/job-slug?_rsc=3heop
Headers:
  rsc: 1
  next-router-prefetch: 1
  next-router-state-tree: [encoded]
```

The response is `text/x-component` format, a binary format used internally by React.

### Bot Detection

The site doesn't appear to have aggressive bot detection based on HAR analysis:
- No reCAPTCHA on job browsing
- No rate limiting observed
- Standard Chromium user agent works fine

However, Playwright provides realistic browser behavior which helps avoid detection.

## Output Format

### Job List JSON

```json
[
  {
    "title": "Content Engineer",
    "company": "vercel",
    "url": "https://map.stapply.ai/jobs/vercel/content-engineer-shkfa9",
    "relative_url": "/jobs/vercel/content-engineer-shkfa9",
    "slug": "content-engineer-shkfa9",
    "external_apply_url": "https://job-boards.greenhouse.io/vercel/jobs/5722313004",
    "location": "San Francisco, CA"
  }
]
```

### Job Details

```json
{
  "url": "https://map.stapply.ai/jobs/vercel/content-engineer-shkfa9",
  "title": "Content Engineer",
  "company": "vercel",
  "location": "San Francisco, CA",
  "description": "Full job description text...",
  "external_apply_url": "https://job-boards.greenhouse.io/vercel/jobs/5722313004",
  "scraped_at": "2024-12-25T18:48:46.135000"
}
```

## Logging

The client uses Python's standard logging module:

```python
import logging

# Enable debug logging
logging.basicConfig(level=logging.DEBUG)

with StapplyAPIClient() as client:
    jobs = client.get_jobs()
```

## Troubleshooting

### Browser Not Starting
```bash
# Reinstall Playwright browsers
python -m playwright install --force chromium
```

### Timeout Errors
Increase the timeout parameter:
```python
client = StapplyAPIClient(timeout=60000)  # 60 seconds
```

### Missing Jobs
The site may have dynamic loading. Increase wait time:
```python
jobs = client.get_jobs(wait_time=5000)  # 5 seconds
```

## Limitations

1. **No Search/Filter**: The current implementation doesn't support filtering jobs by company, location, or keyword
2. **No Pagination**: If the site has pagination, it's not currently handled
3. **External Apply Only**: Applications go through external systems (Greenhouse, etc.)
4. **Performance**: Browser automation is slower than pure API calls

## Future Enhancements

Potential improvements:
- [ ] Add filtering by company/location
- [ ] Support pagination if available
- [ ] Extract more job metadata (salary, remote status, etc.)
- [ ] Add caching to reduce duplicate requests
- [ ] Support for job alerts/monitoring
- [ ] Async version using Playwright async API

## License

This is a reverse-engineered client for educational purposes. Please respect Stapply.ai's terms of service and robots.txt when using this client.

## Support

For issues or questions about this client, please review the code and logs. The implementation is fully documented with docstrings and type hints.

---

**Generated**: Reverse-engineered from HAR file `1b71990efa04/recording.har`
**Test Status**: ✅ Tested and working
**Last Updated**: 2024-12-25
