# Extensibility

NLQL is designed to be highly extensible. You can customize almost every aspect of query execution.

## Registration Levels

NLQL supports two levels of registration:

1. **Global Registration**: Functions/operators/providers are registered globally and available to all NLQL instances
2. **Instance-Level Registration**: Functions/operators/providers are registered to a specific NLQL instance only

Instance-level registrations take precedence over global registrations, allowing you to override global behavior for specific instances.

## Custom Operators

Register domain-specific operators:

```python
from nlql import register_operator
import re

@register_operator("HAS_EMAIL")
def has_email(text: str) -> bool:
    """Check if text contains an email address."""
    pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
    return bool(re.search(pattern, text))

@register_operator("HAS_URL")
def has_url(text: str) -> bool:
    """Check if text contains a URL."""
    pattern = r'https?://[^\s]+'
    return bool(re.search(pattern, text))
```

Use in queries:

```sql
SELECT CHUNK
WHERE HAS_EMAIL(content) AND HAS_URL(content)
```

## Custom Functions

Add query functions:

```python
from nlql import register_function

@register_function("word_count")
def word_count(text: str) -> int:
    """Count words in text."""
    return len(text.split())

@register_function("days_ago")
def days_ago(days: int) -> str:
    """Get date N days ago."""
    from datetime import datetime, timedelta
    date = datetime.now() - timedelta(days=days)
    return date.strftime("%Y-%m-%d")
```

Use in queries:

```sql
SELECT CHUNK
WHERE word_count(content) > 100
  AND META("date") > days_ago(7)
```

### Instance-Level Registration

You can register functions, operators, and embedding providers to specific NLQL instances instead of globally. This is useful for:

- Multi-tenant applications with different business logic per tenant
- A/B testing different implementations
- Isolating test environments from production
- Domain-specific query engines with specialized functions

**Example: Different Function Implementations per Instance**

```python
from nlql import NLQL
from nlql.adapters import MemoryAdapter

# Create two NLQL instances
nlql1 = NLQL(adapter=adapter1)
nlql2 = NLQL(adapter=adapter2)

# Register different implementations to each instance
@nlql1.register_function("WORD_COUNT")
def word_count_total(text: str) -> int:
    """Count total words."""
    return len(text.split())

@nlql2.register_function("WORD_COUNT")
def word_count_unique(text: str) -> int:
    """Count unique words."""
    return len(set(text.lower().split()))

# Each instance uses its own implementation
results1 = nlql1.execute("SELECT CHUNK WHERE WORD_COUNT(content) > 10")
results2 = nlql2.execute("SELECT CHUNK WHERE WORD_COUNT(content) > 10")
```

**Example: Different Operator Implementations per Instance**

```python
# Register different operators to each instance
@nlql1.register_operator("CUSTOM_FILTER")
def filter_python(text: str) -> bool:
    return "Python" in text

@nlql2.register_operator("CUSTOM_FILTER")
def filter_ai(text: str) -> bool:
    return "AI" in text

# Each instance uses its own operator
results1 = nlql1.execute("SELECT CHUNK WHERE CUSTOM_FILTER(content)")
results2 = nlql2.execute("SELECT CHUNK WHERE CUSTOM_FILTER(content)")
```

**Example: Different Embedding Providers per Instance**

```python
# Register different embedding providers to each instance
@nlql1.register_embedding_provider
def embedding_word_based(texts: list[str]) -> list[list[float]]:
    return [[len(text.split()) / 10.0, 0.5, 0.5] for text in texts]

@nlql2.register_embedding_provider
def embedding_char_based(texts: list[str]) -> list[list[float]]:
    return [[len(text) / 50.0, 0.5, 0.5] for text in texts]

# Each instance uses its own embedding provider
results1 = nlql1.execute('SELECT CHUNK WHERE SIMILAR_TO("query") > 0.5')
results2 = nlql2.execute('SELECT CHUNK WHERE SIMILAR_TO("query") > 0.5')
```

**Priority Rules**:
- Instance-level registrations take precedence over global registrations
- If a function/operator is registered both globally and to an instance, the instance-level version is used
- Instance-level registrations do not affect the global registry or other instances

See the `examples/instance_registry_demo.py` file in the repository for a complete working example.

## Custom Types

Define metadata field types for type-safe comparisons:

```python
from nlql import register_meta_field
from nlql.types import BaseType, NumberType, DateType, TextType

# Register built-in types
register_meta_field("score", NumberType)
register_meta_field("created_at", DateType)
register_meta_field("status", TextType)

# Create custom type
class PriorityType(BaseType):
    """Custom priority type with special comparison logic."""

    LEVELS = {"low": 1, "medium": 2, "high": 3, "critical": 4}

    def __init__(self, value: str | int):
        if isinstance(value, str):
            value = self.LEVELS.get(value.lower(), 0)
        super().__init__(value)

    def __lt__(self, other):
        other_val = other.value if isinstance(other, BaseType) else other
        return self.value < other_val

    def __gt__(self, other):
        other_val = other.value if isinstance(other, BaseType) else other
        return self.value > other_val

    def __eq__(self, other):
        other_val = other.value if isinstance(other, BaseType) else other
        return self.value == other_val

# Register custom type
register_meta_field("priority", PriorityType)
```

Use in queries:

```sql
SELECT CHUNK
WHERE META("priority") > "medium"
```

## Custom Splitters

Implement language-specific or domain-specific text splitting:

```python
from nlql import register_splitter

@register_splitter("SENTENCE")
def german_sentence_splitter(text: str) -> list[str]:
    """Split German text into sentences."""
    import nltk
    return nltk.sent_tokenize(text, language='german')

@register_splitter("PARAGRAPH")
def paragraph_splitter(text: str) -> list[str]:
    """Split text into paragraphs."""
    return [p.strip() for p in text.split('\n\n') if p.strip()]
```

Use in queries:

```sql
-- Uses custom German sentence splitter
SELECT SENTENCE
WHERE SIMILAR_TO("Künstliche Intelligenz")
```

## Custom Embedding Provider

NLQL uses embedding providers for semantic search (`SIMILAR_TO` operator). You can customize the embedding model used for vectorization.

### Default Provider

By default, NLQL uses `sentence-transformers` with the `all-MiniLM-L6-v2` model:

```python
from nlql import NLQL
from nlql.adapters import MemoryAdapter

# Uses default embedding provider automatically
adapter = MemoryAdapter()
nlql = NLQL(adapter=adapter)

# SIMILAR_TO will use all-MiniLM-L6-v2
results = nlql.execute('SELECT CHUNK WHERE SIMILAR_TO("AI") > 0.7')
```

### Custom Provider with OpenAI

Use OpenAI's embedding API:

```python
from nlql.registry.embedding import register_embedding_provider

# Use decorator syntax (recommended)
@register_embedding_provider
def openai_embedding_provider(texts: list[str]) -> list[list[float]]:
    """Generate embeddings using OpenAI API."""
    import openai

    # Configure your API key
    openai.api_key = "your-api-key"

    response = openai.Embedding.create(
        input=texts,
        model="text-embedding-ada-002"
    )

    return [item["embedding"] for item in response["data"]]

# Now SIMILAR_TO will use OpenAI embeddings
nlql = NLQL(adapter=adapter)
results = nlql.execute('SELECT CHUNK WHERE SIMILAR_TO("AI") > 0.7')
```

**Note**: Embedding providers must be functions with signature `(list[str]) -> list[list[float]]`. They receive a batch of texts and return a batch of embedding vectors. You can use either decorator syntax (`@register_embedding_provider`) or function call syntax (`register_embedding_provider(my_func)`).

### Custom Provider with Different Sentence-Transformers Model

Use a different sentence-transformers model:

```python
from nlql.registry.embedding import register_embedding_provider
from sentence_transformers import SentenceTransformer

# Load model once (lazy loading)
_model = None

@register_embedding_provider
def custom_embedding_provider(texts: list[str]) -> list[list[float]]:
    """Generate embeddings using a different sentence-transformers model."""
    global _model
    if _model is None:
        _model = SentenceTransformer("all-mpnet-base-v2")

    embeddings = _model.encode(texts, convert_to_numpy=True)
    return embeddings.tolist()
```

### Custom Provider with Hugging Face Models

Use any Hugging Face model:

```python
from nlql.registry.embedding import register_embedding_provider
from transformers import AutoTokenizer, AutoModel
import torch

# Load model once (lazy loading)
_tokenizer = None
_model = None

@register_embedding_provider
def huggingface_embedding_provider(texts: list[str]) -> list[list[float]]:
    """Generate embeddings using Hugging Face transformers."""
    global _tokenizer, _model

    if _tokenizer is None or _model is None:
        _tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
        _model = AutoModel.from_pretrained("bert-base-uncased")

    # Tokenize
    encoded = _tokenizer(
        texts,
        padding=True,
        truncation=True,
        return_tensors="pt"
    )

    # Get model output
    with torch.no_grad():
        output = _model(**encoded)

    # Mean pooling
    embeddings = output.last_hidden_state.mean(dim=1)

    return embeddings.tolist()
```

### Provider Interface

All embedding providers must be functions with this signature:

```python
def embedding_provider(texts: list[str]) -> list[list[float]]:
    """Generate embeddings for a list of texts.

    Args:
        texts: List of text strings to embed

    Returns:
        List of embedding vectors (each vector is a list of floats)

    Example:
        >>> embedding_provider(["hello", "world"])
        [[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]]
    """
    # Your embedding logic here
    pass
```

### Best Practices

1. **Lazy Loading**: Load models only when needed to save memory:
   ```python
   _model = None

   def my_embedding_provider(texts: list[str]) -> list[list[float]]:
       global _model
       if _model is None:
           _model = load_expensive_model()
       return _model.encode(texts)
   ```

2. **Batch Processing**: Process multiple texts at once for efficiency:
   ```python
   def my_embedding_provider(texts: list[str]) -> list[list[float]]:
       # Process all texts in one batch (not one by one)
       return model.encode(texts)
   ```

3. **Error Handling**: Handle errors gracefully:
   ```python
   from nlql.errors import NLQLConfigError

   def my_embedding_provider(texts: list[str]) -> list[list[float]]:
       try:
           return model.encode(texts)
       except Exception as e:
           raise NLQLConfigError(f"Embedding failed: {e}") from e
   ```

4. **Caching**: Consider caching embeddings for frequently used texts:
   ```python
   from nlql.registry.embedding import register_embedding_provider

   # Cache for storing embeddings
   _cache = {}
   _base_model = None

   @register_embedding_provider
   def cached_embedding_provider(texts: list[str]) -> list[list[float]]:
       """Embedding provider with caching."""
       global _cache, _base_model

       if _base_model is None:
           from sentence_transformers import SentenceTransformer
           _base_model = SentenceTransformer("all-MiniLM-L6-v2")

       results = []
       uncached = []
       uncached_indices = []

       for i, text in enumerate(texts):
           if text in _cache:
               results.append(_cache[text])
           else:
               uncached.append(text)
               uncached_indices.append(i)

       if uncached:
           new_embeddings = _base_model.encode(uncached).tolist()
           for text, emb in zip(uncached, new_embeddings):
               _cache[text] = emb
               results.append(emb)

       return results
   ```

## Custom Adapters

Create adapters for new data sources:

```python
from nlql.adapters import BaseAdapter, QueryPlan
from nlql.text.units import Chunk, TextUnit

class ElasticsearchAdapter(BaseAdapter):
    """Adapter for Elasticsearch."""

    def __init__(self, es_client, index_name: str):
        self.client = es_client
        self.index = index_name

    def query(self, plan: QueryPlan) -> list[TextUnit]:
        # Build Elasticsearch query
        es_query = {"bool": {"must": []}}

        # Add metadata filters
        if plan.filters:
            for field, value in plan.filters.items():
                es_query["bool"]["must"].append({
                    "term": {field: value}
                })

        # Add semantic search (if using vector field)
        if plan.query_text:
            # Implement vector search
            pass

        # Execute query
        response = self.client.search(
            index=self.index,
            query=es_query,
            size=plan.limit or 10,
        )

        # Convert to TextUnit
        results = []
        for hit in response["hits"]["hits"]:
            chunk = Chunk(
                content=hit["_source"]["content"],
                metadata=hit["_source"].get("metadata", {}),
                chunk_id=hit["_id"],
                position=0,
            )
            results.append(chunk)

        return results

    def supports_semantic_search(self) -> bool:
        return True  # If you have vector fields

    def supports_metadata_filter(self) -> bool:
        return True
```

## Configuration

Customize NLQL behavior with `NLQLConfig`:

```python
from nlql import NLQL, NLQLConfig
from nlql.adapters import MemoryAdapter

# Create configuration
config = NLQLConfig(
    default_limit=100,  # Default LIMIT when query doesn't specify one
    debug_mode=True,    # Enable debug logging
)

# Create NLQL instance with config
adapter = MemoryAdapter()
nlql = NLQL(adapter=adapter, config=config)

# Query without LIMIT will use default_limit=100
results = nlql.execute("SELECT CHUNK WHERE CONTAINS('AI')")
```

**Available Configuration Options**:

- `default_limit` (int | None): Default LIMIT value when query doesn't specify one. Default: `None` (no limit)
- `debug_mode` (bool): Enable debug logging for query execution steps. Default: `False`
- `enable_caching` (bool): Reserved for future caching implementation. Default: `False`
- `custom_settings` (dict): Reserved for future extensibility. Default: `{}`

## Best Practices

### 1. Naming Conventions

- **Operators**: UPPERCASE (e.g., `HAS_EMAIL`, `CONTAINS_CODE`)
- **Functions**: lowercase (e.g., `word_count`, `days_ago`)
- **Types**: PascalCase (e.g., `PriorityType`, `CustomDateType`)

### 2. Type Hints

Always use type hints for better IDE support:

```python
@register_function("my_func")
def my_func(text: str, threshold: int) -> bool:
    ...
```

### 3. Documentation

Document custom extensions:

```python
@register_operator("CUSTOM_OP")
def custom_op(text: str) -> bool:
    """Check if text matches custom criteria.

    Args:
        text: Input text to check

    Returns:
        True if criteria is met
    """
    ...
```

### 4. Error Handling

Handle errors gracefully:

```python
from nlql.errors import NLQLExecutionError

@register_function("safe_func")
def safe_func(value: str) -> int:
    try:
        return int(value)
    except ValueError as e:
        raise NLQLExecutionError(f"Cannot convert '{value}' to int") from e
```

## Complete Example

For a comprehensive demonstration of all extensibility features, see the `examples/extensibility_demo.py` file in the repository.

This example shows:

- **Custom Functions**: `WORD_COUNT()`, `UPPERCASE()`, `EXTRACT_YEAR()`
- **Custom Operators**: `STARTS_WITH()`, `HAS_DIGIT()`, `REGEX_MATCH()`
- **Custom Embedding Provider**: Simple statistics-based embedding
- **Integration**: How custom extensions work with built-in NLQL features

Run the example:

```bash
python examples/extensibility_demo.py
```

## Common Issues and Solutions

### 1. Function/Operator Name Conflicts

**Problem**: Custom function names containing built-in keywords (like `COUNT`, `IS`) may cause parsing errors.

**Solution**: Avoid using built-in keywords as prefixes in custom names:

```python
# ❌ Bad - contains built-in keyword "COUNT"
@register_function("COUNTWORDS")
def count_words(text: str) -> int:
    return len(text.split())

# ✅ Good - no keyword conflicts
@register_function("NUMWORDS")
def count_words(text: str) -> int:
    return len(text.split())
```

Built-in keywords to avoid:
- Functions: `LENGTH`, `NOW`, `COUNT`
- Operators: `MATCH`, `SIMILAR_TO`, `CONTAINS`, `IS`, `META`

### 2. Custom Operators Must Be Uppercase

**Problem**: Lowercase operator names cause registration errors.

**Solution**: Always use UPPERCASE names for operators:

```python
# ❌ Bad - lowercase name
@register_operator("starts_with")
def starts_with(text: str, prefix: str) -> bool:
    return text.startswith(prefix)

# ✅ Good - uppercase name
@register_operator("STARTS_WITH")
def starts_with(text: str, prefix: str) -> bool:
    return text.startswith(prefix)
```

### 3. Embedding Provider Signature

**Problem**: Custom embedding provider has wrong signature.

**Solution**: Embedding providers must accept `list[str]` and return `list[list[float]]`:

```python
# ❌ Bad - wrong signature (single text)
def my_embedding(text: str) -> list[float]:
    return [0.1, 0.2, 0.3]

# ✅ Good - correct signature (batch processing)
def my_embedding(texts: list[str]) -> list[list[float]]:
    return [[0.1, 0.2, 0.3] for _ in texts]
```

### 4. Handling None Values in Functions

**Problem**: Functions returning `None` cause comparison errors.

**Solution**: Return a default value instead of `None`:

```python
# ❌ Bad - returns None
@register_function("EXTRACT_YEAR")
def extract_year(text: str) -> int | None:
    match = re.search(r'\b(19|20)\d{2}\b', text)
    return int(match.group()) if match else None

# ✅ Good - returns default value
@register_function("EXTRACT_YEAR")
def extract_year(text: str) -> int:
    match = re.search(r'\b(19|20)\d{2}\b', text)
    return int(match.group()) if match else 0
```

### 5. Operator Argument Evaluation

**Problem**: Operator receives unevaluated AST nodes instead of values.

**Solution**: The evaluator automatically evaluates arguments before passing them to operators. Just define your operator with the expected types:

```python
@register_operator("STARTS_WITH")
def starts_with(text: str, prefix: str) -> bool:
    # text and prefix are already evaluated strings
    return text.startswith(prefix)
```

## Next Steps

- Check [API Reference](../api/registry.md) for detailed API docs
- See [Architecture](../architecture.md) for system design
- Run the example files in the `examples/` directory for hands-on learning

