# Query Syntax Reference

Complete reference for NLQL query syntax.

## SELECT Clause

### Granularity Levels

```sql
-- Full documents
SELECT DOCUMENT

-- Chunks (default from vector databases)
SELECT CHUNK

-- Individual sentences
SELECT SENTENCE

-- Sliding window with context
SELECT SPAN(SENTENCE, window=3)
SELECT SPAN(CHUNK, window=2)
```

## WHERE Clause

### Semantic Operators

#### SIMILAR_TO - Semantic Similarity

The `SIMILAR_TO` operator performs vector-based semantic search:

```sql
-- Basic semantic search
WHERE SIMILAR_TO("query text") > 0.8

-- Combine with other conditions
WHERE SIMILAR_TO("AI agents") > 0.7 AND META("status") == "published"
```

**How it works:**

1. **Automatic Vectorization**: NLQL automatically embeds both the query text and all document chunks using the configured embedding provider (default: `all-MiniLM-L6-v2`)

2. **Similarity Computation**: Computes cosine similarity between query and document vectors, returning a score between 0 and 1

3. **Score Storage**: The similarity score is stored in `metadata["similarity"]` for each result, making it accessible in:
   - WHERE clause: `WHERE SIMILAR_TO("query") > threshold`
   - ORDER BY clause: `ORDER BY SIMILARITY DESC`
   - Result metadata: `result.metadata['similarity']`

**Important Notes:**

- **Execution Order**: Similarity is computed on the original chunks BEFORE granularity transformation (SENTENCE/SPAN)
- **Score Inheritance**: When transforming to SENTENCE/SPAN, the similarity score is inherited from the parent chunk
- **Threshold Selection**: Typical thresholds:
  - `> 0.8`: Very high similarity (near-duplicates)
  - `> 0.6`: High similarity (related content)
  - `> 0.4`: Moderate similarity (loosely related)
  - `> 0.2`: Low similarity (may include noise)

**Example:**

```python
results = nlql.execute("""
    SELECT CHUNK
    WHERE SIMILAR_TO("machine learning applications") > 0.6
    ORDER BY SIMILARITY DESC
    LIMIT 5
""")

for result in results:
    print(f"[{result.metadata['similarity']:.3f}] {result.content}")
```

#### MATCH - Exact Text Match

Exact phrase matching (case-sensitive):

```sql
-- Match exact phrase
WHERE MATCH("exact phrase")

-- Combine with OR
WHERE MATCH("error") OR MATCH("warning")
```

#### CONTAINS - Substring Match

Case-insensitive substring matching:

```sql
-- Contains keyword (case-insensitive)
WHERE CONTAINS("keyword")

-- Combine with AND
WHERE CONTAINS("machine") AND CONTAINS("learning")
```

### Metadata Operators

```sql
-- Access metadata fields
WHERE META("field_name") == "value"
WHERE META("date") > "2024-01-01"
WHERE META("score") >= 0.5
```

### Comparison Operators

```sql
-- Numeric comparisons
WHERE META("score") > 0.8
WHERE META("count") <= 100
WHERE META("value") >= 50

-- Text comparisons
WHERE META("status") == "active"
WHERE META("category") != "archived"

-- Date comparisons (requires DateType registration)
WHERE META("created_at") > "2024-01-01"
```

### Boolean Logic

```sql
-- AND
WHERE SIMILAR_TO("AI") > 0.7 AND META("topic") == "ML"

-- OR
WHERE CONTAINS("machine learning") OR CONTAINS("deep learning")

-- NOT
WHERE NOT CONTAINS("deprecated")

-- Complex combinations
WHERE (SIMILAR_TO("AI") > 0.8 OR CONTAINS("artificial intelligence"))
  AND META("date") > "2024-01-01"
  AND NOT META("archived") == true
```

### Functions

```sql
-- Built-in functions
WHERE LENGTH(content) > 100
WHERE COUNT("AI") > 3

-- Custom functions (after registration)
WHERE word_count(content) > 50
```

## ORDER BY Clause

```sql
-- Order by similarity score (for SIMILAR_TO queries)
ORDER BY SIMILARITY DESC

-- Order by metadata fields
ORDER BY META("date") DESC
ORDER BY META("score") ASC

-- Multiple fields
ORDER BY META("priority") DESC, META("date") DESC
```

## LIMIT Clause

```sql
-- Limit number of results
LIMIT 10
LIMIT 100
```

## Complete Examples

### Basic Retrieval

```sql
SELECT CHUNK
WHERE CONTAINS("machine learning")
LIMIT 10
```

### Semantic Search

```sql
SELECT SENTENCE
WHERE SIMILAR_TO("AI agents and autonomous systems") > 0.75
ORDER BY SIMILARITY DESC
LIMIT 5
```

### Metadata Filtering

```sql
SELECT DOCUMENT
WHERE META("author") == "Alice"
  AND META("date") > "2024-01-01"
ORDER BY META("date") DESC
```

### Hybrid Query

```sql
SELECT SENTENCE
WHERE SIMILAR_TO("neural networks") > 0.7
  AND META("topic") == "deep learning"
  AND LENGTH(content) > 50
ORDER BY SIMILARITY DESC
LIMIT 20
```

### Context Windows

```sql
SELECT SPAN(SENTENCE, window=2)
WHERE SIMILAR_TO("transformer architecture")
ORDER BY SIMILARITY DESC
LIMIT 5
```

## Operator Reference

| Operator | Description | Returns | Example |
|----------|-------------|---------|---------|
| `SIMILAR_TO("text")` | Semantic similarity (vector-based) | Float (0-1) | `SIMILAR_TO("query") > 0.8` |
| `MATCH("text")` | Exact text match (case-sensitive) | Boolean | `MATCH("exact phrase")` |
| `CONTAINS("text")` | Substring match (case-insensitive) | Boolean | `CONTAINS("keyword")` |
| `META("field")` | Metadata field access | Any | `META("field") == value` |

## Function Reference

| Function | Description | Example |
|----------|-------------|---------|
| `LENGTH` | Text length | `LENGTH(content) > 100` |
| `COUNT` | Count occurrences | `COUNT("word") > 3` |
| `NOW` | Current timestamp | `META("date") < NOW()` |

## Type System

Register metadata field types for type-safe comparisons:

```python
from nlql import register_meta_field, NumberType, DateType, TextType

# Register field types
register_meta_field("score", NumberType)
register_meta_field("created_at", DateType)
register_meta_field("status", TextType)
```

Then use in queries:

```sql
WHERE META("score") > 0.8
WHERE META("created_at") > "2024-01-01"
WHERE META("status") == "active"
```

