Metadata-Version: 2.3
Name: DAWG2-Python
Version: 0.9.0
Summary: Pure-python reader for DAWGs (DAFSAs) created by dawgdic C++ library or DAWG Python extension.
License: MIT
Author: Mikhail Korobov
Author-email: kmike84@gmail.com
Requires-Python: >=3.8,<4.0
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Linguistic
Requires-Dist: typing-extensions (>=4.0) ; python_version < "3.11"
Project-URL: Repository, https://github.com/pymorphy2-fork/DAWG-Python/
Description-Content-Type: text/markdown

# DAWG2-Python

[![Python tests](https://github.com/pymorphy2-fork/DAWG-Python/actions/workflows/python-tests.yml/badge.svg)](https://github.com/pymorphy2-fork/DAWG-Python/actions/workflows/python-tests.yml)
[![Coverage Status](https://coveralls.io/repos/github/pymorphy2-fork/DAWG-Python/badge.svg?branch=master)](https://coveralls.io/github/pymorphy2-fork/DAWG-Python?branch=master)

This pure-python package provides read-only access for files created by
[dawgdic][1] C++ library and
[DAWG][2] python package.

This package is not capable of creating DAWGs. It works with DAWGs built
by [dawgdic][1] C++ library or
[DAWG][2] Python extension module. The main
purpose of DAWG-Python is to provide access to DAWGs without
requiring compiled extensions. It is also quite fast under PyPy (see
benchmarks).

# Installation

```commandline
pip install DAWG2-Python
```
# Usage

The aim of DAWG2-Python is to be API- and binary-compatible with
[DAWG][2] when it is possible.

First, you have to create a dawg using
[DAWG][2] module:

```python
import dawg

d = dawg.DAWG(data)
d.save('words.dawg')
```
And then this dawg can be loaded without requiring C extensions:

```python
import dawg_python

d = dawg_python.DAWG().load('words.dawg')
```
Please consult [DAWG][2] docs for detailed
usage. Some features (like constructor parameters or `save` method) are
intentionally unsupported.

# Benchmarks

Benchmark results (100k unicode words, integer values (lengths of the
words), PyPy 1.9, macbook air i5 1.8 Ghz):

    dict __getitem__ (hits):        11.090M ops/sec
    DAWG __getitem__ (hits):        not supported
    BytesDAWG __getitem__ (hits):   0.493M ops/sec
    RecordDAWG __getitem__ (hits):  0.376M ops/sec

    dict get() (hits):              10.127M ops/sec
    DAWG get() (hits):              not supported
    BytesDAWG get() (hits):         0.481M ops/sec
    RecordDAWG get() (hits):        0.402M ops/sec
    dict get() (misses):            14.885M ops/sec
    DAWG get() (misses):            not supported
    BytesDAWG get() (misses):       1.259M ops/sec
    RecordDAWG get() (misses):      1.337M ops/sec

    dict __contains__ (hits):           11.100M ops/sec
    DAWG __contains__ (hits):           1.317M ops/sec
    BytesDAWG __contains__ (hits):      1.107M ops/sec
    RecordDAWG __contains__ (hits):     1.095M ops/sec

    dict __contains__ (misses):         10.567M ops/sec
    DAWG __contains__ (misses):         1.902M ops/sec
    BytesDAWG __contains__ (misses):    1.873M ops/sec
    RecordDAWG __contains__ (misses):   1.862M ops/sec

    dict items():           44.401 ops/sec
    DAWG items():           not supported
    BytesDAWG items():      3.226 ops/sec
    RecordDAWG items():     2.987 ops/sec
    dict keys():            426.250 ops/sec
    DAWG keys():            not supported
    BytesDAWG keys():       6.050 ops/sec
    RecordDAWG keys():      6.363 ops/sec

    DAWG.prefixes (hits):    0.756M ops/sec
    DAWG.prefixes (mixed):   1.965M ops/sec
    DAWG.prefixes (misses):  1.773M ops/sec

    RecordDAWG.keys(prefix="xxx"), avg_len(res)==415:       1.429K ops/sec
    RecordDAWG.keys(prefix="xxxxx"), avg_len(res)==17:      36.994K ops/sec
    RecordDAWG.keys(prefix="xxxxxxxx"), avg_len(res)==3:    121.897K ops/sec
    RecordDAWG.keys(prefix="xxxxx..xx"), avg_len(res)==1.4: 265.015K ops/sec
    RecordDAWG.keys(prefix="xxx"), NON_EXISTING:            2450.898K ops/sec

Under CPython expect it to be about 50x slower. Memory consumption of
DAWG-Python should be the same as of
[DAWG][2].

# Current limitations

- This package is not capable of creating DAWGs;
- all the limitations of [DAWG][2] apply.

Contributions are welcome!

# Contributing

- Development happens at GitHub: <https://github.com/pymorphy2-fork/DAWG-Python>
- Issue tracker: <https://github.com/pymorphy2-fork/DAWG-Python/issues>

Feel free to submit ideas, bugs or pull requests.

## Running tests and benchmarks

Make sure [pytest][3] is installed and run

```commandline
$ pytest .
```
from the source checkout. Tests should pass under python 3.8, 3.9, 3.10, 3.11 and PyPy3 \>= 7.3.

In order to run benchmarks, type

```commandline
$ pypy3 -m bench.speed
```
This runs benchmarks under PyPy (they are about 50x slower under
CPython).

## Authors & Contributors

- Mikhail Korobov \<kmike84@gmail.com\>
- [@bt2901](https://github.com/bt2901)
- [@insolor](https://github.com/insolor)

The algorithms are from [dawgdic][1]
C++ library by Susumu Yata & contributors.

# License

This package is licensed under MIT License.

[1]: https://code.google.com/p/dawgdic/
[2]: https://github.com/pymorphy2-fork/DAWG
[3]: https://docs.pytest.org/en/7.4.x/getting-started.html

# Changes

## 0.8.1 (2024-08-01)

Minor technical update:

- fixed typo in github link
- updated dependencies

## 0.8.0 (2023-09-27)

- Allow more flexible char substitutes by [@bt2901](https://github.com/bt2901)
- minimal Python version changed to 3.8 by [@insolor](https://github.com/insolor)
- setup.py building changed to poetry by [@insolor](https://github.com/insolor)

## 0.7.2 (2015-04-18)

- minor speedup;
- bitbucket mirror is no longer maintained.

## 0.7.1 (2014-06-05)

- Switch to setuptools;
- upload wheel to pypi;
- check Python 3.4 compatibility.

## 0.7 (2013-10-13)

IntDAWG and IntCompletionDAWG are implemented.

## 0.6 (2013-03-23)

Use less shared state internally. This should fix thread-safety bugs and
make iterkeys/iteritems reentrant.

## 0.5.1 (2013-03-01)

Internal tweaks: memory usage is reduced; something is a bit faster,
something is a bit slower.

## 0.5 (2012-10-08)

Storage scheme is updated to match DAWG==0.5. This enables the
alphabetical ordering of `BytesDAWG` and `RecordDAWG` items.

In order to read `BytesDAWG` or `RecordDAWG` created with versions of
DAWG \< 0.5 use `payload_separator` constructor argument:

    >>> BytesDAWG(payload_separator=b'\xff').load('old.dawg')

## 0.3.1 (2012-10-01)

Bug with empty DAWGs is fixed.

## 0.3 (2012-09-26)

- `iterkeys` and `iteritems` methods.

## 0.2 (2012-09-24)

`prefixes` support.

## 0.1 (2012-09-20)

Initial release.

