Metadata-Version: 2.1
Name: Text_Preprocessing_Toolkit
Version: 0.0.2
Summary: A package that automates text preprocessing
Home-page: https://github.com/Gaurav-Jaiswal-1/Text-Preprocessing-Toolkit.git
Author: Gaurav Jaiswal
Author-email: Gaurav Jaiswal <jaiswalgaurav863@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/Gaurav-Jaiswal-1/Text-Preprocessing-Toolkit
Project-URL: Issues, https://github.com/Gaurav-Jaiswal-1/Text-Preprocessing-Toolkit/issues
Keywords: text preprocessing,toolkit,automation
Classifier: Development Status :: 3 - Alpha
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: spellchecker
Requires-Dist: matplotlib
Requires-Dist: pandas
Requires-Dist: spacy
Requires-Dist: nltk



# Text Preprocessing Toolkit (TPT)

**Version:** 0.0.1  
**Author:** Gaurav Jaiswal  
A comprehensive Python toolkit for preprocessing text, designed to simplify NLP workflows. This package provides various utilities like stopword removal, punctuation handling, spell-checking, lemmatization, and more to clean and preprocess text effectively.

---

## Features

- **Remove Punctuation:** Strips punctuation marks from text.
- **Remove Stopwords:** Removes common stopwords to reduce noise in textual data.
- **Remove Special Characters:** Cleans text by removing unnecessary symbols.
- **Lowercase Conversion:** Standardizes text to lowercase.
- **Spell Correction:** Identifies and corrects misspelled words.
- **Lemmatization:** Converts words to their base forms.
- **Stemming:** Reduces words to their root forms using a stemming algorithm.
- **HTML Tag Removal:** Cleans HTML tags from the text.
- **URL Removal:** Detects and removes URLs.
- **Customizable Pipeline:** Allows users to apply preprocessing steps in a specified order.
- **Quick Dataset Preview:** Provides a summary of text datasets, including word and character counts.

---

## Installation

Clone the repository or install the package using `pip`:

```bash
pip install Text_Preprocessing_Toolkit
```

---

## Usage

### Import the Package

```python
from TPT import TPT
```

### Initialize the Toolkit

You can add custom stopwords during initialization:

```python
tpt = TPT(custom_stopwords=["example", "custom"])
```

### Preprocess Text with Default Pipeline

```python
text = "This is an <b>example</b> sentence with a URL: https://example.com."
processed_text = tpt.preprocess(text)
print(processed_text)
```

### Customize Preprocessing Steps

```python
custom_steps = ["lowercase", "remove_punctuation", "remove_stopwords"]
processed_text = tpt.preprocess(text, steps=custom_steps)
print(processed_text)
```

### Quick Dataset Summary

```python
texts = [
    "This is a sample text.",
    "Another <b>example</b> with HTML tags and a URL: https://example.com.",
    "Spellngg errors corrected!",
]
tpt.head(texts, n=3)
```

---

## Available Methods

| Method                   | Description                                                     |
|--------------------------|-----------------------------------------------------------------|
| `remove_punctuation`     | Removes punctuation from text.                                 |
| `remove_stopwords`       | Removes stopwords from text.                                   |
| `remove_special_characters` | Cleans text by removing special characters.                  |
| `remove_url`             | Removes URLs from the text.                                    |
| `remove_html_tags`       | Strips HTML tags from text.                                    |
| `correct_spellings`      | Corrects spelling mistakes in the text.                       |
| `lowercase`              | Converts text to lowercase.                                   |
| `lemmatize_text`         | Lemmatizes text using WordNet.                                |
| `stem_text`              | Applies stemming to reduce words to their root forms.         |
| `preprocess`             | Applies a series of preprocessing steps to the input text.    |
| `head`                   | Displays a quick summary of a text dataset.                   |

---

## Example Output

### Input

```text
This is a <b>sample</b> text with a URL: https://example.com. Check spellngg errors!
```

### Output (Default Pipeline)

```text
sample text check spelling errors
```

---

## Requirements

- Python >= 3.8
- Libraries: `nltk`, `pandas`, `spellchecker`, `IPython`

To install the dependencies:

```bash
pip install -r requirements.txt
```

---

## Contributing

Contributions are welcome! To contribute:

1. Fork this repository.
2. Clone your forked repository.
3. Create a new branch for your feature.
4. Make your changes, write tests, and ensure the code passes.
5. Submit a pull request for review.

---

## Testing

To test the package locally:

1. Install development dependencies:
   ```bash
   pip install pytest
   ```
2. Run tests:
   ```bash
   pytest
   ```

---

## License

This project is licensed under the MIT License. See the `LICENSE` file for details.

---

## Author

- **Gaurav Jaiswal**  
  [GitHub](https://github.com/Gaurav-Jaiswal-1)  

