Skip to content

Commit 18940bc

Browse files
committed
Fixed issue 4. Huge bug in profanities list, ValX v0.2.6 update.
1 parent 2a828de commit 18940bc

2 files changed

Lines changed: 285 additions & 239 deletions

File tree

README.md

Lines changed: 57 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
# ValX
2+
23
![Python Version](https://img.shields.io/badge/python-3.12-blue.svg)
34
[![Code Size](https://img.shields.io/github/languages/code-size/infinitode/valx)](https://github.com/infinitode/valx)
45
![Downloads](https://pepy.tech/badge/valx)
@@ -16,18 +17,26 @@ An open-source Python library for data cleaning tasks. It includes functions for
1617
> [!NOTE]
1718
> ValX will automatically install a version of `scikit-learn` that is compatible with your device if you don't have one already.
1819
20+
## Changes in 0.2.6
21+
22+
ValX v0.2.6 fixes a major bug, view the issue here: https://github.com/Infinitode/ValX/issues/4, where profanity lists for multiple languages were missing, under wrong languages, or simply incomplete.
23+
24+
Version 0.2.6 introduces fixes for this using the original language lists data, and includes new handling for languages, including:
25+
26+
- Case insensitivity for language selection: "English", "EN", "en", or variants like "enGliSh" will all work for language selection in the `detect_profanity` and `remove_profanity` functions.
27+
1928
## Changes in 0.2.5
2029

2130
ValX v0.2.5 introduces enhanced flexibility for profanity filtering by adding support for custom profanity lists:
2231

23-
- **Custom Profanity Word Lists**: Users can now provide their own lists of profane words directly as Python lists to the `detect_profanity` and `remove_profanity` functions via the new `custom_words_list` parameter.
24-
- **Standalone Custom Lists**: Utilize your custom profanity list exclusively by setting the `language` parameter to `None`. ValX will then only use the words provided in `custom_words_list`.
25-
- **Combined Lists**: Use a custom list in conjunction with ValX's built-in language-specific wordlists. Simply provide both a `language` (e.g., "English") and your `custom_words_list`. ValX will use the combined set of words.
26-
- **Loading Custom Lists from File**: A new helper function, `load_custom_profanity_from_file(filepath)`, allows you to easily load custom profanity words from a text file.
27-
- **File Format**: The file should contain one profanity word per line.
28-
- Lines starting with a hash symbol (`#`) are treated as comments and ignored.
29-
- Empty lines or lines containing only whitespace are also ignored.
30-
- **Updated Detection Reporting**: The `detect_profanity` function's output now specifies the source of detected profanity more clearly (e.g., "Custom", "Custom + English").
32+
- **Custom Profanity Word Lists**: Users can now provide their own lists of profane words directly as Python lists to the `detect_profanity` and `remove_profanity` functions via the new `custom_words_list` parameter.
33+
- **Standalone Custom Lists**: Utilize your custom profanity list exclusively by setting the `language` parameter to `None`. ValX will then only use the words provided in `custom_words_list`.
34+
- **Combined Lists**: Use a custom list in conjunction with ValX's built-in language-specific wordlists. Simply provide both a `language` (e.g., "English") and your `custom_words_list`. ValX will use the combined set of words.
35+
- **Loading Custom Lists from File**: A new helper function, `load_custom_profanity_from_file(filepath)`, allows you to easily load custom profanity words from a text file.
36+
- **File Format**: The file should contain one profanity word per line.
37+
- Lines starting with a hash symbol (`#`) are treated as comments and ignored.
38+
- Empty lines or lines containing only whitespace are also ignored.
39+
- **Updated Detection Reporting**: The `detect_profanity` function's output now specifies the source of detected profanity more clearly (e.g., "Custom", "Custom + English").
3140

3241
These features give users greater control over the profanity filtering process, allowing for more tailored and specific use cases.
3342

@@ -42,6 +51,7 @@ We've also removed `scikit-learn==1.2.2` as a dependency, as most versions of `s
4251
We have introduced a new optional `info_type` parameter into our `detect_sensitive_information`, and `remove_sensitive_information` functions, to allow you to have fine-grained control over what sensitive information you want to detect or remove.
4352

4453
Also introduced more detection patterns for other types of sensitive information, including:
54+
4555
- `"iban"`: International Bank Account Number.
4656
- `"mrn"`: Medical Record Number (may not work correctly, depending on provider and country).
4757
- `"icd10"`: International Classification of Diseases, Tenth Revision.
@@ -54,6 +64,7 @@ Also introduced more detection patterns for other types of sensitive information
5464
## Changes in 0.2.2
5565

5666
We have refactored and changed the `detect_profanity` function:
67+
5768
- Removed unnecessary printing
5869
- Now returns more information about each found profanity, including `Line`, `Column`, `Word`, and `Language`.
5970

@@ -95,36 +106,65 @@ Please ensure that you have one of these Python versions installed before using
95106
- **Remove Hate Speech**: Remove hate speech or offensive speech in text, using AI.
96107

97108
### List of supported languages for profanity detection and removal
109+
98110
Below is a complete list of all the available supported languages for ValX's profanity detection and removal functions which are valid values for `language`:
99111

100-
- **All**
112+
- All
101113
- Arabic
114+
- AR
102115
- Czech
116+
- CS
103117
- Danish
118+
- DA
104119
- German
120+
- DE
105121
- English
122+
- EN
106123
- Esperanto
124+
- EO
107125
- Persian
108126
- Finnish
127+
- FI
109128
- Filipino
129+
- FIL
110130
- French
131+
- FR
111132
- French (CA)
133+
- FR-CA-U-SD-CAQC
112134
- Hindi
135+
- HI
113136
- Hungarian
137+
- HU
114138
- Italian
139+
- IT
115140
- Japanese
141+
- JA
116142
- Kabyle
143+
- KAB
117144
- Korean
145+
- KO
118146
- Dutch
147+
- NL
119148
- Norwegian
149+
- NO
120150
- Polish
151+
- PL
121152
- Portuguese
153+
- PT
122154
- Russian
155+
- RU
156+
- Spanish
157+
- ES
123158
- Swedish
159+
- SV
124160
- Thai
161+
- TH
125162
- Klingon
163+
- TLH
126164
- Turkish
165+
- TR
127166
- Chinese
167+
- ZH
128168

129169
## Usage
130170

@@ -214,15 +254,15 @@ print(results_file_only)
214254
**Output Format for `detect_profanity`**
215255

216256
The `detect_profanity` function returns a list of dictionaries. Each dictionary includes:
257+
217258
- `"Line"`: The line number (1-indexed).
218259
- `"Column"`: The column number (1-indexed) where the profanity starts.
219260
- `"Word"`: The detected profanity word.
220261
- `"Language"`: Indicates the source of the word list:
221-
- `<LanguageName>` (e.g., "English"): If only a built-in language list was used.
222-
- `"Custom"`: If `language=None` and only a `custom_words_list` was used.
223-
- `"Custom + <LanguageName>"` (e.g., "Custom + English"): If both a built-in list and `custom_words_list` were used.
224-
- `"Custom + All"`: If `language='All'` and `custom_words_list` were used.
225-
262+
- `<LanguageName>` (e.g., "English"): If only a built-in language list was used.
263+
- `"Custom"`: If `language=None` and only a `custom_words_list` was used.
264+
- `"Custom + <LanguageName>"` (e.g., "Custom + English"): If both a built-in list and `custom_words_list` were used.
265+
- `"Custom + All"`: If `language='All'` and `custom_words_list` were used.
226266

227267
**4. Removing Profanity**
228268

@@ -283,6 +323,7 @@ outcome_of_detection = detect_hate_speech("You are stupid.")
283323

284324
> [!IMPORTANT]
285325
> The model's possible outputs are:
326+
>
286327
> - `['Hate Speech']`: The text was flagged and contained hate speech.
287328
> - `['Offensive Speech']`: The text was flagged and contained offensive speech.
288329
> - `['No Hate and Offensive Speech']`: The text was not flagged for any hate speech or offensive speech.
@@ -299,7 +340,9 @@ Contributions are welcome! If you encounter any issues, have suggestions, or wan
299340
ValX is released under the terms of the **MIT License (Modified)**. Please see the [LICENSE](https://github.com/infinitode/valx/blob/main/LICENSE) file for the full text.
300341

301342
### Derived licenses
343+
302344
---
345+
303346
ValX uses data from this GitHub repository:
304347
https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/
305348
© 2012-2020 Shutterstock, Inc.

0 commit comments

Comments
 (0)