Natural Language Processing

Natural Language Processing (NLP) refers to the methodologies and complexity employed on computers to understand human language called "natural language" such ambiguity, polysemous words or irony. Trought its development, different techniques have led to the current Large Language Moldes (LLMs) . The initial favorable outcomes employed language rules and One-Hot vectors, therefore integrating different approaches within Neural Network connections and finally adding an attention layer to figure out the most important correlations between words.

Summary

The Beginning
Preprocessing Techniques
Natural Language Processing
- $${\color{orange}Bag \space of \space Words \space (Bow)}$$
- $${\color{orange}Term \space Frequency \space - \space Inverse \space Document \space Frequency \space (TF-IDF)}$$
- $${\color{orange}Language \space Models}$$
  - $${\color{lightskyblue}N-Grams}$$
- $${\color{orange}Word2Vec}$$
  - $${\color{lightskyblue}Neural \space Networks}$$
    - $${\color{olive}Neural \space Language \space Models}$$
    - $${\color{olive}Kneser \space Ney \space Smooting}$$
  - $${\color{lightskyblue}Architecture}$$
    - $${\color{olive}Continuous \space Bag \space of \space Words \space(CBOW)}$$
    - $${\color{olive}Skip-Gram}$$
- $${\color{orange}Transformers}$$
  - $${\color{lightskyblue}Seq2Seq \space (Encoder \space - \space Decoder)}$$
Video
References

The Beginning

The start of natural language processing relates during cold war where translation of Russian military documents, papers and literacy was essential to the United States to keep up to date. On January 7th, 1954, at IBM the university of Georgetown built the experiment of "Georgetown-IBM" making use of prepared selected phrases to enhance results, nevertheless outside those sentences the model struggle of to capture context then leading to an investment cease in research [1].

Eliza was the first natural language processing program released in 1966 by Joseph Weizenbaum employing basic query consults in a database and adding basic language rules, result into the initial favorable outcomes tricking people on the time to think they were talking with a human being instead of a machine, thriving the Turing test [2].

Human: My boyfriend made me come.
ELIZA: Your boyfriend made you come?
Human: He says I'm depressed most of the time.
ELIZA: I’m sorry to hear that you are depressed.
Human: It’s true. I am unhappy.
ELIZA: Do you think coming here will help you not be unhappy?

“Can machines think? I believe to be too meaningless to deserve discussion.”
— Alan Turing

Preprocessing Techniques

The complexity of natural language processing relies on giving a reliable numeric representation called embedding, to easily overcome this issue preprocessing techniques are applied after segmenting text into individual words named "tokens". A successful tokenization process should be able to adjust and clean the data, disjoin punctuation marks and break off grammar [3].

Stemming: Deletes word suffixes.
Lemmatization: Removes word affixes returning an standard normalized version.

Note

Both teniques reduced word sparcity by substracting word derivation and therefore shiriking dimmensions. Search Engines applied this techniques to improve results quality.

Examples

Original Word	Stemming	Lemmatization
running	run	run
studies	studi	study
better	better	good
leaves	leav	leave
cars	car	car
went	went	go
flying	fli	fly
happiness	happi	happy
cats	cat	cat
playing	play	play

Important

Stemming has a lower computer complexity, however its results are not guaranteed to be good enough, struggling at words that dramatically change their structure after verb conjugation such verb to be. On the other hand, Lemmatization makes use of more computational power to archive its result.

Stop Words

Natural language is prone to be saturated with articles and connectors which give structure and sense to sentences, however they do not present additional information, these "stop words" are recommend to be removed when facing simply models such One-Hot vectors.

Preprocessing-Techniques-Pathinker.mp4

Natural Language Processing

Natural Language Processing techniques aim to provide a useful number representation to text, naturally these expressions need to manage correctly all different meanings a word have regarding of their context called semantic which is not even static, since natural language and communication changes within human development.

The following techniques provide a summary of strategies made through time to fix and find better data correlations allowing modern day transformers.

Note

You will see more refined techniques enhancing their results by reutilizing previous ones with additional adjustments.

$${\color{orange}Bag \space of \space Words \space (Bow)}$$

Giving a set of texts or documents defined as corpus. Bag of words returns a one vector representation named One-Hot by tracking only the appearance (binary) or the frequency of words on determine group of sentences, although simple it provides a quick brief guide to distinguish and perform feature extraction.

The problems of using bag of words are the lack of word order since it will provide the same One-Hot vector for both binary and frequency based methods since sentences such "I almost told him the truth" and "I told him almost the truth" change dramatically besides giving the same output. Other issues are limited semantics and being insensitive to grammar.

The frequency-based bag of words therefore can be upgraded performing the next seen correlation "Term Frequency-Inverse Document Frequency".

Bag-of-Words-Pathinker.mp4

$${\color{orange}Term \space Frequency \space - \space Inverse \space Document \space Frequency \space (TF-IDF)}$$

Term Frequency-Inverse Document Frequency employs the frequency on each document to extract the number of appearances of each word giving the next expression.

$$ tf(t,d) = \frac{f(t,d)}{\sum_{t' \in d} f(t', d)} $$

$$ tf(t,d) = \frac{\text{term occurrence}}{\text{total words}} $$

Once completed, it compares and extracts the Inverse Document Frequency by relative comparing each word giving all the documents computing a logarithm expression escalating toward one the more the same word appears in more documents and therefore returning the value of zero. This method enhances words that are not likely to appear on other documents of the given corpus.

$$ idf(t, D) = \log \left( \frac{|D|}{|{ d \in D : t \in d }|} \right) $$

$$ idf(t, D) = \log \left( \frac{\text{Number Total Documents}}{\text{Number Documents With the Term}} \right) $$

Finally, the Term Frequency and the Inverse Document Frequency results are multiplied retrieving a combination of both table properties resulting in bigger weights on words that more appear on each document that does not show in others, therefore extracting the most significant words than better describe the given text, performing document clustering at the drawback of being sensitive to corpus where similar contents give less significant results.

$$ TFIDF(t, d, D) = TF(t, d) \cdot IDF(t, D) $$

Term-Frequency-Inverse-Document-Frequency-Pathinker.mp4

$${\color{orange}Language \space Models}$$

Language models gives a probability of appearance given sequence of words, the simpler techniques employs N-Grams variations, computing the probabilistic distribution following words by the total frequency of large corpus, this primary premise allows to predict texts and even being applied to as a grammar correction, since well written words shows more often given a good dataset of training [4, 5].

$${\color{lightskyblue}N-Grams}$$

N-Grams were large employed on earlier stages of natural language processing due their obtaining results compare of their complexity, more sophisticated methods were consequently created to face their limitations, natural language tend to be creative a certain word expression is not likely to be captured by simpler, therefore n-grams variations explored later on adequate extra values into the calculations to perform slightly probability toward unseen n-grams in training dataset and small probability to execute a backoff process and decompose the word into more enhance context [4,5 6].

$$ P(w_1, w_2, \ldots, w_n) \approx P(w_1) \cdot P(w_2 \mid w_1) \cdot P(w_3 \mid w_2) \cdots P(w_n \mid w_{n-1}) $$

$$ P(w_i \mid w_{i-1}) = \frac{\text{count}(w_{i-1}, w_i)}{\text{count}(w_{i-1})} $$

$$ P(w_i \mid w_{i-1}) = \frac{\text{times "}w_{i-1}, w_i\text{" appears}}{\text{times "}w_{i-1}\text{" appears}} $$

Note

Complex language models eventually lead to larger ones, conceiving the name of Large Language Model (LLM) by substituting N-Grams by huge correlated neural networks.

Language-Model-Pathinker.mp4

$${\color{orange}Word2Vec}$$

One-Hot vectors struggle to encapsulate complex relations of natural language, the main threads to solve are word sparsity due to the given larger corpus and the different values a word can have regarding their context. The incorporation of neural networks, projection layers and language models allows to group words with the same meaning into clusters providing enough abstraction to represent words into vectors and consequently been able to perform basic algebra operations to gather more accurate words prediction [7].

$${\color{lightskyblue}Neural \space Networks}$$

Artificial Neural Networks are an abstract representation from human brain neuron cells, the cell body integrated by dendrites sends electrical signals, therefore the axon carries the signal into the following neurons, leading to the perception idea [8]. As a result, Artificial Neural Networks are a set of interconnected perceptrons in sets of layers, each connection is called weight and multiples the received value to forward the value into the next perceptrons.

Artificial Neural Networks compute in each epoch a forward propagation in which the network from the input until the output layer spread the received information to following layers up to the last layer, therefore computes the error employing a fitness function comparing the expected value with the obtain value, consequently adjusting each perceptrons connection weight value normally making use of "Stochastic Gradient Descend (SGD)" computing the partial derivate gaining the slope that describes the mathematical function representation that solves the challenge, passing the adjustment into backward direction, archiving backward propagation.

Artificial Neural Networks between layers applies a non-linear transformation given by a activation layer to enable the network to solve non-linear problems due the missing linear representation, meaning it can not be solve through a straight line followed by the linear equation $f(x) = mx + b$ where $mx$ is the slope and $b$ a linear value that adjust the intersection.

Artificial Neural Networks architecture is an entire field of study, accordingly different methods to interconnect each layer and perceptron had been develop, Recurrent Neural Network (RNN) proceed to connect itself a perceptron or beneath layers to solve vanishing or exploding gradient and stabilize training epochs, other future approaches seen into the Transformer architecture is residual training. Additional approaches such Long Short Term Memory Neural Network (LSTM) incorporate an input, forget and output gate to overcome in greater way than Recurrent Neural Networks the problem of vanishing gradient by letting the memory cell update itself to retain or discard information to learn long term dependencies [9].

Feed Forward Neural Network (FNN): Regular correlation, the information flows continuously from the first layer up to the last.
Recurrent Neural Network (RNN): Employs connections from previous layers or itself.
Long Short Term Memory Neural Network (LSTM): Develop a memory cell with three gates (input, forget, output) to gather long term dependencies.
Recurrent Long Short Term Memory Neural Network (LSTM-RNN): Integration of recurrent connections into an LSTM network.

$${\color{olive}Neural \space Language \space Models}$$

Neural Networks infers the word embedding that suits the assigned task, adapting it functionality into Natural Language Processing, the output layer is connected trough a Language Model such N-Gram. This simpler configuration allows to capture more correlation within the corpus than a regular One-Hot Vectors approaches, by integrating a Language Model their nomenclature changes [10]:

Model	Language Model
FNN	FNNLM
RNN	RNNLM
LSTM	LSTMLM
LSTM-RNN	LSTM-RNNLM

Other alternative to improve results are changing the math N-Gram computation through the introduction of smoother techniques, there are plenty of methods discuss in [4], however Kneser-Ney smoothing will be explained by being use on a LSTM neural network [9].

$${\color{olive}Kneser \space Ney \space Smooting}$$

Kneser Ney Smoothing takes the premise of applying absolute discounting saving probability for unseen combination of N-Grams, changing the mindset of "How likely is this word appearance" to "How likely is this context apparency" by modifying the math calculation from individual words into set of unique words [5, 6].

$$ \frac{C(w_{i-1}, w_i)}{\sum_{v} C(w_{i-1}, v)} $$

$$ P_{KN}(w_i \mid w_{i-1}) = \frac{\text{times } w_{i-1}, w_i \text{ n-gram appear}}{\text{total n-grams}} $$

$$ P_{KN}(w_i \mid w_{i-1}) = \frac{\max{C(w_{i-1}, w_i) - d, 0}}{\sum_v C(w_{i-1}, v)} + \lambda(w_{i-1}) P_{KN}(w_i) $$

Performing absolute discounting enhances results grouping up common word distributions. The most remarkable examples of this are location names for example the word York is more likely to be found after New "P(York|New)", "New York" rather than York Terrier or Francisco over San "P(Francisco|San)" over San Bernard or San Judas.

Kneser Ney Smoothing also applies an $- d$ oscillating within 0 to 1. It amplifies the unseen combination expressed in $\lambda(w_{i-1}) P_{KN}(w_i)$ by decomposing and decreasing the n-gram dimensionality where the exact word combination is not likely to appear accomplishing "backoff" process. For example, the N-Gram "P(drink|memory)" would not make sense to appear together since the corpus does not contain that exact word combination, therefore performs a more generalize search of just the word "memory".

Note

Both mechanisms reduce word sparsity and smooth the results.

Neural-Networks-Pathinker.mp4

$${\color{lightskyblue}Architecture}$$

Word2Vec premise relies on finding a simpler model from standard neural networks capable of handling high dense continuous vectors and producing satisfactory word embedding. Although neural networks perform greatly given its non-linearity, so it does the computer complexity, therefore Word2Vec segments into two phases, learning continuous word vectors with a simpler model and then connected to a N-Gram language model [11].

$${\color{olive}Continuous \space Bag \space of \space Words \space(CBOW)}$$

Coming Soon (Q3 2026)

$${\color{olive}Skip-Gram}$$

Coming Soon (Q3 2026)

Word2Vec-Pathinker.mp4

$${\color{orange}Transformers}$$

Coming Soon (Q3 2026)

Video

Coming Soon (Q3 - Q4 2026)

References

[1] M. Á. Álvarez Carmona, “Aplicaciones básicas de IA en procesamiento de lenguaje natural,” en Macroentrenamiento en Inteligencia Artificial (MeIA) 2025, Semana 1, SECIHTI–CIMAT Monterrey, México, jun. 2025.

[2] D. Bergmann, “The ELIZA effect at work: Avoiding emotional attachment to AI coworkers,” IBM, IBM Think, [Online]. Available: https://www.ibm.com/think/insights/eliza-effect-avoiding-emotional-attachment-to-ai

[3] J. Murel and E. Kavlakoglu, “What are stemming and lemmatization?,” IBM, [Online]. Available: https://www.ibm.com/think/topics/stemming-lemmatization

[4] S. F. Chen and J. Goodman, “An empirical study of smoothing techniques for language modeling,” Computer Speech & Language, vol. 13, no. 4, pp. 359–394, Oct. 1999.

[5] R. Kneser and H. Ney, “Improved backing-off for m-gram language modeling,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP), Detroit, MI, USA, 1995, pp. 181–184.

[6] D. Jurafsky and J. H. Martin, “Kneser-Ney smoothing,” en Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models, 3rd ed., online manuscript, Jan. 6, 2026.

[7] T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient Estimation of Word Representations in Vector Space," arXiv preprint arXiv:1301.3781, 2013. doi: 10.48550/arXiv.1301.3781

[8] J. M. Keller, D. Liu, and D. B. Fogel, “Neural networks,” in Fundamentals of Computational Intelligence: Neural Networks, Fuzzy Systems,and Evolutionary Computation. 2016, pp. 5–55

[9] M. Sundermeyer, R. Schlüter, and H. Ney, "LSTM Neural Networks for Language Modeling," in Proc. Interspeech 2012, Portland, OR, USA, Sep. 2012, pp. 194–197. doi: 10.21437/Interspeech.2012-65

[10] D. Shi, “A Study on Neural Network Language Modeling,” CoRR, vol. abs/1708.07252, 2017. [Online].

[11] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013

[12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention Is All You Need,” arXiv preprint arXiv:1706.03762, 2017.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
LICENSE		LICENSE
README.MD		README.MD

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Natural Language Processing

Summary

The Beginning

Preprocessing Techniques

Examples

Stop Words

Natural Language Processing

$${\color{orange}Bag \space of \space Words \space (Bow)}$$

$${\color{orange}Term \space Frequency \space - \space Inverse \space Document \space Frequency \space (TF-IDF)}$$