Introduction
Natural Language Processing (NLP) enables computers to understand, interpret, and generate human language. NLP powers chatbots, translation services, sentiment analysis, and large language models like GPT and Claude.
Why NLP is Challenging
Human language is:
- Ambiguous: "I saw the man with the telescope" - Who has the telescope?
- Context-dependent: "Bank" means different things in different contexts
- Evolving: New words, slang, and meanings constantly emerge
- Nuanced: Sarcasm, irony, and tone are hard to detect
NLP Pipeline
Raw Text
│
▼
┌─────────────────┐
│ Preprocessing │
│ - Tokenization │
│ - Normalization│
│ - Stop words │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Feature Extract │
│ - Embeddings │
│ - TF-IDF │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Model/Task │
│ - Classification
│ - Generation │
│ - Translation │
└────────┬────────┘
│
▼
Output
Text Preprocessing
1. Tokenization
Breaking text into smaller units (tokens).
Word Tokenization:
Input: "The quick brown fox jumps."
Output: ["The", "quick", "brown", "fox", "jumps", "."]
Subword Tokenization (BPE, WordPiece):
Input: "unhappiness"
Output: ["un", "##happiness"] or ["un", "happi", "ness"]
Modern LLMs use subword tokenization to handle rare words
2. Normalization
| Technique | Example | |-----------|---------| | Lowercasing | "Hello World" → "hello world" | | Stemming | "running", "runs", "ran" → "run" | | Lemmatization | "better" → "good", "ran" → "run" | | Removing punctuation | "Hello!" → "Hello" |
3. Stop Words Removal
Common words with little meaning: "the", "is", "at", "which"
Before: "The cat is sitting on the mat"
After: "cat sitting mat"
Note: Modern deep learning models often keep stop words
Text Representation
1. Bag of Words (BoW)
Counts word occurrences, ignores order.
Doc1: "I love cats"
Doc2: "I love dogs"
Vocabulary: [I, love, cats, dogs]
Doc1 vector: [1, 1, 1, 0]
Doc2 vector: [1, 1, 0, 1]
Limitations:
- Ignores word order
- No semantic understanding
- High dimensionality
2. TF-IDF (Term Frequency-Inverse Document Frequency)
Weights words by importance:
- TF: How often a word appears in document
- IDF: How rare a word is across all documents
TF-IDF = TF × log(N/df)
where:
TF = term frequency in document
N = total documents
df = documents containing term
Common words get lower scores, rare important words get higher scores
3. Word Embeddings
Dense vector representations that capture semantic meaning.
Word2Vec:
- Words with similar meanings have similar vectors
- Captures relationships: king - man + woman ≈ queen
"king" - "man" + "woman" = "queen"
[0.2] [0.1] [0.3] [0.4]
[0.8] - [0.2] + [0.5] ≈ [1.1]
[0.3] [0.4] [0.2] [0.1]
GloVe (Global Vectors):
- Trained on word co-occurrence statistics
- Pre-trained embeddings widely available
Contextual Embeddings (BERT, GPT):
- Same word gets different embeddings based on context
- "bank" near "river" vs "bank" near "money"
NLP Tasks
1. Text Classification
Assign categories to text.
Applications:
- Spam detection
- Sentiment analysis (positive/negative/neutral)
- Topic classification
- Intent detection in chatbots
Example - Sentiment Analysis:
Input: "This product is amazing! Best purchase ever."
Output: Positive (confidence: 0.95)
2. Named Entity Recognition (NER)
Identify and classify named entities.
Input: "Apple CEO Tim Cook announced new products in Cupertino."
Output:
- Apple → ORGANIZATION
- Tim Cook → PERSON
- Cupertino → LOCATION
Entity Types:
- PERSON, ORGANIZATION, LOCATION
- DATE, TIME, MONEY
- PRODUCT, EVENT
3. Machine Translation
Translate between languages.
Input (English): "The weather is nice today."
Output (Spanish): "El clima está agradable hoy."
Evolution:
- Rule-based (dictionaries + grammar rules)
- Statistical (phrase-based, learned from parallel texts)
- Neural (sequence-to-sequence, attention, transformers)
4. Question Answering
Extract or generate answers from text.
Extractive QA:
Context: "Paris is the capital of France. It has the Eiffel Tower."
Question: "What is the capital of France?"
Answer: "Paris" (extracted from context)
Generative QA:
Question: "Explain photosynthesis."
Answer: Generated explanation (not extracted)
5. Text Summarization
Condense long text into shorter summary.
Extractive: Select important sentences Abstractive: Generate new summary text
6. Text Generation
Generate coherent text from prompts.
Applications:
- Chatbots and virtual assistants
- Content creation
- Code generation
- Creative writing
Large Language Models (LLMs)
What are LLMs?
- Neural networks trained on massive text data
- Billions of parameters
- Understand and generate human-like text
Key Characteristics:
| Feature | Description | |---------|-------------| | Scale | Billions of parameters (GPT-4: ~1.7T) | | Training Data | Internet-scale text (books, websites, code) | | Capabilities | Few-shot learning, reasoning, coding | | Architecture | Transformer-based |
How LLMs Work:
- Pre-training: Learn language patterns from huge datasets
- Fine-tuning: Adapt to specific tasks
- RLHF: Align with human preferences
Prompt Engineering:
Getting the best results from LLMs:
Poor prompt: "Write about dogs"
Better prompt: "Write a 200-word informative paragraph
about the history of dog domestication, suitable for
a middle school science class."
Techniques:
- Zero-shot: Direct question
- Few-shot: Include examples
- Chain-of-thought: "Think step by step"
- System prompts: Set behavior/persona
Cloud NLP Services
Azure:
- Azure AI Language: NER, sentiment, summarization
- Azure OpenAI Service: GPT-4, embeddings
- Azure Translator: 100+ languages
- Azure Bot Service: Conversational AI
AWS:
- Amazon Comprehend: NLP analysis
- Amazon Translate: Neural translation
- Amazon Lex: Chatbot building
- Amazon Bedrock: Claude, Titan, etc.
Google Cloud:
- Cloud Natural Language API: Entity, sentiment
- Cloud Translation: 100+ languages
- Dialogflow: Conversational agents
- Vertex AI: Custom NLP models
Exam Tips
Common exam questions test:
- Choosing right NLP service for a task
- Understanding tokenization approaches
- Embeddings vs bag-of-words
- Extractive vs abstractive summarization
- LLM capabilities and limitations
Watch for keywords:
- "Understand customer feedback" → Sentiment analysis
- "Extract names and places" → NER
- "Summarize documents" → Text summarization
- "Build a chatbot" → Conversational AI
- "Translate content" → Machine translation
Key Takeaway
NLP has evolved from rule-based systems to powerful neural models that can understand and generate human language. Modern LLMs represent a paradigm shift, enabling few-shot learning and general-purpose language understanding. Understanding NLP fundamentals helps you choose the right approach and service for language-related AI tasks.
