How Kaypoh Aunty Works

Kaypoh Aunty is a sophisticated web application that automatically scrapes and classifies Google Maps reviews for any business or location. The name "Kaypoh Aunty" is a playful nod to the Singaporean colloquial term for a nosy, inquisitive person, reflecting the application's ability to thoroughly examine and understand user-generated content.

User Workflow

Input a Name

Enter the name of any business, landmark, or location you want to analyze into the search bar.

Scrape Reviews via Apify

The backend triggers an integrated Apify API call. Apify searches for the specified location on Google Maps and scrapes its reviews in real-time.

Classify Each Review

As reviews are collected, each one is passed through the Kaypoh Aunty classification engine using our two-stage process.

View Results

The application displays the scraped reviews along with the labels assigned by the classifier, allowing for immediate analysis and filtering.

Classification Categories

Kaypoh Aunty classifies reviews into the following categories. A single review can be assigned multiple labels if it meets the criteria for different categories:

🛍️ Advertisements

Identifies promotional content, URLs, phone numbers, and marketing language with call-to-action.

🚫 Spam

Filters out undesirable content with excessive punctuation, caps, spam-like usernames, or generic text.

😤 Rant Without Visit

Catches feedback from users who may not have had a firsthand experience, using phrases like "never visited" or hearsay language.

❓ Irrelevant Content

Flags reviews that are nonsensical, extremely short, test content, standalone questions, or contain gibberish.

✅ Useful Reviews

Recognizes detailed, helpful reviews with specific details, recommendation language, and balanced opinions.

Classification Engine

To ensure both speed and accuracy, every review is processed through a sophisticated two-stage pipeline:

Stage 1: Rule-Based Classification

A fast, rule-based classifier written in JavaScript runs first, designed to quickly catch obvious cases using predefined patterns and keywords.

Detection Triggers:

Advertisements: URLs, phone numbers, promotional language + call-to-action
Spam: Excessive punctuation & caps, spam-like usernames, very short generic text
Rant Without Visit: Phrases like "never visited," hearsay language, lack of personal details
Irrelevant Content: Extremely short text, test content, standalone questions, gibberish/symbols
Useful Reviews: Detailed text, specific details mentioned, recommendation language, balanced opinion

Stage 2: AI Model Classification

If a review doesn't match predefined rules, it's sent to a fine-tuned DistilBERT model hosted on Hugging Face for nuanced analysis.

Why DistilBERT?

DistilBERT is a smaller, faster, and lighter version of BERT that retains over 95% of BERT's language understanding capabilities while being significantly more performant for web applications.

Model Training & Performance

Dataset & Feature Engineering

The model was trained on a large dataset of Google local reviews from the UC San Diego McAuley Lab. To give the model more context, structured data like review ratings and photo presence were engineered into the text:

"Very good place nice things... [SEP] rating:5.0 has_pics:0"

The Challenge: Severe Class Imbalance

Initial data distribution was extremely imbalanced with 3,910 "Useful" samples but only 4 "Advertisement" samples. This would lead to poor model performance.

The Solution: LLM-Generated Synthetic Data

A Large Language Model was used to generate high-quality, realistic synthetic data for underrepresented categories, resulting in a perfectly balanced dataset of ~3,900-4,000 samples per category.

Final Performance

99.64%

Overall F1-Score

Outstanding balance between precision and recall

99.57%

Perfect Match Accuracy

Exact label combination prediction accuracy

Ready to Try Kaypoh Aunty?

Experience the power of intelligent review classification for yourself!

Start Analyzing Reviews