Kaypoh Aunty is a sophisticated web application that automatically scrapes and classifies Google Maps reviews for any business or location. The name "Kaypoh Aunty" is a playful nod to the Singaporean colloquial term for a nosy, inquisitive person, reflecting the application's ability to thoroughly examine and understand user-generated content.
Enter the name of any business, landmark, or location you want to analyze into the search bar.
The backend triggers an integrated Apify API call. Apify searches for the specified location on Google Maps and scrapes its reviews in real-time.
As reviews are collected, each one is passed through the Kaypoh Aunty classification engine using our two-stage process.
The application displays the scraped reviews along with the labels assigned by the classifier, allowing for immediate analysis and filtering.
Kaypoh Aunty classifies reviews into the following categories. A single review can be assigned multiple labels if it meets the criteria for different categories:
Identifies promotional content, URLs, phone numbers, and marketing language with call-to-action.
Filters out undesirable content with excessive punctuation, caps, spam-like usernames, or generic text.
Catches feedback from users who may not have had a firsthand experience, using phrases like "never visited" or hearsay language.
Flags reviews that are nonsensical, extremely short, test content, standalone questions, or contain gibberish.
Recognizes detailed, helpful reviews with specific details, recommendation language, and balanced opinions.
To ensure both speed and accuracy, every review is processed through a sophisticated two-stage pipeline:
A fast, rule-based classifier written in JavaScript runs first, designed to quickly catch obvious cases using predefined patterns and keywords.
If a review doesn't match predefined rules, it's sent to a fine-tuned DistilBERT model hosted on Hugging Face for nuanced analysis.
DistilBERT is a smaller, faster, and lighter version of BERT that retains over 95% of BERT's language understanding capabilities while being significantly more performant for web applications.
The model was trained on a large dataset of Google local reviews from the UC San Diego McAuley Lab. To give the model more context, structured data like review ratings and photo presence were engineered into the text:
"Very good place nice things... [SEP] rating:5.0 has_pics:0"
Initial data distribution was extremely imbalanced with 3,910 "Useful" samples but only 4 "Advertisement" samples. This would lead to poor model performance.
A Large Language Model was used to generate high-quality, realistic synthetic data for underrepresented categories, resulting in a perfectly balanced dataset of ~3,900-4,000 samples per category.
Experience the power of intelligent review classification for yourself!
Start Analyzing Reviews