AskSauron
Overview
AskSauron is a high-performance distributed search engine built in Java to support the full search lifecycle, from crawling and indexing to ranking and retrieval.
The system is powered by a custom Key-Value Store and Flame, a distributed processing framework built from scratch in Java and modeled after Apache Spark, designed to execute parallel dataflow jobs across the pipeline. The pipeline processed over 1 million crawled pages and produced an indexed corpus of about 520,000 validated documents.
Although the software architecture is fully distributed, deployment was optimized on a high-performance AWS EC2 environment running multiple concurrent worker processes across the crawl, indexing, and ranking stages.
Results & Core Features
- Data Pipeline Scale: Processed over 1,000,000 raw pages, filtered down to a high-quality index of ~520,000 unique validated documents.
- High-Performance Frontend UX: Includes real-time search suggestions (binary search), spellcheck (Levenshtein distance), and a capped in-memory query cache to quickly serve repeated searches.
- Cached Page Access: Served stored snapshots of indexed pages so content remained viewable even if the original source was down or had changed.
- Search Result Preview Snippets: Built dynamic 350-character contextual previews from page content, with matched query terms highlighted for readability.
- Linguistic Normalization: Integrated the OpenNLP PorterStemmer library to stem words (e.g., “computing” to “compute”) at both index and query time, drastically improving search recall.
- Resilient Infrastructure: Engineered checkpoint-based start/resume functionality across the Crawler, Indexer, and PageRank pipeline jobs to automatically recover from worker timeouts or Out-of-Memory (OOM) crashes.
System Architecture & Data Flow
The system operates through a coordinator-worker model using Flame, our Spark-like distributed processing engine. Flame provides RDD-style functional operators (map, flatMapToPair, foldByKey, join) that allow us to write large-scale jobs as parallel dataflow pipelines.
- Distributed Crawler: Discovers and archives raw HTML into a persistent KVS table, filtering out infinite content traps as it extracts new URLs.
- Inverted Indexer: A multi-threaded Flame job that strips boilerplate tags, tokenizes text, and maps term positions to compute Term Frequencies.
- Offline Ranking Engine: Computes global authority via iterative PageRank jobs and outputs
pt-pageranksfor online lookup. - Search Retrieval Frontend: A low-latency service that computes score online using
pt-index+pt-pageranks, applies dynamic boosts, reads titles frompt-documentStats, and lazily extracts 350-character contextual previews from cached HTML.
Ranking Engine: TF-IDF & PageRank
TF-IDF (Relevance)
Term Frequency-Inverse Document Frequency measures how relevant a page is to your search.
- TF: How many times your word appears on a specific page.
- IDF: How “rare” that word is across our ~520,000 pages. Common words like “the” are penalized, while unique words receive a boost.
- Weights: We give visible text a weight of 1.0 but penalize hidden metadata at 0.001 to prevent “keyword stuffing.”
Iterative PageRank (Authority)
PageRank measures “trust” by treating links as votes. We compute this on hashed URLs using an iterative loop. Each page state tracks currentRank, previousRank, and outlinks.
- Transfer Step: A page takes 85% of its rank and divides it equally among its outlinks. A link from a page with few outlinks provides a “stronger” vote than one with thousands.
- Aggregation: Pages sum all incoming contributions and add a 0.15 baseline to find their
newRank. - Convergence: The job continues until 95% of pages are “stable”—meaning their score changed by less than 0.01 since the last loop. This specific threshold ensures accuracy without wasting compute time or crashing the server due to OOM.
Final Scoring: Typing vs. Searching
Our system distinguishes between quick suggestions and full search results:
-
The Suggestions (
/suggest): When you type, the engine uses a fast binary search on a sorted dictionary to finish your words. No complex scoring happens here. -
The Full Search (
/search): Once you hit enter, the engine runs the full retrieval pipeline:- Batch KVS Read:
pt-index+pt-pageranks+pt-documentStats - Core Combine:
(base_tfidf * phrase_boost) * (1.0 + pagerank) - Heuristic Multipliers:
- Multi-term coverage: Documents matching all query terms get a 4.0x boost.
- Title Alignment: Matches in the page
<title>receive up to a 4.0x boost. - URL Matching: Terms found in the URL path grant up to a 5.0x boost.
- Batch KVS Read:
Final Score = Core Combine * Coverage Boost * Title Boost * URL Boost
Engineering Challenges
Defeating Black-Hat SEO and Malformed HTML
Early on, broken HTML and keyword stuffing ruined search quality. We replaced slow regex-based cleaning with efficient, character-level tokenization. By strictly separating visible text from metadata and applying a 0.001 penalty to hidden keywords, we ensured only actual content influenced the rankings.
Universal Job Resiliency
Running 10+ worker processes on a single instance often led to memory exhaustion. To solve this, we implemented bucket-based hash checkpoints across the Crawler, Indexer, and PageRank jobs. This allows the system to partition data and save progress; if a worker crashes, it resumes from the last completed hash bucket rather than restarting the entire 1-million-page process.
Strategic Triage and Prioritization
During testing, our KVS debugging UI reported 5x more pages than actually existed. After a time-boxed investigation, we proved the core indexer was working perfectly and the issue was simply a bug in the monitoring tool’s pagination logic. We chose to deprioritize the UI fix to focus on production features—a key lesson in prioritizing core system integrity over internal tool perfection.