AskSauron

Overview

AskSauron is a high-performance distributed search engine built in Java to support the full search lifecycle, from crawling and indexing to ranking and retrieval.

The system is powered by a custom Key-Value Store and Flame, a distributed processing framework built from scratch in Java and modeled after Apache Spark, designed to execute parallel dataflow jobs across the pipeline. The pipeline processed over 1 million crawled pages and produced an indexed corpus of about 520,000 validated documents.

Although the software architecture is fully distributed, deployment was optimized on a high-performance AWS EC2 environment running multiple concurrent worker processes across the crawl, indexing, and ranking stages.

AskSauron Search Experience


Results & Core Features

  • Data Pipeline Scale: Processed over 1,000,000 raw pages, filtered down to a high-quality index of ~520,000 unique validated documents.
  • High-Performance Frontend UX: Includes real-time search suggestions (binary search), spellcheck (Levenshtein distance), and a capped in-memory query cache to quickly serve repeated searches.
Search Query Spellcheck and Suggestions
Query cache (capped in-memory map)
  • Cached Page Access: Served stored snapshots of indexed pages so content remained viewable even if the original source was down or had changed.
Search Results Cached Page
  • Search Result Preview Snippets: Built dynamic 350-character contextual previews from page content, with matched query terms highlighted for readability.
Search Results Previews (with PageRank scores)
  • Linguistic Normalization: Integrated the OpenNLP PorterStemmer library to stem words (e.g., “computing” to “compute”) at both index and query time, drastically improving search recall.
  • Resilient Infrastructure: Engineered checkpoint-based start/resume functionality across the Crawler, Indexer, and PageRank pipeline jobs to automatically recover from worker timeouts or Out-of-Memory (OOM) crashes.


System Architecture & Data Flow

The system operates through a coordinator-worker model using Flame, our Spark-like distributed processing engine. Flame provides RDD-style functional operators (map, flatMapToPair, foldByKey, join) that allow us to write large-scale jobs as parallel dataflow pipelines.

  1. Distributed Crawler: Discovers and archives raw HTML into a persistent KVS table, filtering out infinite content traps as it extracts new URLs.
  2. Inverted Indexer: A multi-threaded Flame job that strips boilerplate tags, tokenizes text, and maps term positions to compute Term Frequencies.
  3. Offline Ranking Engine: Computes global authority via iterative PageRank jobs and outputs pt-pageranks for online lookup.
  4. Search Retrieval Frontend: A low-latency service that computes score online using pt-index + pt-pageranks, applies dynamic boosts, reads titles from pt-documentStats, and lazily extracts 350-character contextual previews from cached HTML.

Ranking Engine: TF-IDF & PageRank

TF-IDF (Relevance)

Term Frequency-Inverse Document Frequency measures how relevant a page is to your search.

  • TF: How many times your word appears on a specific page.
  • IDF: How “rare” that word is across our ~520,000 pages. Common words like “the” are penalized, while unique words receive a boost.
  • Weights: We give visible text a weight of 1.0 but penalize hidden metadata at 0.001 to prevent “keyword stuffing.”

Iterative PageRank (Authority)

PageRank measures “trust” by treating links as votes. We compute this on hashed URLs using an iterative loop. Each page state tracks currentRank, previousRank, and outlinks.

  1. Transfer Step: A page takes 85% of its rank and divides it equally among its outlinks. A link from a page with few outlinks provides a “stronger” vote than one with thousands.
  2. Aggregation: Pages sum all incoming contributions and add a 0.15 baseline to find their newRank.
  3. Convergence: The job continues until 95% of pages are “stable”—meaning their score changed by less than 0.01 since the last loop. This specific threshold ensures accuracy without wasting compute time or crashing the server due to OOM.

Final Scoring: Typing vs. Searching

Our system distinguishes between quick suggestions and full search results:

  • The Suggestions (/suggest): When you type, the engine uses a fast binary search on a sorted dictionary to finish your words. No complex scoring happens here.

  • The Full Search (/search): Once you hit enter, the engine runs the full retrieval pipeline:

    1. Batch KVS Read: pt-index + pt-pageranks + pt-documentStats
    2. Core Combine: (base_tfidf * phrase_boost) * (1.0 + pagerank)
    3. Heuristic Multipliers:
      • Multi-term coverage: Documents matching all query terms get a 4.0x boost.
      • Title Alignment: Matches in the page <title> receive up to a 4.0x boost.
      • URL Matching: Terms found in the URL path grant up to a 5.0x boost.

Final Score = Core Combine * Coverage Boost * Title Boost * URL Boost



Engineering Challenges

Defeating Black-Hat SEO and Malformed HTML

Early on, broken HTML and keyword stuffing ruined search quality. We replaced slow regex-based cleaning with efficient, character-level tokenization. By strictly separating visible text from metadata and applying a 0.001 penalty to hidden keywords, we ensured only actual content influenced the rankings.

Universal Job Resiliency

Running 10+ worker processes on a single instance often led to memory exhaustion. To solve this, we implemented bucket-based hash checkpoints across the Crawler, Indexer, and PageRank jobs. This allows the system to partition data and save progress; if a worker crashes, it resumes from the last completed hash bucket rather than restarting the entire 1-million-page process.

Strategic Triage and Prioritization

During testing, our KVS debugging UI reported 5x more pages than actually existed. After a time-boxed investigation, we proved the core indexer was working perfectly and the issue was simply a bug in the monitoring tool’s pagination logic. We chose to deprioritize the UI fix to focus on production features—a key lesson in prioritizing core system integrity over internal tool perfection.