The goal of this assignment is to implement a basic search engine using the concepts and tools
covered in the first half of the course. To complete this assignment, you will need to implement
a web crawler, a RESTful server, and a browser-based client that will allow a user to perform
The code for your assignment must be submitted on Brightspace before the deadline. You do
not have to submit your database files. Grading for the assignment will be done via
demonstration in the week following the deadline. Scheduling of the demonstrations will be done
closer to the deadline. Partners submitting the assignment should make a single submission
that contains both partners’ names and student numbers in the README file.
The web crawler portion of your assignment must be capable of crawling the following:
1. The fruit example site. Start at people.scs.carleton.ca/~davidmckenney/fruitgraph/N-
0.html and crawl the entire site (1000 pages).
2. Another site of your choosing. Limit the total number of crawled pages (~500-1000). It is
suggested to limit your crawl within the same domain that you begin. You can design
your selection policy to focus on any pages you deem important. You are not required to
crawl non-HTML resources but can choose to do so.
Your crawled data must be stored in a database for persistence. Your crawler must also perform
PageRank calculations and store the values for each page in the database.
Your RESTful web server must read the data from the database, perform required indexing, and
provide relevant, ranked search results for any valid request. Your server must support GET
requests for at least the following endpoints:
1. /fruits – represents a request to search the data from the fruit example
2. /personal – represents a request to search the data in the alternate site you selected
Both of your search endpoints (/fruits and /personal) must support at least the following query
1. q – a string representing the search query the user has entered, which may contain
2. boost – either true or false, indicating whether each page should be boosted in the
search results using its PageRank score
3. limit – a number specifying how many results the user wants returned (minimum 1,
maximum 50, default 10)
The browser-based interface for searching must allow the client to specify:
1. The text for their search
2. Whether they want the results to be boosted or not using PageRank
3. The number of results they want to receive (minimum 1, maximum 50, default 10)
The search results displayed in the browser must contain:
1. The URL to the original page
2. The title of the original page
3. The PageRank of the page within your crawled network
4. A link to view the data your search engine has for this page. This must include at least
the URL, title, list of incoming links to this page, list of outgoing links from this page, and
word frequency information for the page (e.g., banana occurred 6 times, apple occurred
9 times, etc.). You can also display any additional data you produced during the crawl.