Web Search Engines and Information Retrieval

Web Search Engines and Information Retrieval
ISYS 1078/1079
Assignment 2
Assessment Type Group assignment. Submit online via Canvas ! Assignments
! Assignment 2. Marks awarded for meeting requirements as
closely as possible. Clarications/updates may be made via
announcements/relevant discussion forums.
Due Date Midday, Wednesday 9 October 2019 { Week 11
Marks 100 (30% of total course mark)
Overview
In this assignment, you will implement a ranked querying system, run a set of sam-
ple queries, and evaluate system performance. You will need to design and implement
appropriate data structures for ecient searching.
In the rst assignment, you implemented text processing and indexing. This second
assignment focuses on search; the aims of the assignment are to enhance your under-
standing of retrieval models and evaluation, and to give you practical experience with
the implementation of search algorithms.
For this assignment, you should build on (by re-using and modifying where required)
your indexing code from assignment 1. As usual, wherever code is re-used (including
your own) this must be clearly identied in your submission.
The \Web Search Engines and Information Retrieval” Canvas contains further announce-
ments and a discussion board for this assignment. Please be sure to check these on a
regular basis { it is your responsibility to stay informed with regards to any announce-
ments or changes.
Learning Outcomes
This assessment relates to the following learning outcomes:
CLO1: apply information retreival principles to locate relevant information in large
collections of data
CLO3: implement features of retrieval systems for web-based and other search tasks
CLO4: analyse the performance of retrieval systems using test collections
CLO5: make practical recommendations about deploying information retrieval sys-
tems in dierent search domains, including considerations for document manage-
ment and querying
In addition it relates to the program learning outcomes of Enabling Knowledge; Critical
Analysis; Problem Solving; and Communication/Teamwork.
Teaching Servers
Three CSIT teaching servers are available for your use:
(titan|saturn|jupiter).csit.rmit.edu.au
You can log in to these machines using ssh, for example:
% ssh [email protected]
where s1234567 is your student number. These servers host the document collection
and other data for the assignments in this course. You are encouraged to develop your
code on these machines. If you choose to develop your code elsewhere, it is your re-
sponsibility to ensure that your assignment submission can be compiled and executed on
(titan|saturn|jupiter).csit.rmit.edu.au, as this is where your code will be run for
marking purposes. If your code cannot be complied and run on the teaching servers at
marking time, you will receive zero marks for the programming component.
You are required to make regular backups of all of your work. This is good practice, no
matter where you are developing your assignment solutions.
Academic Integrity and Plagiarism
Academic integrity is about honest presentation of your academic work. It means ac-
knowledging the work of others while developing your own insights, knowledge and ideas.
You should take extreme care that you have:
Acknowledged words, data, diagrams, models, frameworks and/or ideas of others
you have quoted (i.e. directly copied), summarised, paraphrased, discussed or men-
tioned in your assessment through the appropriate referencing methods.
Provided a reference list of the publication details so your reader can locate the
source if necessary. This includes material taken from Internet sites. If you do not
acknowledge the sources of your material, you may be accused of plagiarism because
you have passed o the work and ideas of another person without appropriate
referencing, as if they were your own.
RMIT University treats plagiarism as a very serious oence constituting misconduct.
Plagiarism covers a variety of inappropriate behaviours, including:
Failure to properly document a source
Copyright material from the internet or databases
Collusion between students
For further information on our policies and procedures, please refer to the following:
https://www.rmit.edu.au/students/student-essentials/rights-and-responsibilities/
academic-integrity.
2
General Requirements
This section contains information about the general requirements that your assignment
must meet. Please read all requirements carefully before you start.
You must implement your programs in Java, C, C++, or Python. Your programs
should be well written, using good coding style and including appropriate use of
comments. Your markers will look at your source code, and coding style may form
part of the assessment of this assignment.
You must include a plain text le called \README.txt” with your submission.
This le should include the name and student ID of all team members at the top.
It needs to clearly explain how to compile and run your programs on
(titan|saturn|jupiter).csit.rmit.edu.au. The clarity of the instructions and
usability of your programs may form part of the assessment of this assignment.
Your programs may be developed on any machine, but must compile and run
on the course machines, (titan|saturn|jupiter).csit.rmit.edu.au. If your
submission does not compile and run on these machines, it will not be marked.
The submitted programs must be your own code. You should not use existing
external libraries that implement advanced data structures. Where libraries (or in
the case of scripting languages, built-in features beyond simple low-level data types)
are used for data structures such as hash tables, they must be clearly attributed,
and it is up to you to demonstrate a clear understanding of how the library is
implemented in the discussion in your assignment report.
Paths should not be hard-coded.
Where your programs need to create auxiliary les, these should be stored in the
current working directory.
Please ensure that your submission follows the le naming rules specied in the
tasks below. File names are case sensitive, i.e. if it is specied that the le name is
gryphon, then that is exactly the le name you should submit; Gryphon, GRYPHON,
griffin, and anything else but gryphon will be rejected.
Assignment Teams
This assignment must be carried out in groups of two. Since this assignment builds on
the indexing functionality of assignment 1, it is suggested that you continue to work in
the same group.
You need to register your group through Canvas prior to assignment submission. Further
details are provided in the \What to Submit, When, and How” section of this document.
To manage your teamwork, you should use Trello (https://trello.com).
Each team member should sign up using their RMIT email address.
Each team must create a shared Trello board called \WSEIR Assignment 2″.
You must invite the teaching sta (lecturer and tutor) to join your board.
To allow for
exibility, you may use Trello in the way that best suits your team. However,
as a minimum you must demonstrate activity that involves:
3
Creating cards that correspond to the main assignment components.
Assigning particular cards/tasks to each team member.
Showing regular progress on tasks over each week of the assignment period (e.g.
updating cards, and progressing them from \To Do” to \Doing” to \Done”).
Trello will be covered in more detail in the rst tutorial.
Note that your assignment report submission must include a participation statement,
indicating the proportion of work contributed by each team member. This should re
ect
your Trello board.
Programming Tasks
You will nd all les needed for this assignment in the directory
/home/inforet/a2
on (titan|saturn|jupiter).csit.rmit.edu.au.
First, have a look at the le latimes. It is part of the TREC ad hoc retrieval test
collection, and is comprised of 131896 newspaper documents.
You will need to write programs to perform searches on this data. As a preliminary step,
you should therefore be able to index the data eciently, for example using an inverted
index. This will enable to you to easily access the various term occurrence statistics that
you need to use for the searching tasks described below.
For indexing, you are encouraged to re-use your code (possibly with modications) from
Assignment 1. However, it is your responsibility to verify the accuracy of your indexing
code.
Note: If you have concerns about the functionality of your indexing code from assignment
1, you may optionally make use of a provided inverted index dump le
index dump/invlist-TermFreq.txt
and document map le
index dump/map
If you choose to make use of these les, you will need to load them into memory and
retrieve the term occurrence statistics using appropriate data structures. A README.txt
le explaining the format is included in the same directory.
1 Ranked Retrieval (30%)
Write a program called search that implements the Okapi BM25 similarity function to
rank documents in a text collection. An example of how your program should be invoked
is:
./search -BM25 -q -n -l -i
-m [-s ] [ … ]
(or equivalent invocation in another programming language) where
4
-BM25 species that the BM25 similarity function is to be used.
-q is an integer that identies the current query. For example, you
might indicate that the current query label is \1133″. This item is only used for
output formatting (see below).
-n is an integer number specifying the number of top-ranked doc-
uments that should be returned as an answer.
-l and -i are the inverted index lexicon and inverted list
les; and -m is the mapping table from internal document numbers to actual
document identiers. (See Assignment 1 for details. You are encouraged to re-use
your indexing code from Assignment 1, with modications where required.)
-s is an optional argument which may indicate a stoplist; note that
if the inverted index was created with stopping, then queries should be stopped in
exactly the same way.
[ … ] are a variable number of
query terms (but at least one query term) that the program will receive as command-
line arguments. Each of these terms should be looked up in the lexicon, and the
appropriate inverted list data fetched from the inverted list le.
Your search program should be ecient, using suitable data structures and algorithms.
As a minimum, your program should:
Process a query, one term at a time.
Use accumulators to accrue partial similarity scores as each term is processed (see
Retrieval Models I lecture notes).
Use a min-heap data structure to select the top n accumulators with the highest
similarity scores.
Calculate document weights at indexing time, and store these in the mapping table
(map le) for fast lookup at query time.
The specic simplied version of the BM25 similarity function that you should implement
is:
BM25(Q;Dd) =
X
t2Q
log

N ? ft + 0:5
ft + 0:5


(k1 + 1)fd;t
K + fd;t
(1)
K = k1

(1 ? b) +
b Ld
AL

(2)
where:
Q is a query consisting of terms t
Dd is a document in the collection
N is the number of documents in the collection
5
fd;t is the number of occurrences of t in d
ft is the number of documents containing term t
Ld and AL are the document length and average document length, measured as the
number of characters in the data le.
k1 and b are parameters that should initially be set as k1 = 1:2 and b = 0:75
As output, your program should produce answer lists using the following format:
401 LA010189-0003 1 110.541
401 LA010190-0064 2 95.259
401 LA010389-0019 3 86.806
401 LA010489-0026 4 65.082
401 LA010689-0066 5 21.118
where
column 1 is the query label, passed to your program via the optional command-line
argument -q
column 2 is the document identier
column 3 is the rank position of the document (starting from rank 1)
column 4 is the similarity score produced by the BM25 similarity function (the
numbers given are examples only)
As the nal line of output, your program should print the running time in milliseconds,
for example:
Running time: 298 ms
Your program should produce no other output.
2 Ranked Retrieval: Report (10%)
Create a PDF le called report.pdf. Create a heading called \Ranked Retrieval” in
your report le In this section you should:
Explain your implementation of the BM25 similarity function.
Clearly describe the steps by which you process a query of one or more terms.
Explain the key algorithms and data structures that you have used. You should
refer to specic sections of your code by citing line numbers in the source les to
demonstrate where you have implemented these.
This section should be no longer than one and a half pages.
For all questions that involve writing a report, you must remember to cite any sources
(including books, research papers, course notes, etc.) that you referred to while designing
aspects of your programs.
6
3 Advanced Information Retrieval Feature (30%)
In this section, you will choose one advanced IR feature, which you will then implement
and evaluate.
You may choose from the following list. Some of these items have been covered in lectures,
while others are new.
All of these topics will require you to rst carry out some research: you should aim to
identify and read at least one research paper that provides a foundation for the feature
that you will implement. This must be cited in your report.
Whichever advanced feature you choose, your implementation/program must be called
advanced and clear details for how to run it need to be provided in a README.txt le.

  1. Automatic query expansion (pseudo-relevance feedback): Implement an
    extension to your ranked retrieval approach that implements an automatic query
    expansion approach (for example, the Okapi Term Selection Value approach, or
    another approach).
    Such an approach should include at least two parameters that can be set at run-
    time: R, the number of top-ranked documents that are assumed to be relevant; and
    E, the number of terms that should be appended to the original query.
  2. Manual relevance feedback: Implement an extension to your ranked retrieval
    approach that implements manual relevance feedback.
    Such approaches must include an interactive feedback phase, where a user is pre-
    sented with a list of documents in response to an initial query, and is given the
    opportunity to mark certain documents as relevant (and, optionally, to mark some
    as non-relevant).
    The initial query then needs to be updated, for example using Rocchio’s approach.
    Finally, the updated query is run to retrieve the nal answer list.
  3. Phrase search: Extend your ranked retrieval implementation to support phrase
    queries (for example, the query “cold fusion” would return all documents (and
    only those documents) that contain the term cold followed directly by the term
    fusion).
    Note that to support phrase search, you will rst need to extend your inverted index
    to also store term positions within documents.
  4. Diversity ranking: Extend your ranked retrieval approach to include a diversity
    component. For example, for an ambiguous query such as java, a diversity approach
    would seek to ensure that the top answers include documents that cover various
    interpretations of the query (the island, the coee, the programming language).
    An example of a diversity approach is Maximal Marginal Relevance (MMR).1
  5. Disk-based inverted index construction: Modern collections are often too
    large to t into main memory. Constructing an inverted index therefore requires
    various eciency considerations.
    The lecture on indexing included information on an external sort-based approach
    for constructing an inverted index. Implement this as a baseline, then explore and
    implement a more advanced approach.
    1http://www.cs.cmu.edu/~jgc/publication/The_Use_MMR_Diversity_Based_LTMIR_1998.pdf
    7
  6. Document summaries: Extend your search system to produce short document
    summaries with each answer item returned in a search results list.
    You must implement at least two summary creation approaches. One of these should
    be based on query-biased information, taking the user’s query into account. The
    second should include evidence other than the user’s current query (for example, a
    very simple choice would be to simply return the rst chunk of a document).
  7. Other advanced feature: If you wish to propose another advanced IR feature,
    you are welcome to do so. However, you must send an email to the lecturer to
    discuss this, and to agree on the scope of what such a feature needs to cover, and
    how it should be evaluted.
    4 Advanced Information Retrieval Features: Report (15%)
    In your report le, create a heading \Advanced IR Feature”. In this section, you should
    explain:
    Your chosen advanced IR feature. This must include a clear description of what
    you are aiming to achieve. You must include a reference to at least one research
    paper that you have referred to as part of your design.
    A detailed explanation of how you implemented the feature.
    This section should be no longer than one and a half pages.
    5 Evaluation: Report (15%)
    You have implemented at least two aspects of an IR system. Your task now is to exper-
    imentally evaluate the relative performance of these. Depending on which advanced IR
    feature you have chosen, you should perform one of the following evaluations.
    5.1 Eectiveness evaluation (advanced features 1 and 2)
    The le topics contains ve queries (one on each line) with the following format:
    401 foreign minorities germany
    402 behavioral genetics
    403 osteoporosis
    405 cosmic events
    408 tropical storms
    where the rst item is a query number (identier) followed by a space, and the subsequent
    terms on the line is the query that is to be submitted to your search system.
    A corresponding le, qrels, contains relevance judgements for these queries and the
    collection that you have indexed. These judgements were made by a human assessor,
    who considered the query and the answer document, and decided whether the document
    is relevant for the given query. An example line from the le is:
    8
    401 0 LA010189-0003 0
    401 0 LA010489-0026 1
    401 0 LA010689-0066 0
    where the rst item is a query number, the second item should be ignored, the third is
    a document identier, and the fourth item is a relevance value (a \1″ indicates that this
    document is relevant for the query, and a \0″ indicates not relevant). So, the example
    above indicates that the document LA010489-0026 is relevant for query number 401, and
    the documents LA010189-0003 and LA010689-0066 are not relevant.
    Answer the following questions in your report:
  8. Your rst task is to run the queries using both of your retrieval approaches. You
    should generate answer lists of length 20. You must include the full answer lists
    for both systems in your report.
  9. Compare the systems using precision at 10 documents retrieved ([email protected]) as a metric.
    Explain this performance measurement, and discuss which system is better based
    on the scores. (You do not need to carry out statistical signicance tests.)
    Is this a sensible metric for comparing the two systems that you have implemented?
    Why or why not? You should write no more than half a page for this section.
  10. Based on your knowledge of retrieval evaluation, choose a dierent evaluation metric
    to compare the two systems.
    Explain whether plain ranked retrieval or your chosen advanced approach is better
    according to your chosen metric. Describe why your chosen metric is appropriate
    for comparing the performance of these two systems. You should write no more
    than half a page for this section.
    5.2 Eectiveness evaluation for phrase search (advanced feature
    3)
    Consider the set of test queries described in section 5.1. Then answer the following
    questions in your report:
  11. Your rst task is to run the queries using both of your retrieval approaches (ranked
    retrieval, and phrase retrieval). You should generate answer lists of length 20 for
    ranked retrieval, and report the full answer list for your phrase search. Include the
    answer lists in your report.
  12. Compare the systems using precision at 10 documents retrieved ([email protected]) as a metric.
    Explain this performance measurement, and discuss which system is better based
    on the scores. (You do not need to carry out statistical signicance tests.)
    Is this a sensible metric for comparing the two systems that you have implemented?
    Why or why not? You should write no more than half a page for this section.
  13. Propose a way to evaluate your phrase retrieval search. You should explain the
    proposed process, and any components that you would need, in detail. However,
    you do not need to create the components and carry out an actual evaluation. You
    should write no more than half a page for this section.
    9
    5.3 Eectiveness evaluation for diversity (advanced feature 4)
    Consider the set of test queries described in section 5.1. Then answer the following
    questions in your report:
  14. Your rst task is to run the queries using both of your retrieval approaches (ranked
    retrieval, and diversity retrieval). You should generate answer lists of length 20
    for both your ranked retrieval and diversity approaches. Include the answer lists in
    your report.
  15. Compare the systems using precision at 10 documents retrieved ([email protected]) as a metric.
    Explain this performance measurement, and discuss which system is better based
    on the scores. (You do not need to carry out statistical signicance tests.)
    Is this a sensible metric for comparing the two systems that you have implemented?
    Why or why not? You should write no more than half a page for this section.
  16. Propose a way to evaluate your diversity ranking. You should explain the proposed
    process, and any components that you would need, in detail. However, you do not
    need to create the components and carry out an actual evaluation. You should write
    no more than half a page for this section.
    5.4 Eciency evaluation (advanced feature 5)
    You should include eciency experiments that compare the time and resource require-
    ments of the two approaches (the baseline and the more advanced approach). For ex-
    ample, as a minimum you should include the size of all index components, and the time
    taken on various phases of index construction. These phases must be clearly described
    and compared in your report. You should write no more than a page for this section.
    5.5 Extrinsic summary evaluation (advanced feature 6)
    Document summaries can be evaluated in various ways. For this advanced feature, you
    should carry out an extrinsic evaluation. That is, you will need to show pairs of summaries
    (that is, both summaries created for the same document, using the two dierent methods
    that you implemented) to human evaluators, and ask them to indicate which of the two
    they prefer. This must be repeated for a reasonable number of documents (at least 20),
    for which the preference scores can then be aggregated.
    In your report, you need to describe the evaluation process in detail: who made the
    judgements; how were the summaries presented to the judges; how did you draw nal
    conclusions about whether one method is better than another. You should write no more
    than a page for this section.
    6 Participation
    Create a separate PDF le, called participation.pdf, in which you indicate the pro-
    portion of assignment work contributed by each team member. This proportion should
    re
    ect the tasks (assigned, progressed and completed) in your Trello board. The total
    must come to 100%.
    10
    As part of your assignment submission, you must print this statement, obtain a signature
    from both team members, and upload a scanned signed version of the document.
    7 REQUIRED Assignment Presentation (5% Bonus Marks)
    In week 12, your group will give a short presentation about your assignment.
    The presentation will be a maximum of 3 minutes, and needs to cover:
    A brief description of the advanced IR feature that you have chosen, and how you
    implemented it.
    A brief explanation of your evaluation, and what you learned about the approaches.
    You should create at most 2 slides to display as part of your presentation. To avoid
    overheads of connecting various electronic devices, these should be printed out, and will
    be displayed using the document camera.
    Note: the presentation is a compulsory part of this assignment. You can receive up
    to 5 bonus marks for your presentation. All team members should be present for the
    presentation.
    Marking guidelines
    The marks available for each component are indicated in each section above. A more
    detailed rubric is also available on the course Canvas for your reference.
    What to Submit, When, and How
    The assignment is due at Midday, Wednesday 9 October 2019 { Week 11.
    After the due date, you will have 5 business days to submit your assignment as a late
    submission. Late submissions will incur a penalty of 10% per day. After these ve days,
    Canvas will be closed and you will lose ALL the assignment marks.
    Assessment declaration: When you submit work electronically, you agree to the
    assessment declaration – https://www.rmit.edu.au/students/student-essentials/
    assessment-and-exams/assessment/assessment-declaration
    Group registration:
    Before submitting the assignment, you must register your group in Canvas. Choose the
    \People” item in the course menu on the left. Then choose the \Assignment 2 Groups”
    tab at the top of the next screen. On this page you will be able to assign yourself to
    a group (groups are listed on the right). One team member should sign up for a group
    rst, and then notify their partner of the group number, so that they can then sign up
    for the same group. You are encouraged to register your assignment group in Canvas
    as soon as you decide on a team. Group registration is a separate step from assignment
    submission, and groups need to be re-registered for Assignment 2 (they will not carry
    over from Assignment 1). Please note: you must choose a pre-created group from the
    \Assignment 2 Groups” tab at the top of the Canvas \People” page (with names such as
    \Assignment 2 Groups 21″).
    11
    Submission process
    Assignments should be submitted via the WSEIR course Canvas.
    Carefully re-check the \General Requirements” section of this assignment speci-
    cation and ensure that you have followed the instructions listed there.
    All assignment les (including the README.txt le, source code, report.pdf and
    participation.pdf) must be submitted as ONE single zip le, named \WSEIR-A2-
    Group-X.zip” where \X” is replaced by your group number that you signed up for
    in Canvas.
    Please do NOT submit the latimes document collection le, or any index or map
    les that your programs create as output when run.
    You only need to submit one copy of the assignment per team. Please remember
    to also indicate all team members at the top of your README.txt le.
    12

Leave a Reply

Your email address will not be published. Required fields are marked *