IFN647 Advanced Information Retrieval

1 | P a g e
IFN647 Advanced Information Retrieval and Storage
Project: EduQuiz Question Answering System
Total Marks: 100 (Scaled down to 40% of overall course mark)
Due Dates:
Demonstration of functionalities: 23rd/24th October 2019, during allocated tutorial.
Report: Monday 29th October 2019
Instructions: You should complete the project in teams of One to Four members (maximum). All
Group members must be present for the demonstration. Only one student in each group submits the
assignment. However, all group members’ names and student numbers need be included in the
written report and all group members’ student numbers also included in the final zip filename.
2 | P a g e
Section 1. Scenario You have been hired by Kingsland University of Technology to implement
a new Question answering system for students to refer to in their learning process (hereafter
referred to as “the application”) that will replace their existing system (hereafter referred to as the
“baseline system”). The application should search and retrieve relevant paragraphs that contain
answers to questions entered by the user. The application is to be implemented in C#1 and utilise the
Lucene search engine library (Lucene.NET version 3.0.3). You are required to:
• Implement an information search and retrieval system in C# using Lucene.NET 3.0.3;
• Implement changes to the application to differentiate it from the baseline system. As part of
these changes you must: (See Section 3 for more details)
1. Pre-process the information need to create a query as the input into the application.
Query creation must be automated and require no manual input from the user.
2. Implement one or more sensible and justifiable changes to Lucene’s
“DefaultSimilarity Class”
• Evaluate the performance of the application and compare it to the baseline system;
• Produce a written report that describes, amongst other things, the design, testing,
performance and instructions on how to use your application.
Important Points
• You have been provided MS MARCO collection.json file which contains 82326 dictionaries
formatted as ([‘passages’, ‘query_id’, ‘answers’, ‘query_type’, ‘query’]). Each element
contains the results from one specific query.
• For each of the query contained in “MS MARCO collection.json ” file, the text contained in
the passages field must be pre-processed for the indexing process.
• The passages contain a URL where you could extract a title from the URL for your indexing.
This processing must be automatic and require no manual alteration by the user.
• Users may include any type of natural language question which you need to process and
match with your index and the query.
• You should follow a suitable approach for utilizing the “Passage Text”, URL and the Query
for your system development. (ie: which parts to be indexed, how to do the indexing etc..)
• You should try to implement changes that improve the performance of your application with
respect to the baseline system. However, you will not be penalised if your changes do not
improve on the baseline as long as: your changes are reasonable and you have adequately
evaluated and described the changes.
1 You can request an exemption to complete your project in a different programming language, by emailing
[email protected] with the list of team members before the 10th of October 2019.
3 | P a g e
Section 2. Collection
You will be using a version of the Microsoft Macro Test Collection to evaluate your application. All
files relating to the test collection are available on blackboard (collection.json). Microsoft Machine
Reading Comprehension (MS MARCO) is a collection of large-scale dataset for deep learning related
to Search. In MS MARCO, all questions are sampled from real anonymized user queries. The context
passages, from which answers in the dataset are derived, are extracted from real web documents
using the most advanced version of the Bing search engine. The answers to the queries are human
generated if they could summarize the answer.
The collection consists of a set of source documents provided together with information needs and
relevance judgements (collection.json file). This is a collection of 82326 dictionaries with the
following structure: dict_keys([‘passages’, ‘query_id’, ‘answers’, ‘query_type’, ‘query’]). Each element
contains the results from one specific query. The description of each item is described below:

• passages : It is a compounded list of dictionaries, where each item of the list contains the
information regarding the passage. The number of passages varies from 1 to 12 for each
query. The keys are described below:
o is_selected: Whether or not this passage is relevant for the query. There is/are
relevant passage/s in each list of passages.
o url: Information about the URL or title.
o passage_text: Content of the passage.
o passage_id: unique id for the passage.
• query_id: Unique identifier for the query.
• Answers: Ideal result for the query. This could be a paragraph where the question is
correctly answered or a plain answer (i.e Yes or Not)
• query_type: There are 4 types of queries :
o ‘description’ : 4496
o ‘entity’: 8529
o ‘location’: 4052
o ‘numeric’:22758
o ‘person’. 2026
• query: the string of characters that constitutes the query

4 | P a g e
1. Source Documents
The source documents for this task are all the passages (url/title and passage text).
2. Information Needs
Each User Question must be processed to formulate a query which will be used as the search input
into your application. Example queries, along with the relevance judgment of the passages, as well
as the answers, are contained in the collection. You will only be using the relevant passages.
Following sample contains an extracted content related to query and the answer.

Description:
– Query: what is rba
– Answer: “Results-Based Accountability is a disciplined way of thinking and taking action
that communities can use to improve the lives of children, youth, families, adults and the
community as a whole.”],
– Relevant Passage: “Results-Based Accountability® (also known as RBA) is a disciplined
way of thinking and taking action that communities can use to improve the lives of
children, youth, families, adults and the community as a whole. RBA is also used by
organizations to improve the performance of their programs. Creating Community Impact
with RBA. Community impact focuses on conditions of well-being for children, families
and the community as a whole that a group of leaders is working collectively to improve.
For example: “Residents with good jobs,” “Children ready for school,” or “A safe and clean
neighborhood”.”

5 | P a g e
Section 3. System Requirements
You are required to produce an application written in C# that allows users to index, search and
retrieve relevant passages based on a specific user question. It is assumed that you will produce a
user friendly Graphical User Interface. The functionality of the application is broken up into the
following tasks:
Task 1: Index
• On start-up your application should enable the user to specify two directory paths. The first
directory path will contain the collection (and therefore all the source documents in the json
file). The other directory path will contain the location where a Lucene index will be built.
• The user will then be able to indicate that they would like to create the index from the
specified collection (e.g. through a button click or similar) which is then done
programmatically.
• The application should build the index and report the time it takes to index to the user.
Task 2: Search
• Once the index has been built the user must be able to search it.
• The user should be able to enter a natural language question into a search box and the
application must then process this question to create a final query that is submitted into the
application automatically.
• The final query created programmatically from the natural language question submitted by
the user must be displayed on the GUI (a simple display of the query object submitted).
• How the application processes the information need entered by the user is an
implementation/design choice
• The user should have the option (via a checkbox/radio button or similar control) to submit
multi-term and/or multi-phrase queries “as is” with no pre-processing
• The application should match the final query to relevant passages and rank the passages
according to relevance. The application should report how long it took to search the index
and display this on the GUI (include the time required for query creation)
• The application should display in the GUI how many relevant documents were returned from
the search
• The application should present a ranked list of at least the top 10 relevant passages. For
each document the following pieces of information should be presented:
➢ The title;
➢ URL;
➢ Highlighted set of matching passage text for the query
Task 3: Retrieve Results
• Using an appropriate interface control, the user should be able view the entire passage. This
can either open in the existing window or a new window.
6 | P a g e
Task 4: Save Results
• The user must have the option of saving the list of all of the retrieved results (not just the
first 10) in a text file. To do this the user will need to specify the name of the file to save the
results and query identification. New results should be appended to the end of an existing
results file.
• The format of the text file should be compatible with the trec_eval program. The format of
which is as follows:
Where:
➢ TopicID is the query identification as entered by the user;
➢ Q0 is simply the two characters “Q” and “0”;
➢ DocID is the respective passage id;
➢ rank is the rank of the file as returned by your application;
➢ score is the relevancy score of the file as determined by your application;
➢ student_numbers_groupname are the QUT student numbers for all members of the
group (delimited by underscores) and a group name (e.g
0123456798_0987654321_ourteam). This is used to identify the result file from your
application.
All parameters are separated by whitespace. A sample of the results file from the BaselineSystem is
provided next.
Note that in order to be compatible with trec_eval the results file needs to be a unix formatted file.
You can use the program dos2unix to assist with this.
Task 5: Custom Similarity Modification
As part of the applications implementation you are required to make an improvement to the
baseline scoring functionality of Lucene. In workshops and lectures you have been shown how
Lucene calculates its scoring function. Lucene search is based on a Boolean and Vector Space Model
with some minor modifications. Your application must make sensible and justifiable modifications to
the Lucene scoring function by implementing a custom similarity class which inherits from the
“DefaultSimilarity” base class.
https://svn.apache.org/repos/infra/websites/production/lucene.net/content/docs/3.0.3/d6/d95/cla
ss_lucene_1_1_net_1_1_search_1_1_default_similarity.html
7 | P a g e
Task 6: Improved Query Processing
Implement the following features as advanced searching options of your system.
• Connect a lexical database (e.g.: WordNet) to perform query expansion in your searching
operations. Your system should include set of customizable options to select different
linguistic levels in your query expansion.
• Sometimes in the results from the expanded query Documents that contain the original term
are ranked lower than Documents that contain the expanded terms. For example, Documents
that contain “moving” are ranked higher than Documents that contain “move”. You need to
address this problem in your expanded search queries and give more weight to the original
term.
• You should have an option in your system to demonstrate these options with Query Expansion
and weighted queries.
• Implement the Field level boosting for your search queries using the two fields, Title and the
Passage derived/available in your collection. You should be able to show the different set of
results for two different field level boosting.
8 | P a g e
Section 4. Report Requirements
In addition to the application you are also required to produce a report. The report should contain
the following sections.
Part 1: Statement of Completeness and Work Distribution.
• The Statement of Completeness should be completed with respect to the functionality
outlined in Section 3. If any item has not been completed then it should be detailed.
• The Work Distribution should list which pieces of work were completed by which team
members. It should also how marks should be split between team members.
• The Statement of Completeness and Work Distribution needs to be signed by all members.
Part 2: Design
• A description of your overall system design, identifying which are the major classes and
methods in your application and how they relate to the major components of an IR system
(indexing, searching, human computer interface). A UML diagram may assist in this
description.
• A description of your indexing strategy, including which classes and methods are responsible
for indexing.
• A description of how you are handling the structure of the source document, including which
parts of the source documents are being indexed or stored within the index and the reasons
why you made these decisions.
• A description of how you are handling any document errors.
• A description of your search strategy, including which classes and methods are responsible
for searching. Include your modifications to the “DefaultSimilarity” base class
• A description of your user interface, including interface mock-ups. You may use a number of
applications to produce the mock-ups, however, you are not allowed to use screen shots of
your final application. You need to provide mock-ups for all of the functionality described
in Section 3 to show your GUI design process.
Part 3: Changes to Baseline
• A list of all the changes that you have made in comparison with the baseline system, with a
short description (1-2 sentences) of the change.
• Describe the change(s) made to the “DefaultSimilarity” class and the reason for the
change(s)
• For two (2) of the changes listed above a longer description on:
o The motivation behind the change and why you think the change will lead to
improved search performance;
o Where in your application the changes have been implemented.
• You should try to implement changes that improve the performance of your application with
respect to the baseline system. However, you will not be penalised if your changes do not
improve on the baseline as long as your changes are reasonable and you have adequately
evaluated and described the changes.
9 | P a g e
• Describe your approach of implementing the Task 7, “Improved Query Processing” options.
You could describe your changes with a short description and include screen shots of the
results obtained in each step.
Part 4: System Evaluation
• Provide a suitable test plan and subsequent results for the system evaluation presented in
Section 6. You may use screen shots of your application to prove that it is able to search and
retrieve documents and screen shots of Luke.Net to show that your application is able to
index the collection.
Part 5: Comparison with Baseline
• Provide a comparison with the baseline system as indicated using the criteria in Section 7.
Part 6: User Guide
• Provide a user guide that instructs how to perform each of the tasks outlined in Section 3.
Part 7: Advanced Features to Answer the Questions
• Identify two “advanced features” that will allow the system to provide an answer to the
questions instead of a passage, either identified from the material covered in class or from
your own research.
• For each of the advanced features provide:
➢ A description of the advanced feature;
➢ An explanation of why you think it would aid in answering the questions;
➢ A description of what changes would be required in your application to implement
the advanced features and where in your application (i.e. class/methods) you would
need to make them.
NOTE: For these advanced features you are not required to implement them but rather describe
and justify why they would improve the application.
10 | P a g e
Section 5. Baseline System
The baseline system should be the first you develop in C# and Lucene.Net. Then you save it as an
alternative system and start building on it to produce your system. The properties of the baseline
system are as follows:
• For each passage in the collection, all of the text and url is indexed as a single field.
• The Analyzer used during index and search is the Lucene.Net.Analysis.Simple analyzer.
• The index does not save information related to field normalisation.
• The index does not save information related to term vectors.
Section 6. System Evaluation
Your application needs to be evaluated both in terms of efficiency and effectiveness.
The required efficiency metrics are:
• Index size;
• Time to produce index;
• Time to search.
The required efficiency metrics should be provided across the source documents (for the index size
and time to produce the index) and across all five test information needs (for the time to search).
The time must include the time for processing the information need to produce the final query into
the application. You may use Luke.Net to find the size of the index.
There are two sets of required effectiveness metrics, one that considers the retrieval of all possibly
relevant passages, and one that considers the retrieval of at least one passage that contains the
correct answer.
Then, you need to apply the four following metrics, allocating two of them to each of the type of
retrieval tasks:
– Precision @ 10
– Interpolated Recall/Precision curve
– Mean Reciprocal Rank
– MAP
You then need to create a simulation, using the query, associated passages and selection indicators
from the collection. In order to evaluate, it is recommended that you assign id to the different
passages. You also neeed to create appropriate qrels files for each of the 2 sets of metrics (retrieving
the list of passages linked to the query, or retrieving only the selected passage).
The program trec_eval can be used to assist you in calculating these metrics. Depending on how you
have formatted your results file, you may also need to use the program dos2unix.
11 | P a g e
Section 7. Comparison with Baseline System
You should compare the performance of your application with the baseline. This should be done in
four ways:
1. A comparison of the performance of your application against the baseline system using the
three efficiency metrics outlined in Section 6.
2. A comparison of the performance of your application against the baseline system using the
four effectiveness metrics outlined Section 6, justifying their selection for each task.
3. A graph that shows the overall Interpolated Recall-Precision Averages at 11 standard values,
between your application and the baseline.
4. A short description that describes the reasons why you believe your system outperforms or
does not outperform the baseline system.
Section 8. Late submission and plagiarism policies
QUT’s standard policy for late submissions and plagiarism will apply. Information on these policies
can be found at the following
• https://www.student.qut.edu.au/studying/assessment/late-assignments-and-extensions
• http://www.mopp.qut.edu.au/C/C_05_03.jsp
• http://www.mopp.qut.edu.au/E/E_08_01.jsp#E_08_01.08.mdoc
Section 9. What to Submit
By the demonstration due date, you need to submit:
• A zip file containing all the source code associated with your application. This should include
a Visual Studio project file that can be opened and built using Visual Studio 2017/Community
on the computers in your workshop labs or the QUT virtual cloud VS environment. If the
project is not able to be built using Visual Studio 2017/Community on the computers in your
workshop labs, then the associated functionality marks will be 0.
• A 10 minutes maximum demonstration of the functionalities of your system. The
demonstration may also include some slides. Penalties may apply for exceeding the 10
minutes timeframe. The demonstration will be to one of the tutors, typically during the
tutorial where the group members are registered. A schedule of presentation will be
organised in October, and groups with members registered in several tutorials will have the
opportunity to express schedule preferences and/or constraints.
By the report due date, you need to submit:
• A PDF or word document of your report. Ensure that all group member’s names and student
numbers are on the report. The report should be well presented and formatted and use
appropriate headings where necessary. The report should use be professional in nature.
• A results file generated from your application as specified in Section 3. The results file should
be able to be executed by the trec_eval program on the computers in your workshop lab.
12 | P a g e
Section 10. Assessment Criteria
The assessment criteria are separated into two parts:
1. System functionality
2. Written report.
Marks will be assigned as follows. Please note that marks may be deducted for failing to meet
criteria, poor communication, inadequate descriptions or other reasons.
Functionality Assessment Criteria

Part Marks
System with Baseline Functionality 15
Development of Evaluation Framework 5
Processing the Query and Storing 5
Pre-process Natural Language Question 5
Implement changes to Lucene Scoring 5
Implement two changes to Baseline System 8
Improved Query Processing Functionality 7

Report Assessment Criteria

Part Marks
Statement of Completeness 1
Design 10
Changes to Baseline and Query Processing 5
Justification of evaluation metrics 6
System Evaluation 6
Comparison to Baseline 8
User Guide 4
Advanced Features (Answer Questions) 10
Report Total 50

Total Marks

Part Marks
Functionality 50
Report 50
Report Total 100

<< END OF PROJECT SPECIFICATION >>

Leave a Reply

Your email address will not be published.