Implement a state-of-the-art information retrieval method

INFS7410 Project – Part 2
Preamble
The due date for this assignment is 19 September 2019 17:00 Eastern Australia Standard Time,
together with part 1.
This part of the project is worth 10% of the overall mark for INFS7410 (part 1 is woth 5% — and
thus the whole submission of part 1 + 2 is worth 15%). A detailed marking sheet for this
assignment is provided at the end of this document.
Aim
Project aim: The aim of this project is to implement a state-of-the-art information retrieval
method, evaluate it and compare it to the baseline and rank fusion methods obtained in part 1 in
the context of a real use-case.
Project Part 2 aim
The aim of part 2 is to:
Use the evaluation infrastructure setup for part 1
implement state-of-the-art information retrieval methods, based on query reduction
evaluate, compare and analyse the developed state-of-the-art methods against baseline and
ranking fusion methods
The Information Retrieval Task: Ranking of studies for
Systematic Reviews
Part 2 of the project considers the same problem described in part 1: re-rank a set of documents
retrieved for the compilation of a systematic review. A description of the wider task is provided in
part 1.
What we provide you with (same as part 1)
We provide:
for each dataset, a list of topics to be used for training. Each topic is organised into a file.
Each topic contains a title and a Boolean query.
for each dataset, a list of topics to be used for testing. Each topic is organised into a file. Each
topic contains a title and a Boolean query.
each topic file (both those for training and those for testing), includes a list of retrieved
documents in the form of their PMIDs: these are the documents that you have to rank. Take
note: you do not need to perform the retrieval from scratch (i.e. execute the query against
the whole index); instead you need to rank (order) the provided documents.
for each dataset, and for each train and test partition, a qrels file, containing relevance
assessments for the documents to be ranked. This is to be used for evaluation.
for each dataset, and for test partitions, a set of runs from retrieval systems that
participated to CLEF 2017/2018 to be considered for fusion.
a Terrier index of the entire Pubmed collection. This index has been produced using the
Terrier stopword list and Porter stemmer.
a Java Maven project that contains the Terrier dependencies and a skeleton code to give you
a start. NOTE: Tip #1 provides you with a restructured skeleton code to make the processing
of queries more efficient.
a template for your project report.
What you need to produce
You need to produce:
correct implementations of the state-of-the-art methods required by this project
specifications
correct evaluation, analysis and comparison of the state-of-the-art method, including
comparison with the methods implemented in part 1. This should be written up into a
report following the provided template.
a project report that, following the provided template, details: an explanation of the state-ofthe-
art retrieval method used (with your own words), an explanation of the evaluation
settings followed, the evaluation of results (as described above), inclusive of analysis, a
discussion of the findings. Note that you will need to provide a unique report that
encompasses both part 1 and part 2.
Required methods to implement
In part 2 of the project you are required to implement the following query reduction retrieval
method:
Query reduction using IDF-r. We have discussed this method in the week 6 lecture (online
video) and in the week 6 tutorial. This method is described in Koopman, Bevan, Liam
Cripwell, and Guido Zuccon, “Generating clinical queries from patient narratives: A
comparison between machines and humans.” Proceedings of the 40th international ACM SIGIR
conference on Research and development in information retrieval. ACM, 2017. (see the first
paragraph of section 3.1 if you want a description from the literature — ignore the settings of
described in that publication). You may have already implemented this for part 1 for
reducing the boolean queries (tip 4), and in the relevant tutorial.
Query reduction using Kullback-Liebler informativeness (KLI). This reduction method is
partially described in Daniel Locke, Guido Zuccon, and Harrisen Scells, “Automatic Query
Generation from Legal Texts for Case Law Retrieval.” Asia Information Retrieval Symposium.
Springer, Cham, 2017. (top of page 187)
For IDF-r, we ask you explore reduction on the query formed by the title query. Queries will be
reduced at a reduction of , where is the retantion rate, i.e. means retaining 85%
of the original terms. We ask you explore three retantion rates on the training set: 85%, 50% and
30%. When rounding the number of query terms to retain to an integer number, use the ceiling
function.
For implementing KLI, consider the following, revised definition of this method. The KLI of a term
is formally defined by
where is the set of documents provided to rank (i.e. the documents initially retrieved by the
Boolean query), and is the entire collection as indexed in the provided index. Thus, you need to
compute, for each query term, the probability of the term appearing in the provided retrieved set
(i.e. term frequency in the set — note, here is not representing one document!, but the set
of initially retrieved documents): use MLE to compute this. Similarly, use MLE to compute the
probability of term appearing in the collection. Query reduction is then performed by ranking
query terms in decresing value of , and applying the retaintion rate . For KLI, perform a
similar exploration of retation rates as for IDF- .
For both methods, rank documents according to the reduced queries using BM25 with the best
parameters found from part 1 for the dataset you are experimenting in.
When tuning, tune with respect to MAP.
We strongly recommend you use and extend the Maven project provided for part 1 to implement
these methods. You should have already attempted the implementation of IDF- as part of the
relevant tutorial exercise.
In the report, detail how the methods were implemented, including which formula you
implemented.
What queries to use
For part 2, we ask you to consider the queries for each topic created from the title field of each
topic. For example, consider the example (partial) topic listed below: the query will be Rapid
diagnostic tests for diagnosing uncomplicated P. falciparum malaria in endemic
countries (you may consider performing text processing). This is the same query type used in
part 1.
Above: example topic file
Required evaluation to perform
In part 1 of the project you are required to perform the following evaluation:

  1. For all methods, train on the training set for the 2017 topics with respect to the retaintion
    rate and test on the testing set for the 2017 topics (using the parameter value you selected
    from the training set). Report the results of every method on the training (the best selected)
    and on the testing set, separately, into one table. Perform statistical significance analysis
    across the results of the methods.
  2. Comment on the results reported in the previous table by comparing the methods on the
    2017 dataset.
  3. For all methods, train on the training set for the 2018 topics (with respect to the retaintion
    rate and test on the testing set for the 2018 topics (using the parameter value you selected
    from the training set). Report the results of every method on the training (the best selected)
    and on the testing set, separately, into one table. Perform statistical significance analysis
    across the results of the methods.
  4. Comment on the results reported in the previous table by comparing the methods on the
    2018 dataset.
  5. Perform a topic-by-topic gains/losses analysis for both 2017 and 2018 results on the testing
    datasets, by considering as baseline (tuned) BM25.
  6. Comment on trends and differences observed when comparing the findings from 2017 and
    2018 results. Is there a query reduction method that consistently outperform the others?
    In terms of evaluation measures, evaluate the retrieval methods with respect to mean average
    precision (MAP) using trec_eval . Remember to set the cut-off value ( -M , i.e. the maximum
    number of documents per topic to use in evaluation) to the number of documents to be reranked
    for each of the queries. Using trec_eval , also compute Rprecision (Rprec), which is the
    precision after R documents have been retrieved (by default, R is the total number of relevant
    docs for the topic).
    For all statistical significance analysis, use paired t-test; distinguish between p<0.05 and p<0.01.
    Topic: CD008122
    Title: Rapid diagnostic tests for diagnosing uncomplicated P. falciparum
    malaria in endemic countries
    Query:
  7. Exp Malaria/
  8. Exp Plasmodium/
  9. Malaria.ti,ab
  10. 1or2or3
  11. Exp Reagent kits, diagnostic/ 6. rapid diagnos* test*.ti,ab
  12. RDT.ti,ab
  13. Dipstick*.ti,ab
    How to submit
    You will have to submit 3 files:
  14. the report, formatted according to the provided template, saved as PDF or MS Word
    document. Note, write the report by combining part 1 (the previous assignment) and part 2
    (this assignment) results and methods. make sure you clearly label methods and results that
    belong to the different assignments.
  15. a zip file containing a folder called runs-part2 , which itself contains the runs (result files)
    you have created for the implemented methods.
  16. a zip file containing a folder called code-part2 , which itself contains all the code to re-run
    your experiments. You do not need to include in this zip file the runs we have given to you.
    You may need to include additional files e.g. if you manually process the topic files into an
    intermediate format (rather than automatically process them from the files we provide you),
    so that we can re-run your experiments to confirm your results and implementation.
    If your set of runs is too big, please do the following:
    include in the zip the test run
    include in the zip the best train run you used to decide upon the parameter tuning
    create a separate zip file with all the runs; upload it to a file sharing service like dropbox or
    google drive (or similar), then make sure it is visible without login and add the link to it to
    your report. Please ensure that the link to the resources is available for at least 6 days after
    the submission of the assignment.
    All items need to be submitted via the relevant Turnitin link in the INFS7410 Blackboard site, by 19
    September 2019 17:00 Eastern Australia Standard Time, together with part 1, unless you have
    been given an extension (according to UQ policy), before the due date of the assignment. Note:
    appropriate, separate links are provided in the Assignment 2 folder in Blackboard for submission
    of the report, or runs-part1, runs-part2, code-part1, and code-part2.
    INFS 7410 Project Part 2 – Marking Sheet
    Criterion % 7
    100%
    4
    50%
    FAIL 1
    0%
    IMPLEMENTATION
    The ability to:
    • Understand
    implement and
    execute common
    IR baseline
    • Understand
    implement and
    execute rank
    fusion methods
    • Perform text
    processing
    4 • Correctly implements both
    query reduction methods
    • Correctly implements only one of
    the specified query reduction
    methods
    • No implementation
    EVALUATION
    The ability to:
    • Empirically evaluate
    and compare IR
    methods
    • Analyse the results of
    empirical IR
    evaluation
    • Analyse the statistical
    significance
    difference between
    IR methods’
    effectiveness
    5 • Correct empirical
    evaluation has been
    performed
    • Uses all required
    evaluation measures
    • Correct handling of the
    tuning regime (train/test)
    • Reports all results for the
    provided query sets into
    appropriate tables
    • Provides graphical analysis
    of results on a query-byquery
    basis using
    appropriate gain-loss plots
    • Provides correct statistical
    significance analysis within
    the result table; and
    correctly describes the
    statistical analysis
    performed
    • Provides a written
    understanding and
    discussion of the results
    with respect to the
    methods
    • Provides examples of
    where query reduction
    works, and were it does
    not, and why, e.g.,
    discussion with respect to
    queries, runs.
    • Correct empirical evaluation has
    been performed
    • Uses all required evaluation
    measures
    • Correct handling of the tuning
    regime (train/test)
    • Reports all results for the provided
    query sets into appropriate tables
    • Provides graphical analysis of
    results on a query-by-query basis
    using appropriate gain-loss plots
    • Does not perform statistical
    significance analysis, or errors are
    present in the analysis
    • No or only partial empirical evaluation
    has been conducted, e.g. only on a
    topic set, or a subset of topics
    • Only report a partial set of evaluation
    measures
    • Fails to correctly handle training and
    testing partitions, e.g. train on test,
    reports only overall results
    WRITE UP
    Binary score: 0/2
    The ability to:
    • use fluent
    language with
    correct grammar,
    spelling and
    punctuation
    • use appropriate
    paragraph,
    sentence
    structure
    • use appropriate
    style and tone of
    writing
    • produce a
    professionally
    presented
    document,
    according to the
    provided
    template
    1 • Structure of the document
    is appropriate and meets
    expectations
    • Clarity promoted by
    consistent use of standard
    grammar, spelling and
    punctuation
    • Sentences are coherent
    • Paragraph structure
    effectively developed
    • Fluent, professional style
    and tone of writing.
    • No proof reading errors
    • Polished professional
    appearance
    • Written expression and
    presentation are incoherent, with little
    or no structure, well below
    required standard
    • Structure of the document is not
    appropriate and does not meet
    expectations
    • Meaning unclear as grammar and/or
    spelling contain frequent errors.
    • Disorganised or incoherent writing.

Leave a Reply

Your email address will not be published.