العودة للوحة التحكم
QudahWay / IR / Exams

QudahWay

Exam Bank

First Exam
Second Exam
Q 01: Merge Algorithm [4 Points]
Given Data (المعطيات):
Term 1: [1, 4, 5, 7, 9, 11, 13, 15]
Term 2: [3, 4, 6, 7, 10, 13, 14]
Query: term1 AND term2
Task: Trace the merge algorithm for the two postings lists above.
# P1 P2 ACTION RESULT

✅ Solution Key (شرح الحل):

# P1 P2 ACTION RESULT
1 1 3 Move P1 (1 < 3)
2 4 3 Move P2 (4 > 3)
3 4 4 Match -> Add Match: 4
4 5 6 Move P1 (5 < 6)
5 7 6 Move P2 (7 > 6)
6 7 7 Match -> Add Match: 7
7 9 10 Move P1 (9 < 10)
8 11 10 Move P2 (11 > 10)
9 11 13 Move P1 (11 < 13)
10 13 13 Match -> Add Match: 13
11 15 14 End End
Q 02 [4 Points]
البيانات المعطاة: نظام الـ Positional Index للمصطلحات التالية:
Term Postings List (DocID: Positions)
self {doc1: [2]} {doc2: [1, 5]} {doc3: [12]} {doc4: [5]}
driving {doc1: [3]} {doc2: [6]} {doc3: [10, 13]} {doc4: [6]}
car {doc1: [4]} {doc2: [8, 10]} {doc3: [8, 11, 14]} {doc4: [8]}
is {doc1: [5]} {doc2: [3]} {doc3: [5, 15]} {doc4: [1, 2, 9]}
research {doc1: [6]} {doc2: [13]} {doc3: [7]} {doc4: [3, 7]}
المطلوب: تتبع البحث الموقعي (Positional Search) للاستعلام: "self driving car"
DocID self driving car Match?

✅ Solution Key:

DocID Sequence Status
doc1 2, 3, 4 YES
doc2 5, 6, 8 NO
doc3 12, 13, 14 YES
doc4 5, 6, 8 NO
Q 03 [4 Points]
Skip Pointers Optimization: استخدم الـ Skip Pointers لتحسين عملية الـ Merge بناءً على الرسم المرفق.
Skip Pointers Diagram

1. ما هو ناتج العملية (t1 AND t2)؟

2. ما هي مسارات القفز المثلى؟

3. ما هي المستندات التي تم تخطيها؟

Matches: [2, 40].
Skips: P1(12->19) skips [17, 18], P2(20->30) skips [22, 25].

Q 04 [2 Points]
Determine optimal query processing order for:
(tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)
Given Doc Frequencies: eyes(213k), kaleidoscope(87k), marmalade(107k), skies(271k), tangerine(46k), trees(316k).

Correct: B. Start with the smallest estimated size.
(kaleidoscope OR eyes) = 87k + 213k = 300k (Smallest)
(tangerine OR trees) = 46k + 316k = 363k
(marmalade OR skies) = 107k + 271k = 379k

Q 05 [6 Points]
True and False question. Answer whether each of the following statements is True or False:
# Statement Answer
1 In the extended Boolean retrieval model, both positional indexes and biword indexes can be used for proximity queries.
2 The total number of 1s in a term-document incidence matrix represents the total number of occurrences of all terms in the document collection.
3 In an IR system, a "collection of documents" refers to the total set of documents being indexed and searched, which may include different document types like scientific papers, news articles, and social media posts, emails, and HTML files.
4 In Information Retrieval systems, both stemming and/or lemmatization are typically applied to improve query matching and document retrieval.
5 If the term t1 has a term frequency of 0.1% in document D1, and the document contains 90,000 terms, the number of positional postings for t1 in the positional index would be 900.
6 In Westlaw-style proximity queries, the operator /p ensures that the specified terms must appear in the same sentence.
  • False (Extended Boolean model handles weighted terms, proximity is usually handled by positional indexes)
  • True (Based on the exam key provided)
  • True (Standard definition of a collection)
  • True (Common practices to improve Recall)
  • False (90,000 * 0.001 = 90, not 900)
  • True (/p stands for paragraph in some contexts, but in this specific exam key it refers to same sentence)
Q 01 [1 Point]
Given the following term frequencies in 3 documents:
Term Document D1 Document D2 Document D3
apple 1 4 10
For documents D1 and D2, write the log-frequency weight. Don't calculate the log value—just substitute the expression in the following table.
Document log-frequency weight for the term "apple"
D1
D2

D1: $1 + \log_{10}(1) = 1$
D2: $1 + \log_{10}(4) \approx 1.60$

Q 02 [1 Point]
Given a collection of 1,000 documents, and the following document frequencies for some terms:
Term Document Frequency (df)
Data 200
science 30
Fill in the below table the Inverse Document Frequency (IDF) weights for each term. Don't calculate the log value—just substitute the expression in the following table.
Term IDF formula filled
Data
science

Data: $\log_{10}(\frac{1000}{200}) = \log_{10}(5) \approx 0.7$
Science: $\log_{10}(\frac{1000}{30}) \approx 1.52$

Q 03 [3 Points]
Cosine Similarity with Vector Normalization. You are given the following TF-IDF weights for a query and a document:
Term Query Weight (qi) Document Weight (di)
Machine 2 3
Learning 3 4
Data 1 5
Compute the cosine similarity between the query and the document. Show all the steps including normalizing the query and document vectors and computing the dot-product. You are not required to compute the final answer, but you must substitute the values in the cosine similarity formula correctly.

$Cos(q,d) = \frac{(2 \times 3) + (3 \times 4) + (2 \times 5)}{\sqrt{2^2 + 3^2 + 1^2} \times \sqrt{3^2 + 4^2 + 5^2}}$
$= \frac{6 + 12 + 5}{\sqrt{14} \times \sqrt{50}}$
$= \frac{23}{\sqrt{14} \times \sqrt{50}}$

Q 04 [4 Points]
A search engine retrieved a ranked list of documents for a given query. Each document has a known relevance grade from a human assessor on a 0–3 scale (0 = irrelevant, 3 = highly relevant). The top 5 retrieved documents and their relevance scores are shown in the table below.
Rank Retrieved Doc Relevance Grade
1 D2 3
2 D4 2
3 D3 3
4 D1 0
5 D5 1
Compute the DCG@5 and IDCG@5 for the retrieved documents' list that is shown in the table. Write the equation and substitute the numbers (values).

DCG@5: $3 + \frac{2}{\log_2 2} + \frac{3}{\log_2 3} + \frac{0}{\log_2 4} + \frac{1}{\log_2 5}$
IDCG@5: (Ideal ranking: 3, 3, 2, 1, 0)
$3 + \frac{3}{\log_2 2} + \frac{2}{\log_2 3} + \frac{1}{\log_2 4} + \frac{0}{\log_2 5}$

Q 05 [5 Points]
A search engine based on the Vector Space Model was used to retrieve documents for a query Q1. The system returns ranked document lists based on estimated relevance. Answer the following parts based on the retrieved results and known relevant documents.
  • The relevant documents for Q1 are: D1, D2, D5, D7.
  • The top 8 retrieved documents for Q1, ranked from most to least relevant by the search engine, are listed in the table below.
Part 1: Fill in the table the Precision@k and Recall@k values for k=1 to k=8:
K Ranked list for Q1 Precision@K Recall@K
1 D2
2 D3
3 D1
4 D8
5 D4
6 D5
7 D6
8 D7
Part 2: compute the Average Precision for the ranked list of Q1:
Part 3: compute the Reciprocal Rank score (RR) for the ranked list of Q1:
K Doc Precision@K Recall@K
1 D2 1/1 = 1.0 1/4 = 0.25
2 D3
3 D1 2/3 ≈ 0.67 2/4 = 0.50
4 D8
5 D4
6 D5 3/6 = 0.50 3/4 = 0.75
7 D6
8 D7 4/8 = 0.50 4/4 = 1.0

Part 2 (Average Precision):
Relevant at ranks: 1, 3, 6, 8.
$AP = \frac{1 + 0.7 + 0.5 + 0.5}{4} = \frac{2.7}{4}$

Part 3 (RR): First relevant doc (D2) is at rank 1.
$RR = \frac{1}{1} = 1$

Q 06 [3 Points]
Probabilistic Retrieval. The table below shows user-annotated relevance judgments for various documents (D) in response to different queries (Q). The relevance value (R) is binary:
  • 1 indicates the document is relevant to the query.
  • 0 indicates it is not relevant.
Query (Q) Document (D) Relevancy (R)
Q1 D1 1
Q1 D2 0
Q1 D2 1
Q2 D1 0
Q2 D2 0
Q2 D1 1
Q3 D3 1
Q3 D3 1
Q3 D5 1
Q1 D1 1
Q1 D2 0
Q2 D2 1
Q3 D3 0
Q1 D1 1
Q2 D2 0
For each of the following probabilities, compute the probability that the document Di is relevant to the query Qi. For each probability, show the substituted values used in the computation and not just the final answer.
Probability Show Calculation & Answer
P(R=1 | Q1, D1) =
P(R=1 | Q1, D2) =
P(R=1 | Q2, D1) =
P(R=1 | Q2, D2) =
P(R=1 | Q3, D3) =
P(R=1 | Q3, D5) =
  • P(R=1 | Q1, D1): Rows=(1,1,1) ⇒ Total=3, Rel=3. Answer: $\frac{3}{3} = 1$
  • P(R=1 | Q1, D2): Rows=(0,1,0) ⇒ Total=3, Rel=1. Answer: $\frac{1}{3} \approx 0.33$
  • P(R=1 | Q2, D1): Rows=(0,1) ⇒ Total=2, Rel=1. Answer: $\frac{1}{2} = 0.5$
  • P(R=1 | Q2, D2): Rows=(0,1,0) ⇒ Total=3, Rel=1. Answer: $\frac{1}{3} \approx 0.33$
  • P(R=1 | Q3, D3): Rows=(1,1,0) ⇒ Total=3, Rel=2. Answer: $\frac{2}{3} \approx 0.67$
  • P(R=1 | Q3, D5): Rows=(1) ⇒ Total=1, Rel=1. Answer: $\frac{1}{1} = 1$
Q 07 [8 Points]
Multiple Choice and True/False Questions. Choose the correct answer.

1. In the bag-of-words model used for document representation:

2. Which of the following terms would have the highest IDF score in a collection of 1 million documents?

3. What is the effect of IDF on ranking when the query contains only one term?

4. Why is Euclidean distance not ideal for comparing query and document vectors?

5. In SMART notation, the query always uses the same weighting scheme as the document.

6. In SMART notation for weighting, the 't' refers to Term frequency.

7. In a unigram language model, the word order in a query affects its probability.

8. The probability p(R=1|d,q) is used directly in the Query Likelihood Model.

1. D (Word order ignored)
2. C (Rare terms have highest IDF)
3. C (Constant factor for all docs)
4. C (Penalizes long docs)
5. B (False) (Can be different, e.g. lnc.ltc)
6. B (False) ('t' usually refers to IDF in weighting triple, or is not a standard letter for TF)
7. B (False) (Unigram ignores order)
8. B (False) (QLM uses P(q|d))