Exam Bank
| # | P1 | P2 | ACTION | RESULT |
|---|
| # | P1 | P2 | ACTION | RESULT |
|---|---|---|---|---|
| 1 | 1 | 3 | Move P1 (1 < 3) | ∅ |
| 2 | 4 | 3 | Move P2 (4 > 3) | ∅ |
| 3 | 4 | 4 | Match -> Add | Match: 4 |
| 4 | 5 | 6 | Move P1 (5 < 6) | ∅ |
| 5 | 7 | 6 | Move P2 (7 > 6) | ∅ |
| 6 | 7 | 7 | Match -> Add | Match: 7 |
| 7 | 9 | 10 | Move P1 (9 < 10) | ∅ |
| 8 | 11 | 10 | Move P2 (11 > 10) | ∅ |
| 9 | 11 | 13 | Move P1 (11 < 13) | ∅ |
| 10 | 13 | 13 | Match -> Add | Match: 13 |
| 11 | 15 | 14 | End | End |
| Term | Postings List (DocID: Positions) |
|---|---|
| self | {doc1: [2]} {doc2: [1, 5]} {doc3: [12]} {doc4: [5]} |
| driving | {doc1: [3]} {doc2: [6]} {doc3: [10, 13]} {doc4: [6]} |
| car | {doc1: [4]} {doc2: [8, 10]} {doc3: [8, 11, 14]} {doc4: [8]} |
| is | {doc1: [5]} {doc2: [3]} {doc3: [5, 15]} {doc4: [1, 2, 9]} |
| research | {doc1: [6]} {doc2: [13]} {doc3: [7]} {doc4: [3, 7]} |
| DocID | self | driving | car | Match? |
|---|
| DocID | Sequence | Status |
|---|---|---|
| doc1 | 2, 3, 4 | YES |
| doc2 | 5, 6, 8 | NO |
| doc3 | 12, 13, 14 | YES |
| doc4 | 5, 6, 8 | NO |
1. ما هو ناتج العملية (t1 AND t2)؟
2. ما هي مسارات القفز المثلى؟
3. ما هي المستندات التي تم تخطيها؟
Matches: [2, 40].
Skips: P1(12->19) skips [17, 18], P2(20->30) skips [22, 25].
Correct: B. Start with the smallest estimated size.
(kaleidoscope OR eyes) = 87k + 213k = 300k (Smallest)
(tangerine OR trees) = 46k + 316k = 363k
(marmalade OR skies) = 107k + 271k = 379k
| # | Statement | Answer |
|---|---|---|
| 1 | In the extended Boolean retrieval model, both positional indexes and biword indexes can be used for proximity queries. | |
| 2 | The total number of 1s in a term-document incidence matrix represents the total number of occurrences of all terms in the document collection. | |
| 3 | In an IR system, a "collection of documents" refers to the total set of documents being indexed and searched, which may include different document types like scientific papers, news articles, and social media posts, emails, and HTML files. | |
| 4 | In Information Retrieval systems, both stemming and/or lemmatization are typically applied to improve query matching and document retrieval. | |
| 5 | If the term t1 has a term frequency of 0.1% in document D1, and the document contains 90,000 terms, the number of positional postings for t1 in the positional index would be 900. | |
| 6 | In Westlaw-style proximity queries, the operator /p ensures that the specified terms must appear in the same sentence. |
| Term | Document D1 | Document D2 | Document D3 |
|---|---|---|---|
| apple | 1 | 4 | 10 |
| Document | log-frequency weight for the term "apple" |
|---|---|
| D1 | |
| D2 |
D1: $1 + \log_{10}(1) = 1$
D2: $1 + \log_{10}(4) \approx 1.60$
| Term | Document Frequency (df) |
|---|---|
| Data | 200 |
| science | 30 |
| Term | IDF formula filled |
|---|---|
| Data | |
| science |
Data: $\log_{10}(\frac{1000}{200}) = \log_{10}(5) \approx 0.7$
Science:
$\log_{10}(\frac{1000}{30}) \approx 1.52$
| Term | Query Weight (qi) | Document Weight (di) |
|---|---|---|
| Machine | 2 | 3 |
| Learning | 3 | 4 |
| Data | 1 | 5 |
$Cos(q,d) = \frac{(2 \times 3) + (3 \times 4) + (2 \times 5)}{\sqrt{2^2 + 3^2 + 1^2} \times
\sqrt{3^2 + 4^2 + 5^2}}$
$= \frac{6 + 12 + 5}{\sqrt{14} \times \sqrt{50}}$
$= \frac{23}{\sqrt{14} \times \sqrt{50}}$
| Rank | Retrieved Doc | Relevance Grade |
|---|---|---|
| 1 | D2 | 3 |
| 2 | D4 | 2 |
| 3 | D3 | 3 |
| 4 | D1 | 0 |
| 5 | D5 | 1 |
DCG@5: $3 + \frac{2}{\log_2 2} + \frac{3}{\log_2 3} + \frac{0}{\log_2 4} +
\frac{1}{\log_2 5}$
IDCG@5: (Ideal ranking: 3, 3, 2, 1, 0)
$3 + \frac{3}{\log_2 2} + \frac{2}{\log_2 3} + \frac{1}{\log_2 4} + \frac{0}{\log_2 5}$
| K | Ranked list for Q1 | Precision@K | Recall@K |
|---|---|---|---|
| 1 | D2 | ||
| 2 | D3 | ||
| 3 | D1 | ||
| 4 | D8 | ||
| 5 | D4 | ||
| 6 | D5 | ||
| 7 | D6 | ||
| 8 | D7 |
| K | Doc | Precision@K | Recall@K |
|---|---|---|---|
| 1 | D2 | 1/1 = 1.0 | 1/4 = 0.25 |
| 2 | D3 | ||
| 3 | D1 | 2/3 ≈ 0.67 | 2/4 = 0.50 |
| 4 | D8 | ||
| 5 | D4 | ||
| 6 | D5 | 3/6 = 0.50 | 3/4 = 0.75 |
| 7 | D6 | ||
| 8 | D7 | 4/8 = 0.50 | 4/4 = 1.0 |
Part 2 (Average Precision):
Relevant at ranks: 1, 3, 6, 8.
$AP = \frac{1 + 0.7 + 0.5 + 0.5}{4} = \frac{2.7}{4}$
Part 3 (RR): First relevant doc (D2) is at rank 1.
$RR = \frac{1}{1} =
1$
| Query (Q) | Document (D) | Relevancy (R) |
|---|---|---|
| Q1 | D1 | 1 |
| Q1 | D2 | 0 |
| Q1 | D2 | 1 |
| Q2 | D1 | 0 |
| Q2 | D2 | 0 |
| Q2 | D1 | 1 |
| Q3 | D3 | 1 |
| Q3 | D3 | 1 |
| Q3 | D5 | 1 |
| Q1 | D1 | 1 |
| Q1 | D2 | 0 |
| Q2 | D2 | 1 |
| Q3 | D3 | 0 |
| Q1 | D1 | 1 |
| Q2 | D2 | 0 |
| Probability | Show Calculation & Answer |
|---|---|
| P(R=1 | Q1, D1) = | |
| P(R=1 | Q1, D2) = | |
| P(R=1 | Q2, D1) = | |
| P(R=1 | Q2, D2) = | |
| P(R=1 | Q3, D3) = | |
| P(R=1 | Q3, D5) = |
1. In the bag-of-words model used for document representation:
2. Which of the following terms would have the highest IDF score in a collection of 1 million documents?
3. What is the effect of IDF on ranking when the query contains only one term?
4. Why is Euclidean distance not ideal for comparing query and document vectors?
5. In SMART notation, the query always uses the same weighting scheme as the document.
6. In SMART notation for weighting, the 't' refers to Term frequency.
7. In a unigram language model, the word order in a query affects its probability.
8. The probability p(R=1|d,q) is used directly in the Query Likelihood Model.
1. D (Word order ignored)
2. C (Rare terms have highest IDF)
3. C (Constant factor for all docs)
4. C (Penalizes long docs)
5. B (False) (Can be different, e.g. lnc.ltc)
6. B (False) ('t' usually refers to IDF in weighting triple, or
is not a standard letter for TF)
7. B (False) (Unigram ignores order)
8. B (False) (QLM uses P(q|d))