Microsoft Research Paraphrase Corpus


Bill Dolan, Chris Brockett, and Chris Quirk

Microsoft Research

March 2, 2005


This document provides some information about the creation of the corpus, along with results of the annotation effort. If you use the corpus in your research, we would appreciate your citing one or both of the following papers, which give some details of our work on paraphrase and our data annotation efforts. (A paper describing in detail how this corpus was created is currently in progress.) We are continuing to tag data, and hope to release a larger version of this corpus to the research community in the future.

Quirk, C., C. Brockett, and W. B. Dolan. 2004. Monolingual Machine Translation for Paraphrase Generation, In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona Spain.

Dolan W. B., C. Quirk, and C. Brockett. 2004. Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources. COLING 2004, Geneva, Switzerland.


1.      Introduction to the paraphrase tagging task


This dataset consists of 5801 pairs of sentences gleaned over a period of 18 months from thousands of news sources on the web. Accompanying each pair is judgment reflecting whether multiple human annotators considered the two sentences to be close enough in meaning to be considered close paraphrases.


Each pair of sentences has been examined by 2 human judges who were asked to give a binary judgment as to whether the two sentences could be considered “semantically equivalent”. Disagreements were resolved by a 3rd judge. This annotation task was carried out by an independent company, the Butler Hill Group, LLC. Mo Corston-Oliver directed the effort, with Jeff Stevenson, Amy Muia, and David Rojas acting as raters. Mo Corston-Oliver and Jeff Stevenson also helped with the preparation of this document.


After resolving differences between raters, 3900 (67%) of the original 5801 pairs were judged “semantically equivalent”.


In many instances, the pair of sentences rated by 2 judges as “semantically equivalent” will in fact diverge semantically to at least some degree. If a full paraphrase relationship can be described as “bidirectional entailment”, then the majority of the “equivalent” pairs in this dataset exhibit “mostly bidirectional entailments”, with one sentence containing information that differs from or is not contained in the other. Some specific rating criteria are included in a tagging specification (Section 3), but by and large the degree of mismatch allowed before the pair was judged “non-equivalent” was left to the discretion of the individual rater: did a particular set of asymmetries alter the meanings of the sentences enough that they couldn’t be considered “the same” in meaning? This task was ill-defined enough that we were surprised at how high interrater agreement was (averaging 83%).


A series of experiments aimed at making the judging task more concrete resulted in uniformly degraded interrater agreement. Providing a checkbox to allow judges to specify that one sentence entailed another, for instance, left the raters frustrated and had a negative impact on agreement. Similarly, efforts to identify classes of syntactic alternations that would not count against an “equivalent” judgment resulted, in most cases, in a collapse in interrater agreement. The relatively few situations where we found firm guidelines of this type to be helpful (e.g. in dealing with anaphora) are included in Section 3.


The decision to tag sentences as being “more or less semantically equivalent”, rather than “semantically equivalent” was ultimately a practical one: insisting on complete sets of bidirectional entailments would have ruled out all but the most trivial sorts of paraphrase relationships, such as sentence pairs differing only a single word or in the presence of titles like “Mr.” and “Ms.”. Our interest was in identifying more complex paraphrase relationships, which required a somewhat looser definition of what “semantic equivalence” means.  In an effort to focus on these more interesting pairs, the dataset was restricted to pairs with a minimum word-based Levenshtein distance of 8.


Given our relatively loose definition of equivalence, any 2 of the following sentences would probably have been considered “paraphrases”, despite obvious differences in information content:


·        The genome of the fungal pathogen that causes Sudden Oak Death has been sequenced by US scientists

·        Researchers announced Thursday they've completed the genetic blueprint of the blight-causing culprit responsible for sudden oak death

·        Scientists have figured out the complete genetic code of a virulent pathogen that has killed tens of thousands of California native oaks

·        The East Bay-based Joint Genome Institute said Thursday it has unraveled the genetic blueprint for the diseases that cause the sudden death of oak trees


Raters were presented with sentences in which several classes of named entities were replaced by generic tags, so that “Tuesday” became %%DAY%%, “$10,000” became “%%MONEY%%, and so on. The release versions, however, preserve the original strings.


Note that many of the sentence pairs judged to be “not equivalent” will still overlap significantly in information content and even wording. A variety of automatic filtering techniques were used to create an initial dataset that was rich in paraphrase relationships, and the success of these techniques meant that approximately 70% of the pairs examined by raters were, by our criteria, semantically equivalent. The remaining 30% represent a range of relationships, from pairs that are completely unrelated semantically, to those that are partially overlapping, to those that are almost-but-not-quite semantically equivalent. For this reason, this “not equivalent” set should not be used as negative training data.


We have made every effort to ensure that each sentence in this dataset has been given proper attribution. If you encounter any errors/omissions, please contact Bill Dolan (, and we will promptly modify the data to reflect the correct information.





2.      Methodology and Results


This data set consists of 5801 sentence pairs, with a binary human judgment of whether or not the pairing constitutes a paraphrase.


2.1.     Methodology


To generate the judgments, we used 3 raters to score the sentence pairs according to a given specification. Rater 1 scored all 5801 sentences. Rater 2 scored 3533 sentences, and Rater 3 scored 2268 sentences. For the sentences where Rater 1 and 2 did not agree on the judgment, Rater 3 gave a final judgment, while Rater 2 gave the final judgment on sentences where Rater 1 and Rater 3 did not agree.


2.2.     Interrater Agreement


To test interrater agreement, we took a simple percentage:



Total scored

Total agreements

Percentage agreement

Raters 1 & 2




Raters 1 & 3





2.3.     Overall scoring results


We computed scoring results for each individual (raw scores, before resolving differences):



Total scored

Number “yes”

Percentage “yes”

Rater 1




Rater 2




Rater 3





After resolving differences, we judged 3900 out of 5801 sentence pairs to be valid paraphrases, for a final percentage of 67.23%


2.4.     Test/training


We assigned a random sequence ID to each sentence pair, sorted them, and assigned the first 30% of the data to be “training” and the last 70% to be “test” data. For obscure technical reasons, the final test/train percentage is inexact (29.7% (1725 sentence pairs) vs. 70.3% (4076 sentence pairs))




3.      Detailed Tagging Guidelines



3.1.     “Equivalent” vs. “not equivalent” content


·        In this task, we are trying to determine if two sentences express the same content.

·        As is true for paraphrase in general, this may be realized by means of alternative but similar syntactic constructions and lexical items, etc.

·        In general, the standard as to whether two sentences express the same content should be relatively high, meaning that many of ambiguous cases should be marked "not equivalent" rather than "equivalent".


Examples of sentences with “equivalent” content expressed via alternative lexical items:


The Senate Select Committee on Intelligence is preparing a blistering report on prewar intelligence on Iraq.


American intelligence leading up to the war on Iraq will be criticised by a powerful US Congressional committee due to report soon, officials said today.



A strong geomagnetic storm was expected to hit Earth today with the potential to affect electrical grids and satellite communications.


A strong geomagnetic storm is expected to hit Earth sometime %%DAY%% and could knock out electrical grids and satellite communications.


These sentences are clearly paraphrases. The different lexical items are still expressing the same content. This type of sentence pair should be tagged as “equivalent”.



3.2.     “Equivalent” sentence pairs with minor differences in content


Minor differences between sentences can be overlooked when determining if two sentences are paraphrases. For example:


An autopsy found Hatab's death was caused by "strangulation/asphyxiation," Rawson said %%DAY%%.


An autopsy found that Nagem Sadoon Hatab's death on %%DATE%% was caused by "strangulation/asphyxiation,” Marine spokesman %%NUMBER%% st Lt. Dan Rawson said %%DAY%%.


The following sentences also express “equivalent” content:


Mr. Concannon had been doused in petrol, set himself alight and jumped onto a bike to leap eight metres onto a mattress below.


A SYDNEY man suffered serious burns after setting himself alight before attempting to jump a BMX bike off a toilet block into a pile of mattresses , police said.


The agent (Mr. Concanon), the predicated actions (set himself alight, jumped a bike), and important details (onto a pile of mattresses) are present in both sentences. Additional lexical material in either sentence mainly serves to embellish the main propositions (for example, “. . .suffered serious burns” which is logically entailed by “set himself alight”). Also notice that the details of a given proposition need not be exact: a mattress (sing.) vs. a pile of mattresses (plur.). Finally, notice that the second of the sentence pairs in the previous example is “attributed” to the police where the first is not. This difference between sentences is also acceptable for purposes of tagging them as paraphrases.


For this type of sentence pair, we want to mark them as equivalent (paraphrases). Notice that the sentence pairs, while clearly similar overall in content, both differ in additional, modifying content.. As the main content of the sentences similar in meaning, we “allow” some minor content mismatch.


3.3.     Anaphora


Sometimes the difference between two sentences involves anaphora (NPs and pronominal). These sentences can be tagged as paraphrases despite the (sometimes) fairly large gap between them in terms of their corresponding full-form NPs. Examples follow.


3.3.1.                  Demonstratives


But Secretary of State Colin Powell brushed off this possibility  %%day%%.


Secretary of State Colin Powell last week ruled out a non-aggression treaty.



3.3.2.       NP -> pro


Meteorologists predicted the storm would become a category %%number%% hurricane before landfall.


It was predicted to become a category 1 hurricane overnight.


3.3.3.       Proper NP (+animate) -> pro


Earlier, he told France Inter-Radio , ''I think we can now qualify what is happening as a genuine epidemic.''


''I think we can now qualify what is happening as a genuine epidemic,'' health minister Jean-Francois Mattei said on France Inter Radio.


3.3.4.      Title + proper NP (+animate) -> pro


''United is continuing to deliver major cost reductions and is now coupling that effort with significant unit revenue improvement, '' chief financial officer Jake Brace said in a statement.


''United is continuing to deliver major cost reductions and is now coupling that effort with significant unit revenue improvement,'' he said.


3.3.5.       NP (-animate) -> pro


''Spoofing is a problem faced by any company with a trusted domain name that uses e-mail to communicate with its customers.


It is a problem for Amazon and others that have a trusted domain name and use e-mail to communicate with customers.



3.4.     Inherent ambiguity of the task


The relatively holistic/vague criteria established above should work well for most sentence pairs. In the end, we’re tagging something that’s not quite paraphrase, but something like “semantic near-equivalence” – sentences pairs that ideally involve complete sets of bidirectional entailments, but which in fact often have some entailment asymmetries or other mismatches. The issue here is when those asymmetries/differences become significant enough to make the pair different enough that you don’t think they mean more or less the same thing anymore, where “more or less”  becomes a personal judgment call.



3.5.     Sentence pairs with “different” content


3.5.1.       “Different” content: prototypical example


In contrast to the examples above, the following sentences clearly express “different” content:


Prime Minister Junichiro Koizumi did not have to dissolve parliament until next summer , when elections for the upper house are also due .


Prime Minister Junichiro Koizumi has urged Nakasone to give up his seat in accordance with the new age rule .


While the principal agent (Koizumi) is the same, predicated actions, i.e. verbs (dissolve / urge) and other arguments (parliament / Nakasone) are clearly different. The additional material found in either sentence does not embellish the main proposition but instead contains important content itself. These two sentence pairs should be marked as “not equivalent” in that while they share an agent “Koizumi,” they are about unrelated events. Again, ambiguous cases should be marked "not equivalent" rather than "equivalent”.



3.5.2.      Shared content of the same event, etc. but lacking details (one sentence is a superset of the other)


Researchers have identified a genetic pilot light for puberty in both mice and humans .


The discovery of a gene that appears to be a key regulator of puberty in humans and mice could lead to new infertility treatments and contraceptives.


These sentences are similar in content, refer to a similar key piece of information, but cannot be marked as “equivalent”. The sentences should be tagged as “not equivalent” because even though the content of the sentences is similar, one sentence is a significantly larger superset of the other: all the content of the first sentence is in the second, but not vice-versa. The superset sentence contains important content information (above, in bold) not present in the second sentence.


Some similar sentence pairs follow (missing content in superset sentence is in bold):


SOME %%NUMBER%% jobs are set to go at Cadbury Schweppes , the confectionery and drinks giant , as part of a sweeping cost reduction programme announced today .


Confectionery group Cadbury Schweppes has warned of further cuts to its %%NUMBER%% -strong UK workforce .



This sentence is difficult in that, while one sentence is a superset of the other, it is also arguably the case that the sentences are “almost” paraphrases except when we see that the content of the underlined portions in the two sentences above is exclusive to one sentence. In the end, however, the material in bold is an important difference in content between the sentences, and adds important additional content, leading us to prefer tag them as “not equivalent”.


Please use your best judgment in choosing to tag sentences as “equivalent” or “not equivalent”. Many of the sentence pairs you see differ due to the way editors eliminate language/content they deem unnecessary. Sometimes the two sentences will differ in information that conveys important additional information. Sentences like these should be tagged as “not equivalent”:



The former wife of rapper Eminem has been electronically tagged after missing two court appearances .


After missing two court appearances in a cocaine possession case , Eminem's ex-wife has been placed under electronic house arrest .



The issue of whether or not the extra/missing information is significant enough to warrant treating the sentences as “not equivalent” amounts to a judgment call. Minor differences between sentences can be overlooked when determining if two sentences are paraphrases. As seen in a previous example sentence pair, the only differences in content between the following sentences are the reduced forms of names and adverbial modifiers (dates). There are no major differences in content between these sentences. They can be marked as “equivalent”.



An autopsy found Hatab's death was caused by "strangulation/asphyxiation," Rawson said %%DAY%%  .


An autopsy found that Nagem Sadoon Hatab's death on %%DATE%% was caused by " strangulation/asphyxiation , "   Marine spokesman %%NUMBER%% st Lt. Dan Rawson said %%DAY%%.


The role of content asymmetries in determining whether sentences should be marked as equivalent/not equivalent is also linked to sentence length. In a pair of 20-word sentences, the presence/absence of a single modifier might be lost in the noise, while in a pair of 5 word sentences it might take on much greater significance. There is no good way to normalize for length in such cases, so again, just depend on your own judgment.



3.5.3.      Cannot determine if sentences refer to the same event


More than %%NUMBER%% acres burned and more than %%NUMBER%% homes were destroyed in the massive Cedar Fire .


Major fires had burned %%NUMBER%% acres by early last night.


In this example, both sentences could be about the same series of events (fires). However, these are possibly about two events: one is about a specific fire, the other about a cluster of fires. This should lead us to annotate these sentences as expressing “not equivalent” content. Another such example follows:


The spokeswoman said four soldiers were wounded in the attack, which took place just before noon around %%NUMBER%% km ( %%NUMBER%% miles ) north of the capital Baghdad.


Two US soldiers were killed in a mortar attack near the Iraqi town of Samarra yesterday , a US military spokeswoman said.


Notice that both sentences report the deaths of soldiers in an attack in some Iraqi town. However, it is clear that the two sentences could be describing two isolated events. The fact that there is a discrepancy in the number of reported deaths should add to one’s suspicions that this might be the case. Since the sentences share some content, but we cannot be sure they refer to the same event, we should seek to err on the side of caution and mark them as “not equivalent”.



3.5.4.      Shared content but different rhetorical structure


The search feature works with around %%NUMBER%% titles from %%NUMBER%% publishers, which translates into some %%NUMBER%% million pages of searchable text .


This innovative search feature lets Amazon customers search the full text of a title to find a book , supplementing the existing search by author or title .


In this sentence pair, both sentences clearly make statements about a new search feature. However, notice the emphasis placed on the amount of data in the first sentence via the rhetorical device of reiterated citation of numbers. The two sentences are about the same subject matter, but they are significantly different in that the first might occur as a detailed exploration of the second. Therefore, this leads us to mark the sentences as “not equivalent”.


3.5.5.      Same event but details different emphasis


A Hunter Valley woman sentenced to %%NUMBER%% years jail for killing her four babies was only a danger to children in her care, a court was told.


As she stood up yesterday to receive a sentence of %%NUMBER%% years for killing her four babies, Kathleen Folbigg showed no emotion.


These sentences clearly report information related to the same event, but the first sentence emphasizes a particular legal argument presented by the convicted woman’s lawyer, while the second focuses on her apparent mental state at the trial. This type of sentence pair should be tagged as “not equivalent”. Given the magnitude of the semantic divergence between these two sentences – both in terms of content and emphasis – they should be treated as “not equivalent”.


More example sentence pairs which, while clearly significantly overlapping in content, should be tagged as “not equivalent”:



Authorities dubbed the investigation Operation Rollback , a reference to Wal-Mart's name for price reductions .


The ICE's investigation , known as " Operation Rollback " , targeted workers at %%NUMBER%% Wal-Mart stores in %%NUMBER%% states .



Researchers also found that women with mutations in the BRCA1 or BRCA2 gene have a %%NUMBER%% % to %%NUMBER%% % risk of ovarian cancer , depending on which gene is affected .


Earlier studies had suggested that the breast cancer risk from the gene mutations ranged from %%NUMBER%% % to %%NUMBER%% % .


Note that while the sentences may refer to the same piece of information, the inclusion of “earlier studies….” suggests this may not be the case. Therefore, they should be tagged as “not equivalent”.