This report dis­cusses the col­lab­o­ra­tive work of the Eras­musMC, Uni­ver­sity of Twente, and the Uni­ver­sity of Ams­ter­dam on the TREC 2011 Med­ical track. Here, the task is to retrieve patient vis­its from the Uni­ver­sity of Pitts­burgh NLP Repos­i­tory for 35 top­ics. The repos­i­tory con­sists of 101,711 patient reports, and a patient visit was recorded in one or more reports.

Because the train­ing set pro­vided by the track orga­ni­za­tion was small and not made avail­able until quite late in the com­pe­ti­tion, we decided to cre­ate a small train­ing set our­selves. Not only did this allow us to test sev­eral ideas before sub­mit­ting runs to TREC, it also led to sev­eral insights into the data. One find­ing was that syn­onyms are widely used. Query expan­sion was there­fore deemed essen­tial to achieve a rea­son­able per­for­mance. Query expan­sion has been used before in Infor­ma­tion Retrieval (IR), and is often divided into sta­tis­ti­cal and knowledge-based query expan­sion. Sta­tis­ti­cal query expan­sion uses data derived from the cor­pus itself, and a well-known exam­ple is pseudo-relevance feed­back . In con­trast, we inves­ti­gated knowledge-based query expan­sion, which uses a knowl­edge base such as an ontol­ogy or a dic­tio­nary to find related terms. This type of query expan­sion has not always proven to be suc­cess­ful. For instance, Hersh et al. found a decrease in over­all search per­for­mance when using the Uni­fied Med­ical Lan­guage Sys­tem (UMLS) to find related terms. Liu et al. found slight improve­ments with scenario-specific expan­sion strate­gies using UMLS. In a pre­vi­ous TREC track, we also found reduced per­for­mance when using con­cept based query expan­sion , but found slightly improved results when using an approach com­bin­ing con­cepts with a sta­tis­ti­cal model of related words . Sim­i­larly, Zhou found promis­ing results when using com­bi­na­tion of both the orig­i­nal words in the text and the syn­onyms found for con­cepts in the text.

An often-used resource for knowledge-based query expan­sion in the bio­med­ical domain is the UMLS. How­ever, ini­tial explo­rations indi­cated that there is only lim­ited over­lap between terms used in top­ics and med­ical records and terms found in the UMLS. The main rea­son for this appears to be that the UMLS is mainly con­structed from vocab­u­lar­ies used in clas­si­fy­ing clin­i­cal data, but not intended to be used in text– min­ing. Terms in the UMLS tend to be more spe­cific than what a physi­cian would use in free-text report­ing. For instance, a physi­cian might use the term „upper endoscopy‟, but this term is not found in the UMLS. Instead, the term „upper GI endoscopy‟ is found. We have there­fore explored a dif­fer­ent source of syn­onyms: Wikipedia. We expected Wikipedia to have a bet­ter cov­er­age of the terms encoun­tered in med­ical records.

  • [PDF] M. Schuemie, D. Tri­eschnigg, and E. Meij, “Dutch­Hat­Trick: Seman­tic query mod­el­ing, Con­Text, sec­tion detec­tion, and match score max­i­miza­tion,” in TREC 2011 Work­ing Notes, 2011.
    [Bib­tex]
    @inproceedings{TREC:2011:schuemie,
      Author = {Schuemie, M. and Trieschnigg, Dolf and Meij, Edgar},
      Booktitle = {TREC 2011 Working Notes},
      Date-Added = {2011-10-22 12:14:30 +0200},
      Date-Modified = {2011-10-22 12:15:47 +0200},
      Title = {DutchHatTrick: Semantic query modeling, {ConText}, section detection, and match score maximization},
      Year = {2011}}