Book Search Experiments: Investigating IR Methods for the Indexing and Retrieval of BooksAdvances in Information Retrieval (2008), pp. 234-245.
|
Reviews
[Write a review of this article]
There are no reviews of this article
Find related articles from these CiteULike users
Find related articles with these CiteULike tags
AbstractThrough mass-digitization projects and with the use of OCR technologies, digitized books are becoming available on the Web and in digital libraries. The unprecedented scale of these efforts, the unique characteristics of the digitized material as well as the unexplored possibilities of user interactions make full-text book search an exciting area of information retrieval (IR) research. Emerging research questions include: How appropriate and effective are traditional IR models when applied to books? What book specific features (e.g., back-of-book index) should receive special attention during the indexing and retrieval processes? How can we tackle scalability? In order to answer such questions, we developed an experimental platform to facilitate rapid prototyping of a book search system as well as to support large-scale tests. Using this system, we performed experiments on a collection of 10 000 books, evaluating the efficiency of a novel multi-field inverted index and the effectiveness of the BM25F retrieval model adapted to books, using book-specific fields.
BibTeX record
RIS recordaacr abbreviations accessibility active-annotation adaptive-hypermedia adaptive-web adl aggregate-works ajax analytic-cataloging ancient-texts annotations apis application-development application-profiles arabic arabic_ocr archaeology archival-description archives armadillo artificial-intelligence association-rules authority authority_control authority_control--automated authority_files authority_files--international authorship_attribution automatic-gazetteer-construction automatic-hypertext-creation automatic-index-generation automatic-linking automatic_metadata_extraction automatic_metadata_generation automatic-taxonomy-generation bibliographic-data bibliographic_infrastructure bibliographic-relationships bibliography-management bibliometrics bleu blogs books bootstrapping browsing cataloging cataloging--rules categorization--automatic cdl cervantes cheshire_dl cidoc-crm citation_linking citeseer city-directories classification classification--automatic classification--text--automated clef clickstream_data clir cluster-analysis clustering clustering--evaluation coins collaboration collaborative_authoring collaborative_digitization collaborative_filtering collaborative-ontology-building collaborative_tagging collaborative_tagging--overview collection-analysis collection-building collection-development collection-development--automated collection-management collection-scope collective-intelligence communities communities--member-maintained communities-of-practice comparable-corpora computational-lexicons computational_linguistics computational_semiotics computer-mediated-communication computer-science conditional-random-fields confidence content-management-systems context controlled_vocabularies co-occurence_models copyright co-reference-resolution corpora corpora--annotated corpora--historical corpus-linguistics cross-collection-access cts cultural-heritage customization cyberinfrastructure databases databases--fulltext databases--images databases--indexing data-curation data_fountains data_grid data_integration data-mining data_models data-provenance data_sparsity data-transformation definitional-knowledge diagrammatic_inference_systems diagrammatic_reasoning diagrammatic_representation diagrams diagrams--generation diagrams-representation diagrams-syntax diagrams--visualization dialects dictionaries dictionaries--electronic dictionaries--historical dictionaries--machine_readable dictionary-segmentation digital-archives digital-classics digital_collections digital-editions digital-history digital_humanities digital_libraries digital_libraries--evaluation digital_libraries--historical digital_libraries--services digital_library_architectures digital_library_models digital_museums digital_objects digital-objects--reusability digital_objects--validation digital_preservation digital-reading digital-reference digital_repositories digital-reputation digital-scholarship digital-theses digital_tools digitization disambiguation disambiguation--personal_names disambiguation--place_names distance-learning distance-measurement distributed-moderation distributed-repositories document-analysis document-analysis--applications document-clustering document_genre document-image-retrieval document-layout-analysis document_models document-recognition document-recognition--historical document-representation document-structure dois domain-knowledge domain-learning dom-model dspace dtds dublin_core duplicate-detection dynamic_link_generation dynamic-programming ead early-modern-english e-books economics edit-distance edition-alignment e-journals e-learning electronic-citations electronic-cultural-atlas-initiative electronic_publishing electronic-texts emergent_semantics encoding_schemes encyclopedias enterprise-application-integration entity-ranking entity-relationship-graph entity-relationship-model entry-level-vocabularies epic-poetry e-prints e-research e-science euclid euclidean_geometry european-digital-library evaluation evaluation--methods event-and-entity-tracking event_extraction event-modeling external-evidence extractive-summarization faceted-browsing faceted-classification facets fact_extraction faculty faculty-recruitment fair-use feature_extraction feature-extraction feature_generation feature-identification federated-digital-libraries fedora figures filtering finding-aids flickr foaf focused_crawler folksologies folksonomies for_gabe formal-concept-analysis frad frameworks franar frbr frbroo fuzzy-k-nn-classification fuzzy-logic gamera gaming gate gazetteer-lookup gazetteers gazetteers--digital gazetteers-evaluation gazetteers--time_periods genealogists genealogy generative-models genre-analysis geocoding geo-digitallibraries geographic-information_retrieval geometry georeferencing geoxwalk german gis gis--historical glossaries google google-analytics google_books google_maps graph-analysis graph-based-mutual-reinforcement grddl great-britain-historical-gis greek greenstone ground-truth-data gutenberg handheld-computing handles handwriting-recognition hci heml heuristics hierarchical-classification hierarchical_clustering hindi historians historical-document-indexing historical-methods historical_newspapers historical_newspapers--research historic_newspaper_digitization history history--ontology history--teaching hmms html human_factors humanities hybrid-library hybrid-recommender-algorithm hyperbooks hyperlinks hypertext hypertextbooks identifiers image-annotation image-processing image_retrieval images image-segmentation implicit_ratings indecs indexing indexing--automatic indian-languages inex inflectional-morphology information-access information_architecture information-behavior information-commons information_extraction information-filtering information-quality information_retrieval information_retrieval--evaluation information_retrieval--historical information-seeking--humanists infrastructure insan intellectual-property intelligent-systems intelligent-tutoring interactive-information_extraction interactive-machine-translation interactivity interdisciplinary interface-design interfaces interfaces--adaptive inter-indexer-consistency internal-evidence internet_archive internet-based-community-network internet_resources interoperability inverted-index ivia javascript jena jhove jstor kalman-filters keyphrase_assignment keyphrase_extraction keyword-searching k-means knowledge-acquisition knowledge-acquisition-bottleneck knowledge-bases knowledge-construction knowledge-discovery knowledge-elicitation knowledge-management knowledge-modeling knowledge-organization knowledge-representation knowledge-sources--historical kos labeling language_engineering language-learning language_models language_models--historical language-resources language-resources--geographical language_technologies latent-semantic-analysis latent-semantic-indexing latin lcc lcnaf lcsh leaders-project leaf-project learning-objects learning-objects-metadata lewis_carroll lexical-classes lexical-relations lexical-semantics lexicography lexicons lexicons-bilingual librarians librarianship libraries libraries--academic libraries--collections libraries--perceptions libraries--public libraries--scientific libraries--services library20 library_as_place library_catalogs library-information-systems library-outreach library_thing lingo linguistic-markup linguistics linguists link-analysis linked-data linking link_mining link-servers listservs literary-computing literature local-history local-history--sources logistic-regression long-tail lsch lucene machine_learning machine_learning--incremental machine_learning--semi-supervised machine_learning--supervised machine_learning--unsupervised machine-readable-dictionaries machine-translation machine-translation--evaluation machine-translation--statistical manuscripts maps maps--historic marc marc_xml markup mashups mass-collaboration mass_digitization massive-digital-libraries maximum-entropy mead meaning_discovery medieval-texts medline mental-models metadata metadata--aggregation metadata--applications metadata--creators metadata--evaluation metadata--genre metadata--geographic metadata--harvesting metadata--interoperability metadata--mapping metadata--overviews metadata--quality metadata--reuse metadata--schemas metadata--standards metadata--subject metadata-translation meta-searching mets microfilm microformats middle-english mods monolingual-comparable-corpora morphology movielens multi-document multi-document-summarization multilingual-collections multilingual-digital-libraries multi-lingual-document-clustering multilinguality multilingual-language-resources multilingual-text-retrieval multilingual-text-summarization multimedia multiple-alignment multiple-hierarchies music-information-retrieval mysql naco naive-bayes name_authorities--historical named-entities named_entities--historical named-entity-classification named-entity-disambiguation named-entity-extraction named-entity-recognition named-entity-research--overview named-entity-searching named-entity-tagging named-graphs name_modeling narrative-texts natural-history natural-language natural-language-processing ndnp newspaper-archive nextgen next-generation-catalogs n-grams nines niso nkos n-miller nomenclatures non-projective-dependency-parsing normalized-information-distance nsdl nsf oai-pmh oaister obituaries object-oriented ockham oclc ocr oed online-communities ontological-indexing ontologies ontologies--alignment ontologies--domain ontologies--geographic ontologies--integration ontologies--learning ontologies--mapping ontologies-of-records ontologies--population ontologies--reuse ontologies--users ontology ontology-based-information_extraction opacs open-access open-archival-information-systems open-archives open-content-alliance open-data open-search open-source-software open-url oral_history owl palaeography papyrology parallel-texts paraphrase-discovery parsers parsing partnerships passage-retrieval pattern-classification pattern_learning pattern-recognition pdf peer-to-peer-digital_libraries perceptron perl perseus-biblio-import personal_digital_libraries personalization personalized-searching philology philosophy photographs pl place-name-recognition planet_math plsa-probablistic-latent-semantic-analysis plugins popular-history portability postmodernism primary_sources primary_sources--digitization print-collections prior-knowledge probabilistic-latent-semantic-analysis project-planning proper-names provenance public-domain public-history purls python quality-control quality-metrics query-disambiguation query-expansion query-expansion--semantic query-personalization query-processing query-reformulation query-rewriting query-translation question-answering quotation_identification random-walks rankings rda rdbms rdf rdf-browsing rdf-xml reading reading-comprehension reading-purpose reading_support_system recommender_systems record_linkage reference reference_linking reference_models reference-works--dynamic reference-works--online regular-expressions relational_databases relation_discovery relation_extraction relevance-ranking research-habits research-papers rest rewards rewrite-rules rights-metadata romanization rouge rss rule-based-learning rule-based-system scalability schema-mapping schema-matching schemas scholarly_communication scientific-reading scorm screen-scraping search-engine-coverage search-engine-optimization search-engines search-engines--results searching search-results-clustering search-results-display search-terms second-life selection self-organizing-maps self-supervised-learning semantic_annotation semantic_annotation--automatic semantic_annotation--interfaces semantic-digital-libraries semantic_indexing semantic_integration semantic-interoperability semantic_metadata semantic_networks semantic_relations semantic_searching semantic_similarity semantic_tagging semantic_web semantic_web--applications semantic_web--metadata semantic_web--searching semantic-wikis semi-structured-information sentence-alignment sentence_classification sentence-classification sentence-extraction serials service_oriented_architecture services sgml shakespeare shared-ontology shipbuilding similarity-metrics simile-project single-document-summarization skos soap social_bookmarking social-classification social-computing social-history social-informatics social-navigation social-networking social_networks social-software social-theory sparql spatial-data spatial-data-infrastructure spatial-hypertext spatial-ranking special-collections spelling-variants spreadsheets sql sru sru-srw standards standoff-markup statistics stemma-reconstruction stemming string_matching structured_information structured-navigation structured-prediction students student-users stylesheets subject_headings sumerian summarization summarization--evaluation summarization--overview summarization--web_pages support_vector_machines sustainability svg swish-e swoogle synonyms synthetic-documents syriac tabbed-browsing table-extraction table-recognition tables-of-contents tacit-knowledge tag_clouds tag-clustering tagging task-analysis task-based-evaluation taxonomies taxonomy-alignment teachers teaching tei temporal_information terminology-extraction terminology_services text_alignment text-analysis text-annotation text-categorization text-collation text-data-mining