oatd.org Discovery for Open Access Theses and Dissertations An ASERL Webinar, October 15, 2013 These slides: http://goo.gl/muxq15 Thomas Dowling dowlintp@wfu.edu
I Can Haz ASERL ETDs? 34 of 37 ASERL universities provide open access ETDs 25 of 37 members provide OA ETDs through a harvestable repository. [Or, 9 members provide OA ETDs but do not make them harvestable.] As of September 30, 2013, OATD indexes 99,857 records from ASERL members.
What Is OATD? A discovery service for Open Access Graduate Level Theses and Dissertations Harvested from Repositories Worldwide
What Is OATD? 1.89 million records 850+ universities 360+ repositories 75,000 records with "semi-full text turbo boost" Search hits from first 40 pages Sample images from PDF Not a full-text index One Amazon small server + 200GB
What Is OATD? Steering Committee: Martin Courtois (Kansas State), John Hagen (WVU), Molly Keener (WFU), Caitlin Nelson (Florida Virtual Lib), Ryan Steans (Texas Digital Lib), Zoe Stewart-Marshall (past president, LITA) Generous support from the Z. Smith Reynolds Library, Wake Forest University.
What Needs Does OATD Meet? Current search tools for ETDs: Point to closed-access copies when OA is available Lump OA ETDs in with overwhelming numbers of other documents Rely on the kindness of vendors Have under-developed, uninformative user interfaces Have no enhancement request process
What Needs Does OATD Meet? Current search tools for ETDs: Point to closed-access copies when OA is available Lump OA ETDs in with overwhelming numbers of other documents Rely on the kindness of vendors Have under-developed, uninformative user interfaces Have no enhancement request process
What Needs Does OATD Meet? Current search tools for ETDs: Point to closed-access copies when OA is available Lump OA ETDs in with overwhelming numbers of other documents Rely on the kindness of vendors Have under-developed, uninformative user interfaces Have no enhancement request process
What Needs Does OATD Meet? Current search tools for ETDs: Point to closed-access copies when OA is available Lump OA ETDs in with overwhelming numbers of other documents Rely on the kindness of vendors Have under-developed, uninformative user interfaces Have no enhancement request process
What Needs Does OATD Meet? Meanwhile, in every other library search interface... Massive investment of time, energy, and money Google-driven user expectations Simpler search Concentration on what users can do with results
OATD Components Getting Metadata (OAI-PMH harvesting) Cleaning Up Metadata (XML conversion) Indexing Metadata (Solr) Web User Interface Web Crawler for PDFs Web Crawlers for [a few] non-oai repositories
OAI-PMH OAI-harvested repositories in OATD, by platform Built into most major repository platforms (DSpace, DigitalCommons, ContentDM, EPrints...)
OAI-PMH But... It may not be enabled It may not be configured well It may break without alerting you Wait Didn t Google quit using OAI-PMH? Why should we still care about it?
OAI-PMH Talks to our highly structured metadata: <title>emulating Data Synthesis for Virtual Simulations</title> <dc:creator>aaronson, A. Arthur</dc:creator> <dc:contributor role="committee Chair"> Berenson, Barbara B. </dc:contributor>
OAI-PMH Talks to our highly structured metadata: <mods:dateaccessioned> 2013-09-16T09:15:00Z </mods:dateaccessioned> <mods:dateavailable> 2014-09-16T09:15:00Z </mods:dateavailable>
OAI-PMH Six "Verbs" (and various adverbs) [Odds are, you don't need to know any of this...] Identify: Tell me about yourself ListMetadataFormats: Tell me what metadata "flavors" you offer (DC, Qualified DC; ETD-MS, UKETD, xmetadiss; METS, MODS) ListSets: Tell me how you subdivide your repository
OAI-PMH Six "Verbs" (and various adverbs) ListIdentifiers: List record identifiers [in this set] [from a date] [until a date] [available in this metadata format] GetRecord: Give me one record [with this identifier] [in this metadata format] ListRecords: Give me all records [in this set] [from a date] [until a date] [in this metadata format]
OAI-PMH Six "Verbs" (and various adverbs) So for example: http://archive.foo.edu/oai? verb=listrecords & set=etds & from=2013-10-01 & metadataprefix=oai_etdms
Cleanup and Conversion...<metadata> <title>a New Theory on the Brontosaurus</title> <creator>elke, Anne</creator> <degree> <name>doctor of Philosophy</name> <level>doctoral</level> <discipline>dinosaur Studies</discipline> <grantor>foo Tech</grantor> </degree> </metadata>...
Cleanup and Conversion <title>a New Theory on the Brontosaurus</title> <creator>elke, Anne</creator> <degree>doctor of Philosophy</degree> <grantor>foo Tech</grantor> <field name="title">a New Theory on the Brontosaurus </field> <field name="author">elke, Anne</field> <field name="degree" >PhD</field> <field name="publisher">foo Institute of Technology and Science</field>
Cleanup and Conversion <subject>thesis (M.S.) - Archeology</subject> <contributor>foo Tech, School of Social Work</contributor> <date>spring 2010</date> <date>2010-04</date> <field name="degree">ms</field> <field name="level">masters</field> <field name="discipline"> Archeology</field> <field name="discipline"> Social Work</field> <field name="date"> 2010-04-01</field>
Solr Free, open source search engine Search engine used by Netflix, StubHub, Instagram, Internet Archive, Zappos, Smithsonian... Also VuFind and Blacklight library catalog interfaces. Reindexes records without creating dups.
Semi-Full Text Turbo Boost Covers ~75k mostly recent ETDs Very lightweight web crawler Pauses if it has hit a site within the last 2 minutes Should not re-pull the same PDF more than once a year Gets record IDs from OATD: does not run searches on your site Pulls OATD's URL and looks for first likely PDF link
Semi-Full Text Turbo Boost Grabs page images for "front matter" (first 7 pages) Indexes pages 8 to 40 Used for search highlighting only Not searchable in the UI Extracts first 11 "real" images Not front matter Not too big, not too small No PDF "soft mask" images
Semi-Full Text Turbo Boost
Semi-Full Text Turbo Boost
Semi-Full Text Turbo Boost
Web Crawling for Non-OAI Sites Last Resort Depends on findable, parseable browse pages; parseable record pages; persistent URLs Very labor-intensive Prone to breaking whenever you tweak your site Requires re-crawling the entire site every time If you really can t do OAI, but have good metadata, call me.
Frequently Unanswered Questions (And How Your Metadata Can Help) Is this an ETD? Entire repository is ETDs Set ABCD is all of our ETDs dc:type=thesis, dc:type=dissertation Is it open access? How about Creative Commons? dc:rights=unrestricted dc:rights=licensed under a Creative Commons CC-BY-SA license...
Frequently Unanswered Questions (And How Your Metadata Can Help) What school is this from? Use a good dc:publisher value What department or discipline is this from? Use ETD-MS or UKETD dc.contributor=[something very consistent]
Frequently Unanswered Questions (And How Your Metadata Can Help) Is this a doctoral dissertation or a masters thesis? What s the degree? Use ETD-MS or UKETD dc.subject=[something very consistent] What s the complete citation? Confirm that you export author, title, year, URL
Frequently Unanswered Questions (And How Your Metadata Can Help) What s the embargo situation? [Now] dc:rights=restricted [Later] dc:rights=unrestricted dc:date.accessioned=... dc:date.available=...
Not Really That Helpful <publisher>digitorium @ State!</publisher> <dc:rights>no access until 2010. Campus-only access until 2012.</dc:rights> <dc:title>embargo Test #2</dc:title>
The Mad Libs Guide To Next Steps OATD is a harvested index of ETD records in repositories from around the world. way cool name is a harvested index of discipline OR content type records in repositories from. geographical region OR library consortium OR repository type
Remember When Every Library Presentation Had an Obligatory Silo Slide? Three Silos Theresa L. Wysocki, from Flickr
oatd.org Discovery for Open Access Theses and Dissertations An ASERL Webinar, October 15, 2013 These slides: http://goo.gl/muxq15 Thomas Dowling dowlintp@wfu.edu