Web Archiving at the Bavarian State Library

Web Archiving FAQsWeb Archiving FAQs

1. Basic Information

During a pilot phase the Bavarian State Library’s Munich Digitization Center (MDZ) has been testing the process of web archiving scientifically relevant web resources, which are made accessible with extensive effort in the framework of the Bavarian State Library’s DFG-funded subject gateways (b2i, Chronicon, Propylaeum, ViFaMusik, ViFaOst und ViFarom) and the Bavarian Regional Library Online (BLO). Those web resources are recorded in the respective subject gateway databases. In January 2012 the web archiving has been transferred into routine operation. From then on the Bavarian State Library will also begin with the regular collection, archiving and provision of websites of authorities, departments and agencies of the State of Bavaria. According to the notice of the Bavarian State Government - Bavarian State Promulgation of 2. December 2008 (Az.: B II 2-480-30) - the Bavarian State Library, as archival library of the Free State of Bavaria, has the obligation to long-term store and provide access to official publications in electronic format. The collection of websites by the Bavarian State Library is carried out twice a year by means of a harvesting-process; in this case no action on the part of the Bavarian authorities and agencies is necessary. This procedure complies with the standards of the German National Library.

Objectives Definitions
A website is a virtual place on the World Wide Web mostly including several web pages or documents (files) and other resources. Those are accessible through a http address. The Internet can be defined as an active „publication system“, which permanently produces new and changing content and on which various information is lost because it is replaced, moved or deleted. At the same time the display formats and applications for the presentation of content change.

Therefore the objective of web archiving is collecting (selected) websites, storing them permanently and providing long-term access to them, last but not least preventing the loss of knowledge due to the deactivation of scientifically relevant offers on the World Wide Web.

In the framework of web archiving it can be basically differentiated between "domain harvesting", which includes the entire content of a domain (e.g. *.de), "selective harvesting", which stores only selected websites, and the so-called "event harvesting", which archives the Internet content regarding a particular event (e.g. German federal election 2009). The MDZ employs selective harvesting in the context of indexing Internet resources in the subject gateways.

The goal of web archiving is limited by the quick and permanent transformation and evolution of content and display formats. As an example, it will always be only possible to reproduce specific captures of a website, e.g. archive copies of a web presence once or twice a year. All changes made in the meantime cannot be reconstructed anymore.

The crawlers also rarely capture the complete content of a website, because much of the content is generated dynamically for example by means of a database query („Deep Web“ or „hidden Web“) and is thus not statically available. Additionally dynamic applications such as JavaScript, Flash or YouTube videos cannot be harvested at this point in time. External links are generally „cut off“, because otherwise the archiving would be carried out too widely. In the framework of archiving digital objects the BSB utilizes state-of-the-art technology. Simply storing the websites though is not enough. Regarding the quick technological change in the Internet world it can be anticipated that in future there will be the need for additional technical effort to maintain their usability.

2. Authorisation Process and Authorisation Form

For legal reasons the BSB only harvests, archives and provides access to websites if an explicit authorisation has been received or if these have to be long-term stored and made accessible as electronic publications according to the notice of the Bavarian State Government - Bavarian State Promulgation of 2. December 2008 (Az.: B II 2-480-30). For this reason, if the website is not an official publication of an authority, department or agency of the State of Bavaria, an authorisation email is sent to the website operator during the first step of the web archiving workflow. A harvesting and archiving authorisation can be given by email or by returning the completed authorisation form to the BSB. Due to the fact that in this context German law applies the authorisation forms are only available in German. A corresponding explanation in common other languages however can be obtained anytime. Only if the authorisation has been received in written form (email or authorisation form) the harvesting- and archiving-process can be started. Please choose this Bewilligungform [PDF] for the archiving authorisation of a website or other Internet presence.

3. Technology and Workflows

The Bavarian State Library utilizes the software Web Curator ToolWeb Curator Tool for the web archiving process. This open source software was developed by the British Library and the National Library of New Zealand and has been successfully in use also by other institutions for several years now.

The Web Curator Tool offers an integrated archiving process from the request for authorisation, the automated harvesting process according to schedule, the quality control, to the archiving.

For an automatic harvesting process at regular intervals (currently it is intended to harvest the selected websites twice a year) a „target“ (url/urls) is created once, linked with the harvest authorisation and the harvesting process is started first-time. From then on the corresponding website is harvested according to schedule and made available within the Web Curator Tool for quality control. The website is harvested with the crawler Heritrix„Heritrix“, which has been developed specifically for web archiving by the Internet Archive„Internet Archive“.

Presentation of Archived Websites On the one hand access to a website archive is provided directly in the subject gateways by means of a supplemental link, which is added to the record of the current website. On the other hand the archived websites are searchable in and made available through the BSB catalogue. The web archive is catalogued as an intellectual entity, combining the individual captures of the website to one unit. The user receives a link, which offers him by means of the Wayback-MachineWayback Machine a chronological list of all captures, which can be browsed separately.

4. Process Model

(Please klick on the image to enlarge)

V. 4.2.1 en

zur Homepage der BSB


Latest Additions to our Digital Collections

Updated daily: Latest additions to the Digital Collections of the Bavarian State Library. Items online: 2,530,686Latest Additions to our Digital Collections



zur Homepage des MDZ

Hinweis: Durch die Nutzung dieser Webseite stimmen Sie der Verwendung von Cookies zu.
OK Mehr erfahren