Home → Cataloging in NC Cardinal → Appendices → Appendix O: Annual Deduplication Process
Last Updated 02/06/2024
*Last updated 02/06/2024
The annual deduplication process is a 2-stage process, beginning with a cleanup of bib records that either do not have a format icon or have the wrong one applied. The process of identifying these records and applying the correct icon is called the Waves clean up. This clean up looks at all of the bibliographic records in the catalog.
After the Waves clean up process is complete a deduplication script compares records by format type. By comparing data in the record, such as title, author, ISBN, and other identifiers, the script determines whether there are duplicate records with the same data. After duplicate records are identified, the records are scored for quality using a variety of criteria. The record with the highest score is marked as the lead bib record.
Records are placed in a spreadsheet for examination by catalogers, who review the two bibs identified in each line item to make sure that they are for the same material (content and format) and to determine whether the “lead” bib identified is actually the better record. After the review is complete, the script is run and items for duplicate records are merged onto the lead record and the subordinate records are deleted.
This presentation outlines the Waves clean up and deduplication process and describes the criteria used for the process.
A “Wave” is a format (e.g. electronic, audiobook, large print, video). The Waves clean up is the first step in the annual deduplication process. It is run by Mobius, and it is a process where Mobius attempts to go through and correct the format icons for records in the system, based on the data in the record. This Waves clean up is first because the format icons are used in the deduplication process to make sure that different formats are not being merged, like regular and large print books, or physical and audiobooks.
Please be aware that the Waves clean up does not convert all format icons. The formats that it can convert icons to are Electronic, Audiobook, Video, Large Print, and Music.
Important note: The initial Waves clean up list Mobius provides is not precisely the list that will be affected when the actual process is run "for real." It is a test run, applying the algorithm on a copy of the bibliographic data at a point in time. When Mobius runs it for real on the production data, Mobius will produce a new and "final" spreadsheet of the bibs that were affected. In all likelihood a large percentage of the initial “test” spreadsheet will be included, but there will be new data and some items that will not get touched (because the data changed between the time of the “test” run and the time of the “real” run).
In the resulting spreadsheets for both the Waves clean up and the deduplication process, there are two categories present:
Once the initial Waves clean up has been run, members of the Cataloging Committee will review the Waves clean up spreadsheet and see if there is anything that jumps out to them as a problem in the “Auto” sheet/s.
The reviewers’ role at this stage is to review the "Auto" sheet/s and see if the "winning" format icon is correct. If not, it will then be important to figure out how best to tweak the criteria used to determine the winning format icon so that the correct format icon is selected. Basically, the Cataloging Committee is trying to confirm that the process is making good determinations on what the format icons should be.
Once the review of the Waves clean up has been completed and approved, then the Committee moves on to scrutinizing the actual deduplication process.
Once the Waves clean up process has finished, and the Cataloging Committee has reviewed and signed off on the results, the actual deduplication process can begin. This process compares a number of MARC fields between records to determine if they are a good match for merging or not.
From the LDR field, the following data elements are taken into consideration:
From the =008 field, the following data elements are taken into consideration:
From the =020 field, the following subfield is taken into consideration:
From the =100, =110, =111 fields, the following subfield, specifically the first occurrence, is taken into consideration:
From the =245 field, the following subfields are taken into consideration:
From the =264 _1 field, the following subfield is taken into consideration:
From the =700 field/s, the following subfield is taken into consideration:
In addition to the above fields and subfields, the deduplication process also checks to see if a bib record is an audio format or a video format, because those require different scoring.
After the format icons have been updated by the Waves clean up process, Mobius will also generate a list of records that would be merged if the deduplication process were to be run immediately. This information will be provided in a spreadsheet for the Cataloging Committee to review. Members of the Cataloging Committee will review the matched records in the “Auto” sheet and see if there is anything that jumps out to them as a problem.
The reviewers’ role at this stage is to review the "Auto" sheet and see if the records marked Lead Bib and Sub Bib are a good match that in fact should be merged on to the Lead Bib. If not, it will then be important to figure out how to differentiate the records as non-duplicates, so that the criteria used to consider two records as duplicates can be tweaked. Basically, the Cataloging Committee is trying to confirm that the process is making good determinations regarding what records should be matched and merged.
=020 (ISBN) is merged onto the final bib. Any unique =020 is “merged/melted” onto the winning bib, such that the final bib may have multiple =020 fields.
=035 (OCLC number) is merged onto the final bib. Any unique =035 is “merged/melted” onto the winning bib, such that the final bib may have multiple =035 fields.
=037 (source of acquisition) is merged onto the final bib. Any unique =037 is “merged/melted” onto the winning bib, such that the final bib may have multiple =037 fields.
=086 (government document classification number) is merged onto the final bib. Any unique =086 is “merged/melted” onto the winning bib, such that the final bib may have multiple =086 fields.
=856 (URL to electronic resource) is merged onto the final bib. Any unique =856$u along with any accompanied $9 are "merged/melted" onto the winning bib, such that the final bib may have multiple =856 fields.
Any of the bibs listed here that have an OPAC Icon of:
Will NOT be automatically merged.
Because they are too often false positives. They need to be handled by hand, so they are part of the "Needs Humans" results. The rest of the bibs are merged. The merging process is fairly involved (internally) - it takes into account all of the holds and the metarecords issues. They are reconciled and merged as well.
Cataloging Committee (~2 weeks)
Mobius (~2 weeks; may take as little as a day depending on the number of bibs and the parameters used)
Cataloging Committee (~2 weeks)
Mobius (~1 month; may take as little as a day depending on the number of bibs and the parameters used)
Deleted bibs may be included in the deduplication process. How does this affect the merging of records, if at all?
The deduplication process can and may prefer a deleted bib over a non-deleted bib. If that is the decision it makes, it will do all of the things necessary for it to bring the deleted bib back to life and move everything to it. The resurrected bib will be searchable and sound in the catalog.
While the deduplication process is running, can catalogers still edit bibs, merge bibs, import bibs, etc.?
Yes. Unlike the quarterly authorities update where many cataloging functions must cease lest they be overwritten/undone by the authorities update, catalogers can continue to perform their normal cataloging duties without fear of wasted time or lost work.