NC Cardinal Support and Staff Education
  • Home
  • Submit a Request
  • Check on a Request
  • Knowledge Books
    • About NC Cardinal
    • Circulation in Evergreen
    • Cataloging in NC Cardinal
    • Administration Manual for Libraries
    • Reports in Evergreen
    • Resource Sharing
    • Student Access Initiative
    • Acquisitions in Evergreen
    • Serials in Evergreen
    • Offline Transactions
    • Evergreen Upgrades
    • Libraries Migrating into NC Cardinal
    • Summon Documentation
    • Troubleshooting in Evergreen
HelpSpot help desk software

Home → Cataloging in NC Cardinal → Appendices → Appendix R: Annual Deduplication Process

9.18. Appendix R: Annual Deduplication Process

Last Updated 04/03/2025


What is the Annual Deduplication Process?

*Last updated 02/06/2024

The annual deduplication process is a 2-stage process, beginning with a cleanup of bib records that either do not have a format icon or have the wrong one applied.  The process of identifying these records and applying the correct icon is called the Waves clean up.  This clean up looks at all of the bibliographic records in the catalog.

After the Waves clean up process is complete a deduplication script compares records by format type.  By comparing data in the record, such as title, author, ISBN, and other identifiers, the script determines whether there are duplicate records with the same data.  After duplicate records are identified, the records are scored for quality using a variety of criteria.  The record with the highest score is marked as the lead bib record.

Records are placed in a spreadsheet for examination by catalogers, who review the two bibs identified in each line item to make sure that they are for the same material (content and format) and to determine whether the “lead” bib identified is actually the better record.  After the review is complete, the script is run and items for duplicate records are merged onto the lead record and the subordinate records are deleted.

Deduplication Process Outline

  1. Waves clean up for each format
  2. Voting system to identify “Automatic” versus “Needs Humans”
  3. Fingerprints matched
  4. Bibs scored for record quality
  5. Higher quality is lead bib
  6. Catalogers examine records and confirm choice of bibs to merge and lead bib
  7. Less complete records merged onto lead bib

Waves Clean Up and Deduplication Presentation (“Scrubaloging”)

This presentation outlines the Waves clean up and deduplication process and describes the criteria used for the process.

What Is the Waves Clean Up?

A “Wave” is a format (e.g. electronic, audiobook, large print, video).  The Waves clean up is the first step in the annual deduplication process.  It is run by Mobius, and it is a process where Mobius attempts to go through and correct the format icons for records in the system, based on the data in the record.  This Waves clean up is first because the format icons are used in the deduplication process to make sure that different formats are not being merged, like regular and large print books, or physical and audiobooks.

Please be aware that the Waves clean up does not convert all format icons.  The formats that it can convert icons to are Electronic, Audiobook, Video, Large Print, and Music.

Important note:  The initial Waves clean up list Mobius provides is not precisely the list that will be affected when the actual process is run "for real."  It is a test run, applying the algorithm on a copy of the bibliographic data at a point in time.  When Mobius runs it for real on the production data, Mobius will produce a new and "final" spreadsheet of the bibs that were affected.  In all likelihood a large percentage of the initial “test” spreadsheet will be included, but there will be new data and some items that will not get touched (because the data changed between the time of the “test” run and the time of the “real” run).

Results Categories

In the resulting spreadsheets for both the Waves clean up and the deduplication process, there are two categories present:

  • Auto (or Converted)
    • For the Waves clean up, these are records that the process is confident enough about that they will be changed to the format described, unless we see a problem with the logic.
    • For the deduplication, these are records that the process is confident enough about that they will be merged, with one record selected as the lead bib and the other as the sub, unless we see a problem with the logic.
    • So, essentially, the “Auto” sheets show what the Waves clean up and deduplication process would do if we were to run them now as is.
    • “Auto” is listed as “Converted” once the actual Waves clean up has been performed and the format changes have been made.
  • Needs Humans
    • For the Waves clean up, these are records that the process has a guess about but is not sure enough to actually go ahead and make the change.
    • For the deduplication, these are records that the process has a guess about but is not sure enough to actually go ahead and merge.

Reviewing the Waves Clean Up "Auto" Sheet/s

Once the initial Waves clean up has been run, members of the Cataloging Committee will review the Waves clean up spreadsheet and see if there is anything that jumps out to them as a problem in the “Auto” sheet/s.

The reviewers’ role at this stage is to review the "Auto" sheet/s and see if the "winning" format icon is correct.  If not, it will then be important to figure out how best to tweak the criteria used to determine the winning format icon so that the correct format icon is selected.  Basically, the Cataloging Committee is trying to confirm that the process is making good determinations on what the format icons should be.

Once the review of the Waves clean up has been completed and approved, then the Committee moves on to scrutinizing the actual deduplication process.

Reviewing the Waves Clean Up "Needs Humans" Sheet/s

The “Needs Humans” lists are "at your leisure."  Each record needs to be looked at and evaluated for its format.  Once it has been reviewed, the human should introduce a =903 tag so that another human does not spend time on it.

=903$a should contain your name or tag or library or whatever you decide you want to identify yourself.

=903$b is the action taken. 

=903$c is the date. A strict date format works best.

=903$d is any extra information you want to include.

Look at the example =903 tags that the software created on bibs from the “Auto” list:

=903 \\$amobius-catalog-fix$b05-15-2021$cformatted$dL a r g e P r i n t

When adding the =903, it is crucial that you do not introduce any of the keywords that the software is looking for so the bib does not get additional format votes.  The “Needs Humans” lists can take time to work through.  It is not absolutely required that they be completed before performing the deduplication.  Having them done, however, makes the deduplication process that much better.

What MARC Fields are Compared During the Deduplication Process?

Once the Waves clean up process has finished, and the Cataloging Committee has reviewed and signed off on the results, the actual deduplication process can begin.  This process compares a number of MARC fields between records to determine if they are a good match for merging or not.

From the LDR field, the following data elements are taken into consideration:

  • Type of record
  • Bibliographic level

From the =008 field, the following data elements are taken into consideration:

  • Form of item
  • Date 1

From the =020 field, the following subfield is taken into consideration:

  • $a (ISBN)

From the =100, =110, =111 fields, the following subfield, specifically the first occurrence, is taken into consideration:

  • $a (Personal name; Corporate name; Meeting name)

From the =245 field, the following subfields are taken into consideration:

  • $a (Title)
  • $b (Subtitle)
  • $h (Medium/GMD)
  • $n (Number of part/section of work)
  • $p (Name of part/section of work)

From the =264 _1 field, the following subfield is taken into consideration:

  • $c (Date of publication)

From the =700 field/s, the following subfield is taken into consideration:

  • $a (Added entry/Personal name)

In addition to the above fields and subfields, the deduplication process also checks to see if a bib record is an audio format or a video format, because those require different scoring.

How Candidate Bibs are Selected

  1. Candidate bibs are selected based upon Evergreen's fingerprint.
  1. Each candidate gets a new fingerprint applied to them with the following:
  • Form of item
  • Date 1
  • Type of record
  • Bibliographic level
  • Title (=245$a)
  • Subtitle/Remainder of title (=245$b)
  • Medium/GMD (=245$h)
  • Number of part/section of work (=245$n)
  • Name of part/section of work (=245$p)
  • Author (=100$a, =110$a, =111$a)
  • Added entry/Personal name (=700$a)
  • Audio format
  • Video format
  • Date of publication
  • Normalized ISBNs
  1. All of the new fingerprints are recorded and then a new set of bibs that have the exact same fingerprint are considered to be duplicates.  This is the set of bibs that will be merged.  The winning bib is decided based upon a complex "score."  The bib with the highest score "wins" and the other bibs are merged onto that one.

Reviewing the Deduplication Process “Auto” Sheet

After the format icons have been updated by the Waves clean up process, Mobius will also generate a list of records that would be merged if the deduplication process were to be run immediately.  This information will be provided in a spreadsheet for the Cataloging Committee to review.  Members of the Cataloging Committee will review the matched records in the “Auto” sheet and see if there is anything that jumps out to them as a problem.

The reviewers’ role at this stage is to review the "Auto" sheet and see if the records marked Lead Bib and Sub Bib are a good match that in fact should be merged on to the Lead Bib.  If not, it will then be important to figure out how to differentiate the records as non-duplicates, so that the criteria used to consider two records as duplicates can be tweaked.  Basically, the Cataloging Committee is trying to confirm that the process is making good determinations regarding what records should be matched and merged.

What Happens When Two Bibs are Merged?

=020 (ISBN) is merged onto the final bib.  Any unique =020 is “merged/melted” onto the winning bib, such that the final bib may have multiple =020 fields.

=035 (OCLC number) is merged onto the final bib.  Any unique =035 is “merged/melted” onto the winning bib, such that the final bib may have multiple =035 fields.

=037 (source of acquisition) is merged onto the final bib.  Any unique =037 is “merged/melted” onto the winning bib, such that the final bib may have multiple =037 fields.

=086 (government document classification number) is merged onto the final bib.  Any unique =086 is “merged/melted” onto the winning bib, such that the final bib may have multiple =086 fields.

=856 (URL to electronic resource) is merged onto the final bib.  Any unique =856$u along with any accompanied $9 are "merged/melted" onto the winning bib, such that the final bib may have multiple =856 fields.

What Is Not Merged?

Any of the bibs listed here that have an OPAC Icon of:

  • Serial
  • DVD
  • VHS
  • Blu-Ray
  • Microform
  • Software

Will NOT be automatically merged.

Because they are too often false positives.  They need to be handled by hand, so they are part of the "Needs Humans" results.  The rest of the bibs are merged.  The merging process is fairly involved (internally) - it takes into account all of the holds and the metarecords issues.  They are reconciled and merged as well.

Approximate Timeline for Project

Cataloging Committee (~2 weeks)

  • Review Waves clean up Auto sheets

Mobius (~2 weeks; may take as little as a day depending on the number of bibs and the parameters used)

  • Run Waves clean up process to update format icons
  • Generate list of what would be deduped

Cataloging Committee (~2 weeks)

  • Review list of what would be merged (to be provided by Mobius)

Mobius (~1 month; may take as little as a day depending on the number of bibs and the parameters used)

  • Run deduplication process

FAQ

Deleted bibs may be included in the deduplication process.  How does this affect the merging of records, if at all?

The deduplication process can and may prefer a deleted bib over a non-deleted bib.  If that is the decision it makes, it will do all of the things necessary for it to bring the deleted bib back to life and move everything to it.  The resurrected bib will be searchable and sound in the catalog.

While the deduplication process is running, can catalogers still edit bibs, merge bibs, import bibs, etc.?

Yes.  Unlike the quarterly authorities update where many cataloging functions must cease lest they be overwritten/undone by the authorities update, catalogers can continue to perform their normal cataloging duties without fear of wasted time or lost work.

Knowledge Tags
annual deduplication process  /  deduplication process  /  deduplication  /  dedup  /  dedupe  /  waves clean up  /  waves  /  wave  / 

This page was: Helpful | Not Helpful


NC Cardinal is supported by the Institute of Museum and Library Services under the provisions of the federal Library Services and Technology Act (LSTA), as administered by the State Library of North Carolina, a division of the Department of Natural and Cultural Resources.