Author: Amit Verma

Home / Articles posted by Amit Verma
Decoding the Importance of Metadata in Digitization and Preservation of Content

Decoding the Importance of Metadata in Digitization and Preservation of Content

Introduction

Digital media has come a long way over the past decade. The shift from single-screen to multiple-screen and multi-device, from the subscription-based model to OTT service providers is apparent over the years. Keeping in line with the demand, broadcasters are also broadening their distribution channel.

With the audience having a wide variety of choice to consume video across platforms at their preferred time – broadcasters are leaving no stones unturned to digitize video content, even those dating back to decades.

Broadcasters are now focused on aggregation and distribution of highly-targeted content that reaches narrow-interest audiences. As broadcasters develop and store digital content to use and reuse across devices and platforms, the value of good shareable content is increasing.

However, the problem lies elsewhere. An estimated 98% of archived media is not available for digital distribution.[1]

Why?

Migrating hours of media content from tape to digital storage is time-consuming. Though automated migration systems convert tapes to multiple digital formats simultaneously, tagging these files to make them searchable is a challenge.

Have you ever wondered how – when you Google – some videos top the search results? With an average of 300 hours[2] of video content being uploaded to YouTube alone every minute, content producers and owners sweat over making their content optimized for search results.

The solution

The key to ensuring that your content doesn’t get lost in the crowd is tagging it with relevant keywords. While search engines have evolved over the years, they are still not human – hence can’t read/watch your content. They need a hint (or metadata) to understand the content and apply analytics to list them. While filtering, the search engine follows the following order – title, description, and tags. If you optimize these three, half of the battle is won.

In this paper, we will explore:

  • What is metadata?
  • Types of metadata
  • Metadata Schema Models
  • The importance of metadata in content digitization
  • Optimizing metadata for content digitization

What is metadata?

Metadata refers to “data about data.”[3] It represents a detailed description of the underlying data within an object concerning its title, date & time of creation, format, length, language, year of reference, narration describing the object’s identity & purpose, etc.

For long-term digital archiving, metadata refers to the preservation techniques that are applied to the digital objects in the archives. Metadata does the following:

  • Helps in easy identification, location, and retrieval of information by the end-users
  • Provides information about quality aspects or issues of the created object along with its access privileges/rights
  • Ensures smooth data management

Types of metadata

Depending on the nature of data and usability in a real-world scenario, metadata can be categorized as:

  • Descriptive: Helps to identify, locate, and retrieve information related to an object through indexing and navigation to related links. It includes elements such as title, creator, identity, and description
  • Structural: Defines the complexity of an object along with the role of individual data files, ordering of pages to form a chapter, file names, and their organization, etc.
  • Administrative: Helps to manage the resources in terms of its creation, methods, access rights, associated copyright, and the techniques required for preserving it
  • Rights: Defines access permissions and constraint over the stored objects and information contained in them at different levels
  • Preservation: Records activities or methodology opted in the archive for preserving digital data.
  • Technical: Provides technical information embedded with the digital object (content files). It describes attributes of the digital image (not the analog source of the picture) and helps to ensure that the image will be rendered with accuracy, capture process of the data, and their transformation.
  • Provenance: Records object’s origin/nativity and the changes that were performed to these objects for its resolution, format, perspectives, etc.
  • Tracking: Keeps track of the data at different stages of the workflow (data automation processes, digital capturing, transformation, processing filters and toolsets, enhancement, quality control and management, and data archival and deliverables)

For long-term digital preservation, two types of metadata play a crucial role:

  1. Packaging Metadata

Defines three kinds of information packages, which are as follows:

  1. Submission Information Package (SIP) – Contains information delivered to the archive from the content provider
  2. Archival Information Package (AIP) – Related content information stored in the archive
  3. Dissemination Information Package (DIP) – On request delivery of information to the user
  1. Preservation Metadata

Records the process that supports the preservation of digital data

Metadata Schema Models

According to ISO 23081[4], a schema is “a logical plan showing the relationships between metadata elements, normally through establishing rules for the use and management of metadata specifically as regards the semantics, the syntax and the optionality (obligation level) of values.”

The amount of metadata that needs to be stored for an object depends on its functional usage & significance. With a large amount of metadata already there, and more being published regularly for a different purpose by different communities, metadata schema designers need unique experience of using the Semantic Web to consider a metadata schema.

For long term preservation of data, a varying Metadata Schema Models has been developed, which includes the following:

  • MARC: Machine Readable Cataloguing
  • MARCXML: XML version of MARC 21
  • METS: Metadata Encoding & Transmission Standard
  • MODS: Metadata Object Description Schema
  • DCMI: Dublin Core Metadata Initiative
  • CDWA: Categories for the Description of Works of Art
  • CRM: CIDOC Conceptual Reference Model
  • MPEG-7: Moving Picture Coding Experts Group
  • EAD: Encoded Archival Description
  • RDF: Resource Description Framework
  • VRA CORE: Visual Resources Association
  • DDI: Data Documentation Initiative
  • MIX: Metadata for Images in XML Standard
  • IEEE LOM: Institute of Electrical and Electronics Engineers Standards Association for the description of “learning objects”

The importance of metadata in content digitization

Metadata plays a key role in processing, managing, accessing, and preserving digital content –be it audio, video, or image collections. Metadata has the following key functionalities:

  • Search: To search for data associated with a file like Author, Date Published, Key Words, etc.
  • Distribute: To determine when and where the content will be distributed
  • Access: To determine delivery of targeted content based upon preset rules matching metadata values
  • Retain: To determine which records to archive

Optimizing metadata for content digitization

The importance of metadata lies in the fact that it makes the content searchable – both online and offline. While filtering, the search engine follows the following order – title, description, and tags. Some key points to remember while using metadata for content digitization are:

Optimize the title

Grab the attention with a catchy and compelling title. To make a title search engine (and mobile) friendly, limit it to 120 characters and include your top keywords. Think what the audience would relate to, and make the title informative and relevant.

Optimize the description

Follow and include the keywords, and detail what the content is all about. Limit the most critical information within the first 22 words of your description – as search engine displays it on the list before you click ‘see more’ button.

Optimize the tags

A couple of things to keep in mind while tagging a digital asset are:

  1. Assign keywords that cover the 5 W’s – what, when, who, why, and where – to make it a well-captured asset
  2. Avoid grammatical errors while assigning keywords
  3. Avoid ambiguous words or words with multiple meanings
  4. Be consistent with abbreviations and acronyms
  5. Use a minimum of 8 – 12 tags per asset

Conclusion

Metadata plays a crucial role in keeping track of content right from its inception to its processing and accessibility. It provides a complete description of the purpose and functionality of the data, making it easier for end-users to locate and retrieve the data. Therefore, it is crucial that all contents should have embedded metadata in them.

[1] https://www.recode.net/2014/4/8/11625358/modernizing-the-entertainment-industry-supply-chain-in-the-age-of

[2] https://merchdope.com/youtube-stats/

[3] https://www.techopedia.com/definition/1938/metadata

[4] https://committee.iso.org/sites/tc46sc11/home/projects/published/iso-23081-metadata-for-records.html

Pros and Cons of Linear Tape Open (LTO) for Long-term Content Archiving

Pros and Cons of Linear Tape Open (LTO) for Long-term Content Archiving

Originally developed in the late 1990s, LTO (Linear Tape Open) is a magnetic tape data storage technology that offers extensive storage for a variety of applications comprising, long-term archive, data back-up, high-capacity data transfer, and offline storage almost over the past two decades.

The LTO technology has shown vast up gradation with new features added to its subsequent generations (1 – 8) including write-once, read-many (WORM); data encryption; and partitioning to enable a Linear Tape File System (LTFS) that aids in enhancing its overall performance in terms of storage capacity, speed, data transfer rate (MBps), digital encoding methods and compression techniques.

An overview of the LTO generations is depicted hereunder:

LTO TypeYear of IntroductionGenerationNative CapacityCompressed CapacityCompression RatioData Native Transfer RateCompressed Data Transfer Rate

LTO-1

20001100 GBup to 200 GB2:120 MBps

40 MBps

LTO-2

20032200 GB400 GB2:140 MBPS

80 MBPS

LTO-3

Late 20043400 GB800 GB2:180 MBPS

160 MBPS

LTO-4

20074800 GB1.6 TB2:1120 MBPS240 MBPS

LTO-5

201051.5 TB3 TB2:1140 MBPS280 MBPS

LTO-6

201262.5 TB6.25 TB2.5:1160 MBPS400 MBPS
LTO-7201576 TB15 TB2.5:1300 MBPS

700 MBPS

LTO-8

2017812 TB30 TB2.5:1360 MBPS

750 MBPS

Pros & Cons of data storage on LTO

Pros:

Storage Capacity & Costs

Archival on LTO for industries dealing with huge data size costs cheaper & effective as compared to storage on internal hard drives. The LTO Data Archival has shown rapid growth in sectors like media, entertainment, data analytics, science where there is a continuous flow of data throughout the operations. With the advent of the latest generation LTO-8 (as depicted above), one can store 12TB of uncompressed data @360 MBPS data transfer rate and 30TB of uncompressed data @ 750 MBPS on a single taped costing about $100.

Life Span/Durability

The LTO cartridges offer extensive lifespan with an average cycle of 30 years along with high-end backup & recovery throughout its life cycle.

Data Mobility

Transferring voluminous data through networks is an expensive & time-consuming process and may also lead to data crash /corruption in cases of any link or interoperability failures. Also, there are probabilities of unauthorized data access over the internet acting as a great threat towards the confidentiality of data.

LTO, on the other hand, provides an easy & rational means of data exchange physically over tape from one location to another.

Technology Upgrade

The LTO technology has shown remarkable growth since years with new releases every 2 to 3 years highlighting expansion in storage capability, increased data transfer rate & advancement of data compression & encryption solutions.

The LTO Program group has laid a product timeline with new releases up to LTO-12 delivering incremented storage capacity and performance growth.

Disaster Recovery

As backup data stored on LTO are preserved offline, the data is safe from any sort of virus attacks or malware and whole data can be restored as per necessity & requirements.

Cons:

Operational costs

The overall operational cost in Tape based archival is comparatively high as the cost of LTO drives that are used for digital recording in order to store data on magnetic tapes, range from $2000-$3500 and that may rise up for enterprise versions.

Keeping up with the technology

LTO1 was introduced in the year 2000, 12 years later LTO 6 was introduced, so a new version every second year. Typically what happens is that LTO’s are migrated every second generation since the writer and the reader only support 2 or 3 generations of tapes. If we record on LTO 6 and leave it on the shelf for 60 years there will for the guarantee not be a reading device available and with a very high probability, most of the data will be gone.

Tapes are not random access like hard drives

The LTO drive that does a digital recording of data on magnetic tape is only capable of moving tape in a single direction. As such, only sequential access storage can be possible in tapes. This adversely affects the speed of storage & retrieval of data due to its constraints of linear technology.

Due to limitations of linear technology, if new data is inserted/existing data modified in between leads to erasure of data beyond the point of insertion or modification. Data has to be necessarily added to tape right from the point of last written sector to avoid any deletion of existing data. This sometimes leads to data replication and also minimizes the optimal use of storage space of LTO tapes.

Conclusion:

Whether LTO and LTFS are optimum for storage depends on the amount of data that need to be archived and also their frequency of access by end users. There is no doubt that LTO is an ideal media for the offline preservation and protection of data for completed projects. LTO’s exclusive features make it too competent for long-term data retention and content archive applications.

LTO-tape data backup seems to be more consistent, durable, and cost-efficient for data archiving in long-term supported by an offsite tape vaulting service.

LTO Tapes serve as a better option for archiving huge amount of data in long-term especially for those industries that produce a substantial amount of data all through its lifecycle as media, entertainment, survey, medical records, verdict, library etc.

Whereas in-house disk system-or even cloud storage-can work efficiently for data that needs to be accessed frequently under low access latencies. Also, the ability for random access & modifications of existing data stored in disk minimizes the chances of data replication.