UNSTRUCTURED DATA CORPUS

UNSTRUCTURED DATA CORPUS

The emergence of open data is the growing growth of electronic collections through mass digitization.

Also, with the budgets of cultural institutions decreasing, cultural institutions need to rethink the management of their funds on the one hand. On the other hand, its data needs to be developed.

These two requirements are partially solved using Named Entity Recognition (NER). Indeed, the traditional model of manual cataloging and indexing has been under serious pressure for several years.

With ever-shrinking budgets, they have to do more with less. The trend towards semi-automatic computerized cataloging is strong.

It is also supported by financial institutions. This encourages data through enrichment by linking it to external sources of information.

This context gave importance to the concepts of data network and open data in the cultural world. Recent initiatives such as OpenGLAM2 and Lodlam illustrate the evolution of applications. It shows how these evolutions permeate the field of cultural heritage.

In both the United States and the European Union, their digital libraries adopt data network principles. In France, the National Library of France has a similar project.

The enrichment and integration of heterogeneous collections can be facilitated by using dictionaries made available according to data network principles.

A popularization file is presented on the data network and the semantic network. It allows information professionals to update themselves on the challenges of a fully evolving field.

Following the same logic, it presents a presentation of the different technological bricks of the semantic web, with new possibilities it offers to libraries as an approach.

Research Questions

Here we will try to answer a few questions. First, we will discuss the possibilities and limitations of NER and other feature extraction methods to enrich the unstructured data corpus.

A standard will be used to calculate the precision, recall and F-Score of results achieved by their services. More systemic issues will also be addressed, such as the benefits of using a GSC and how to counteract its shortcomings.

Indeed, terms like "paleontology" or "space exploration" enrich a corpus. It is an undeniably interesting source of information. But they are not considered by a GSC because they are not named entities.

Also, the GSC is, at first glance, concerned with the granularity and frequency of a term. It does not make it possible to distinguish between the omission of the term "Quebec" in a compilation and its proper meaning. The city adds little value, but contributes to a high recall score.

Next, we will take an overview of its creation and current evolution, with special attention to its use in the data network and cultural heritage.

Next, we'll cover a case study and its methodology for its use, followed by contextualization of the results of our study.

The problem of multilingualism will also be addressed, both at the level of extraction and disambiguation. What effect does the language of a corpus have on a subtraction algorithm?

At the same time, how to continue to find a meaning of the extracted entities? What is the definition in the correct language?

Finally, we will consider the general risks of using it in bulk, especially in terms of the language used.

Dr.Yaşam Ayavefe

Etiketler : open data examples open data gov open data platform open data portal world bank data open data uk open data movement open data covid Dr.Yaşam Ayavefe

DİĞER TEKNOLOJİ HABERLERİ

OpenAI, ilk Asya ofisini Tokyo'da açtı

OpenAI, ilk Asya ofisini Tokyo'da açtı

Yapay zeka sohbet robotu ChatGPT'nin geliştiricisi OpenAI, Asya bölgesindeki ilk ofisini Japonya'nın başkenti Tokyo'da açtı.

Tiktok'un geliri geçen yıla göre yüzde 60 arttı

Tiktok'un geliri geçen yıla göre yüzde 60 arttı

TikTok’un ana şirketi ByteDance, 2023’te yıllık yüzde 60 gelir artışı kaydetti.

Elon Musk açıkladı... Robotaksi için tarih belli oldu

Elon Musk açıkladı... Robotaksi için tarih belli oldu

Tesla CEO'su Elon Musk, X'te robotaksiyi 8 Ağustos'ta tanıtacağını duyurdu.

YouTube CEO'su OpenAI'yi hedef aldı: Videolarımızı kullanmaya cesaret etmeyin

YouTube CEO'su OpenAI'yi hedef aldı: Videolarımızı kullanmaya cesaret etmeyin

OpenAI'nin videolar yaratan Sora isimli yapay zekâ botu YouTube CEO'sunun eleştirilerinin hedefinde.

İnsan beyninin en güçlü MRI cihazıyla çekilen en net görüntüsü yayınlandı

İnsan beyninin en güçlü MRI cihazıyla çekilen en net görüntüsü yayınlandı

Xiaomi'nin elektrikli otomobili satışa çıktı: 27 dakikada 50 bin sipariş

Xiaomi'nin elektrikli otomobili satışa çıktı: 27 dakikada 50 bin sipariş

Teknoloji devi Xiaomi, düzenlediği etkinlikle ilk elektrikli otomobili SU/'yi satışa çıkardı. Etkinlikte konuşan Xiaomi CEO'su Lei Jun, SU7’yi zararına satacaklarını belirtti. Şirket, satışa çıkılan i..

Köşe Yazarları

Mehmet YILMAZ

Eskisi ile yenisinin farkı

Dr.Yaşam Ayavefe

Deferral Of Receivables and Tax Credit

Taha AKYOL

Profesör

Melis ALPHAN

Atık suyu değerlendirseydik Melen Çayı’nı riske atmazdık

Ayşe ARMAN

Balayı Kabusu

Çok Okunan Haberler

Puan Durumu

© Copyright 2020 KNK GLOBAL MEDYA LTD. Sitemizdeki yazı, resim ve haberlerin her hakkı saklıdır. İzinsiz veya kaynak gösterilmeden kullanılamaz. Haber Yazılımı