The Relation Between the Composition of Corpora (Genre Balance and Representativeness) and Their Reliability in Compiling General Explanatory Dictionary

Authors

DOI:

https://doi.org/10.3986/16.1.07

Keywords:

Corpora, Dictionaries, Reference corpus, Representativeness, Balance, Meanings Proportion, Lexicology, Lexicography, Slovene

Abstract

This paper aims to examine the genre composition of certain Slovenian corpora as sources for lexicographic analysis (especially when compiling dictionaries such as eSSKJ, the general explanatory dictionary), particularly of the largest corpus, Gigafida 2.0 (divided into two sub-corpora: a sub-corpus of non-fiction and literary texts and a sub-corpus of journalistic texts), the Corpus of Slovenian School Texts, the Corpus of Scientific Texts of Contemporary Slovenian, as well as the KRES corpus. We argue that corpora with major discrepancy in the proportions between different text genres used as lexicographic resources do not reflect the proportions between meanings which originate in semantic extension processes. Thus, one of the largest corpora available for Slovene, Gigafida (in both versions, 1.0 and 2.0, updated in 2019), could hardly be regarded as a reference source of data for a general explanatory dictionary. This is because various journalistic texts and web texts are predominant in Gigafida, while the share of non-fiction and literary texts does not exceed 10% in total. We suggest that a corpus should be at least approximately balanced, which could in turn provide its representativeness.

Downloads

Download data is not yet available.

References

eSSKJ: Slovar slovenskega knjižnega jezika 2016–, www.fran.si (1. 1.-31. 5. 2024).

Gigafida 2.0: Korpus pisne standardne slovenščine. https://viri.cjvt.si/gigafida/ (subcorpora PUBL and STVL available within search options: https://www.clarin.si/noske/run.cgi/first_form?corpname=gfida20_dedup;align=)

Gigafida 2.0. Corpus Compilation: Specifications. https://www.cjvt.si/gigafida/wp-content/uploads/sites/10/2019/06/Gigafida2.0_specifikacije.pdf

KRES. http://www.korpus-kres.net/ (October 2024)

Korpus šolskih besedil slovenskega jezika (KŠBSJ). Internal materials.

Korpus znanstvenih besedil (KZB). https://www.clarin.si/ske/#dashboard?corpname=kzb10

British National Corpus. https://www.english-corpora.org/bnc/

Czech National Corpus. https://www.korpus.cz/kontext/query?corpname=syn2020 (October 2024); https://wiki.korpus.cz/doku.php/cnk:syn2020 (October 2024)

Polish National Corpus. https://nkjp.pl/poliqarp/

Russian National Corpus. https://ruscorpora.ru/stats

Slovak National Corpus. https://korpus.sk/en/corpora-and-databases/snc-corpora/publiclyavailable-snc-corpora/structure-of-the-corpus-prim-10-0/

Atkins, Sue, Clear, Jeremy, Ostler, Nicholas. 1992. Corpus Design Criteria. Literary and Linguistic Computing 7/1: 1–16.

Biber, Douglas. 1993. Representativeness in Corpus Design. Literary and Linguistic Computing 8/4: 243–257.

Centa Strahovnik, Mateja. 2023. Čustva, človekova odnosnost in doseganje dobrega življenja. Ljubljana: Teološka fakulteta.

Corpas Pastor, Gloria, Seghiri, Miriam. 2010. Size Matters: A Quantitative Approach to Corpus Representativeness. In: R. Rabadán, M. Fernández López, and T. Guzmán González (ed.). Lengua, traducción, recepción en honor de Julio César Santoyo. León: Universidad de León Área de Publicaciones: 111–145. http://hdl.handle.net/2436/622560

Gabrovšek, Dejan. 2023. Povedkov prilastek v slovenščini. Slavistična revija 71/2: 113–128. https://doi.org/10.57589/srl.v71i2.4108

Gorjanc, Vojko. 2005. Uvod v korpusno jezikoslovje. Domžale: Izolit.

Górski, Rafał L. 2008. Representativeness of a written part of a Polish general-reference corpus. Primary notes. In: B. Lewandowska-Tomaszczyk (ed.). Corpus Linguistics, Computer Tools, and Applications – State of the Art, Frankfurt am Main: Peter Lang. 119–123. http://nkjp.pl/settings/papers/representativeness_primary_notes.pdf

Górski, Rafał L., Łaziński, Marek. 2012. Reprezentatywność i zrównoważenie korpusu. In: A. Przepiórkowski, M. Bańko, R. L. Górski, B. Lewandowska-Tomaszczyk (ed.). Narodowy korpus języka polskiego. Warszawa: Wydawnictwo naukowe PWN. 25–36.

Gregorčič, Rok, 2023. Tehnološki razvoj v luči Habermasove etike diskurza. Bogoslovni vestnik 83/4: 911–922.

Jakobson, Roman. 1996. Lingvistični in drugi spisi. Ljubljana: Inštitut za humanistične študije.

Korošec, Tomo. 2005. Jezik in stil oglaševanja. Ljubljana: Fakulteta za družbene vede.

Kosem, Iztok, Čibej, Jaka, Dobrovoljc, Kaja, Kuzman, Taja, Ljubešić, Nikola. 2023. Spremljevalni korpus Trendi in avtomatska kategorizacija. Slovenščina 2.0: Empirične, Aplikativne in Interdisciplinarne Raziskave 11/1: 161–188. https://doi.org/10.4312/slo2.0.2023.1.161-188

About KRES. http://www.korpus-kres.net/Support/About (November 2023).

Krek, Kilgariff. 2006. Slovene Word Sketches. Proceedings of 5th Slovenian/First International Languages Technology Conference. Ljubljana. https://www.kilgarriff.co.uk/Publications/2006-KrekKilg-Ljub-SloveneWS.pdf

Krvina, Domen. 2018. Glagolski vid v sodobni slovenščini 1. Besedotvorje in pomen. Ljubljana: Založba ZRC. https://doi.org/10.3986/9789610500742

Krvina, Domen. 2022. The Growing Dictionary of the Slovenian Language (2014-) and Slovenian Neologisms: Study on Types of Data and Their Use. Slovenski jezik / Slovene Linguistic Studies 14: 117–151. https://doi.org/10.3986/sjsls.14.1.05

Ledinek, Nina, Jemec Tomazin, Mateja, Trojar, Mitja, Perdih, Andrej, Ježovnik, Janoš, Romih, Miro, Erjavec, Tomaž. 2022. Korpus šolskih besedil slovenskega jezika: zasnova in gradnja. Jezikoslovni zapiski 28/1: 123–137. https://doi.org/10.3986/JZ.28.1.07

Logar Berginc, Nataša, Grčar, Miha, Brakus, Marko, Erjavec, Tomaž, Arhar Holdt, Špela, Krek, Simon. 2012. Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: gradnja, vsebina, uporaba. Ljubljana: Trojina, zavod za uporabno slovenistiko, in Fakulteta za družbene vede.

Logar Berginc, Nataša, Gorjanc, Vojko, Arhar Holdt, Špela. 2023. Korpus Gigafida 2.0: Mnenje uporabnikov. Jezik in Slovstvo 68/2: 75–91.

Novak, France. 2004. Samostalniška večpomenskost v jeziku slovenskih protestantskih piscev 16. stoletja. Ljubljana: Založba ZRC

Petric Žižić, Špela. 2020. Tipologija razlag v Šolskem slovarju slovenskega jezika. Slavistična revija 68/3: 391–409. https://srl.si/ojs/srl/article/view/3875

Petric Žižić, Špela (tran.). 2022. School Dictionary of the Slovenian Language on the Franček Web Portal. Slavistica Vilnensis, 67/2: 126–140. https://orcid.org/0000-0001-7451-4264

Rundell, Michael, Atkins, Sue. 2013. Criteria for the design of corpora for monolingual lexicography. In: R. H. Gouws, U. Heid, W. Schweickard, H. E. Wiegand (eds.).

Dictionaries. An International Encyclopedia of Lexicography. Berlin/Boston: De Gruyter Mouton. 1336–1343.

Snoj, Jerica. 2004. Tipologija slovarske večpomenskosti slovenskih samostalnikov. Ljubljana: Založba ZRC. https://doi.org/10.3986/9616500309

Stefanowitsch, Anatol. 2020. Corpus linguistics: A guide to the methodology (Textbooks in Language Sciences 7). Berlin: Language Science Press.

Suhadolnik, Stane. 1963. Problemi slovenske leksikografije. Sodobnost 11/10: 926–934.

Suhadolnik, Stane, Janežič, Marija. 1962. Plasti in pogostnost leksike. Jezik in slovstvo 8/1–2: 45–49.

Svetina, Peter. 2009. Kaj naj beremo z otroki? In: Livija Knaflič, N. Bucik (ed.). Branje za znanje in branje za zabavo: priročnik za spodbujanje družinske pismenosti. Ljubljana: Andragoški center Slovenije. 67–69. https://arhiv.acs.si/publikacije/Branje_za_znaje_in_branje_za_zabavo-prirocnik.pdf

Vidovič Muha, Ada. 2013. Slovensko leksikalno pomenoslovje. Ljubljana: Znanstvena založba Filozofske fakultete Univerze v Ljubljani.

Vodičar, Janez. 2023. Avtoriteta na področju vzgoje in verovanja v digitalni dobi. Bogoslovni vestnik 83/4: 1035–1047.

Downloads

Published

2024-12-11

How to Cite

Krvina, D., & Petric Žižić, Špela. (2024). The Relation Between the Composition of Corpora (Genre Balance and Representativeness) and Their Reliability in Compiling General Explanatory Dictionary. Slovenski Jezik / Slovene Linguistic Studies, 16. https://doi.org/10.3986/16.1.07