The Relation Between the Composition of Corpora (Genre Balance and Representativeness) and Their Reliability in Compiling General Explanatory Dictionary
DOI:
https://doi.org/10.3986/16.1.07Keywords:
Corpora, Dictionaries, Reference corpus, Representativeness, Balance, Meanings Proportion, Lexicology, Lexicography, SloveneAbstract
This paper aims to examine the genre composition of certain Slovenian corpora as sources for lexicographic analysis (especially when compiling dictionaries such as eSSKJ, the general explanatory dictionary), particularly of the largest corpus, Gigafida 2.0 (divided into two sub-corpora: a sub-corpus of non-fiction and literary texts and a sub-corpus of journalistic texts), the Corpus of Slovenian School Texts, the Corpus of Scientific Texts of Contemporary Slovenian, as well as the KRES corpus. We argue that corpora with major discrepancy in the proportions between different text genres used as lexicographic resources do not reflect the proportions between meanings which originate in semantic extension processes. Thus, one of the largest corpora available for Slovene, Gigafida (in both versions, 1.0 and 2.0, updated in 2019), could hardly be regarded as a reference source of data for a general explanatory dictionary. This is because various journalistic texts and web texts are predominant in Gigafida, while the share of non-fiction and literary texts does not exceed 10% in total. We suggest that a corpus should be at least approximately balanced, which could in turn provide its representativeness.
Downloads
References
eSSKJ: Slovar slovenskega knjižnega jezika 2016–, www.fran.si (1. 1.-31. 5. 2024).
Gigafida 2.0: Korpus pisne standardne slovenščine. https://viri.cjvt.si/gigafida/ (subcorpora PUBL and STVL available within search options: https://www.clarin.si/noske/run.cgi/first_form?corpname=gfida20_dedup;align=)
Gigafida 2.0. Corpus Compilation: Specifications. https://www.cjvt.si/gigafida/wp-content/uploads/sites/10/2019/06/Gigafida2.0_specifikacije.pdf
KRES. http://www.korpus-kres.net/ (October 2024)
Korpus šolskih besedil slovenskega jezika (KŠBSJ). Internal materials.
Korpus znanstvenih besedil (KZB). https://www.clarin.si/ske/#dashboard?corpname=kzb10
British National Corpus. https://www.english-corpora.org/bnc/
Czech National Corpus. https://www.korpus.cz/kontext/query?corpname=syn2020 (October 2024); https://wiki.korpus.cz/doku.php/cnk:syn2020 (October 2024)
Polish National Corpus. https://nkjp.pl/poliqarp/
Russian National Corpus. https://ruscorpora.ru/stats
Slovak National Corpus. https://korpus.sk/en/corpora-and-databases/snc-corpora/publiclyavailable-snc-corpora/structure-of-the-corpus-prim-10-0/
Atkins, Sue, Clear, Jeremy, Ostler, Nicholas. 1992. Corpus Design Criteria. Literary and Linguistic Computing 7/1: 1–16.
Biber, Douglas. 1993. Representativeness in Corpus Design. Literary and Linguistic Computing 8/4: 243–257.
Centa Strahovnik, Mateja. 2023. Čustva, človekova odnosnost in doseganje dobrega življenja. Ljubljana: Teološka fakulteta.
Corpas Pastor, Gloria, Seghiri, Miriam. 2010. Size Matters: A Quantitative Approach to Corpus Representativeness. In: R. Rabadán, M. Fernández López, and T. Guzmán González (ed.). Lengua, traducción, recepción en honor de Julio César Santoyo. León: Universidad de León Área de Publicaciones: 111–145. http://hdl.handle.net/2436/622560
Gabrovšek, Dejan. 2023. Povedkov prilastek v slovenščini. Slavistična revija 71/2: 113–128. https://doi.org/10.57589/srl.v71i2.4108
Gorjanc, Vojko. 2005. Uvod v korpusno jezikoslovje. Domžale: Izolit.
Górski, Rafał L. 2008. Representativeness of a written part of a Polish general-reference corpus. Primary notes. In: B. Lewandowska-Tomaszczyk (ed.). Corpus Linguistics, Computer Tools, and Applications – State of the Art, Frankfurt am Main: Peter Lang. 119–123. http://nkjp.pl/settings/papers/representativeness_primary_notes.pdf
Górski, Rafał L., Łaziński, Marek. 2012. Reprezentatywność i zrównoważenie korpusu. In: A. Przepiórkowski, M. Bańko, R. L. Górski, B. Lewandowska-Tomaszczyk (ed.). Narodowy korpus języka polskiego. Warszawa: Wydawnictwo naukowe PWN. 25–36.
Gregorčič, Rok, 2023. Tehnološki razvoj v luči Habermasove etike diskurza. Bogoslovni vestnik 83/4: 911–922.
Jakobson, Roman. 1996. Lingvistični in drugi spisi. Ljubljana: Inštitut za humanistične študije.
Korošec, Tomo. 2005. Jezik in stil oglaševanja. Ljubljana: Fakulteta za družbene vede.
Kosem, Iztok, Čibej, Jaka, Dobrovoljc, Kaja, Kuzman, Taja, Ljubešić, Nikola. 2023. Spremljevalni korpus Trendi in avtomatska kategorizacija. Slovenščina 2.0: Empirične, Aplikativne in Interdisciplinarne Raziskave 11/1: 161–188. https://doi.org/10.4312/slo2.0.2023.1.161-188
About KRES. http://www.korpus-kres.net/Support/About (November 2023).
Krek, Kilgariff. 2006. Slovene Word Sketches. Proceedings of 5th Slovenian/First International Languages Technology Conference. Ljubljana. https://www.kilgarriff.co.uk/Publications/2006-KrekKilg-Ljub-SloveneWS.pdf
Krvina, Domen. 2018. Glagolski vid v sodobni slovenščini 1. Besedotvorje in pomen. Ljubljana: Založba ZRC. https://doi.org/10.3986/9789610500742
Krvina, Domen. 2022. The Growing Dictionary of the Slovenian Language (2014-) and Slovenian Neologisms: Study on Types of Data and Their Use. Slovenski jezik / Slovene Linguistic Studies 14: 117–151. https://doi.org/10.3986/sjsls.14.1.05
Ledinek, Nina, Jemec Tomazin, Mateja, Trojar, Mitja, Perdih, Andrej, Ježovnik, Janoš, Romih, Miro, Erjavec, Tomaž. 2022. Korpus šolskih besedil slovenskega jezika: zasnova in gradnja. Jezikoslovni zapiski 28/1: 123–137. https://doi.org/10.3986/JZ.28.1.07
Logar Berginc, Nataša, Grčar, Miha, Brakus, Marko, Erjavec, Tomaž, Arhar Holdt, Špela, Krek, Simon. 2012. Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: gradnja, vsebina, uporaba. Ljubljana: Trojina, zavod za uporabno slovenistiko, in Fakulteta za družbene vede.
Logar Berginc, Nataša, Gorjanc, Vojko, Arhar Holdt, Špela. 2023. Korpus Gigafida 2.0: Mnenje uporabnikov. Jezik in Slovstvo 68/2: 75–91.
Novak, France. 2004. Samostalniška večpomenskost v jeziku slovenskih protestantskih piscev 16. stoletja. Ljubljana: Založba ZRC
Petric Žižić, Špela. 2020. Tipologija razlag v Šolskem slovarju slovenskega jezika. Slavistična revija 68/3: 391–409. https://srl.si/ojs/srl/article/view/3875
Petric Žižić, Špela (tran.). 2022. School Dictionary of the Slovenian Language on the Franček Web Portal. Slavistica Vilnensis, 67/2: 126–140. https://orcid.org/0000-0001-7451-4264
Rundell, Michael, Atkins, Sue. 2013. Criteria for the design of corpora for monolingual lexicography. In: R. H. Gouws, U. Heid, W. Schweickard, H. E. Wiegand (eds.).
Dictionaries. An International Encyclopedia of Lexicography. Berlin/Boston: De Gruyter Mouton. 1336–1343.
Snoj, Jerica. 2004. Tipologija slovarske večpomenskosti slovenskih samostalnikov. Ljubljana: Založba ZRC. https://doi.org/10.3986/9616500309
Stefanowitsch, Anatol. 2020. Corpus linguistics: A guide to the methodology (Textbooks in Language Sciences 7). Berlin: Language Science Press.
Suhadolnik, Stane. 1963. Problemi slovenske leksikografije. Sodobnost 11/10: 926–934.
Suhadolnik, Stane, Janežič, Marija. 1962. Plasti in pogostnost leksike. Jezik in slovstvo 8/1–2: 45–49.
Svetina, Peter. 2009. Kaj naj beremo z otroki? In: Livija Knaflič, N. Bucik (ed.). Branje za znanje in branje za zabavo: priročnik za spodbujanje družinske pismenosti. Ljubljana: Andragoški center Slovenije. 67–69. https://arhiv.acs.si/publikacije/Branje_za_znaje_in_branje_za_zabavo-prirocnik.pdf
Vidovič Muha, Ada. 2013. Slovensko leksikalno pomenoslovje. Ljubljana: Znanstvena založba Filozofske fakultete Univerze v Ljubljani.
Vodičar, Janez. 2023. Avtoriteta na področju vzgoje in verovanja v digitalni dobi. Bogoslovni vestnik 83/4: 1035–1047.
Downloads
Published
How to Cite
Issue
Section
License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors guarantee that the work is their own original creation and does not infringe any statutory or common-law copyright or any proprietary right of any third party. In case of claims by third parties, authors commit their self to defend the interests of the publisher, and shall cover any potential costs.
More in: Submission chapter