Design and Construction of the Corpus of Scientific Texts of Contemporary Slovenian
DOI:
https://doi.org/10.3986/JZ.31.2.06Keywords:
corpus of scientific texts, specialized corpus, corpus annotation, CoNNI-UAbstract
This paper presents the Corpus of Scientific Texts of Contemporary Slovenian, a specialized written corpus of Slovenian comprising 33,604,256 tokens from 884 scientific and expert texts, primarily in the fields of social sciences and the humanities, published mainly between 2000 and 2023. We focus on describing the text-type composition of the corpus, the technical procedures used in the preprocessing of corpus texts, corpus annotation, text encoding formats and corpus accessibility. We also discuss the rationale for constructing the corpus and its practical applications, aiming to outline the specific characteristics and advantages of the Corpus of Scientific Texts of Contemporary Slovenian in comparison with other Slovenian corpora that include specialized texts.
Downloads
References
Bach idr. 1997 = Carme Bach – Roser Saurí Colomer – Jorge Vivaldi – Maria Teresa Cabré, El corpus de l’IULA: descripció, Barcelona: Universitat Pompeu Fabra, Institut Universitari de Lingüística Aplicada, http://hdl.handle.net/10230/1299.
Bowker – Pearson 2002 = Lynne Bowker – Jennifer Pearson, Working with specialized language: a practical guide to using corpora, London; New York: Routledge, 2002.
Erjavec idr. 2014 = Tomaž Erjavec – Jan Jona Javoršek – Simon Krek, Raziskovalna infrastruktura CLARIN.SI, v: Zbornik 17. mednarodne multikonference Informacijska družba – IS 2014, Zvezek G, ur. Tomaž Erjavec – Jerneja Žganec Gros, Ljubljana: Institut Jožef Stefan, 2014, 19–24.
Erjavec idr. 2016 = Tomaž Erjavec – Darja Fišer – Nikola Ljubešić – Nataša Logar – Milan Ojsteršek, Slovenska akademska besedila: prototipni korpus in načrt analiz, v: Zbornik konference Jezikovne tehnologije in digitalna humanistika, 29. september–1. oktober 2016, ur. Tomaž Erjavec – Darja Fišer, Ljubljana: Znanstvena založba Filozofske fakultete, 2016, 58–64.
Erjavec idr. 2023 = Tomaž Erjavec – Mateja Jemec Tomazin – Nina Ledinek – Andrej Perdih – Miro Romih – Mitja Trojar – Luka Romih, Corpus of scientific texts of contemporary Slovenian KZB 1.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1872.
Erjavec idr. 2024 = Tomaž Erjavec – Nikola Ljubešić – Katja Meden – Taja Kuzman – Cyprian Adam Laskowski – Jan Jona Javoršek – Simon Krek – Mateja Jemec Tomazin – Jakob Lenardič, CLARIN.SI, the Slovenian node of CLARIN: ten years on, v: CLARIN Annual Conference Proceedings, ur. Vincent Vandeghinste – Thalassia Kontino, Barcelona, 2024, 76–80.
Jemec Tomazin 2024 = Mateja Jemec Tomazin, Slovenski terminološki portal, v: Zbornik konference Jezikovne tehnologije in digitalna humanistika, 19.–20. september 2024, ur. Špela Arhar Holdt – Tomaž Erjavec, 2024, 546–556.
Jemec Tomazin – Romih 2023 = Mateja Jemec Tomazin, Miro Romih, Slovenski terminološki portal: nova priložnost za urejanje slovenske terminologije, v: Razvoj slovenščine v digitalnem okolju, ur. Špela Arhar Holdt – Simon Krek, Ljubljana: Založba Univerze, 2023, 211–247, https://ebooks.uni-lj.si/ZalozbaUL/catalog/view/522/852/9443.
Kanič 2011 = Ivan Kanič, Slovenski besedilni korpus bibliotekarstva – najsodobnejša slovaropisna podpora bibliotekarski terminologiji, Knjižnica: odprt prostor za dialog in znanje: zbornik referatov, Strokovno posvetovanje Zveze bibliotekarskih društev Slovenije, Maribor, 20.-22. oktober 2011, ur. Melita Ambrožič – Damjana Vovk, Ljubljana: Zveza bibliotekarskih društev Slovenije, 2011, 297–312.
Kilgarriff idr. 2014 = Adam Kilgarriff – Vít Baisa – Jan Bušta – Miloš Jakubíček – Vojtěch Kovář – Jan Michelfeit – Pavel Rychlý – Vít Suchomel, The Sketch Engine: ten years on, Lexicography 1.1 (2014), 7–36, DOI: https://doi.org/10.1007/s40607-014-0009-9.
Krek idr. 2019 = Simon Krek – Špela Arhar Holdt – Jaka Čibej – Andraž Repar – Nikola Ljubešić, Specifikacije izdelave korpusa Gigafida 2.0, v1.0, 13. 6. 2019, https://www.cjvt.si/gigafida/wp-content/uploads/sites/10/2019/06/Gigafida2.0_specifikacije.pdf.
Krvina – Petric Žižić 2024 = Domen Krvina – Špela Petric Žižić, Razmerje med sestavo korpusov (žanrska uravnoteženost in reprezentativnost) in njihovo zanesljivostjo pri izdelavi splošnega razlagalnega slovarja, Slovenski jezik / Slovene Linguistic Studies 16 (2024), 149–176, DOI: https://doi.org/10.3986/16.1.07.
León Araúz idr. 2018 = Pilar León Araúz – Antonio San Martín – Arianne Reimerink, The EcoLexicon English Corpus as an open corpus in Sketch Engine, v: Proceedings of the 18th EURALEX International Congress, ur. Jaka Čibej – Vojko Gorjanc – Iztok Kosem – Simon Krek, Ljubljana: Euralex, 2018, 893–901.
Ljubešić idr. 2024 = Nikola Ljubešić, Luka Terčon, Kaja Dobrovoljc, CLASSLA-Stanza: the next step for linguistic processing of South Slavic languages, v: Zbornik konference Jezikovne tehnologije in digitalna humanistika, 19.–20. september 2024, ur. Špela Arhar Holdt – Tomaž Erjavec, 2024, 251–274.
Logar 2013 = Nataša Logar, Korpusna terminografija: primer odnosov z javnostmi, Ljubljana: Trojina, zavod za uporabno slovenistiko; Fakulteta za družbene vede, 2013.
Machálek 2020 = Tomáš Machálek, KonText: advanced and flexible corpus query interface, v: Proceedings of the Twelfth Language Resources and Evaluation Conference, ur. Nicoletta Calzolari – Frédéric Béchet – Philippe Blache – Khalid Choukri – Christopher Cieri – Thierry Declerck – Sara Goggi – Hitoshi Isahara – Bente Maegaard – Joseph Mariani – Hélène Mazo – Asuncion Moreno – Jan Odijk – Stelios Piperidis, Marseille: European Language Resources Association, 2020, 7003–7008.
Mikolič 2013 = Vesna Mikolič, Turistični terminološki slovar – predstavitev izhodišč in učinkov projekta ter opis slovarja TURS kot glavnega rezultata projekta, elaborat, Koper: Univerza na Primorskem, Znanstveno-raziskovalno središče, 2013.
Pearson 1998 = Jennifer Pearson, Terms in Context, Amsterdam; Philadelphia: John Benjamins, 1998.
Rychlý 2007 = Pavel Rychlý, Manatee/Bonito – A Modular Corpus Manager, v: Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2007, ur. Petr Sojka – Aleš Horák, Brno: Masarykova univerza, 2007, 65–70.
Žagar idr. 2023 = Kristjan Žagar – Marko Ferme – Milan Ojsteršek – Mateja Jemec Tomazin – Tomaž Erjavec, Corpus of scientific texts from the Open Science Slovenia portal OSS 1.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1774.
Žagar Karer 2011 = Mojca Žagar Karer, Terminologija med slovarjem in besedilom: analiza elektrotehniške terminologije, Ljubljana: Založba ZRC, ZRC SAZU, 2011.
Žagar Karer – Nina Ledinek 2021 = Mojca Žagar Karer – Nina Ledinek, Med terminologijo in splošno leksiko: determinologizacija in z njo povezane slovaropisne ter uporabniške dileme, Slovenski jezik – Slovene linguistic Studies 13 (2021), 41–60, DOI: https://doi.org/10.3986/sjsls.13.1.03.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors guarantee that the work is their own original creation and does not infringe any statutory or common-law copyright or any proprietary right of any third party. In case of claims by third parties, authors commit their self to defend the interests of the publisher, and shall cover any potential costs.
More in: Submission chapter
