You are here: Home / Infrastructures / Res. Infrastructure
Czech National Corpus (CNC)
Identification
Hosting Legal Entity
Self-standing RI
Legal Status
University or higher education institution
Location
Panská 890/7, Institute of the Czech National Corpus, Prague, PO: 11000 (Czech Republic)
Structure
Type Of RI
Virtual
Coordinating Country
Czech Republic
Status
Status
Current Status:
Operational since 2000
Scientific Description
CNC is continuously mapping the Czech language by building large general-purpose language corpora and providing access to them. The CNC’s linguistic data covers a wide range of genres and language varieties, including written, spoken and diachronic Czech. In addition, the InterCorp parallel corpus contains original and translated texts in Czech and more than 30 other languages. The CNC corpora constitute a unique resource of authentic language information for both basic and applied linguistic research as well as for other domains of social sciences and humanities. CNC corpora are widely used thanks to their continuously growing size, varied and well-defined composition, reliable metadata and high quality data processing with state-of-the-art tools. The CNC provides intuitive access to its corpora through efficient, specialized web-based applications and user support featured at the CNC research portal www.korpus.cz that also includes a User Forum (with Q&A, bug reporting, etc.) and a corpus linguistics Wiki. CNC is the only research infrastructure in the Czech Republic focusing systematically on developing the methodology of corpus linguistics. It also provides data packages tailored to specific users’ needs. Despite its national character, CNC is widely used by international users and the exceptional range of CNC corpora attracts collaborative corpus-based research in the area of contrastive language study, which requires comparable data in different languages. The CNC closely cooperates with the research infrastructure LINDAT/CLARIN, the Czech national node of the pan-European research infrastructure CLARIN ERIC.

RI Keywords
Corpus, Language data, Czech, Corpus linguistics, Language, Written language corpora, Parallel corpus, Corpora, Language resources, Spoken language corpora, Language and Communication, Representative corpus
Classifications
RI Category
Data Archives, Data Repositories and Collections
Scientific Domain
Information Science and Technology
Social Sciences
Humanities and Arts
ESFRI Domain
Social and Cultural Innovation
Services
Provision of language data

A large set of authentic texts (written or spoken) converted into electronic form in a uniform format so that it can be easily searched for language phenomena, especially words and phrases. This database is built to serve as a record and, as much as possible, as the most objective model of language empiricism. This is naturally a source of data for linguistic research, but today the corpuses are also used in other areas that use texts as sources of knowledge (reality, history, sociology, psychology, etc.).

Equipment
Servers

Regular Linux servers.

Access
Access Type
Remote
Access Mode
Wide
Access Webpage
Users
Number of Users
Number Year
6,481 2016
Users Definition
Individuals
User Demographics
National Users - 79.0% in 2016
Extra-European Users - 1.0% in 2016
European Users - 20.0% in 2016
Type of Users
Academic - 84.0%
General public - 16.0%
Date of last update: 02/04/2019
Printable version