You are here: Home / Infrastructures / Res. Infrastructure
Czech National Corpus (CNC)
Hosting Legal Entity
Self-standing RI
Legal Status
University or higher education institution
Panská 890/7, Institute of the Czech National Corpus, Prague, PO: 11000 (Czech Republic)
Type Of RI
Coordinating Country
Czech Republic
Current Status:
Operational since 2000
Scientific Description
CNC is continuously mapping the Czech language by building large general-purpose language corpora and providing access to them. The CNC’s linguistic data covers a wide range of genres and language varieties, including written, spoken and diachronic Czech. In addition, the InterCorp parallel corpus contains original and translated texts in Czech and more than 30 other languages. The CNC corpora constitute a unique resource of authentic language information for both basic and applied linguistic research as well as for other domains of social sciences and humanities. CNC corpora are widely used thanks to their continuously growing size, varied and well-defined composition, reliable metadata and high quality data processing with state-of-the-art tools. The CNC provides intuitive access to its corpora through efficient, specialized web-based applications and user support featured at the CNC research portal that also includes a User Forum (with Q&A, bug reporting, etc.) and a corpus linguistics Wiki. CNC is the only research infrastructure in the Czech Republic focusing systematically on developing the methodology of corpus linguistics. It also provides data packages tailored to specific users’ needs. Despite its national character, CNC is widely used by international users and the exceptional range of CNC corpora attracts collaborative corpus-based research in the area of contrastive language study, which requires comparable data in different languages. The CNC closely cooperates with the research infrastructure LINDAT/CLARIN, the Czech national node of the pan-European research infrastructure CLARIN ERIC.

RI Keywords
Language, Corpus, Language resources, Corpus linguistics, Language data, Spoken language corpora, Parallel corpus, Czech, Corpora, Representative corpus, Language and Communication, Written language corpora
RI Category
Data Archives, Data Repositories and Collections
Scientific Domain
Humanities and Arts
Information Science and Technology
Social Sciences
ESFRI Domain
Social and Cultural Innovation
Provision of language data

A large set of authentic texts (written or spoken) converted into electronic form in a uniform format so that it can be easily searched for language phenomena, especially words and phrases. This database is built to serve as a record and, as much as possible, as the most objective model of language empiricism. This is naturally a source of data for linguistic research, but today the corpuses are also used in other areas that use texts as sources of knowledge (reality, history, sociology, psychology, etc.).


Regular Linux servers.

Access Type
Access Mode
Access Webpage
Number of Users
Number Year
6,481 2016
Users Definition
User Demographics
European Users - 20.0% in 2016
National Users - 79.0% in 2016
Extra-European Users - 1.0% in 2016
Type of Users
Academic - 84.0%
General public - 16.0%
Date of last update: 02/04/2019
Printable version