Building a National Corpus of Turkish. Design & Implementation

Benzer belgeler
Cambridge International Examinations Cambridge International General Certificate of Secondary Education

Cambridge International Examinations Cambridge International General Certificate of Secondary Education

CmpE 320 Spring 2008 Project #2 Evaluation Criteria

.. ÜNİVERSİTESİ UNIVERSITY ÖĞRENCİ NİHAİ RAPORU STUDENT FINAL REPORT

Profiling the Urban Social Classes in Turkey: Economic Occupations, Political Orientations, Social Life-Styles, Moral Values

Cambridge International Examinations Cambridge International General Certificate of Secondary Education

Department of Public relations and Publicity (TR)

ENG ACADEMIC YEAR SPRING SEMESTER FRESHMAN PROGRAM EXEMPTION EXAM

Yüz Tanımaya Dayalı Uygulamalar. (Özet)

Eğitim ve Kültür Education and Culture

TÜRKiYE'DEKi ÖZEL SAGLIK VE SPOR MERKEZLERiNDE ÇALIŞAN PERSONELiN

Cambridge International Examinations Cambridge International General Certificate of Secondary Education

T.C. MİLLİ EĞİTİM BAKANLIĞI Dış İlişkiler Genel Müdürlüğü

Konforun Üç Bilinmeyenli Denklemi 2016

empati adam fawer 94CA80D2E9C0D7A06FE68F357BDFD9E4 Empati Adam Fawer

Cambridge International Examinations Cambridge International General Certificate of Secondary Education

HEARTS PROJESİ YAYGINLAŞTIRMA RAPORU

140 Y AZARLARA B İLGİ YAZARLARA BİLGİ

daha çok göz önünde bulundurulabilir. Öğrencilerin dile karşı daha olumlu bir tutum geliştirmeleri ve daha homojen gruplar ile dersler yürütülebilir.

The University of Jordan. Accreditation & Quality Assurance Center. COURSE Syllabus

HIGH SPEED PVC DOOR INSTALLATION BOOK

Tüm dosyalar word biçiminde gönderilmelidir. Makale 2500 ile 8000 kelime arasında olmalıdır. Başlık 10 kelimeden uzun olmamalıdır.

ACT (American College Testing ) Sınavı Hakkında

Inventory of LCPs in Turkey LCP Database explained and explored

Determinants of Education-Job Mismatch among University Graduates

Cambridge International Examinations Cambridge International General Certificate of Secondary Education

WEEK 11 CME323 NUMERIC ANALYSIS. Lect. Yasin ORTAKCI.

DOKUZ EYLUL UNIVERSITY FACULTY OF ENGINEERING OFFICE OF THE DEAN COURSE / MODULE / BLOCK DETAILS ACADEMIC YEAR / SEMESTER. Course Code: MMM 4039

İTÜ DERS KATALOG FORMU (COURSE CATALOGUE FORM)

A UNIFIED APPROACH IN GPS ACCURACY DETERMINATION STUDIES

THE IMPACT OF AUTONOMOUS LEARNING ON GRADUATE STUDENTS PROFICIENCY LEVEL IN FOREIGN LANGUAGE LEARNING ABSTRACT

EĞİTİM ÖĞRETİM YILI GÖRELE ANADOLU LİSESİ 9. SINIFLAR İNGİLİZCE DERSİ 1. DÖNEM PERFORMANS ÖDEV KONULARI

YABANCI DİL I Zorunlu 1 1 4

TOBB ETÜ Co-Op with Erasmus Placement Program

ALANYA HALK EĞİTİMİ MERKEZİ BAĞIMSIZ YAŞAM İÇİN YENİ YAKLAŞIMLAR ADLI GRUNDTVIG PROJEMİZ İN DÖNEM SONU BİLGİLENDİRME TOPLANTISI

Özel Koşullar Requirements & Explanations Eğitim Fakültesi Fen Bilgisi Öğretmenliği

Dairesel grafik (veya dilimli pie chart circle graph diyagram, sektor grafiği) (İngilizce:"pie chart"), istatistik

AB surecinde Turkiyede Ozel Guvenlik Hizmetleri Yapisi ve Uyum Sorunlari (Turkish Edition)

Eğitim-Öğretim Yılında

MEVLANA DEĞİŞİM PROGRAMI PROTOKOLÜ

I.YIL HAFTALIK DERS AKTS

TOEFL ibt LISTENING STRATEGIES & PRACTICE DR. HİKMET ŞAHİNER

Dersin Kodu Dersin Adı Dersin Türü Yıl Yarıyıl AKTS MAKİNA PROJESİ II Zorunlu 4 7 4

ÖZET Amaç: Yöntem: Bulgular: Sonuçlar: Anahtar Kelimeler: ABSTRACT Rational Drug Usage Behavior of University Students Objective: Method: Results:

Özgeçmiş (CV/Resume) Hazırlanması

WILLIAM SHAKESPEARE BY TERRY EAGLETON DOWNLOAD EBOOK : WILLIAM SHAKESPEARE BY TERRY EAGLETON PDF

( ) ARASI KONUSUNU TÜRK TARİHİNDEN ALAN TİYATROLAR

a, ı ı o, u u e, i i ö, ü ü

Bilgisayarlı Muhasebe ve Uygulamaları (MGMT 418) Ders Detayları

TDE 101 Türkiye Türkçesi I Turkey Turkish I TDE 102 Türkiye Türkçesi II Turkey Turkish II

ÇANKAYA UNIVERSITY Faculty of Architecture

First Stage of an Automated Content-Based Citation Analysis Study: Detection of Citation Sentences

Student (Trainee) Evaluation [To be filled by the Supervisor] Öğrencinin (Stajyerin) Değerlendirilmesi [Stajyer Amiri tarafından doldurulacaktır]

Özgeçmiş (CV/Resume) Hazırlanması

Sokak Hayvanları yararına olan bu takvim, Ara Güler tarafından bağışlanan fotoğraflardan oluşmaktadır. Ara Güler

DOKUZ EYLUL UNIVERSITY FACULTY OF ENGINEERING OFFICE OF THE DEAN COURSE / MODULE / BLOCK DETAILS ACADEMIC YEAR / SEMESTER. Course Code: CME 4002

İTÜ DERS KATALOG FORMU (COURSE CATALOGUE FORM)

D-Link DSL 500G için ayarları

The University of Jordan. Accreditation & Quality Assurance Center. COURSE Syllabus

AVRUPA İNSAN HAKLARI MAHKEMESİNE HÂKİM ADAYI BELİRLENMESİNE İLİŞKİN DUYURU

ÖRNEKTİR - SAMPLE. RCSummer Ön Kayıt Formu Örneği - Sample Pre-Registration Form

HOW TO MAKE A SNAPSHOT Snapshot Nasil Yapilir. JEFF GOERTZEN / Art director, USA TODAY

YAZ OKULU TARİHLERİ. Yaz Okulu için yeni ders kayıtları Temmuz 2012 tarihlerinde OASIS sistemi üzerinden yapılacaktır.

Ders Adı Kodu Yarıyılı T+U Saati Ulusal Kredisi AKTS Y.DİL III.(İNG.) DKB

Yaz okulunda (2014 3) açılacak olan (Calculus of Fun. of Sev. Var.) dersine kayıtlar aşağıdaki kurallara göre yapılacaktır:

YAPI ATÖLYESİ. make difference.. DESIGN & CONSTRUCTION ENGINEERING ARCHITECTURE CONTRACTING. Design & Construction


Turkish Vessel Monitoring System. Turkish VMS

KIRIKKALE ÜNİVERSİTESİEĞİTİM FAKÜLTESİ SINIF ÖĞRETMENLİĞİ PROGRAMI EĞİTİM-ÖĞRETİM YILI LİSANS PROGRAMI ÖĞRETİM PLANI.

İş Zekası çözümleri doğru zamanda, doğru kişiye doğru bilginin ulaşmasına olanak tanır.

ISSN: Yıl /Year: 2017 Cilt(Sayı)/Vol.(Issue): 1(Özel) Sayfa/Page: Araştırma Makalesi Research Article

Istanbul Technical University. Faculty of Aeronautics & Astronautics

PROFESSIONAL DEVELOPMENT POLICY OPTIONS

Proceedings/Bildiriler Kitabı PROBLEM. gerekirse; Nmap[7], Nessus[8] [9] Webinspect[10], Acunetix[11] Bu uygulamalar sadece belirli sistem veya

Freshman ACWR ACWR ETHR Ethical Reasoning 3 HUMS Humanities 3 SOSCSocial Science 3 SCIE Natural Sciences 3

INSPIRE CAPACITY BUILDING IN TURKEY

Yarışma Sınavı A ) 60 B ) 80 C ) 90 D ) 110 E ) 120. A ) 4(x + 2) B ) 2(x + 4) C ) 2 + ( x + 4) D ) 2 x + 4 E ) x + 4

Program Learning Outcomes. Teaching Methods 1,4 1, 3,4 A 1,4 1,3,4 A

springerlink.com SpringerLink springerlink.com

IDENTITY MANAGEMENT FOR EXTERNAL USERS

ÖZEL KOŞULLAR REQUİREMENTS & EXPLANATIONS SÜRE DURATION KONTENJAN QUOTA. FEN BİLGİSİ ÖĞRETMENLİĞİ Teacher Training in Sciences 4 4 -

THE ROLE OF GENDER AND LANGUAGE LEARNING STRATEGIES IN LEARNING ENGLISH

Vakko Tekstil ve Hazir Giyim Sanayi Isletmeleri A.S. Company Profile- Outlook, Business Segments, Competitors, Goods and Services, SWOT and Financial

Öğrencilere, endüstriyel fanları ve kullanım alanlarını tanıtmak, endüstriyel fan teknolojisini öğretmektir.

Mitsubishi Electric Corporation. Number of Items

Doç. Dr. Ümit KOÇ (You can see his CV in English on the following pages)

Topluluk Önünde Konuşma (İngilizce) (KAM 432) Ders Detayları

Virtualmin'e Yeni Web Sitesi Host Etmek - Domain Eklemek

THE ENGLISH SCHOOL OF KYRENIA an exceptional school

İSTANBUL KEMERBURGAZ ÜNİVERSİTESİ

BPR NİN ETKİLERİ. Selim ATAK Çevre Mühendisi Environmental Engineer

YAYIN İLKELERİ VE YAZIM KURALLARI

Kuzey Kıbrıs Türk Cumhuriyeti ne Yapılacak Yolculuklarda Verilecek Gündeliklere Dair Karar ile Yurtdışı Gündeliklerine Dair Karar

Website review m.iyibahis.net

PROJE FİŞLERİNDE BÜTÇE. Rıfat Ünal Sayman 15 Ocak 2010 Ankara.

INDIVIDUAL COURSE DESCRIPTION

ÖZET. SOYU Esra. İkiz Açık ve Türkiye Uygulaması ( ), Yüksek Lisans Tezi, Çorum, 2012.

BİLİMSEL ARAŞTIRMA PROJESİ (BAP) SONUÇ RAPORU SCIENTIFIC RESEARCH PROJECT (SRP) FINAL REPORT

ENGiN GÜNEYSU / enginguneysu@gmail.com. enginguneysu@gmail.com mobile

Transkript:

Building a National Corpus of Turkish Design & Implementation

Information on Turkish National Corpus Project Design of Turkish National Corpus Construction of an Electronic Database Selection and sampling procedures of texts Computerizing data Annotation Parts of Speech Tagging Developing parts of speech software Developing the corpus interface Interface features Releases of the corpus

Information on Turkish National Corpus Project Building a National Corpus of Turkish Funded by the Scientific and Technological Research Council of Turkey for a period of three years (2008-2011) Size: 50 million words Time Period: 1990-2008 Annotation: Parts of speech tagged Medium: 95% writing, 5% speech User friendly web-based interface Following the BNC model

Turkish National Corpus will be a general corpus : it will not be specifically restricted to any particular subject field, register or genre a mixed corpus : it will contain examples of both written and spoken language a monolingual corpus : it will consist of text samples which are produced by Turkish language speakers a sample corpus : it will be composed of text samples generally no longer than 45.000 words a synchronic corpus : it will include imaginative and informative texts from 1990

Time / Workflow Building a National Corpus of Turkish

Time / Workflow Building a National Corpus of Turkish

Corpus Representativeness Balance: The range of genres included in a corpus Intended uses of a corpus Covering a wide range of text categories which are representative of a language Sampling text categories proportionally No scientific measure for corpus balance Adopting an existing corpus model when building your corpus Sampling: How text chunks for each genre are selected View of a corpus: Static or dynamic language model

Construction of an Electronic Database I Selection Features Domain: Subject matter Written data selection Written texts included in the database: Building a National Corpus of Turkish Imaginative: Fiction Informative: Social science, art, commerce-finance, belief-thought, world affairs, applied science, naturalpure science, leisure Time: 1990-2008 Medium of text: Books, periodicals, miscellaneous (published-unpublished), written to be spoken

Construction of an Electronic Database II Selection Features Sampling Sampling population for a general-purpose corpus Sampling from produced and received language population Sampling frame Information on published material Catalogues of books, books in print lists, best seller lists, particullarly prize winners of competitions (Yunus Nadi, Orhan Kemal Roman, Haldun Taner short story prizes), library lending statistics, lists of current magazines and periodicals, periodical circulation figuers

Construction of an Electronic Database III Selection Features Sampling procedure For books: a target sample size of 45.000 words will be chosen. A convenient break point will be chosen for text samples. Samples will be taken randomly from the beginning, middle or end of longer texts. Individual texts that newspapers and magazines contain will be separated and classified on the basis of the selection and classification features.

Construction of an Electronic Database V Selection Features Building a National Corpus of Turkish Designated data sources Books from bestsellers, prize winners of competitions lists, books in print lists: Publication date: 2000-2007 Periodicals National newspapers: Radikal, Milliyet, Türkiye, Cumhuriyet, Zaman Local newspapers: Bugün Mersin, Bolu Detay Magazines: Birikim, Sızıntı, Aksiyon, Aktüel, Telapati Unpublished written material Student essays (Primary, secondary, and higher education)

Construction of an Electronic Database VI Metatextual information: Book header Title: Excerpt from Benim adım kırmızı. Sample containing about 38.764 words Spoken or written: Written Number of words: 38.764 Derived text type: Fiction Text type: Fiction: Prose Publication date: 1998 Age of author: 46 Sex of author: Male Type of author: Sole Age of audience: Adult

Construction of an Electronic Database VII Text information: Book header Text domain: Imaginative: Fiction Medium of text: Book Text sample: Middle sample Name of author: Orhan Pamuk Name of text: Benim Adım Kırmızı Target audience sex: Mixed Publisher: İletişim Place of publication: İstanbul Key words: History, crime and adventure

Construction of an Electronic Database VIII Newspaper header Title: Art news from Radikal. Sample containing about 25.184 words Spoken or written: Written Number of words: 25.184 Derived text type: Newspaper National Text type: Written News Publication date: 2005 Age of author: Unknown Sex of author: Mixed Type of author: Multiple Age of audience: Adult

Construction of an Electronic Database IX Newspaper header Text domain: Informative: Arts Medium of text: Periodical Text sample: Composite Target audience sex: Mixed Publisher: Radikal Place of publication: İstanbul

Construction of an Electronic Database X Selection Features Spoken data selection procedure Spoken data included in the database: Demographic part of spoken database Transcriptions of spontaneous natural conversations: Equal numbers of men and women recruits will use a personal stereo to record all their conversations over two to seven days, and logged details of each conversation in a notebook. Permission will be obtained form the participants of the conversations upon each recording. Information about the participants, such as age, sex, accent, occupation will be recorded when available.

Construction of an Electronic Database XI Selection Features Spoken data selection procedure Spoken data included in the database: Context-governed part of spoken database Educational-informative events: Lectures, news broadcasts, classroom discussions Institutional-public events: Political speeches, parliamentary proceedings, council meetings Leisure events: Sport commentaries, club meetings, after-dinner speeches

Construction of an Electronic Database XII Computerizing and checking data - I Re-use of existing electronic texts: Newspapers, magazines, journals, books, other materials Scanning: Scanning and hand editing of books Keyboarding: Typing of hand-written items, leaflets, recorded speech

Construction of an Electronic Database XIII Computerizing and checking data - II

Construction of an Electronic Database XIV Computerizing and checking data III

Construction of an Electronic Database XV Computerizing and checking data - IV Illustrating the re-use of existing electronic texts: Downloading materials from newspapers header \blank line \blank line title news text \blank line subtitle subtitle text \blank line \blank line 2nd news text MGK: Önce laiklik AKP hükümeti ile Genelkurmay arasında Milli Görüş genelgesiyle başlayıp, TBMM resepsiyonunun boykotu ile süren gerilimin gölgesinde toplanan Milli Güvenlik Kurulu (MGK), 'ince ve diplomatik' laiklik uyarısı ile son buldu. 'Postmodern darbe' olarak nitelenen 28 Şubat 1997 tarihli toplantısının ardından en uzun toplantısını yapan MGK, bildirisinde laikliği, Irak ve Kıbrıs gibi konuların önüne koydu. 7.5 saatlik toplantı Cumhurbaşkanı Ahmet Necdet Sezer başkanlığındaki MGK toplantısına Başbakan Recep Tayyip Erdoğan'ın yanı sıra Genelkurmay Başkanı Orgeneral Hilmi Özkök, Başbakan Yardımcıları Abdüllatif Şener ve Mehmet Ali Şahin, Dışişleri Bakanı Abdullah Gül, Milli Savunma Bakanı Vecdi Gönül, İçişleri Bakanı Abdülkadir Aksu, Adalet Bakanı Cemil Çiçek, kuvvet komutanları katıldı. Toplantıya ayrıca Genelkurmay Harekat Başkanı Korgeneral Köksal Karabay, Genelkurmay İstihbarat Başkanı Korgeneral Hüseyin Göksu, MİT Müsteşarı Şenkal Atasagun, Dışişleri Bakanlığı Müsteşarı Büyükelçi Uğur Ziyal de katıldı. İrticayla mücadele, Irak, Kıbrıs, Petrol Boru Hatları gibi konuların ele alındığı toplantı 7.5 saat sürdü. Yemek arası vermeyen MGK üyeleri sandviç atıştırarak toplantıya devam ettiler. 'Laiklik titiz korunsun' Toplantının ardından yayımlanan kısa bildiride şöyle denildi: 1- MGK 30 Nisan 2003 tarihinde aylık olağan toplantısını yapmıştır. 2- Toplantıda ülkemizin güvenlik ve asayişini etkileyen iç ve dış gelişmeler gözden geçirilmiştir. Bu kapsamda; a) Devletin temel niteliklerinden olan laiklik ilkesinin önemi ve titizlikle korunması vurgulanmıştır. b) Irak'a ilişkin gelişmeler ayrıntılı değerlendirilmiş gelişmelerin yakından izlenmeye devam edilmesi ve gerekli temasların sürdürülmesi kararlaştırılmıştır. c) Kıbrıs'la ilgili tüm gelişmeler kapsamlı olarak gözden geçirilmiştir. d) Ayrıca, petrol boru hatlarına ilişkin son gelişmeler hakkında kurula bilgi sunulmuş ve alınması gereken önlemler görüşülmüştür. Toplantıda tartışmaların önemli bir bölümü, son günlerde devletin zirvesinde büyük gerilim yaratan Milli Görüş genelgesi, türban tartışmaları, kadrolaşma gibi konular tartışılarak geçti.

Mersin Dil Derlemi Annotation - I Parts of Speech Tagging

Annotation -II Developing a parts of speech tagging software Rule-based morphological analyser: Construct an electronic lexicon of Turkish, run it on the corpus data (at least 10 million words), and obtain the most likely Turkish grammatical categories - such as nouns, verbs, adjectives, adverbs, conjunctions, etc. Probabilistic analyser: It will run on a corpus data containing manually tagged parts of speech annotation, and will turn the most probable word classes of Turkish.

Annotation -II Developing a parts of speech tagger Comparison: Compare the tagged word classes of rule-based analyser to word classes of probabilistic analyser. To achieve precision in POS tagging software, human post-editing will be taken place on the outputs of software.

Annotation -III A part-of-speech tagged token It is a dream come true for me she said I love you. <p> <s n="1"> <w c1="pnp" hw="it" pos="pron">it </w> <w c1="vbz" hw="be" pos="verb">is </w> <w c1="at0" hw="a" pos="art">a </w> <w c1="nn1" hw="dream" pos="subst">dream </w> <w c1="vvb" hw="come" pos="verb">come </w> <w c1="aj0" hw="true" pos="adj">true </w> <w c1="prp" hw="for" pos="prep">for </w> <w c1="pnp" hw="i" pos="pron">me</w> <w c2="pnp" hw="she" pos="pron">she </w> <w c1="vvd" hw="say" pos="verb">said</w> <s n="2"><w c1="pnp" hw="i" pos="pron">i </w> <w c1="vvn" hw="love" pos="verb">love </w> <w c1="pnp" hw="you" pos="pron">you </w> <c c1="pun">.</c></s> </p>

Query Developing the Corpus Interface I Interface features Building a National Corpus of Turkish Detailed specification of the categories of text, and/or sections of texts to constrain a search Concordance output Navigating through concordance output and manipulating the concordance Text Information - Statistical information Facility to access metatextual information and frequency data

Developing the Corpus Interface IV Visual Design Building a National Corpus of Turkish

Developing the Corpus Interface IV Visual Design Building a National Corpus of Turkish

Developing the Corpus Interface IV Visual Design Building a National Corpus of Turkish

Developing the Corpus Interface IV Visual Design Building a National Corpus of Turkish

Developing the Corpus Interface V Cross-platform compatibility Character set: Building a National Corpus of Turkish Texts in the corpus data will be saved in UTF-8 character code. Platform-free: Corpus interface will be constructed in terms of w3c standards and it will be accessed through any webbrowser on any platform. Result set: Results of the corpus will be used on any platform and they will be compatible with other softwares.

Prerelease of the Corpus Releases Beta Release I - Local Testing: Bringing out the corpus on the local network for testing Beta Release II - National Testing: Upon local testing, revealing the corpus to national users and academicians Release Candidate I - International Testing: Upon national testing, making the corpus available to international users Release Candidate II - Turkish National Corpus: The final candidate release before the release of the corpus

Building a National Corpus of Turkish And the volunteers Working...

Building a National Corpus of Turkish And the volunteers Working...

Building a National Corpus of Turkish And the volunteers Working...

Building a National Corpus of Turkish And the volunteers Working...

Thank you