Newest Data
Arrivals
Last update: July 20, 2019
Recently added and updated files
    First DIHARD Challenge Development - Eight Sources was developed by the Linguistic Data Consortium (LDC) and contains approximately 17 hours of English and Chinese speech data along with corresponding annotations used in support of the First DIHARD Challenge. The First DIHARD Challenge was an attempt to reinvigorate work on diarization through a shared task focusing on "hard" diarization; that is, speech diarization for challenging corpora where there was an expectation that existing state-of-the-art systems would fare poorly. As such, it included speech from a wide sampling of domains representing diversity in number of speakers, speaker demographics, interaction style, recording quality, and environmental conditions, including, but not limited to: clinical interviews, extended child language acquisition recordings, YouTube recordings, and conversations collected in restaurants. Data This release, when combined with First DIHARD Challenge Development - SEEDLingS (LDC2019S10), contains the development set audio data and annotation as well as the official scoring tool. The evaluation data for the First DIHARD Challenge is also available from LDC as Nine Sources (LDC2019S12) and SEEDLingS (LDC2019S13). The source data was drawn from the following (all sources are in English unless otherwise indicated): Autism Diagnostic Observation Schedule (ADOS) interviews DCIEM/HCRC map task (LDC96S38) Audiobook recordings from LibriVox Meeting speech from 2004 Spring NIST Rich Transcription (RT-04S) Development (LDC2007S11) and Evaluation (LDC2007S12) releases. 2001 U.S. Supreme Court oral arguments Sociolinguistic interviews from SLX Corpus of Classic Sociolinguistic Interviews (LDC2003T15) Chinese video collected by LDC as part of the Video Annotation for Speech Technologies (VAST) project YouthPoint radio interviews All audio is provided in the form of 16 kHz, mono-channel FLAC files. The diarization for each recording is stored as a NIST Rich Transcription Time Marked (RTTM) file. RTTM files are space-separated text files containing one turn per line. Segmentation files are stored as HTK label files. Each of these files contains one speech segment per line. Both of the annotation file types are encoded as UTF-8. More information about the file formats and data sources are in the included documentation.

    Published on: 19 July 2019

    Permanent URL: http://hdl.handle.net/11272/EIEQG

    The 2016 Census public use microdata file (PUMF) on households contains 140,720 private households with a total of 343,330 individual records, representing 1% of the population in private households in private occupied dwellings in Canada. These records were drawn from a sample of one quarter of the Canadian population (sample data from questionnaire 2A-L). The 2016 PUMF contains 95 variables. Of these, 81 variables, or 85%, come from the individual universe and 14 variables, or 15%, are drawn from the family, household and dwelling universes. In addition, the file contains four unique record identifiers (ID), an individual weighting factor and 16 replicate weights for the purpose of estimating sampling variability. The file does not include people living in institutions; Canadian citizens living temporarily in other countries; full- time members of the Canadian Forces stationed outside Canada; persons living in institutional collective dwellings such as hospitals, nursing homes and penitentiaries; and persons living in non-institutional collective dwellings such as work camps, hotels and motels, and student residences.

    Published on: 17 July 2019

    Permanent URL: http://hdl.handle.net/11272/10735

    Orthorectified aerial imagery of the UBC Vancouver campus, 2019. Ortho Pixel size - 10 cm

    Published on: 17 July 2019

    Permanent URL: http://hdl.handle.net/11272/SVGVA

    Korean Telephone Conversations Transcripts was produced by Linguistic Data Consortium (LDC) catalog number LDC2003T08 and ISBN 1-58563-264-3. The telephone conversations on which these transcripts are based were originally recorded as part of the CALLFRIEND project. The CALLFRIEND Korean telephone speech was collected by Linguistic Data Consortium primarily in support of the Language Identification (LID) project, sponsored by the U.S. Department of Defense. The calls were later transcribed for use in other projects. This publication consists of 100 transcribed telephone conversations in Korean. The corresponding speech is published as Korean Telephone Conversations Speech. The Korean orthographic forms from the 100 trascription files serve as the head-words in the associated Korean Telephone Conversations Lexicon. The recorded conversations are between native speakers of Korean and last up to 30 minutes, of which the transcribed speech covers between 15 to 18 minutes. All speakers were aware that they were being recorded. They were given no guidelines concerning what they should talk about. Once a caller was recruited to participate, he/she was given a free choice of whom to call. Most participants called family members or close friends. All calls originated in either the United States or Canada. Data There are 100 time aligned text files, totalling approximately 190K words and 25K unique words. All files are in Korean orthography: orthographic Korean characters are in Hangul, encoded in KSC5601 (Wansung) system, also known as EUC-KR or ISO-2022-KR.

    Published on: 15 July 2019

    Permanent URL: http://hdl.handle.net/11272/KHZ73

    The Social Policy Simulation Database and Model (SPSD/M) is a tool designed to assist those interested in analyzing the financial interactions of governments and individuals in Canada. It can help one to assess the cost implications or income redistributive effects of changes in the personal taxation and cash transfer system. As the name implies, SPSD/M consists of two integrated parts: a database (SPSD), and a model (SPSM). The SPSD is a non-confidential, statistically representative database of individuals in their family context, with enough information on each individual to compute taxes paid to and cash transfers received from government. The SPSM is a static accounting model which processes each individual and family on the SPSD, calculates taxes and transfers using legislated or proposed programs and algorithms, and reports on the results. A sophisticated software environment gives the user a high degree of control over the inputs and outputs to the model and can allow the user to modify existing programs or test proposals for entirely new programs. The model comes with full documentation including an on-line help facility.This is a model update – the database remains unchanged and remains based on the 2015 survey data from the Canadian Income Survey (CIS). This version is capable of modeling the tax/transfer system for all years from 1997 through 2025. The update includes the following: Changes resulting from the 2019-20 budgets, announced before May 1, 2019, have been incorporated. Changes resulting from the 2018 tax forms have been incorporated. Changes resulting from the 2018 and 2019 TD1s have been incorporated The regulatory charge on fossil fuels has been included in the commodity tax model The weights have been updated to reflect changes in Statistics Canada’s population estimates. The most recent economic growth projections from the Parliamentary Budget Office have been incorporated. Income and expenditures are now grown on a per capita basis in order to reflect demographic changes.

    Published on: 02 July 2019

    Permanent URL: http://hdl.handle.net/11272/10734

    Multi-Language Conversational Telephone Speech 2011 – English Group was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 18 hours of telephone speech in two general varieties of English: American and South Asian. The data were collected primarily to support research and technology evaluation in automatic language identification, and portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation (LRE). LRE 2011 focused on language pair discrimination for 24 languages/dialects, some of which could be considered mutually intelligible or closely related. LDC has also released the following as part of the Multi-Language Conversational Telephone Speech 2011 series: Slavic Group (LDC2016S11) Turkish (LDC2017S09) South Asian (LDC2017S14) Central Asian (LDC2018S03) Central European (LDC2018S08) Spanish (LDC2018S12) Arabic (LDC2019S02) Data Participants were recruited by native speakers who contacted acquaintances in their social network. Those native speakers made one call, up to 15 minutes, to each acquaintance. The data was collected using LDC’s telephone collection infrastructure, comprised of three computer telephony systems. Human auditors labeled calls for callee gender, dialect type and noise. Demographic information about the participants was not collected. All audio data are presented in FLAC-compressed MS-WAV (RIFF) file format (*.flac); when uncompressed, each file is 2 channels, recorded at 8000 samples/second with samples stored as 16-bit signed integers, representing a lossless conversion from the original mu-law sample data as captured digitally from the public telephone network. The following table summarizes the total number of calls, total number of hours of recorded audio, and the total size of compressed data: group lng #calls #hours #MB english eng 62 13.5 589 english eni 26 4.9 242 english totals 88 18.4 831

    Published on: 27 June 2019

    Permanent URL: http://hdl.handle.net/11272/HKSYZ

    CIEMPIESS (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) Experimentation was developed by the social service program "Desarrollo de Tecnologías del Habla" of the "Facultad de Ingeniería" (FI) at the National Autonomous University of Mexico (UNAM) and consists of approximately 22 hours of Mexican Spanish broadcast and read speech with associated transcripts. The goal of this work was to create acoustic models for automatic speech recognition. For more information and documentation see the CIEMPIESS-UNAM Project website. CIEMPIESS Experimentation is a set of three different data sets, specifically Complementary, Fem and Test. Complementary is a phonetically-balanced corpus of isolated Spanish words spoken in Central Mexico. Fem contains broadcast speech from 21 female speakers, collected to balance by gender the number of recordings from male speakers in other CIEMPIESS collections. Test consists of 10 hours of broadcast speech and transcripts and is intended for use as a standard test data set alongside other CIEMPIESS corpora. See the included documentation for more details on each corpus. LDC has released the following data sets in the CIEMPIESS series: CIEMPIESS (LDC2015S07) CHM150 (LDC2016S04) CIEMPIESS Light (LDC2017S23) CIEMPIESS Balance (LDC2018S11) Data The majority of the speech recordings in Fem and Test were collected from Radio-IUS, a UNAM radio station. Other recordings were taken from IUS Canal Multimedia and Centro Universitario de Estudios Jurídicos (CUEJ UNAM). Those two channels feature videos with speech around legal issues and topics related to UNAM. The Complementary recordings consist of read speech collected for that corpus. Complementary includes specifications for creating transcripts using the phonetic alphabet Mexbet and for converting Mexbet output to the International Phonetic Alphabet and X-SAMPA. An automatic phonetizer for Mexbet, written in Python 2.7, to create pronouncing dictionaries is provided as well. The audio files are presented as 16 kHz, 16-bit PCM flac format for this release. Transcripts are presented as UTF-8 encoded plain text.

    Published on: 21 June 2019

    Permanent URL: http://hdl.handle.net/11272/HDIAS

    The poetry tape contains the collected poems of fourteen Canadian poets, stored in a compact format and keyed by uniquely constructed poem code numbers. These poems were collected and keypunched under the supervision of Mrs. Sandra Djwa of the Department of English at UBC. Note The original FORTRAN IV programs to parse the text data are not available, although the functionality was limited to: Creating a list of poem titles Displaying poems by index number UBC Library Data Services note, 21 June 2019

    Published on: 21 June 2019

    Permanent URL: http://hdl.handle.net/11272/GZXWC

    TAC KBP Chinese Regular Slot Filling - Comprehensive Training and Evaluation Data 2014 was developed by the Linguistic Data Consortium (LDC) and contains training and evaluation data produced in support of the TAC KBP Chinese Regular Slot Filling evaluation track conducted in 2014. Text Analysis Conference (TAC) is a series of workshops organized by the National Institute of Standards and Technology (NIST). TAC was developed to encourage research in natural language processing and related applications by providing a large test collection, common evaluation procedures, and a forum for researchers to share their results. Through its various evaluations, the Knowledge Base Population (KBP) track of TAC encourages the development of systems that can match entities mentioned in natural texts with those appearing in a knowledge base and extract novel information about entities from a document collection and add it to a new or existing knowledge base. The regular Chinese Slot Filling evaluation track involved mining information about entities from text. Slot Filling can be viewed as more traditional Information Extraction, or alternatively, as a Question Answering task, in which the questions are static but the targets change. In completing the task, participating systems and LDC annotators searched a corpus for information on certain attributes (slots) of person and organization entities and attempted to return all valid answers (slot fillers) in the source collection. For more information about Chinese Slot Filling, please refer to the 2014 track home page. Data This release contains all evaluation and training data developed in support of TAC KBP Chinese Regular Slot Filling. This includes queries, the 'manual runs' (human-produced responses to the queries), the final rounds of assessment results and the complete set of Chinese source documents. All text data is encoded as UTF-8.

    Published on: 20 June 2019

    Permanent URL: http://hdl.handle.net/11272/TYWMQ

    FactBank 1.0, Linguistic Data Consortium (LDC) catalog number LDC2009T23 and isbn 1-58563-522-7, consists of 208 documents (over 77,000 tokens) from newswire and broadcast news reports in which event mentions are annotated with their degree of factuality, that is, the degree to which they correspond to those events. FactBank 1.0 was built on top of TimeBank 1.2 and a fragment of the AQUAINT TimeML Corpus, both of which used the TimeML specification language. This resulted in a double-layered annotation of event factuality. TimeBank 1.2 and AQUAINT TimeML encode most of the basic structural elements expressing factuality information while FactBank 1.0 represents the resulting factuality interpretation. The combination of the factuality values in FactBank with the structural information in TimeML-annotated corpora facilitates the development of tools aimed at automatically identifying the factuality values of events, a component fundamental in tasks requiring some degree of text understanding, such as Textual Entailment, Question Answering, or Narrative Understanding. FactBank annotations indicate whether the event mention describes actual situations in the world, situations that have not happened, or situations of uncertain interpretation. Event factuality is not an inherent feature of events but a matter of perspective. Different discourse participants may present divergent views about the factuality of the very same event. Consequently, in FactBank, the factuality degree of events is assigned relative to the relevant sources at play. In this way, it can adequately reflect the divergence of opinions regarding the factual status of events, as is common in news reports. The annotation language is grounded on established linguistic analyses of the phenomenon, which facilitated the creation of a battery of discriminatory tests for distinguishing between factuality values. Furthermore, the annotation procedure was carefully designed and divided into basic, sequential annotation tasks. This made it possible for hard tasks to be built on top of simpler ones, while at the same time allowing annotators to become incrementally familiar with the complexity of the problem. As a result, FactBank annotation achieved a relatively high interannotation agreement, kappa=0.81, a positive result when considered against similar annotation efforts. Data All FactBank markup is standoff and is represented through a set of 20 tables which can be easily loaded into a database. Each table resides in an independent text file, where fields are separated by three consecutive bars (i.e., |||). The data in fields of string type are presented between simple quotations ('). Because FactBank 1.0 was built on top of TimeBank 1.2 and AQUAINT TimeML, both of which are marked up with inline XML-based annotation, this release contains the TimeBank 1.2 and AQUAINT TimeML annotation in standoff, table-based format as well.

    Published on: 20 June 2019

    Permanent URL: http://hdl.handle.net/11272/XRITF