Newest Data
Arrivals
Last update: May 8, 2015
Recently added and updated files
    Introduction 2010 NIST Speaker Recognition Evaluation Test Set was developed by the Linguistic Data Consortium (LDC) and NIST (National Institute of Standards and Technology). It contains 2,255 hours of American English telephone speech and speech recorded over a microphone channel involving an interview scenario used as test data in the NIST-sponsored 2010 Speaker Recognition Evaluation (SRE). The ongoing series of SRE yearly evaluations conducted by NIST are intended to be of interest to researchers working on the general problem of text independent speaker recognition. To this end the evaluations are designed to be simple, to focus on core technology issues, to be fully supported and to be accessible to those wishing to participate. The 2010 evaluation was similar to the 2008 evaluation by including in the training and test conditions for the core test not only conversational telephone speech (CTS) recorded over ordinary telephone channels, but also CTS and conversational interview speech recorded over a room microphone channel. Unlike prior evaluations, some of the conversational telephone style speech was collected in a manner to produce particularly high, or particularly low, vocal effort on the part of the speaker of interest. Data The speech recordings in this release were collected in 2009 and 2010 by LDC at its Human Subjects Collection facility in Philadelphia. This collection was part of the Mixer 6 project, which was designed to support the development of robust speaker recognition technology by providing carefully collected and audited speech from a large pool of speakers recorded simultaneously across numerous microphones. The telephone speech segments include two-channel excerpts of approximately 10 seconds and 5 minutes. There are also summed-channel excerpts in the range of 5 minutes. The microphone excerpts are 3-15 minutes in duration. As in prior evaluations, intervals of silence were not removed. The data included in this release is 8 bit ulaw with a sample rate of 8000. In addition to evaluation data, this package also consists of answer keys, trial and train files, development data and evaluation documentation.

    Published on: 28 April 2017

    Permanent URL: http://hdl.handle.net/11272/V7OXL

    Introduction CHiME2 Grid was developed as part of The 2nd CHiME Speech Separation and Recognition Challenge and contains approximately 120 hours of English speech from a noisy living room environment. The CHiME Challenges focus on distant-microphone automatic speech recognition (ASR) in real-world environments. CHiME2 Grid reflects the small vocabulary track of the CHiME2 Challenge. The target utterances were taken from the Grid corpus and consist of 34 speakers reading simple 6-word sequences. Data Data is divided into training, development and test sets. All data is provided as 16 bit WAV files sampled at 16 kHz. The noisy utterances are provided both in isolated form and in embedded form. The latter either involve five seconds of background noise before and after the utterance (in the training set) or they are mixed in continuous five minute noise background recordings (in the development and test sets). Seven hours of noise background not part of the training set are also included. The data is accompanied by one annotation file per speaker that includes additional technical information. Also included is a baseline Hidden Markov Model (HMM)-based speech recogniser and a scoring tool designed for the 2nd CHiME Challenge to allow users to obtain keyword recognition scores from formatted result files, perform recognition and score the challenge data, and estimate parameters of speaker dependent HMMs.

    Published on: 27 April 2017

    Permanent URL: http://hdl.handle.net/11272/GJ8WY

    Introduction BOLT Egyptian Arabic SMS/Chat and Transliteration was developed by the Linguistic Data Consortium (LDC) and consists of naturally-occurring Short Message Service (SMS) and Chat (CHT) data collected through data donations and live collection involving native speakers of Egyptian Arabic. The corpus contains 5,691 conversations totaling 1,029,248 words across 262,026 messages. Messages were natively written in either Arabic orthography or romanized Arabizi. A total of 1,856 Arabizi conversations (287,022 words) were transliterated from the original romanized Arabizi script into standard Arabic orthography. The BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources – discussion forums, text messaging and chat – in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference. Data The data in this release was collected using two methods: new collection via LDC’s collection platform, and donation of SMS or chat archives from BOLT collection participants. All data collected were reviewed manually to exclude any messages/conversations that were not in the target language or that had sensitive content, such as personal identifying information (PII). A portion of the source conversations containing Arabizi tokens were automatically transliterated into Arabic script. Once the Arabizi source was transliterated into Arabic script automatically, LDC annotators reviewed, corrected and normalized the transliteration according to “Conventional Orthography for Dialectal Arabic” (CODA). All data is presented in XML.

    Published on: 26 April 2017

    Permanent URL: http://hdl.handle.net/11272/LHYC4

    TransLink route and station data created from General Transit Specification Feed (GTFS), downloaded 24 April 2017. Esri shapefiles and geojson were created by UBC library from the GTFS feed from TransLink. Stops shapefile: Transit stops as point shapefile Shapes, routes and trips shapefile and geojson: Bus routes as polyline shape file with trip information. No time codes are included.

    Published on: 25 April 2017

    Permanent URL: http://hdl.handle.net/11272/10476

    The Inter-corporate ownership product is the most authoritative and comprehensive source of information available on corporate ownership; a unique directory of "who owns what" in Canada. It provides up-to-date information reflecting recent corporate takeovers and other substantial changes. Ultimate corporate control is determined through a careful study of holdings by corporations, the effects of options, insider holdings, convertible shares and interlocking directorships. The number of corporations that make up the hierarchy of structures totals approximately 45,000. The information that is presented is based on non-confidential returns filed by Canadian corporations under the Corporations Returns Act and on research using public sources such as internet sites. The data are presented in an easy-to-read tiered format, illustrating at a glance the hierarchy of subsidiaries within each corporate structure. The entries for each corporation provide both the country of control and the country of residence. The product covers every individual corporation that is part of a group of commonly controlled corporations with combined assets exceeding 600 million dollars or combined revenue exceeding 200 million dollars. Individual corporations with debt obligations or equity owing to non-residents exceeding a net book value of 1 million dollars are covered as well.

    Published on: 13 April 2017

    Permanent URL: http://hdl.handle.net/11272/10475

    The Inter-corporate ownership product is the most authoritative and comprehensive source of information available on corporate ownership; a unique directory of "who owns what" in Canada. It provides up-to-date information reflecting recent corporate takeovers and other substantial changes. Ultimate corporate control is determined through a careful study of holdings by corporations, the effects of options, insider holdings, convertible shares and interlocking directorships. The number of corporations that make up the hierarchy of structures totals approximately 45,000. The information that is presented is based on non-confidential returns filed by Canadian corporations under the Corporations Returns Act and on research using public sources such as internet sites. The data are presented in an easy-to-read tiered format, illustrating at a glance the hierarchy of subsidiaries within each corporate structure. The entries for each corporation provide both the country of control and the country of residence. The product covers every individual corporation that is part of a group of commonly controlled corporations with combined assets exceeding 600 million dollars or combined revenue exceeding 200 million dollars. Individual corporations with debt obligations or equity owing to non-residents exceeding a net book value of 1 million dollars are covered as well.

    Published on: 13 April 2017

    Permanent URL: http://hdl.handle.net/11272/10361

    The Labour Force Survey provides estimates of employment and unemployment which are among the most timely and important measures of performance of the Canadian economy. With the release of the survey results only 10 days after the completion of data collection, the LFS estimates are the first of the major monthly economic data series to be released. The Canadian Labour Force Survey was developed following the Second World War to satisfy a need for reliable and timely data on the labour market. Information was urgently required on the massive labour market changes involved in the transition from a war to a peace-time economy. The main objective of the LFS is to divide the working-age population into three mutually exclusive classifications - employed, unemployed, and not in the labour force - and to provide descriptive and explanatory data on each of these. LFS data are used to produce the well-known unemployment rate as well as other standard labour market indicators such as the employment rate and the participation rate. The LFS also provides employment estimates by industry, occupation, public and private sector, hours worked and much more, all cross-classifiable by a variety of demographic characteristics. Estimates are produced for Canada, the provinces, the territories and a large number of sub-provincial regions. For employees, wage rates, union status, job permanency and workplace size are also produced. For a full listing and description of LFS variables, see the Guide to the Labour Force Survey (71-543-G), available through the "Publications" link above. These data are used by different levels of government for evaluation and planning of employment programs in Canada. Regional unemployment rates are used by Employment and Social Development Canada to determine eligibility, level and duration of insurance benefits for persons living within a particular employment insurance region. The data are also used by labour market analysts, economists, consultants, planners, forecasters and academics in both the private and public sector.

    Published on: 10 April 2017

    Permanent URL: http://hdl.handle.net/11272/10439

    Protein interactions shape proteome function and thus biology. Identification of protein interactions is a major goal in molecular biology, but biochemical methods, although improving, remain limited in coverage and accuracy. Whereas computational predictions can guide biochemical experiments, low validation rates of predictions remain a major limitation. Here, we investigated computational methods in the prediction of a specific type of interaction, the inhibitory interactions between proteases and their inhibitors. Proteases generate thousands of proteoforms that dynamically shape the functional state of proteomes. Despite the important regulatory role of proteases, knowledge of their inhibitors remains largely incomplete with the vast majority of proteases lacking an annotated inhibitor. To link inhibitors to their target proteases on a large scale, we applied computational methods to predict inhibitory interactions between proteases and their inhibitors based on complementary data including coexpression, phylogenetic similarity, structural information, co-annotation, and colocalization, and also surveyed general protein interaction networks for potential inhibitory interactions. In testing nine predicted interactions biochemically, we validated the inhibition of kallikrein 5 by serpin B12. Despite the use of a wide array of complementary data, we found a high false positive rate of computational predictions in biochemical follow-up. Based on a protease-specific definition of true negatives derived from the biochemical classification of proteases and inhibitors, we analyzed prediction accuracy of individual features. Thereby we identified feature-specific limitations, which also affected general protein interaction prediction methods. Interestingly, proteases were often not coexpressed with most of their functional inhibitors, contrary to what is commonly assumed and extrapolated predominantly from cell culture experiments. Predictions of inhibitory interactions were indeed more challenging than predictions of non-proteolytic and non-inhibitory interactions. In summary, we describe a novel and well-defined but difficult protein interaction prediction task, and thereby highlight limitations of computational interaction prediction methods.

    Published on: 03 April 2017

    Permanent URL: http://hdl.handle.net/11272/10472

    Introduction Noisy TIMIT Speech was developed by the Florida Institute of Technology and contains approximately 322 hours of speech from the TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) modified with different additive noise levels. Only the audio has been modified; the original arrangement of the TIMIT corpus is still as described by the TIMIT documentation. Data The additive noise are white, pink, blue, red, violet and babble noise with noise levels varying in 5 dB (decibel) steps and ranges from 5 to 50 dB. The color of noise refers to the power spectrum of a noise signal. Sound waves have two characteristics: frequency, which describes how fast the waveform vibrates per second; and amplitude, the size of the waveform. Colored noises are named in an analogy to the colors of light. For instance, white noise contains all audible frequencies just as white light contains all frequencies in the visible range. Non-white colored noises have more energy concentrated at the high or low end of the sound spectrum. White, pink and blue noise are officially defined in the federal telecommunications standard. The white, pink, blue, red and violet noise types added to the TIMIT data in this release were generated artificially using MATLAB. For the babble noise, a random segment of recorded babble speech was selected and scaled relative to the power of the original TIMIT audio signal. All audio files are presented as single channel 16kHz 16-flac.

    Published on: 31 March 2017

    Permanent URL: http://hdl.handle.net/11272/UFA9N

    The Boundary files portray the geographic limits used for census dissemination and are available for Provinces and Territories, Census Divisions, Economic Regions, Census Metropolitan Areas and Census Agglomerations, Census Consolidated Subdivisions, Census Subdivisions and Aggregate Dissemination Area. There are two types of boundary files: digital and cartographic. Digital files depict the full extent of the geographical areas, including the coastal water area. Cartographic files depict the geographical areas using only the major land mass of Canada and its coastal islands. The files provide a framework for mapping and spatial analysis using commercially available geographic information systems (GIS) or other mapping software. The Boundary Files are portrayed in Lambert conformal conic projection and are based on the North American Datum of 1983 (NAD83). A reference guide is available (92-160-G).

    Published on: 31 March 2017

    Permanent URL: http://hdl.handle.net/11272/10436