Newest Data
Arrivals
Last update: October 21, 2017
Recently added and updated files
    Fisher English Training Speech Part 1 Transcripts represents the first half of a collection of conversational telephone speech (CTS) that was created at LDC in 2003. It contains time-aligned transcript data for 5,850 complete conversations, each lasting up to 10 minutes. In addition to the transcriptions, which are found under the trans directory, there is a complete set of tables describing the speakers, the properties of the telephone calls, and the set of topics that were used to initiate the conversations. The corresponding speech files are contained in Fisher English Training Speech Part 1 Speech (LDC2004S13). The Fisher telephone conversation collection protocol was created at LDC to address a critical need of developers trying to build robust automatic speech recognition (ASR) systems. Previous collection protocols, such as CALLFRIEND and Switchboard-II and the resulting corpora, have been adapted for ASR research but were in fact developed for language and speaker identification respectively. Although the CALLHOME protocol and corpora were developed to support ASR technology, they feature small numbers of speakers making telephone calls of relatively long duration with narrow vocabulary across the collection. CALLHOME conversations are challengingly natural and intimate. Under the Fisher protocol, a large number of participants each calls an other participant, whom they typically do not know, for a short short period of time to discuss the assigned topics. This maximizes inter-speaker variation and vocabulary breath while also increasing formality. Previous protocols such as CALLHOME, CALLFRIEND and Switchboard relied upon participant activity to drive the collection. Fisher is unique in being platform driven rather than participant driven. Participants who wish to initiate a call may do so, however, the collection platform initiates the majority of calls. Participants need only answer their phones at the times they specified when registering for the study. To encourage a broad range of vocabulary, Fisher participants are asked to speak about an assigned topic chosen from a randomly generated list that changes every 24 hours. All participants that day will be assigned subjects from that list. Some topics are inherited or refined from previous Switchboard studies while others were developed specifically for the Fisher protocol. Data Overall, about 12% of the conversations were transcribed at LDC, and the rest were transcribed by BBN and WordWave using a significantly different approach to the task. A central goal in both sets was to maximize the speed and economy of the transcription process. This in turn involved certain aspects of mark-up detail and quality control that may have been common in previous, smaller corpora. The LDC transcripts were based on automatic segmentation of the audio data, to identify the utterance end-points on both channels of each conversation. Given these time stamps, manual transcription was simply a matter of typing in the words for each segment and doing a rudimentary spell-check. No attempt was made to modify the segmentation boundaries manually, or to locate utterances that the segmenter might have missed. Portions of speech where the transcriber could not be sure exactly what was said were marked with double parentheses – (( … )) – and the transcriber could hazard a guess as to what was said, or leave the region between parentheses blank. The LDC transcription process yields one plain-text transcript file per conversation, in which the first two lines show the call-ID and the fact that the transcript was developed at LDC. The remainder of the file contains one utterance per line (with blank lines separating the utterances), with the start-time, end-time, speaker/channel-ID and utterance text. Data collection and transcription were sponsored by DARPA and the U.S. Department of Defense, as part of the EARS project for research and development in automatic speech recognition.

    Published on: 19 October 2017

    Permanent URL: http://hdl.handle.net/11272/IQON1

    Fisher English Training Speech Part 1 Speech represents the first half of a collection of conversational telephone speech (CTS) that was created at the LDC during 2003. It contains 5,850 audio files, each one containing a full conversation of up to 10 minutes. Additional information regarding the speakers involved and types of telephones used can be found in the companion text corpus of transcripts, Fisher English Training Speech Part 1, Transcripts (LDC2004T19). The Fisher telephone conversation collection protocol was created at LDC to address a critical need of developers trying to build robust automatic speech recognition (ASR) systems. Previous collection protocols, such as CALLFRIEND and Switchboard-II and the resulting corpora, have been adapted for ASR research but were in fact developed for language and speaker identification respectively. Although the CALLHOME protocol and corpora were developed to support ASR technology, they feature small numbers of speakers making telephone calls of relatively long duration with narrow vocabulary across the collection. CALLHOME conversations are challengingly natural and intimate. Under the Fisher protocol, a very large number of participants each make a few calls of short duration speaking to other participants, whom they typically do not know, about assigned topics. This maximizes inter-speaker variation and vocabulary breath while also increasing formality. Previous protocols such as CALLHOME, CALLFRIEND and Switchboard relied upon participant activity to drive the collection. Fisher is unique in being platform driven rather than participant driven. Participants who wish to initiate a call may do so; however the collection platform initiates the majority of calls. Participants need only answer their phones at the times they specified when registering for the study. To encourage a broad range of vocabulary, Fisher participants are asked to speak on an assigned topic which is selected at random from a list, which changes every 24 hours and which is assigned to all subjects paired on that day. Some topics are inherited or refined from previous Switchboard studies while others were developed specifically for the Fisher protocol. Data The individual audio files are presented in NIST SPHERE format, and contain two-channel mu-law sample data; “shorten” compression has been applied to all files. Data collection and transcription were sponsored by DARPA and the U.S. Department of Defense, as part of the EARS project for research and development in automatic speech recognition.

    Published on: 19 October 2017

    Permanent URL: http://hdl.handle.net/11272/64I08

    The Canadian Centre for Justice Statistics (CCJS), in co-operation with the policing community, collects police-reported crime statistics through the Uniform Crime Reporting Survey (UCR). The UCR Survey was designed to measure the incidence of crime in Canadian society and its characteristics. UCR data reflect reported crime that has been substantiated by police. Information collected by the survey includes the number of criminal incidents, the clearance status of those incidents and persons-charged information. The UCR Survey produces a continuous historical record of crime and traffic statistics reported by every police agency in Canada since 1962. In 1988, a new version of the survey was created, UCR2, and is since referred to as the “incident-based” survey, in which microdata on characteristics of incidents, victims and accused are captured. Data from the UCR Survey provide key information for crime analysis, resource planning and program development for the policing community. Municipal and provincial governments use the data to aid decisions about the distribution of police resources, definitions of provincial standards and for comparisons with other departments and provinces. To the federal government, the UCR survey provides information for policy and legislative development, evaluation of new legislative initiatives, and international comparisons. To the public, the UCR survey offers information on the nature and extent of police-reported crime and crime trends in Canada. As well, media, academics and researchers use these data to examine specific issues about crime. Statistical activity The survey is currently administered as part of the National Justice Statistics Initiative (NJSI). Since 1981, the Federal, Provincial and Territorial Deputy Ministers responsible for the administration of justice in Canada, with the Chief Statistician, have been working together in an enterprise known as the National Justice Statistics Initiative. The mandate of the NJSI is to provide information to the justice community as well as the public on criminal and civil justice in Canada. Although this responsibility is shared among Federal, Provincial and Territorial departments, the lead responsibility for the development of Canada’s statistical system remains with Statistics Canada.

    Published on: 18 October 2017

    Permanent URL: http://hdl.handle.net/11272/O6UQ6

    Introduction IARPA Babel Lao Language Pack IARPA-babel203b-v3.1a was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 207 hours of Lao conversational and scripted telephone speech collected in 2013 along with corresponding transcripts. The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech. Data The Lao speech in this release represents that spoken in the Vientiane dialect region in Laos. The gender distribution among speakers is approximately equal; speakers’ ages range from 16 years to 60 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle. Audio data is presented as 8kHz 8-bit a-law encoded audio in sphere format and 48kHz 24-bit PCM encoded audio in wav format. Transcripts are encoded in UTF-8. The romanization scheme was developed by Appen and was based on the scheme developed by the American Library Association and Library of Congress. Further information about transcription methodology is contained in the documentation accompanying this release. Evaluation data is available from NIST in support of OpenKWS.

    Published on: 17 October 2017

    Permanent URL: http://hdl.handle.net/11272/EQQQR

    The Tuition and Living Accommodation Costs (TLAC) survey collects data for full-time students at Canadian degree-granting institutions that are publicly funded. This annual survey was developed to provide an overview of tuition and additional compulsory fees, and living accommodation costs for an academic year. The TLAC survey data are used to provide stakeholders, the public and students with annual tuition costs and changes in tuition fees from the previous year contribute to a better understanding of the costs to obtain a degree contribute to education policy development contribute to the Consumer Price Index facilitate interprovincial comparisons facilitate comparisons between institutions. Reference period: Academic year (September 1 to April 30) Collection period: April through June

    Published on: 13 October 2017

    Permanent URL: http://hdl.handle.net/11272/LN2IO

    Since the beginning of 2005, the Travel Survey of Residents of Canada (TSRC) has been conducted to measure domestic travel in Canada. It replaces the Canadian Travel Survey (CTS). Featuring several definitional changes and a new questionnaire, this survey provides estimates of domestic travel that are more in line with the international guidelines recommended by the World Tourism Organization (WTO) and the United Nations Statistical Commission. In 2011, TSRC underwent a redesign. The Travel Survey of Residents of Canada is sponsored by Statistics Canada, the Canadian Tourism Commission, and the provincial governments. It measures the size of domestic travel in Canada from the demand side. The objectives of the survey are to provide information about the volume of trips and expenditures for Canadian residents by trip origin, destination, duration, type of accommodation used, trip reason, mode of travel, etc.; to provide information on travel incidence and to provide the socio-demographic profile of travellers and non-travellers. Estimates allow quarterly analysis at the national, provincial and tourism region level (with varying degrees of precision) on: total volume of same-day and overnight trips taken by the residents of Canada with destinations in Canada, same-day and overnight visits in Canada, main purpose of the trip/key activities on trip, spending on same-day and overnight trips taken in Canada by Canadian residents in total and by category of expenditure, modes of transportation (main/other) used on the trip, person-visits, household-visits, spending in total and by expense category for each location visited in Canada, person- and household-nights spent in each location visited in Canada, in total and by type of accommodation used, use of travel packages and associated spending and source of payment (household, government, private employer), demographics of adults that took or did not take trips, and travel party composition. The main users of the TSRC data are Statistics Canada, the Canadian Tourism Commission, the provinces, and tourism boards. Other users include the media, businesses, consultants and researchers.

    Published on: 13 October 2017

    Permanent URL: http://hdl.handle.net/11272/10511

    The Inter-corporate ownership product is the most authoritative and comprehensive source of information available on corporate ownership; a unique directory of "who owns what" in Canada. It provides up-to-date information reflecting recent corporate takeovers and other substantial changes. Ultimate corporate control is determined through a careful study of holdings by corporations, the effects of options, insider holdings, convertible shares and interlocking directorships. The number of corporations that make up the hierarchy of structures totals approximately 45,000. The information that is presented is based on non-confidential returns filed by Canadian corporations under the Corporations Returns Act and on research using public sources such as internet sites. The data are presented in an easy-to-read tiered format, illustrating at a glance the hierarchy of subsidiaries within each corporate structure. The entries for each corporation provide both the country of control and the country of residence. The product covers every individual corporation that is part of a group of commonly controlled corporations with combined assets exceeding 600 million dollars or combined revenue exceeding 200 million dollars. Individual corporations with debt obligations or equity owing to non-residents exceeding a net book value of 1 million dollars are covered as well.

    Published on: 12 October 2017

    Permanent URL: http://hdl.handle.net/11272/10475

    The interprovincial and international trade flows shows the origin and destination of trade flows by product among Canadian provinces and territories and from and to the rest of the world. The information is available at the four levels (Detail, Link-1997, Link-1961 and Summary) of hierarchy of the Supply and Use Product Classification (SUPC). The data is provided in spreadsheet format for ease of use. Please note that the tables for 2010 and 2011 have been replaced with new tables were created for 2012 and 2013. The data is now available at the detail level without any data suppressions.

    Published on: 12 October 2017

    Permanent URL: http://hdl.handle.net/11272/10510

    No abstract available.

    Published on: 12 October 2017

    Permanent URL: http://hdl.handle.net/11272/YG1DY

    The Labour Force Survey provides estimates of employment and unemployment which are among the most timely and important measures of performance of the Canadian economy. With the release of the survey results only 10 days after the completion of data collection, the LFS estimates are the first of the major monthly economic data series to be released. The Canadian Labour Force Survey was developed following the Second World War to satisfy a need for reliable and timely data on the labour market. Information was urgently required on the massive labour market changes involved in the transition from a war to a peace-time economy. The main objective of the LFS is to divide the working-age population into three mutually exclusive classifications - employed, unemployed, and not in the labour force - and to provide descriptive and explanatory data on each of these. LFS data are used to produce the well-known unemployment rate as well as other standard labour market indicators such as the employment rate and the participation rate. The LFS also provides employment estimates by industry, occupation, public and private sector, hours worked and much more, all cross-classifiable by a variety of demographic characteristics. Estimates are produced for Canada, the provinces, the territories and a large number of sub-provincial regions. For employees, wage rates, union status, job permanency and workplace size are also produced. For a full listing and description of LFS variables, see the Guide to the Labour Force Survey (71-543-G), available through the "Publications" link above. These data are used by different levels of government for evaluation and planning of employment programs in Canada. Regional unemployment rates are used by Employment and Social Development Canada to determine eligibility, level and duration of insurance benefits for persons living within a particular employment insurance region. The data are also used by labour market analysts, economists, consultants, planners, forecasters and academics in both the private and public sector.Important note -- 4 August 2017 Labour Force Survey (LFS) data from January 2017 – July 2017 contained errors with numerical variables. Variables such as HRLYARN and UHRSMAIN were missing decimal place holders. As such, their values were off by a factor of 100. The issue has been addressed and the data for the year re-released

    Published on: 06 October 2017

    Permanent URL: http://hdl.handle.net/11272/10439