தரவு (data)

தரவு (data)

பொருளடக்கம்


Wiki WordNet ConceptNet FrameNet VerbNet PropBank TreeBank SyntaxNet SyntagNet OpenHowNet Framebase VerbAtlas HowNet BabelNet OpenCyc


Morfessor WordNet UniMorph MorphoLEX
MorphyNet


NLTK spaCy Stanza


ConvoKit


தரவுத் தேடல் (Data Search)

  • To search dataset

Dataset Search
https://datasetsearch.research.google.com/

Kaggle: Your Machine Learning and Data Science Community
https://www.kaggle.com/

Hugging Face – The AI community building the future.
https://huggingface.co/


Corpus

Google N gram Corpus

Google Books Ngram Exports
https://storage.googleapis.com/books/ngrams/books/datasetsv3.html

Children Stories Text Corpus
Cleaned Gutenberg books
https://www.kaggle.com/datasets/edenbd/children-stories-text-corpus

Children Stories Text Corpus
cleanedmergedfairytaleswithout_eos.txt

Index of /convokit/datasets/
https://zissou.infosci.cornell.edu/convokit/datasets/


பாடல் வரிகள்

Tamil Songs Lyrics Dataset Lyrics from Tamil Movie Songs |
https://listed.to/@prasanth/65025/tamil-songs-lyrics-dataset-lyrics-from-tamil-movie-songs

LRCLIB
https://lrclib.net/db-dumps

tranxuanthang/lrclib: LRCLIB server written in Rust with Axum and SQLite3 database
https://github.com/tranxuanthang/lrclib

Music Dataset: Lyrics and Metadata from 1950 to 2019 - Mendeley Data
https://data.mendeley.com/datasets/3t9vbwxgr5/2

Genius Song Lyrics
https://www.kaggle.com/datasets/carlosgdcj/genius-song-lyrics-with-language-information

brunokreiner/genius-lyrics · Datasets at Hugging Face
https://huggingface.co/datasets/brunokreiner/genius-lyrics

Song Lyrics Dataset
https://www.kaggle.com/datasets/deepshah16/song-lyrics-dataset


உரையாடல்

  • உரையாடல்
  • மேற்கோள்
  • வசனம்

Index of /convokit/datasets/
https://zissou.infosci.cornell.edu/convokit/datasets/

ConvoKit
https://convokit.cornell.edu/

Cornell Movie-Dialogs Corpus
https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html

memorability
https://www.cs.cornell.edu/~cristian/memorability.html

https://www.cs.cornell.edu/~cristian/memorability_files/cornell_movie_quotes_corpus.zip

Datasets — convokit 3.4.1 documentation
https://convokit.cornell.edu/documentation/datasets.html

Quotes- 500k
https://www.kaggle.com/datasets/manann/quotes-500k

Quotes Dataset
https://www.kaggle.com/datasets/akmittal/quotes-dataset

Abirate/englishquotes · Datasets at Hugging Face
https://huggingface.co/datasets/Abirate/english
quotes

jstet/quotes-500k · Datasets at Hugging Face
https://huggingface.co/datasets/jstet/quotes-500k


Word list


பயன்பாடு Frequency

word-freq-top5000
https://github.com/filiph/english_words/blob/master/data/word-freq-top5000.csv

English Word Frequency
https://www.kaggle.com/datasets/rtatman/english-word-frequency

English Word Frequency
⅓ Million Most Frequent English Words on the Web
This dataset contains the counts of the 333,333 most commonly-used single words on the English language web, as derived from the Google Web Trillion Word Corpus.
https://www.kaggle.com/datasets/rtatman/english-word-frequency

English Word Frequency List
List of English word frequencies from the Google Books Ngram dataset.
https://www.kaggle.com/datasets/wheelercode/english-word-frequency-list

This repo contains a list of the 10,000 most common English words in order of frequency, as determined by n-gram frequency analysis of the Google's Trillion Word Corpus.
https://github.com/first20hours/google-10000-english

orgtre/google-books-ngram-frequency: Word/n-gram frequency lists for the Google Books Ngram Corpus (v3, all languages) with Python code
https://github.com/orgtre/google-books-ngram-frequency


வகை

வயதுப்படி

English age of acquisition (AoA) ratings (Kuperman et al., 2012)
https://osf.io/d7x6q/files/osfstorage
https://files.de-1.osf.io/v1/resources/d7x6q/providers/osfstorage/?zip=



NLTK
spaCy
Stanza
Morfessor

WordNet
UniMorph
MorphoLEX
MorphyNet


AoA
English age of acquisition (AoA) ratings (Kuperman et al., 2012)
https://osf.io/d7x6q/files/osfstorage
https://files.de-1.osf.io/v1/resources/d7x6q/providers/osfstorage/?zip=


MorphoLex-en
Lexical database for ~70k English words with morphological variables
https://github.com/hugomailhot/MorphoLex-en


Universal Morphology (UniMorph)
Repositories for the Universal Morphology (UniMorph) project
https://github.com/unimorph/eng


English Lemma Database - Compiled by Referencing British National Corpus
https://github.com/skywind3000/lemma.en
https://github.com/skywind3000/lemma.en/blob/master/lemma.en.txt

Morphemic Segmentation of English Words
A Dataset of english words and morphemic segmentations

https://www.kaggle.com/datasets/thedevastator/morphemic-segmentation-of-english-words

morphemes
Common English morphemes, organized for automated access.
https://github.com/colingoldberg/morphemes/blob/master/data/morphemes.json

MorphyNet
MorphyNet: a Large Multilingual Database of Derivational and Inflectional Morphology (+morpheme segmentation)
https://github.com/kbatsuren/MorphyNet
https://github.com/kbatsuren/MorphyNet/tree/main/eng


Linguistic Variation Interventions
https://github.com/aarsri/interventions-linguistic-variation/blob/main/README.md

Base code for the Tool for Automatic Measurement of Morphological Information (TAMMI).
https://github.com/scrosseye/tammi/blob/main/README.md

https://pypi.org/project/morphemes/
pip install morphemes

Anagram derivation finder
https://github.com/HappySeaFox/anagrams


சொற்கள் மற்றும் அகராதிகள் (Words and Dictionaries )

https://listed.to/@prasanth/64046/


பொருண்மை வலை(Semantic Web)

Wiki WordNet ConceptNet FrameNet VerbNet PropBank TreeBank SyntaxNet SyntagNet OpenHowNet Framebase VerbAtlas HowNet BabelNet OpenCyc


The World Factbook by CIA
The World Factbook, the indispensable source for basic information.

https://www.kaggle.com/datasets/lucafrance/the-world-factbook-by-cia


You'll only receive email when they publish something new.

More from Prasanth
All posts