தரவு (data)
September 16, 2025•542 words
தரவு (data)
பொருளடக்கம்
- தரவுத் தேடல்
- தரவுத் தொகுப்பு
- பாடல் வரி
- உரையாடல்
- [சொல்[(#word-list)
- அகராதி
- பொருண்மை வலை
- Linked Data
- Graph and Glyph, Unicode
Wiki WordNet ConceptNet FrameNet VerbNet PropBank TreeBank SyntaxNet SyntagNet OpenHowNet Framebase VerbAtlas HowNet BabelNet OpenCyc
Morfessor WordNet UniMorph MorphoLEX
MorphyNet
NLTK spaCy Stanza
ConvoKit
தரவுத் தேடல் (Data Search)
- To search dataset
Dataset Search
https://datasetsearch.research.google.com/
Kaggle: Your Machine Learning and Data Science Community
https://www.kaggle.com/
Hugging Face – The AI community building the future.
https://huggingface.co/
Corpus
Google N gram Corpus
Google Books Ngram Exports
https://storage.googleapis.com/books/ngrams/books/datasetsv3.html
Children Stories Text Corpus
Cleaned Gutenberg books
https://www.kaggle.com/datasets/edenbd/children-stories-text-corpus
Children Stories Text Corpus
cleanedmergedfairytaleswithout_eos.txt
Index of /convokit/datasets/
https://zissou.infosci.cornell.edu/convokit/datasets/
பாடல் வரிகள்
Tamil Songs Lyrics Dataset Lyrics from Tamil Movie Songs |
https://listed.to/@prasanth/65025/tamil-songs-lyrics-dataset-lyrics-from-tamil-movie-songs
LRCLIB
https://lrclib.net/db-dumps
tranxuanthang/lrclib: LRCLIB server written in Rust with Axum and SQLite3 database
https://github.com/tranxuanthang/lrclib
Music Dataset: Lyrics and Metadata from 1950 to 2019 - Mendeley Data
https://data.mendeley.com/datasets/3t9vbwxgr5/2
Genius Song Lyrics
https://www.kaggle.com/datasets/carlosgdcj/genius-song-lyrics-with-language-information
brunokreiner/genius-lyrics · Datasets at Hugging Face
https://huggingface.co/datasets/brunokreiner/genius-lyrics
Song Lyrics Dataset
https://www.kaggle.com/datasets/deepshah16/song-lyrics-dataset
உரையாடல்
- உரையாடல்
- மேற்கோள்
- வசனம்
Index of /convokit/datasets/
https://zissou.infosci.cornell.edu/convokit/datasets/
ConvoKit
https://convokit.cornell.edu/
Cornell Movie-Dialogs Corpus
https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html
memorability
https://www.cs.cornell.edu/~cristian/memorability.html
https://www.cs.cornell.edu/~cristian/memorability_files/cornell_movie_quotes_corpus.zip
Datasets — convokit 3.4.1 documentation
https://convokit.cornell.edu/documentation/datasets.html
Quotes- 500k
https://www.kaggle.com/datasets/manann/quotes-500k
Quotes Dataset
https://www.kaggle.com/datasets/akmittal/quotes-dataset
Abirate/englishquotes · Datasets at Hugging Face
https://huggingface.co/datasets/Abirate/englishquotes
jstet/quotes-500k · Datasets at Hugging Face
https://huggingface.co/datasets/jstet/quotes-500k
Word list
பயன்பாடு Frequency
word-freq-top5000
https://github.com/filiph/english_words/blob/master/data/word-freq-top5000.csv
English Word Frequency
https://www.kaggle.com/datasets/rtatman/english-word-frequency
English Word Frequency
⅓ Million Most Frequent English Words on the Web
This dataset contains the counts of the 333,333 most commonly-used single words on the English language web, as derived from the Google Web Trillion Word Corpus.
https://www.kaggle.com/datasets/rtatman/english-word-frequency
English Word Frequency List
List of English word frequencies from the Google Books Ngram dataset.
https://www.kaggle.com/datasets/wheelercode/english-word-frequency-list
This repo contains a list of the 10,000 most common English words in order of frequency, as determined by n-gram frequency analysis of the Google's Trillion Word Corpus.
https://github.com/first20hours/google-10000-english
orgtre/google-books-ngram-frequency: Word/n-gram frequency lists for the Google Books Ngram Corpus (v3, all languages) with Python code
https://github.com/orgtre/google-books-ngram-frequency
வகை
வயதுப்படி
English age of acquisition (AoA) ratings (Kuperman et al., 2012)
https://osf.io/d7x6q/files/osfstorage
https://files.de-1.osf.io/v1/resources/d7x6q/providers/osfstorage/?zip=
NLTK
spaCy
Stanza
Morfessor
WordNet
UniMorph
MorphoLEX
MorphyNet
AoA
English age of acquisition (AoA) ratings (Kuperman et al., 2012)
https://osf.io/d7x6q/files/osfstorage
https://files.de-1.osf.io/v1/resources/d7x6q/providers/osfstorage/?zip=
MorphoLex-en
Lexical database for ~70k English words with morphological variables
https://github.com/hugomailhot/MorphoLex-en
Universal Morphology (UniMorph)
Repositories for the Universal Morphology (UniMorph) project
https://github.com/unimorph/eng
English Lemma Database - Compiled by Referencing British National Corpus
https://github.com/skywind3000/lemma.en
https://github.com/skywind3000/lemma.en/blob/master/lemma.en.txt
Morphemic Segmentation of English Words
A Dataset of english words and morphemic segmentations
https://www.kaggle.com/datasets/thedevastator/morphemic-segmentation-of-english-words
morphemes
Common English morphemes, organized for automated access.
https://github.com/colingoldberg/morphemes/blob/master/data/morphemes.json
MorphyNet
MorphyNet: a Large Multilingual Database of Derivational and Inflectional Morphology (+morpheme segmentation)
https://github.com/kbatsuren/MorphyNet
https://github.com/kbatsuren/MorphyNet/tree/main/eng
Linguistic Variation Interventions
https://github.com/aarsri/interventions-linguistic-variation/blob/main/README.md
Base code for the Tool for Automatic Measurement of Morphological Information (TAMMI).
https://github.com/scrosseye/tammi/blob/main/README.md
https://pypi.org/project/morphemes/
pip install morphemes
Anagram derivation finder
https://github.com/HappySeaFox/anagrams
சொற்கள் மற்றும் அகராதிகள் (Words and Dictionaries )
https://listed.to/@prasanth/64046/
பொருண்மை வலை(Semantic Web)
Wiki WordNet ConceptNet FrameNet VerbNet PropBank TreeBank SyntaxNet SyntagNet OpenHowNet Framebase VerbAtlas HowNet BabelNet OpenCyc
The World Factbook by CIA
The World Factbook, the indispensable source for basic information.
https://www.kaggle.com/datasets/lucafrance/the-world-factbook-by-cia