Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

Always interesting to see new people try to approach the summary problem but I find these summaries have the defects common to automatic keyphrase extraction summaries: they feel very artificial, and are usually not accurate. The summary of "four steps to Google" is a good example.

I hope this kind of technology sees the day but I'm very skeptical about it working on general-purpose content and not just structured content such as news, as it does today.



As the author of TextTeaser noted, there are two approaches to automatic summarization: abstraction and extraction.

Abstraction combines huge portions of two young research fields -- NLP & NLG (Natural Language Processing & Generation). NLG is even harder than NLP, and less researched. Without good NLG algorithm for presenting summary, you can't have more humane summaries.

Extraction simply takes sentences (or some portions of them), ranks them and presents a few best results.

Two years ago, I was at presentation of PhD about text summarization. There I've figured out that you can make fair summarization algorithm in a few hours. Here is an prototype: https://bitbucket.org/ivan444/textsum/src/1d09b0f4f72a60903d... Dirty prototype code, it took me just about 10h of work to prepare dataset, think algorithm, write program and tune it (this works only for Croatian language, if you want other language, you'll need to get list of function words for that language -- http://en.wikipedia.org/wiki/Function_word ). There is also java version of text summarizer (somewhere in repository) and simple tool to get clean, article-only text from any page containing some longer texts (it isn't tuned well, I didn't spent more than 1h of work in it, so I don't expect it works well).

Algorithm is simple: (1) break text into sentences, (2) extract features, (3) calc features score and sum them, (4) present ranked sentences (and, later, choose a few best).

Used features: normalized number of words, type of sentence (declarative, interrogative, exclamatory), order score (give first sentence a boost, as usually first sentence is the most important one), ratio between number of function words and all words (function words are words without semantic content; there is a fwords.txt in a repository which contains ~700 Croatian function words), normalized sum of three minimum TF-IDF scores (document = sentence).

I don't know the state of the code (it is more than a year old code), but anyone is free to use that code for anything they like.


As a PhD graduate in NLG I wouldn't say NLG is a "young" research field. For example the oldest NLG book I have is Eduard Hovy's PhD work on the PAULINE system ("Generating Natural Language Under Pragmatic Constraints"), which was published back 1988. The seminal reference book for NLG ("Building Natural Language Generation Systems") was published back in 2000. What's made NLG more interesting recently is that the computing environment has changed considerably. We have considerably more larger pools of time-series data than was available in the past and that we now also have standardised data-to-text pipeline architecture when creating NLG applications for such data.

Nevertheless, I do agree there's still considerable challenges when trying to perform text-to-text generation which involves trying to combine NLP and NLG together to abstract, interpret, and then summarise unstructured free text.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: