Hacker Timesnew | past | comments | ask | show | jobs | submit | _hzw's commentslogin

I'd love to see a monolingual Japanese model sometime in the future. Qwen3-tts works for Japanese in general, but from time to time it will mix with some Mandarin in between, making it unusable.


our next model(eta 3ish weeks) will support Japanese. would love to get your feedback then on how the quality is. can you share what usecase you want? would love to support it.


I have a pipeline of jp epub>m4b, just need to swap tts models in between :)


You could try a preprocessing step where you convert to hiragana, but I guess that would lose pitch accent information (e.g. 飴 vs 雨)


Exactly. Qwen only has one pitch accent for pure hiragana words, even though it actually work (removing mandarin mixed-in), which requires some great efforts to normalize text in order to disambiguate heteronyms, the result is (if you use voice cloning) your favorite CV speaking in some weird, unknown accent :)


That got me wondering if "you convert to hiragana" is a solved task, or a research team and five years[0], and Google showed me an article[1] that gave me a facepalm, quoting from Google Translate(square brackets are mine):

  > - As a result,
  >   - When the string "明日["tomorrow"]" is entered into TTS, the TTS model [・皿・] outputs an ambiguous pronunciation that sounds like a mix of "asu" and "ashita" (something like "[asyeta]").

  > From this, we found that by using the proposed method, it is possible to obtain data from private data in which the consistency between speech, graphemes, and phonemes is almost certainly maintained for more than 80% of the total.

  > Another possible cause is a mismatch between the domain of the training data's audio (all [in read-aloud tones]) and the inference domain.
My resultant rambling follows:

  1. Sounds like general state of Japanese speech dataset is a mess
    1.1. they don't maintain great useful correspondence between symbols to audio
    1.2. they tend to contain too much of "transatlantic" voices and less casual speeches
  2. Japanese speakers generally don't denote pronunciations for text
    2.1. therefore web crawls might not contain enough information as to how they're actually pronounced
    2.2. (potentially) there could be some texts that don't map to pronunciations
    2.3. (potentially) maybe Japanese spoken and literal languages are still a bit divergent from each others 
  3. The situation for Chinese/Sinitic languages are likely __nowhere__ near as absurd, and so Chinese STT/TTS might not be well equipped to deal with this mess
  4. This feels like much deeper mess than what commonly observed "a cloud in a sky" Japanese TTS problems such as obvious basic alignment errors(e.g. pronouncing "potatoes" as "tato chi")
---

  0: https://xkcd.com/1425/
  1: https://zenn.dev/parakeet_tech/articles/2591e71094ea58
  2: https://qiita.com/maishikawa/items/dcadfeebf693080f0415


An (un)obvious connection between Eraserhead and Bloodborne (spoiler!):

https://www.reddit.com/r/bloodborne/comments/xgu21c/eraserhe...


Apple Books has some rare exclusives. To name one, all of Norman Berrow's novels are only available there.


x


Tangent. Yesterday I tried Gemini Ultra with a Django template question (HTML + Bootstrap v5 related), and here's its totally unrelated answer:

> Elections are a complex topic with fast-changing information. To make sure you have the latest and most accurate information, try Google Search.

I know how to do it myself, I just want to see if Gemini can solve it. And it did (or didn't?) disappoint me.

Links: https://g.co/gemini/share/fe710b6dfc95

And ChatGPT's: https://chat.openai.com/share/e8f6d571-127d-46e7-9826-015ec3...


I've seen multiple people get that exact denial response on prompts that don't mention elections in any way. I think they tried to make it avoid ever answering a question about a current election and were so aggressive it bled into everything.


They probably have a basic "election detector" which might just be a keyword matcher, and if it matches either the query or the response they give back this canned string.

For example, maybe it looks for the word "vote", yet the response contained "There are many ways to do this, but I'd vote to use django directly".


I'm pretty certain that there is a layer before the LLM that just checks to see if the embedding of the query is near "election", because I was getting this canned response to several queries that were not about elections, but I could imagine them being close in embedding space. And it was always the same canned response. I could follow up saying it has nothing to do with the election and the LLM would respond correctly.

I'm guessing Google really just want to keep Gemini away from any kind of election information for PR reasons. Not hard to imagine how it could be a PR headache.


i wonder if it had to do with the Django hello-world example app being called "Polls"

https://docs.djangoproject.com/en/5.0/intro/tutorial01/#crea...


If that was the reason, Gemini must have been doing some very convoluted reasoning...


It's not reasoning, it's fuzzy-matching in an extremely high dimensional space. So things get weird.


I think the only difference between what you described and how we reason is the amount of fuzziness.


Asking it to write code for a react notes project and it's giving me the same response, bizarre and embarrassing.


"Parents" and "center" maybe? Weird.


It gave me this response when I simply asked who a current US congressman was.


> try Google Search.

Anti-trust issue right there.


It's only the other way around, no? Abusing your monopoly position in one area to advance your product in another is wrong, but I don't see a clear issue on the other direction.


It's interesting to note that you consider having children a selfish choice, because in some cultures, not having children is considered selfish.


Cultural anthropology might have an answer there. I guess that some cultures and traditions view not having kids as selfish because there's a cultural push for preserving society (kids will become workers, help parents, pay taxes, fight in the army, and so on). I consider myself selfish because I've done it to fill my life with something I felt was lacking. I've no rational excuse.


I think both options are selfish.

You don't do it or not do it for humanity, society or whatever random shit you make up.


If both are selfish, then neither is selfish. You conveniently deconstructed the concept of selfishness and I like it a lot.


I live in Germany. After dual-using Google Maps and Apple Maps for a year, I've finally uninstalled Google Maps and now use only Apple Maps.


Try osmand, it's based over the openstreetmaps data and you can download the map, so it works also with no internet.

In germany it should be very good.


I resonate with the top comment on Tildes that suggests we need a singular conglomerate to cover all media types. We actually had one from China, douban.com, but the situation of censorship there makes it unusable. There are also a few new alternatives, such as nicedb.org and its fork, neodb.social, but I find them somehow lacking.


Does anyone remember scriptogr.am? It was a similar idea, but it's long gone now.

https://hn.algolia.com/?q=scriptogr.am


As the saying goes, "no news is good news".


Once upon a time, if the BBC didn't have anything else to report, they'd play music for the rest of the news time slot.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: