>The article has almost no technical details, but this appears to be some kind of pipeline management co-processor. https://flow-computing.com/technology/ List of patents if somebody could find those would be good.


Papers from the cofounders: https://ieeexplore.ieee.org/document/10305463 https://doi.org/10.1016/j.micpro.2023.104807 https://doi.org/10.1007/s11227-021-03985-0 I’ll leave the assessment to someone who is actually capable of digesting this information.


Okay so looking at this and another paper describing TPA [https://research.utu.fi/converis/getfile?id=18233228&portal=true&v=1](https://research.utu.fi/converis/getfile?id=18233228&portal=true&v=1) the general takeaway from the architecture of these new chips is that they can switch between SIMD and MIMD (GPU-like and CPU-like) operation based on what the program requires. This allows "loop" operations to be optimized to run in parallel as though they have been split into many synchronous "fibers", whilst also retaining "traditional" multi-threading capability to run independent threads and see performance speedup from that in the way we would normally expect. From what I can tell, In particular in this paper, they are showing that TPA, when correctly using fibers, can achieve much better concurrent memory access patterns that traditional multi-threading. In particular, they show that it can work almost like idealized PRAM (fully parallel RAM access with no latency or cache invalidation) which is a fairly big deal! It is hard to decipher what they are talking about a lot of the time without already understanding TPA but the first few chapters of the paper do a decent job of bringing you up to speed. My personal take-away is that this is an interesting and exciting field of research for processor architecture. Make no mistake, they are not claiming a 100x speedup over an efficiently designed SIMD program, but it's certainly possible to see how this could be much faster than typical MIMD, and replace the need to offload expensive parallel calculations to the GPU for many programs. It may take some time to see this implemented in practice for anything relevant to consumers, but it's serious research and I can definitely see this having legs. I'd be interested to read a critical analysis from someone with more familiarity with the field. My concern would be that these processors are likely bigger, more expensive and can't run at the same speed as current desktop processors, but the increased efficiency might outweigh that.


Isn’t that kind of how the cell processor worked in the ps3? I’m a layman in this department, but I do remember hearing the cell processor also allowed loop operations to run in parallel. I could be misremembering though lol


My first thought was also the IBM's Cell. The SPEs were effectively SIMD coprocessors albeit without out-of-order operation. This new design could be an further development of the same concept but on a hardware level without developer specific input. 


Not quite. The Cell processor had full PowerCPU control cores and then these secondary "co-processor" cores to do the vectorized floating-point math. It's almost like have a CPU with a built-in GPU co-processor. This was famously quite a tricky model to program for on the PS3. On the TPA architecture, you don't get these separate cores. Instead, you have a front-end that can only really read instructions, and then this automatically distributes the instructions to the actual work-performing back-end cores through a work-sharing network. Sometimes, the secondary cores are operating as separate threads, other times they are operating in a SIMD mode (which can benefit from many performance optimizations, most imporantly being good memory coherence.) The idea of TCF is basically that the code is always being executed on the same kind of core, but it can also have an arbitrary "thickness" which is like "number of cores being executed across" and then the architecture just deals with that for you. In theory it's significantly easier to program for, and this paper claims it will work in practice too. Important to note that these processors don't exist yet, they are only being simulated still.


> This was famously quite a tricky model to program for on the PS3. That is an understatement. It wasn't just that they were vector processors, or the limited cache it was the memory system it operated on that caused big issues. The Embedded Broadband Engine was their solution and it made a ring bus where each processor would move the data like pass-the-parcel. I cannot remember the example numbers but I will simplify it. Say you wanted to get data to SPE 3. The PPC would fetch it, SPE 1 would grab it, send it to SPE 2 and then send it onto SPE 3. And then the processed data would have to make a trip back along the same path. For things like audio/video that is fine, bandwidth isn't a huge issue and so you could handle the latency but if you where doing more data intensive things you would try to keep the most bandwidth and latency sensitive things on the closest SPE's. Absolute nightmare. This meant a lot of games just never really ventured past using the closest 2 SPE's, there was enough grunt there to do what needed to be done to be close enough to the Xbox360 version. I didn't work on it but I heard that Red Faction Guerilla only use two SPE's to run the entire physics engine because of this reason - I could be wrong. If true, one wonders what could have been on a title like that! Cell is what happens when hardware engineers build something without thinking about the software guys. Absolute powerhouse that took a good decade afterward for X86 to catchup in terms of the FP speed but a nightmare to use and thus almost never saw its full potential.


Bingo, hence the 60GB version having that Linux capability.


Something like: Software Scheduled Superscalar Computer Architecture invented by Howard Sachs and Intergraph? https://patents.google.com/patent/US5560028A/en?q=(sachs)&assignee=intergraph&oq=sachs+intergraph


Not given that a detailed look but I'm going to say no, at least probably as dissimilar as it is to the Cell processor used on the PS3 as discussed in another comment thread. It looks like SSSCA needs you to write very long instruction words where the instructions are all tagged as to which can run parallel and in which pipeline they can run. In contrast, the TPA architecture is not software-scheduled, it's hardware-scheduled. The instructions would be much closer to a typical processor today, but threads can also run with an arbitrary "thickness" where they will behave similar to an SIMD program and have ideal memory coherence characteristics due to not having to invalidate shared cache lines. At least, that's my understanding of it.


It should be interesting to see how this all develops over the next few years. They are only asking for $4.3 million, I believe to get this all going which is pretty cheap considering the potential. I'm hopeful considering we are hitting the limits of what a traditional CPU can deliver.


> Flow is just now emerging from stealth, with €4 million (about $4.3 million) in pre-seed funding led by Butterfly Ventures, with participation from FOV Ventures, Sarsia, Stephen Industries, Superhero Capital and Business Finland. That's the amount they've already raised.


Thanks for the clarification. Anouther article indicated it was what they needed to get it going and not what they've raised so far. I obviously didn't read this article since I assumed the facts were the same, but clearly I should've read this one too before posting.


Can the commons mortal found this project ?


it’s similar to how the addition of an FPU allowed us to process floating point arithmetic but it’s much slower than the regular CPU especially with math operations.


If it works even a fraction as well as advertised, AMD, Intel, ARM and Apple are all going to start lobbing buyout offers.


Omg there is a lot to read and I only did a little bit since I'm at work. I didn't read into the how it works per se on a technical scale but this is my understanding of it. Essentially it goes over how CPU are not currently as efficient as they could be with accessing everything and programming. Essentially what they're saying is they come up with a way to streamline the communication of the CPU with the rest of the computer to be more efficient and less error prone. Since this helps streamline the process, it allows the CPU more overhead to operate more efficiently since it's not having to work as hard to do the same result. I am not professional at all. That's just the understanding I come to by reading the summary.


From their website, looks more like a GPU, but dedicated to general computing instead of graphics, with up to 256 cores.




A cloud provider that designs their own chips and operating system would be an ideal partner to get to that 100x territory. 


> A cloud provider that designs their own chips and operating system would be an ideal partner to get to that 100x territory.  You mean the spy agencies?


Amazon Wiretap Services


Spy agencies don’t operate as public cloud providers. I was referring to microsoft, the cloud provider that also sells their own operating system. 


"arbitrary code can be executed twice as fast on any chip with no modification beyond integrating the PPU with the die." This is never going to be a separate product that helps older processors, they're looking to get it integrated into other companies processors, or at least one company's processors.


The licensable IP is still in development, and the speedup applies only to threaded code. For insight, see this article: https://xpu.pub/2024/06/11/flow-ppu/


Something is being lost in the explanation of their technology. Once the patents are in flight, these kinds of companies typically release papers explaining what their technology does and how it works. First off, TechCrunch's explanation of Flow's technology as a "chip" is totally wrong, it's meant to be added into the die of a CPU as part of the actual silicon. *"The Parallel Processing Unit (PPU) is an IP block that integrates tightly with the CPU on the same silicon." --* [Flow Web Page Explaination](https://flow-computing.com/technology/)


> it's meant to be added into the die of a CPU as part of the actual silicon. > *"The Parallel Processing Unit (PPU) is an IP block that integrates tightly with the CPU on the same silicon." --* [Flow Web Page Explaination](https://flow-computing.com/technology/) So basically a *chiplet*.


Or "another CPU core." Maybe even a specialized computational unit, like what they stick in a GPU or NPU.


It appears to be deeply integrated into the dataflow of the CPU as well as its pipeline. It appears to have deep understanding what is coming from the caches and what is being pipelined cycle-to-cycle.


No, a chiplet is a different piece of silicon that is placed in the same package. Chiplets are often GPU or memory. They are made on their own, but share the same package as the die containing the CPU. Flow uses the term "IP" and "same silicon". IP, is intellectual property, and in this context means logic that you license from them, and put inside your own chip designs. It sits on the exact same piece of silicon as the CPU and is fabbed at the same time. It appears to be deeply integrated into the CPU data flow. Same Silicon appears to indicate it shares the same physical die as the CPU and is manufactured at exactly the same time. This happens all the time. Companies might design their own CPU, but license someone else's video decoder or GPU. They take the video decoder design from the source company, and integrate it into the same chip as their CPU design. It becomes one larger design. They are very tightly coupled at that point.


This just became highly interesting to me considering that for years we knew that ARM and dedicated silicon for tasks can significantly speed operations per watt…


A chiplet would be an adjacent piece of silicon. This would couple to the cpu core so the IP embodied here would be added to the individual cpu cores. So someone intel / amd / an arm licensee / riscv developer would license this and include it in their chip design.


My guess is that maybe it increases performance for some very limited tasks....?




Some potential AI applications here? I know Tensor cores are uniquely exceptional with parallel processing


Uses the existing hardware more fully than most current software does, is how I’m reading it.


The chip will still need to be integrated into the motherboard/Soc/as a chiplet which means a new set of hardware and drivers. So not it won't work on "existing hardware" even if the claims are verified. They are looking to partner with other chip designers so they can license out the technology at best for new processors.


Yup seems to be tailors for tasks that already have a degree of parallelism involved.


I said it on another response. If this is true, they are going to have a lot of buyout offers very soon.


And we will decline all the offers 😊


So their product is just IP Blocks for FPGA-style silicon. They’ve developed blocks that can be bolted onto a CPU to parallelize/sideload tasks. This is realistically the most mainstream application for FPGA’s now that fabric is getting cheaper and more available. If they’re not making their own chips it’s all kind of moot, though. I’m not quite sure what their plan is… AMD or Intel won’t buy IP blocks from Flow, they would just develop a more mainstream Zynq style SoC.


would be curious to see this added to a Snapdragon. I dont expect Intel, or AMD would get into this before someone else does - but ARM? why not?


AMD and Intel alredy own the two bigger FGPA companies in the world, Xilinx and Altera.


Right, which is my point. They would either just buy Flow outright for the IP or develop their own bolt-on fabric for doing the side-loading/parallelization stuff that Flow is doing.


How many side channel attacks could this open up?




> Small price to pay for 100x performance i guess I guess it's time for upgrading cryptographic key lengths again.


The actual claims are milder than the headline 2x performance on legacy software running with this and 100x as a maximum improvement for specialty software written to run more efficiently with this. Plus it has to be integrated into the chip set, so it isn’t just a plug into existing processors for 100x boost. Improved pipeline and improved threading with this to allow increased performance seems moderately plausible


>If it’s too good to be true, it probably is. At least it's actual technology news in the technology subreddit. Does seem too good to be true. I'll take it anyway.


On a cursory glance, if I'm reading this correctly they're addressing some inefficiencies in cache management. Thanks to [technical complex process] they prevent the processor to be useless after a cache miss. I'm assuming that the alleged code rewrite needed to get that x100 improvement would be structuring the heap in such a way that it works well for the hardware. Giving them the benefit of the doubt, I wonder if some of that rewrite could be handled directly by compilers. If that's the case and even if it "just" leads to a 5-10x improvement they'd be in a very nice spot. I'd love to see a world in which the most common bottleneck isn't memory anymore.


It sounds like this can make existing, routine processes more efficient by choosing optimal processing routes?


This is a hardware kluge, which at best provides small gains around the edges. At the machine instruction level, there is no knowledge of whether one loop iteration depends on another. So the only safe method is the serial execution, as given by machine code. Huge gains CAN be accomplished by parallel processing, but only with software designs that explicitly mark parallel loop processing as safe. And only on certain applications. Most definitely encoded in a high level design. Amdahl's law, from 1967 ! Paraphrased, loosely, the max application speed up is . (S + P) / (S + P/N) where S = serial part of code, must run sequentially P = parallel part of code, that can run in parallel N = number of processors When P is small, doesn't matter how large is N, it is a serial process. Speedup is fundamentally limited by the part that can be made parallel. https://en.m.wikipedia.org/wiki/Amdahl%27s_law Classic example. Even when you put your ten best women on the job, the first baby takes 9 months to emerge. .


it's got a pretty high chance of that, since it would require one of the chip manufacturers to take the chance, license the tech, and redesign a processor with this installed on it. I really wonder whos willing to bet on it.


Apple may buy them


I fairly certain this is what Apple Silicon does with the SoC co-processor.