SparkFun Forums 

Where electronics enthusiasts find answers.

Have questions about a SparkFun product or board? This is the place to be.
By silic0re
Hi everyone,

I've just finished reading a book entitled, "The Connection Machine" -- it's about a massively parallel machine that was built in the early 1980's consisting of 65384 single-bit processors organized into a ~1m square cube. The architecture was designed by an MIT graduate student who wanted to try to both model intelligence to a level yet unattained, and to design an architecture that departed from the traditional Von Neumann architecture that separates memory and processing resouces into distinct areas with a very low bandwidth connection between the two. As such, the author notes, while one may have a huge number of gates on a modern processor, at any given time only a very small fraction of the gates devoted to 'processing' are active, while many orders of magnitude fewer gates yet are active at a given moment in the memory portion of the processor -- only a few bytes out of many kilo or megabytes.

The Connection Machine was designed to be something much different -- a huge number of extremely simple processors with very little memory per processor, designed with a flexible network topology such that the structure of the network could be modified to resemble an actual physical or structural arrangement of a problem -- individual processors representing individual neurons in a neural network, or individual particles in fluid dynamics or cellular automata. From what I can understand from reading the book (which was based on Hillis' dissertation), each processor is essentially a simple 1-bit Arithmetic Logic Unit, and each processor recieves the same instructions (they have the option not to execute a given instruction), which are carried out on the data in their very small local memory buffers (~4k). With 64k processors, one essentially has 65384 very simple arithmetic logic units acting in parallel on a huge ammount of data! To picture this in a conceptually different way, I sometimes have tried to think of it as though someone had taken a traditional array of memory, and added a very simple processor to arbitrarily small chunks of it. As such, a greater number of gates are active at any given time in the processor, and a great deal of data can be throughput.

I am quite fascinated by this architecture, and it reminds me a lot of modern FPGA stuff. It also has me thinking of the efficacy of either emulating these ideas in a cluster of microcontrollers (for instance, PIC's), or building a small, interesting, fun cluster of PIC's (or some other microcontroller) to sit on a desk off in a corner and chug away at problems. Does this idea fascinate anyone else? Is anyone familiar with efforts to make computing clusters of microcontrollers, and any idea of their conceptual design or performance? I have tried some initial googling, but haven't found anything very promising -- it seems like a relatively non-picked-over problem.
By KamPutty
I am very interested. I've been playing with the pic10f206's for some time, only 6 pins, 4 data lines. very small! I was also thinking about hooking many up etc.

1 bit for each would be nice to model one using transistors etc...

Well, nothing more to add...just saying "me too"

~Kam (^8*
By Philba
there's a reason why that type of machine hasn't caught on - no one has figured out how to program it effectively. there just aren't enough massively parallel algorithms out there. Also, interprocessor communication becomes the defining bottleneck. on top of that, moore's law has continued unabated. maybe it's slacking off now but by the time the researchers would finish off a massively parallel machine, the next generation single processor would be out and ran rings around it.

sorry to be a nay sayer but lots of people have traveled down that road.
By samcheetah

actually now the multicore concept has been adopted by the PC industry and Intel and AMD are trying to get ahead of each other in this race. this concept isnt new for us embedded system designers because we have been doing it for years. but now the whole PC industry is after it and i believe this will push the development of parallel processing algorithms.

and i dont think that by the time the multicore processors arrive, some single core will beat them. i think that its the end of single core processors.

just my 2 cents
By SOI_Sentinel
Note how multicore is built. They have highly specialized data busses onchip. For through-board multicore they have stuff like Hypertransport, high speed serial. Even using the ultra-small ideal that's laid out, the issue is getting the data to the processors. The faster and more seamless you can get the instructions there, the better it'll be.

So, how to do this in a microcontroller? You might be able to convince an ARM chip to do it. Each chip would require 1-4 hardware SPI interfaces. 4 would be ideal. Each would have to be able to be serviced by a DMA interface at full speed. This would allow the bulk of your processing power to be used for processing, not big-banging interfaces (sorry, that's the issue with using small PICs). Even with just hardware SPI, you'd lose so much time handling SPI interrupt requests it wouldn't be funny.

One chip that could do this is made by TI: ... 1b768.html

Given that ARMs see their data/code as a single block, you could packetize the main executable and put a "send along" code wrapper around it. This would let you actually execute a native machine code wrapper instead of try to translate it like modern TCP/IP. Might prove interesting.

I know that some dsPIC33's (and PIC24's?) could do this since some models have dual hardware SPI and DMA. 2 data lines, a clock line, and an alert/data direction line would probably suffice, or maybe 2 to keep it unidirectional.

Parallel interfaces could be implemented, too, for even higher bandwidths, but more thought would be needed, and IO lines.
By Philba
Multicore is a far far far cry from the connection machine. multiprocessors have been around from the 50's and the issues of dealing with a couple of processors are child's play compared to 10s, 100s or 1000s. In general 2 processors can perform 1.4 to 1.7 the work of 1. For 3 it gets worse and so on. this is because communication and synchronization overhead take time. For example, how do you do disk I/O from multiple processors? One of them has to wait. Structuring your application to take this into account helps a lot but you will never get to 100% efficiency.

I'm not trying to brag here but I designed and implemented operating systems and systems software (including multiprocessors) for 25 years. there is no free lunch.
By SOI_Sentinel
OK, you got me there.

I'm mostly worried about communications overhead in emulating the connection machine. There's no little, cheap processor that has the comm hardware I'd want to free it to spin it's little ALU as I'd want it. I have this same issue with the Propeller multicore, as software emulating any peripheral might provide flexibility but I see so much CPU time going up in smoke where a little logic provides a much lower power and faster interface.
By Philba
I think the place where something like propeller might be a real benefit is in creating customized I/O channels with simple and well defined interfaces. make one a UART, I2C, Ethernet stack, sensor conditioning and so on. Writing multiprocessor aps is pretty tricky. Debugging can be challenging - lots of potential race conditions and so on.
By silic0re
Hi there,

I've been giving this idea some thought over the last few weeks, and it occured to me a few days ago that the majority of the complexity in machines with large numbers of processors (such as the Connection Machine) is in designing an architecture that supports sufficient communications performance for the set of problems it aims to solve. Since the mandate of the Connection Machine research program was (in part) to solve problems that involve a large number of operations that are highly dependent on data from operations completed by other processors (such as in neural networks, and simulating cellular automata and physical systems -- eg. fluid dynamics), the Connection Machine required a creative and nimble method of interprocessor communications (and such an elegant method was indeed created).

So then, it also occured to me that if one restricts oneself to the subset of problems that requires relatively low interprocessor communications, one could have a significantly simpler problem in designing a system -- in this case, a cluster of microcontrollers. It's here that I should probably digress for a moment and mention both what I've come up with in terms of other people who've attempted to build a cluster of microcontrollers, as well as what I mean by 'problems that require relatively low interprocessor communications'. The only other microcontroller cluster of interest I was able to locate was one developed as a fourth-year project (i think) from a group in an engineering discipline who developed a cluster of PICs, though the class website contained precious little information and I have since lost the link (and can't find it again on google). The problem that they had decided to solve involved rendering images of fractals -- each processor would render a small subset of the image, and in turn would send this small subset to a host machine that put the entire image together. This is an example of a problem that requires low interprocessor communication: a given processor is given a small, computationally intensive bit of a problem to solve (that doesn't rely on information from other processors), and when it's done it sends it's results back.

Another example of this class of problem could be computing a list of primes (or checking a single large number to see if it's a prime). One might send each processor an integer to check for primality, and (after some time) when the processor returned it's result it could send out another, and so on. I would be particularly interested if anyone has other ideas of this sort that would be particularly suited for low-interprocesor-communication problems.

Grounding this idea of relatively low inter-processor communication in the realm of a cluster of microcontrollers, I wonder if one might be able to use a simple multiple-device interface (eg. I²C -- one shared dataline, one shared clock) to connect multiple microcontrollers together, then write some conceptually simple test program (for instance, the list-of-primes program: one host serving the to-be-verified numbers, many nodes verifying primality of a given number) to see if the idea is sound.

That's where I'm at right about now -- I've been looking at all of the different models of PICs to find ones that combine speed and memory with a relatively small size/low pin-count (and inexpensive is good, too). I've also begun reading up on implementing simple communications protocols like I²C with PICs -- I know that I²C supports 128 devices, but in most of the examples I've found it's simply used as an interface between two devices: usually a PIC and a sensor, or EEPROM. I'm not sure if it becomes difficult to implement when there are more than two devices.
By silic0re
a bit of an update:

I've looked over most of the PICs and dsPICs on microchip's site, and tried to find the best ones that satisfy a few criteria that I think are important to the 'building your first supercomputer' project. :) Here they are for the ideal case:

- ideally, a large amount of program memory
- large amount of RAM
- hardware-supported communications protocol (like I²C) so that the processor can spend more time processing and less time worrying about the low-level specifics of communicating
- processor should be as fast as possible
- ideally, some math functions supported in hardware (such as multipliers, dividers, shifting, etc).
- low pin count, easily managable package (eg. DIP)
- low cost so that many can be combined
- relatively easy to program with a low initial investment in programming hardware

Some of these criteria relate to the performance and capabilities of the cluster (such as program memory size, RAM, communications, speed, math support) where as others relate to aspects of building it (such as the package, cost, and programming hardware).

The microcontroller that sofar tops my list is the dsPIC30F3012, which is a 30 MIPS (max) microcontroller with 24K of flash, 2k of RAM (also 1K of E²PROM), with onboard I²C support (both the 7- and 10-bit addressing versions) and is available in an easy-to-use 18-pin DIP package. It also includes a 16-bit hardware multiplier (and divider) that supports both integer and fractional representations, which is really handy. (I hope that with some fancy register shifting one might be able to relatively easily increase this to a 32-bit multiplier, it just would take a few clock cycles instead of one). Ontop of all that there's some timers and ADC stuff, but I don't really envision using the ADC capabilities unless there are some thoughts of actually using it to acquire data sometime down the road.

It looks as though the processors can be purchased in small quantities for ~$5 per unit, and programming hardware (an Omliex IDC2) would be on the order of $100.

So far this looks to be the best option combining an accessable package (DIP) with speed and functionality, although to be honest I would feel better if there was more than 2k of memory. I had the thought that one could probably increase the amount of storage per microcontroller by adding a small flash memory, but that would likely be much slower to acess than the onboard SRAM...

Any thoughts? any other suggestions?
By saipan59
The microcontroller that sofar tops my list is the dsPIC30F3012
Given your constraints, I think you found the best choice.
You can find I2C code for this class of chip here:
The code there implements an I2C Slave on a dsPIC30F4011, and Master on an 18F4620. It compiles with Microchip's C30 and C18 compilers.

I've often thought about making a 'cluster' of the tiny 10F222 chips, because they are in a 6-pin SOT23 package - you could cram a *bunch* of them on a very small board.
But I haven't thought of a 'practical' way to implement something that couldn't be easier done with a single larger MPU.

One interesting random thought is to use *analog* signals to communicate between small PICs, since many of them have A/D's.

By reklipz
The programmer would just be ICSP, would it not? a PIC-PG2 would do the trick, a clone or not.

Also, wouldn't the ADC have a fairly large acquisition time? especially when doing this multiple times, eg 4 times for an 8 bit res ADC, to send 32 bits. Plus, you would then have to have an DAC on either the host, or if your doing interconnects between the processors, one each processor.

BTW, this topic is very interesting, to answer the original question. :D

I may try to do a simple prime/not prime finder using a host and 2 processors. Although, I'm not too certain how the host code would scale when adding more processors to the cluster (the neat thing though is that all of the processors in the cluster can use the same firmware)

Something that just occurred to me, all of the processors in the cluster can be of basically any architecture and run at any speed they want, so long as the communication protocol is implemented the same on all of them. It would be really neat if someone who specializes in ARM wrote a "client" for it, using the same protocol as someone who wrote a "client" for a PIC, and so on. :lol:
By silic0re
Thanks for the code -- I'll definitely check it out. I have a few dsPIC30F3012's on the way, as well as an Olimex ICD2 -- hopefully they will be in this week.

I think it's interesting to consider the case of very small processors too -- things like the 6-pin 10F222's that you mention. The problem is that with 23 bytes of memory, it might be a little tricky to find a problem you could compute. It would be really interesting to try -- especially if you have a numerical problem that can be solved with relatively little memory workspace and simple arithmetic functions -- things like additions and shifts and such.

I've chosen the dsPIC30F3012 because it seems to be a good balance between memory space, speed (30MIPS -- for a PIC!) and cost per unit of performance. I think 30MIPS is around the same rating as a 486DX/33 ( ... per_second), though its of course difficult to compare MIPS scores across processors with different architectures.

I think the ADC would have quite a bit of external hardware you'd have to include with it (like a DAC). If it's an 8-bit ADC and you have a 200k samples/sec, then that's essentially a 200k bytes/sec transfer line, though I'm not sure if it would be noisey or if you'd actually get that kind of performance.

reklips, you're absolutely right that one could in theory tack on anything that communicated via I2C, although the host might have to take the speed of the units into account depending on what kind of problem you're working on, and if there's any interprocessor communications to solve the problem. Indeed, for the case of PICs, I think one could use the same firmware for each of the PICs with the exception of modifying the I2C address for each PIC.

For the prime problem, consider the following. If each processor simply accepts as input a number to verify the primality of, and when it's done communicates back that number and if it's prime or not (or the lowest factor of that prime), then for large numbers that take some time to compute, one should have relatively little communications overhead. I think this sort of problem would reach pretty close to the limit of:

time_to_solve = amount_of_processing_time / number_of_processors

since each unit is essentially working independently, and communicating infrequently. while it may not be maximally cool (as cool as a problem/topology with lots of interprocessor communications), i still think it's plenty cool to work on for a first 'PIC supercomputer project'. :)
By saipan59
like the 6-pin 10F222's that you mention. The problem is that with 23 bytes of memory, it might be a little tricky to find a problem you could compute. It would be really interesting to try -- especially if you have a numerical problem that can be solved with relatively little memory workspace and simple arithmetic functions -- things like additions and shifts and such.
I suspect a 'numerical' problem is not the sort of thing to do with the tiny PICs. Instead, something more like a neural network, where each little PIC performs the work of say 2 or 3 neurons (or maybe just 1).
But again, it's still easier to use one big processor to simulate many small processors...
About 20 years ago I was doing work with neural networks and such, but a lot of that stuff became impractical when CPUs and memory and storage became so fast and so cheap.

Anyway, for the tiny PICs, it seems like the problem has to be something where it is necessary for the individual small CPUs to be physically intimate with their interfaces -- some sort of *physical* requirement for small, local processing.

By silic0re
ICD2 is here, but the dsPIC30F3012's are taking the scenic route. I'm getting anxious, so I think I'll try and put together the board for them this weekend. :)