1

faithfulparsleySoftware and s/w Development

Nov 2, 2013 (4 years and 11 days ago)

126 views


1




http://6371.lcs.mit.edu/


1.
Transmitte
r Equalizer.



The goal of this project is to build the 4GHz equalizing FIR filter for a differential
transmitter. The equalizer cancels the frequency
-
dependent attenuation caused by the
skin
-
effect resistance of copper wire giving a frequency response th
at is flat to within 5%
over the band from 200MHz to 2GHz even over wires with 6dB of high
-
frequency
attenuation, and thereby allows dependable transmission of high
-
frequency signals with
no intersymbol interference.


Transmitter Equalizer
--

high speed da
ta transmission

Overview

The goal of this project is to build the 4GHz equalizing FIR filter for a differential
transmitter. The equalizer cancels the frequency
-
dependent attenuation caused by the
skin
-
effect resistance of copper wire giving a frequency r
esponse that is flat to within 5%
over the band from 200MHz to 2GHz even over wires with 6dB of high
-
frequency
attenuation, and thereby allows dependable transmission of high
-
frequency signals with
no intersymbol interference.

Our design will especially f
ocus on tradeoffs that will allow us to meet the minimize
delay constraint which is on the order of hundreds of picoseconds.

Team Members and Responsibilities

Team

Member

Responsibility

xxxxxxx

Design, testing and layout of distribution and prefiltering
of data. Share
responsibility for top
-
level routing.

yyyyyyy

Design, testing and layout of the RAM. Share responsibility for top
-
level
routing.

Schedule

(starting from November zzzz th, due date of this project proposal)

Task

Duration

Completion

date

circuit design and schematic entry



initial simulation using schematics




2

cell layout (overlap with simulation)



block layout



block simulation with layout capacitance



top
-
level layout



final simulation



presentation preparation



report pre
paration



Theory of Operation

Our 400MHz clock cycle is divided into 10 equal phases. During each phase, we select a
current
-
supply drive
-
strength depending on how recently a transition in the bit pattern has
occured.

During the first phase, for example
, we'll look at the LSB of the 10 input bits, and the 4
MSBs of the 10 input bits from the previous cycle. (This is because we are implementing
a 5
-
tap FIR filter).

Taking the XOR of consecutive bits will give us a "transition" vector. We then calculate
t
he "distance" between the current bit and the most recent transition by passing the
transition vector into a "Find the First '1'" circuit.

We'll use this "distance" information to lookup a current
-
supply drive strength value
which will stored in RAM. In g
eneral, we will plan to drive the current
-
supplies harder
when distances between transitions are short. This accomplishes our goal of pre
-
emphasizing the high
-
frequencies in the tran
s
mitted signal.

The RAM gives us the ability to make our design flexible.

We will be able to adjust (with
3
-
bit precision) the drive strength depending on whether a transition oc
cured

1,2,3,4, or
"5 or more" bits away.

Each DAC corresponds to one of ten phases during which one group of five bits (the
current bit to transmit an
d the 4 previous) is sent through the FIR filter and the respective
drive strength determined.


3


Pinout

Beca
use we are not designing a single chip to be actually fabricated, we have not created
a pinout. Our work will be included as part of Professor
YY
transmitter chip. (see
references)

Floorplan

Again, we are only doing part of a large design effort, so
we are unsure of the floorplan
especially in terms of the I/O ports of the chip. However, we plan to layout our section of
the design in a similar fashion to that in the diagram above.



4

Cells

The following cells need to be designed:



Transition detector (

).



Distance detector ( ).



5x3 RAM ( ).



Digital to Analog Converter ( ).

References



The
Concurrent VLSI Architecture Group

Home Page. Our project will be
incorporated i
nto one of the CVA Group's projects called the Reliable Router.



"The Reliable Router: A Reliable and High
-
Performance Communication
Substrate for Parallel Computers"
. Appears in the Proceedings of
the First
International Parallel Computer Routing and Communication Workshop, Seattle
WA, May 1994.



"Architecture and Implementation of the Reliable Router"
. Appears in the
Proceedings of Hot Intercon
nects II, Stanford CA, August 1994.



"Low
-
Latency Plesiochronous Data Retiming"
. To appear in the Proceedings of
the 1995 Advanced Research in VLSI Conference, Chapel Hill NC, March 1995.



"High
-
Performance Bidirectional Signalling in VLSI Systems"
. Appears in the
Proceedings of the 1993 Symposium on Research on Integrated Systems, Seattle
WA, January 1993.

Transmitter Equalizer
--

high speed dat
a transmission

Goal of Project:

Given a working implementation of a transmitter equalizer, we wanted to redesign the
circuit to improve size and speed in the hopes of creating a smaller chip with higher
bandwidth.

How we approached it:

Because we were mo
difying an existing design as opposed to creating something new
(from scratch), our first task was to examine what the transmitter was trying to do so that
we could make implementation changes and optimizations while being careful to
preserve functionality
.

After investigating the documentation associated with the first
transmitter equalizer
, we
decided that the following methods could be used to improve the circuit.

Previously, the information for

each phase of current
-
driving was computed separately.
This fails to take advantage of an inherent recursive nature of calculating the "distance to
the nearest transition". For example, in the sequence:


5



bit0 bit1 bit2 bit3 bit4

0 1 0 0 0

if you had al
ready calculated that the first transition from bit2 just occured (the nearest
transition was a distance of "1" away), then you know that the fartest a transition from
bit3 would be "2" away.

Sung created the "distance
-
detector" (DD) circuitry which used
a a series of xnor gates to
first change the data stream into a "transition" vector. A schematic diagram for her circuit
is as follows:


In this transition vector, 0's represent transitions and 1's represent no transition. A
Manchester carry chain
-
implementation of a "find the distance to the most recent 0"
circuit took advantage of the largely recursive natu
re of this circuitry.

Sung considered the layout of her circuitry very carefully in order to pass the 50
-
bits of
information coming out of the distance detector in a way that would by conducive to
compact RAM design. We theorized that we could save a larg
e amount of space in the
RAM if the DD passed information grouped by similar RAM address.


6


This theory came

from the observation that wiring was the limiting factor on the area of
the RAM. The outputs of the DD would be inputs to NFET passgates in the RAM, and
grouping them together by RAM address allowed us to pass information using poly (gate)
wires which gav
e us more room to route things in metal1 and metal2.


7




8

As you can see, the layout of the RAM did make extensive use of poly, metal1, and
metal2 layers, and resulted in a nice, compact design.

The Digital
-
to
-
Ana
log Converters were simply laid out in a shape that better suited our
design. This reshaping also helped to limit some capacitance. By changing the structure
from a long rectangle to something close to squares, we decreased the perimeter length of
the well
, and thus the "sidewall" capacitance, which turns out to be a signficant amount
considering the relative sizes of these transistors which drive the differential output
signal.


All of our proposed changes had to first be verified by using
HSPICE simulation of net
list

Because layout was a key factor in our project, we had to be sure that our transitor
sizes sustained functionality, especially in terms of timing (the circuit is pipelined, one
stage for drive streng
th calculation, another for driving the DACs in turn). Using some
very rough guesses as to the capacitances of nodes with long global wires, we worked for
some time before getting our differential voltage to be able to swing the 200
-
300 mV
needed during a
250ps interval.


9

From here, we proceeded to learn how to use Magic (not the standard 6.371 layout editor)
and began the massive layout portion of the project.

As of this time, we are trying to use some associated tools of Magic (the layout is done,
unless

we find bugs while simulating...) in order to simulate our laid
-
out circuit. Because
the clock buffering for the circuit included quite a bit of complexity. we did not take on
its responsibility in our project. However, we will be careful to

Assuming tha
t our laid
-
out circuit indeed works, (!?) we performed some area
comparisons (in terms of lambda) and found our area to be approximately 40% of the
total original area of the analogous part of the previous circuit.

The diagram below compares our circuitry

with the previous circuitry. Clearly, simply
summed total area may be a misleading specification. Because most modules in VLSI
tend to have square shapes, we also compared areas by rectangle size.


10



11

That is, we found the smallest rectangle that would encompass our circuitry and then
compared it to the previous design (which was already a rectangle). In this for
mat, we
found that we still had a 20% area decrease.

Our conclusions were that we definitely could have made even more improvements to
area if we had more time. Other things that we thought were possible, but that we realized
after we had done simulation
and would have been painful to re
-
simulate and re
-
interface
the other parts that we had already designed were....

This leads us to believe in Professor Termann's talks throughout the year that a building
full of engineers working together on the same proj
ect can squeeze out a very compact
and high
-
performance circuit (at large cost). The timing analysis which has yet to come
will be significant in telling us what kind of bandwidth gains we received (if any) from
the decreased capacitance (and thus latency)

of our more compact circuitry. (ie, can the
thing be clocked at a faster rate, remembering that the phase of the clock decreases by ten
times the decrease in system clcok frequency).

References



The
Concurrent VLSI Architecture Group

Home Page. Our project will be
incorporated into one of the CVA Group's projects called the Reliable Router.



"The Reliable Router: A Reliable and Hig
h
-
Performance Communication
Substrate for Parallel Computers"
. Appears in the Proceedings of the First
International Parallel Computer Routing and Communication Workshop, Seattle
WA, May 1994.



"Archit
ecture and Implementation of the Reliable Router"
. Appears in the
Proceedings of Hot Interconnects II, Stanford CA, August 1994.



"Low
-
Latency Plesiochronous Data Retiming"
. To appear in the Proceed
ings of
the 1995 Advanced Research in VLSI Conference, Chapel Hill NC, March 1995.



"High
-
Performance Bidirectional Signalling in VLSI Systems"
. Appears in the
Proceedings of the 1993 Symposium on Rese
arch on Integrated Systems, Seattle
WA, January 1993.















12


2.
Prime Factorization Chip.




The goal of t
his project is to create a prime factorization algorithm in VLSI. Repeated
subtraction is used to implement the required division operation. The chip accepts a 15
-
bit input number and outputs its prime factors. It begins by testing the numbers 2, 3, 5,
and

every subsequent odd number. When the system discovers a number that is a factor
of the input, it checks to see whether the factor is prime or not. If the factor is prime, it is
sent to the output, otherwise the system proceeds to identify the next factor
.


a Prime Factorization Chip

Overview

The goal of this project is to create a prime factorization algorithm in VLSI. Repeated
subtraction is used to implement the required division operation. The chip accepts a 15
-
bit input number and outputs its prime fa
ctors. It begins by testing the numbers 2, 3, 5,
and every subsequent odd number. When the system discovers a number that is a factor
of the input, it checks to see whether the factor is prime or not. If the factor is prime, it is
sent to the output, other
wise the system proceeds to identify the next factor. An example
of a factor that may be identified (but that is not prime), given an input of 75, is the factor
15. It would be caught as a factor, but would be found to have factors of its own, and
would th
erefore be discarded.

Top Level Functionality

1.

Place number on input.

2.

Enable GO.

3.

Prime factors will appear on output until system enables DONE.

4.

If DONE is enabled with no factors on output, the input is prime.

Team Members and Responsibilities

Team

Me
mber

Responsibility


xxxxxxxx

Design, testing and layout of control logic FSM and assorted cells. Share
responsibility for top
-
level routing.

yyyyyyyy

Design, testing and layout of 16
-
bit adder, and 16
-
bit multiplexors. Share
responsibility for top
-
leve
l routing.




13

Schedule

Task

Duration

Completion

date

circuit design and schematic entry

14 days


initial simulation using schematics

5 days


cell layout (overlap with simulation)

11 days


block layout

4 days


block simulation with layout capacitance

4

days


top
-
level layout

4 days


final simulation

3 days


presentation preparation

3 days


report preparation

on going


Theory of Operation

Given an input Number N, its prime factors can be found with a two
-
step process:

1.

Outer divisor search loop: Fin
d any number, P, which divides evenly into N.

2.

Inner prime checker loop: Check to see if P is prime.

In this implementation the division will be implemented by repeatedly subtracting a
potential divisor from the dividend until the result is zero or negati
ve. If the result is
negative, the number does not divide evenly into the dividend. To check to see if P is
prime, the same algorithm can be run, such that if any number is found to divide evenly
into P, then it is not prime.


14

A simple schematic for this al
gorithm is shown below


This hardware performs both the outer divisor search loop and the inner prime
checking
loop. Within either of these loops, the B input to the adder will represent the potential
divisor, while the A input will contain the updated value for the purpose of subtraction
based division. Thus, N must be latched into a register so it can be

recovered later. Also,
a P found in the outer loop will be placed into the A input for the inner loop and the B
input will be reset, requiring us to store P for reinitialization of the B input upon exiting
the inner loop.


Control of the circuit is achie
ved through an FSM. Required control signals are as
follows:





15

Signal
Name

Purpose

Value=0

Value=1

Value=2

Value=3

AMUX

Select A input to adder

N

A
-

B
feedback

B feedback

P

BMUX

Select B input to adder

2 (reset)

B feedback

incremented
B

P + 1

INCB

se
lect B+1 or B+2 for
BMUX

B + 1

B + 2

X

X

STRB

latch P (contents of B)

don't latch

latch

X

X

STROUT

latch output (P)

don't latch

latch

X

X

ZERO?

Status signal: A
-

B is
zero

A
-

B = 0

A
-

B != 0

X

X

NEG?

Status signal: A
-
B is
negative

A
-

B > =
0

A
-

B
< 0

X

X

Initially we'll try 16
-
bit datapaths and a ripple
-
carry adder. If time permits we may try
some sort of carry
-
lookahead adder.

Pinout

To make the chip simple to test, all the on
-
chip registers are connected in a scan chain
accessed using the IEEE
1149 Boundary Scan protocol [Weste, section 7.5.1]. Other
interesting data and control signals are brought to the pads so that timing measurements
can be made.



+
----------
+


IN< 14:0>
----
>| |


GO
----
>| |



| |


| |


| |
----
>FACTOR< 14:0>



CLK
----
>| |
----
>done


+
----------
+

Floorplan

We'll be using the standard Tiny Chip pad frame which allows a core of up to 1830
lambda wide
and 1800 lambda high.








16






Vdd

D3

(in)

D2

(in)

D1

(in)

D0

(in)

GND

Q0

(out)

Q1

(out)

Q2

(out)

Q3

(out)

Vdd

D4

(in)

r

o

u

t

i

n

g

routing

r

o

u

t

i

n

g

Q4

(out)

D5

(in)

I

n

R

e

g

A

m

u

x

S

t

r

B

+

1

S

t

r

o

u

t

A

d

d

e

r

z

e

r

o

?

Q5

(out)

D6

(
in)

Q6

(out)

D7

(in)

Q7

(out)

D8

(in)

+

1

+

1

I

n

c

P

m

u

x

B

m

u

x

I

n

v

+

1

Q8

(out)

D9

(in)

Q9

(out)

D10

(in)

Q10

(out)

D11

(in)

F S M

Q11

(out)

D12

(in)

Q12

(out)

GND

D13

(in)

D14

(in)

GO

(in)

CLK

(in)

Vdd

NC

DONE

(out)

Q14

(out)

Q13

(out)

GND

Cells

The following cells need to be designed:



16
-
bit Adder (Ameet).



4
-
input 16
-
bit MUX (Ameet).



2
-
input 16
-
bit MUX (Ameet).



+1 Adder (Grant).



16
-
bit Register (Grant).



16
-
bit inverter (Grant).



16
-
input NOR
gate (Grant).



FSM(Grant).



17

3.

MIDI decoder and sound synthesis controller.



The VLSI system will take MIDI input signals
with the aid of a clocked UART chip, and
will produce control signals for synthesis subsystems, which will consist of EPROMs,
D/A converter(s), and PALs. The system must produce the necessary control signals at
specific rates, depending on the frequency/to
ne requested by the MIDI input. The system
should provide controls for at least four subsystems so that at least four voices/tones may
play simultaneously. Mixing of the waveforms may be done either before or after the
D/A conversion with external circuitr
y.


MIDI decoder and sound synthesis controller

Overview:

The VLSI system will take MIDI input signals with the aid of a clocked UART chip, and
will produce control signals for synthesis subsystems, which will consist of EPROMs,
D/A converter(s), and PALs.

The system must produce the necessary control signals at
specific rates, depending on the frequency/tone requested by the MIDI input. The system
should provide controls for at least four subsystems so that at least four voices/tones may
play simultaneousl
y. Mixing of the waveforms may be done either before or after the
D/A conversion with external circuitry.

Since much of the external circuitry will be independent of the VLSI design, needing
only generalized control signals, it is not discussed in this pro
posal but will be presented
in later documents.

Depending on time constraints, the system
may

include the following improvements:



The system will provide for more than four subsystems (expensive in terms of
D/A converters, depending on the external impleme
ntation).

The following are more complex and will require reception of data from the
EPROMs, some limited mathematical calculations, and transmission of data to a
D/A converter:

o

Subsystems will consist only of EPROMs and counters implementing the
addressi
ng logic; the chip will handle the mixing (addition) of waveforms
and send the resulting waveform to a single D/A converter. (Digital
mixing could be external as well, given more control signals (for the
adder) and proper synchronization to avoid glitches
in the signals to the
D/A converter.)

o

The system will scale the waveforms based on MIDI velocity data and
system volume information.

Team Members and Responsibilities:................


18

Tentative Schedule:

11/11

Logic fully
designed

11/15

Transistor
-
level

schematics
completed

11/18

Schematic
entry
completed

11/22

Initial
simulation
completed

12/02

All layout
completed

12/06

Final
simulation
completed

12/09

Presentation

12/11

Final report
completion

Theory of Operation:

The system will take MIDI sign
als from a UART in parallel (8
-
bit) format for timing
simplicity. (The UART will be set up in such a way to receive MIDI signals correctly.)
Whenever the UART signals that it has received a byte of MIDI information, the system
will then request that inform
ation from the UART and clear the UART so that it may
receive another byte of data. (To save pins, the request/clear may be handled through a
single
-
bit line to a PAL fsm.)

The system will then check the byte to determine whether the byte is a command byte

or
a data byte (e.g., check whether the msb is 1 or 0). If the byte is a command byte, then the
system must determine if the command is valid for this implementation. If the command
is valid, then the system must either execute that command (if it is a si
mple command,
such as reset, or all stop) by sending the appropriate control signals, or it must wait for
the following data bytes that describe the desired note and the velocity (which is thrown
away in the primary implementation). If the byte it receives

is a data byte without a valid
preceding command, it will ignore it and wait until it receives a valid command before
issuing further control signals. The system will store each full command (command byte

19

and two data bytes) in dedicated registers; only a
fter the command is executed will the
values be replaced.

Depending on the commands, the system will produce up to four different clock signals
(based on four different counters) derived from its own clock. These will be implemented
using standard counting

arrays of flip
-
flops and will be set via lookup tables. The system
will use the desired note opcode as the lookup into the table. These clocks will serve as
the "increment" control signals to counters (either external or internal
--

to be decided)
which w
ill index the external EPROMs. The varying counting rates will cause the
EPROMs to output waveforms at varying rates, (hopefully) producing tones at many
frequencies. (The trick will be choosing a system clock rate that may be divided neatly to
correctly p
roduce any desired tone. Another possible implementation may be to have a
second clock input that can be divided neatly and is related to the sampling frequency
used when sampling the waveforms for the EPROMs. I will have to do some calculations
in the nex
t couple of days to determine which method to use.) The current design
assumes

that the D/A converter needs no control signals but can run in continuous mode,
updating from its inputs whenever it signals that it has a valid output. One of the
extra/unused
control pins can be used to sync D/A converters if necessary.

The system will choose the appropriate subsystem using 2 pins/bits, and will always send
a start/stop signal on 1 additional pin, to indicate whether or not a particular subsystem
(or D/A conver
ter) should continue producing analog output.

Still under debate is whether the counters will be internal or external to the chip. If
internal, the system will require at least 14 output pins for EPROM addressing (assuming
16k EPROMs). If any of the improv
ements requiring input from EPROMs and output to
a D/A are implemented, the counters will have to be external to the chip in order to have
enough available I/O pins (at least 8). A conclusion will be reached within the next
couple of days.

Also note that t
he system assumes that each EPROM is
full

and contains only one
waveform that may be repeated over and over. This restriction may be difficult to
implement in a practical situation, probably requiring serious tweaking of the waveform
to get the beginning a
nd ends to match so that it may be repeated seamlessly.

Pinout:

These pinouts are numbered with respect to the 34 available signal pins, not actual pins.

There are currently two possible versions, depending on the decision to have internal or
external addr
essing of the EPROMs. This will be determined within a day or so. (I'm
leaning toward external addressing, since external counters will be fairly easy to
implement and will allow more flexibility for improvements.)

Version 1 provides internal addressing of

the EPROMs, which requires at least 14 output
pins, unless a complicated clocking scheme (with external registers) is implemented.

20

Using Version 1, none of the complicated improvements may be implemented, since not
enough pins are left over to get values
from the EPROMs or to send values out to a D/A
converter.

Version 2 provides only control signals to a subsystem's external addressing counters, in
order to save pins and internal complexity. Version 2 allows room for all of the
improvements described in t
he overview, which require taking inputs from the EPROMs
and sending output values to a single D/A converter.

Version 1 (internal addressing):

Pin 1

External CLK input

Pins
2
-
3

UART I/O control signals

Pins
4
-
11

UART 8
-
bit input

Pin
12

Subsystem continu
e/stop

Pins
13
-
14

Subsystem selection (1
-
4)

Pins
15
-
20

Additional pins that may be
used for more subsystems or
for a larger address space

Pins
21
-
34

EPROM addressing

Version 2 (external addressing):

Pin
1

External CLK input

Pins
2
-
3

UART I/O control s
ignals

Pins
4
-
11

UART 8
-
bit input

Pin
12

Subsystem continue/stop

Pins
13
-
14

Subsystem selection (1
-
4)

Pins
15
-
22

Additional space for more
subsystems and for EPROM
control signals (particularly if

21

system clock is faster than
EPROM delay), or put 8
-
bit
input from EPROMs here (for
the improvements mentioned
above)

Pins
23
-
24

Increment/Reset of subsystem
addressing

Pins
25
-
34

Room for 10
-
bit output to D/A
converter and 8
-
bit input from
EPROMs (one of the
improvements)

Can be reduced to 8
-
bit output
(wit
h scaled inputs), in order to
have more control signal
space, but not recommended.

Floorplan:

The (rather small) MIDI code lookup table and the main system logic will probably be in
the upper left corner.

In the lower left of the chip will most likely be
the UART interaction block/unit. It will
contain the logic for the control signals to the UART and for the storage of the data
coming from the UART.

The larger timing lookup table, the four timing clocks, the four (possible) addressing
counters, and the ou
tput logic will sit in the right half of the chip.

If external addressing
and

the improvements are implemented, then the necessary adders
and registers will be added first in the extra space on the right side and then in any
remaining extra space that can
be found.

Cells:

Jon Shoemaker will be laying out all of the functional units described above.

References:

Current references include the following web pages on MIDI opcodes/specifications:

The MIDI Manufacturers Association:


22

http://home.earthlink.net/~mma/


Harmony Central:

http://www.harmony
-
central.com/MIDI/


Circuit design will most likely reference the course text by Weste, and other related
computer ar
chitecture/system design texts.

A Half
-
Baked (but functional) MIDI Decoder


Abstract:

I designed, layed out, and tested the CMOS logic for a limited MIDI decoder. While I
could not implement all of the originally proposed functionality due to space constr
aints,
I managed to redefine the project guidelines so that I could still design and build a useful
device. The final result works well in simulation and could possibly play a role (albeit a
smaller one than before) in a music synthesis system. The followi
ng report analyzes the
thoughts and logic which led to my final design, and shows the results.


Introductory Questions:

"What is MIDI?"

MIDI is
the

standard used by the music industry for the digital
description and transmission of music data. It has exis
ted for years and the entire
specification is quite extensive. MIDI signals basically consist of bit serial transmissions
of data (please visit some of the Web sites referenced at the end of the document for
specific technical information) sent in message
groups of one to three bytes. The shorter
messages represent system messages and the longer messages describe note or control
change information.

What would the MIDI decoder do?

Well, that depends. If I had enough chip space, it
would do everything that I
had originally proposed and more, which includes the
reception of MIDI messages and the translation of those into control signals for wave
tables (which hold discrete samples of sound data describing what various instruments
sound like), digital to analog
signal converters (DAC), and the volume control of the
system amplifier. With the manufacturing process that I had to use for the design of the
chip, however, the decoder is limited to parsing note on and off requests and a few
control/mode change signals.

It currently grabs the data and collates the pertinent
information in one large parallel format, making it easy for a following subsystem to
concentrate on controlling wave table and DAC subsystems.

Why did I choose to design a MIDI decoder?

(It's probabl
y a very good question; I
often asked it myself while working on the project...) I could say that I did it because I

23

love music, and I would not be lying, but I imagine that ever since I came up with the
idea for the project I wanted to do it just to conti
nue my tradition of doing music
-
related
projects for my lab classes. (In both 6.111 and 6.115 a partner and I tried to build some
implementation of a CD party mixer.) That notwithstanding, the project also gave me an
excuse to learn how MIDI actually works
. The implementation seemed structured as well,
since MIDI has very strict guidelines.


Variations from the Proposal and the Revised Specifications

I discovered, after I had already half
-
layed it out in CMOS, that the 16
-
bit counter I
needed for my note
-
s
pecific timers would take up between a third and a half of the chip. I
had not realized before then just how small the total chip area was in relation to the
minimum feature size of the manufacturing process. I had proposed
four

of these
counters! Obviousl
y, I could not continue with the original plan since I could not
reasonably implement the note
-
specific timing idea without the counters.

I spent a few days brainstorming a solution, deciding that inventing a whole new project
was not a good idea (it was m
uch too late in the scheme of things for any sane person to
try it). Instead, I decided to simplify matters, redefining the scope of the project, pulling
in the reins, until the system could fit on the chip. Doing so meant that I could still use
much of th
e preliminary work and research that I had already done.

Eventually, as layout progressed, I found myself limited to deciphering and collating the
data contained in the messages. Using my original plans for implementing the FSM, the
control logic took up e
nough space on the chip that I could not even keep state
information about the notes currently being played. Thus, the revised specifications really
include only the decoder FSM and memory for holding the collated message data. The
decoder now expects the
next subsystem to take the specific note (or control change)
information from its pins and to do what is necessary for keeping state of the allocated
notes and voices and for producing the requested sounds.

The resulting specification is not necessarily ba
d. In simplifying the specifications, I had
made the system more flexible and more general. Even though it only pays attention to a
limited subset of MIDI messages, it has enough functionality for it to be useful in a
simple system. It could probably be us
ed in more cases, since it completes one specific
task that practically all complex systems would have to deal with. The original
specification was particularly special purpose, and could have probably been used in only
one system design.


Logic Design Det
ails


24

Even in the original proposal, I had intimated that I would honor only Note On, Note Off,
and (a few) control messages. They are truly all that are necessary to implement the basic
functionality of a MIDI system. Those messages say enough to turn spec
ific notes on or
off and about how voices should be allocated. The specific list of messages that I honor is
as follows:



Note On



Note Off



All Notes Off



Omni Mode on/off (affects voice allocation method)



Mono/Poly Mode (affects voice allocation method)



System Reset

While the fine print of the MIDI spec also mentions the potential of "running status
bytes," which a given MIDI signal producer may opt to use, I have chosen to ignore it.
Essentially, "running status bytes" means that a signal producer may
send one byte
specifying an action (such as Note On), and follow it with as many pieces of information
as it desires as long as they all relate to the same action. For example, a rolled chord on a
keyboard could be sent with one Note On status byte and fol
lowed by several pairs of
note and velocity data bytes. Since not all manufactures bother to implement this
efficiency feature, I decided that it was not worth trouble of expanding the FSM logic. I
imagine that some products include some way of enabling an
d disabling the feature if it
causes problems.

My preliminary objective for the FSM was basically for it to process and parallel the data
as quickly as possible. (Given the rather slow transmission of MIDI, the speed objective
was irrelevant, and it forced

a rather specific set of circuitry instead of a general,
expandable set, as will become obvious through the development discussion.) I played
with several lists of possible discrete states and kept trying to reduce them, making each
state accomplish more
and more, while separating specific tests and functions that had to
be kept apart for the system to work correctly. I eventually came up with 15 states. Even
three of those states were dummy states that existed for timing purposes (and could be
removed giv
en the final implementation of the memory storage as SRAM rather than
loadable D registers). The resulting state diagram follows:


25


Basically, the system first waits for an incoming message byte. After it receives a
message, it acknowledges the message by sending a reset signal to the subsystem that

26

sent the message (most likely a UART subsystem). It checks to s
ee if the message is a
valid Note On, Note Off, system reset, or control change message. If not, then it returns
to the first state and waits for another message. Except for system reset, the messages that
the decoder accepts are at least two bytes long. I
t thus waits for a second byte, just like it
did for the first one. This time, given the type of message described the first byte, it uses
the second byte to determine what it should do. If the message was Note Off, then it now
knows the note that it must
turn off. It can go to state 6, turn off the note, and wait at the
beginning for a new message. If the message was a control message, then it knows if it
should set the omni mode, the mono/poly mode, or turn off all the notes. If the message
was Note On, h
owever, it must wait for the third message byte, which specifies the
"velocity" or volume of the note. Since a Note On message with a velocity of zero is a
widely used synonym for Note Off, the system also branches to one of two states
depending on the res
ults of a zero test of the third data byte. Lastly, during forced resets
(accomplished through pulling high a particular pin on the chip), the system must go to
state 15, send appropriate reset messages to following subsystems, and then go to state 0
and w
ait for new messages. Since there are only fifteen states, one is missing from the
possible sixteen states that we could have with a four
-
bit FSM. The choice of leaving out
14 for 15 not entirely arbitrary. One initial design would have had the reset state

reach
state 0 through an increment, and thus state 15 would have then wrapped around to state
0.

Given this set of states, I then began to formulate the logic for detecting the various
messages that the decoder should accept. The following diagram shows t
he rather small
chunk of logic which detects the three main message classes:


27


After that, I needed the

following logic to determine the specific control/mode changes
that I wanted to accept:


28


Of course, des
igning those particular pieces was a rather easy task compared to the
implementation of the state selection logic. Given my desire to limit the size of the FSM
and its ROM, I ended up overly complicating the design. Instead of wiring the possible
branch po
ints in the ROM (which would have taken relatively little space) I hardwired
them in a MUX and register subsystem. Some of the selection logic for these MUXes
became as complicated as that already shown above. While I was able to simplify some
of them a gr
eat deal by rearranging them, some resisted simplification and became more
complex if disturbed. After a bit of frustrated tooling, I eventually came up with the
following design:


29


The implementation could have probably been much cleaner with a general test and
branch subsystem, rather than the specialized version shown. If I were to redesign or
further develop
this system, I would certainly consider performing that modification.


Hardware Implementation of the Logic

After the time
-
consuming fiasco with the 16
-
bit counter, I had little time left to formally
try the system on the schematic level. I felt that I kn
ew how the system should work, and
that I had planned it out thoroughly enough to be satisfied that I could lay it out in CMOS
from the logic gate diagrams. I was partially right, and partially wrong. Preliminary
results would have told me approximately ho
w much chip area the design would take,
since I obviously could not tell from experience and since logic diagrams are typically
over
-
simplified. Despite that drawback, however, I was able to lay out all of the logic
cells in CMOS and have them work with fe
w problems. For speed and dependability, I
relied on two input NANDs and NORs as much as possible. While this "bulked up" my
system, I feel the end results were worth the additional effort.

All of my CMOS cell implementations may be found on the following
page:
CMOS
Implementation


30

Please refer to that page for details about how I layed out a particular cell. The
pictures
are all labelled with the particular piece of logic that they implement, and all of their
inputs and outputs have relatively clear designations. Some particularly interesting cells
include the test cells for all ones (for the system reset message)

and for all zeros (for zero
note velocities). I was able to exploit the symmetry of both of these cells to cut the layout
work in half and the required chip area by a third or a fourth. Other optimized cells
include the 4
-
bit MUX, which I derived from the

optimal 2
-
bit versions shown in the text,
and the modified single
-
row SRAM cell, which I discovered could do well without
precharge pullups and differential sense amplifiers, even if the implementation probably
uses a lot more power than it would otherwis
e.

I'm rather disappointed with the D register cell, however, since I could not find a
satisfactory static implementation other than that based on the logic
-
gate level definition,
described in the 6.115 text by Wakerly. I managed to extract some symmetry f
rom it,
though, and simplified the layout, but I am positive that better implementations exist. The
size of my D register cell affected and subtly changed my system design, since I built the
more compact 1x8 SRAMs to take the place of eight bulky loadable
parallel registers.
The final layout of the cells was the most painful. Routing definitely took a substantial
amount of room and forced strange rearrangements of the cells. The wiring also trapped
white space in the middle of the design.

After doing all of

the layout and looking back at the design, I have learned that I could
have probably generalized much of it and could have reduced the amount of specialized
logic. I will definitely keep that in mind for my next hardware design project.


Simulation of th
e Individual Cells: Bottom
-
up Verification

Again, I will refer all questions about the simulations to another page, since it was much
easier for me while drafting this report to have the pictures on another screen.

Independent Cell Simulations


I will, however, attempt to explain a few of the more important cell simulations here. I
have omitted pictures for the s
impler cells, since they were relatively easy to verify on
sight by typing two or three commands at the Lsim main window. I wrote short C
programs, however, to script batch initialization files for the more complicated cells, so
that I could easily start s
imulations running for tens to hundreds of individual test cases.
All of the tests eventually worked. Testing each cell permitted me to find small but nasty
bugs that would have been much harder to find later.

Of the cell simulations I have included, perha
ps the most important are those for the state
selection logic. Those cells definitely have to work correctly for the entire system to
function at all. Their simulation results also help estimate the total propagation delay and
thus how fast I can push the
entire system. Even before I ran the final system simulations,

31

I could have guessed that the operating limit was near 30 MHz from the preliminary
results. The slowest of the state selection logic is the incrementer cell. For the worst case,
where 15 wraps
around to 0, the result takes nearly 9 ns to compute. With the propagation
delays of about 1.5 ns through each set of muxes, and the approximately 8 ns worst case
delays through the MUX selection logic, I estimated that nearly 25 ns were necessary for
the
state selection logic to stabilize. In addition, the ROM provides inputs to the selection
logic, and tests (which I did not save) showed that it required somewhere between 3 and
8 ns to produce stable outputs. Thus, from the preliminary results, I estimate
d the system
delay to be in the 30
-

35 ns range, which is certainly much better than is required for a
system of this type, which does not necessarily need to operate at or benefit from
megahertz clock speeds.


Wrestling the Routing Demon

After I simulat
ed and verified each of the individual cells, I was ready to wire them all
together. It took me nearly two days of rearranging and rewiring until I reached the point
where I refused to optimize the layout any more. I then added all the routing that was
nec
essary to accommodate the final layout, which is shown below:


32


The heftiest and most time
-
consuming rou
ting was actually around the FSM ROM and
the MUX selection logic cells. Since I had split the MUX selection logic into so many
pieces, finding an optimal arrangement for their routing proved to be extremely difficult,
if not impossible. In addition, the wi
dth of the FSM ROM forced me to search for ways
to rout around it without encroaching too much on the global routing space that I would
need to wire the system to the chip pads.

After it was all wired, though, the most rewarding sight was the one shown bel
ow. I
placed the completed system in the pad frame and verified visually that it would just
barely fit with the additional global routing to the pad frame.


33


The final pinouts for the system would be the following, beginning counter
-
clockwise
from GND in the top center:

Clock, C4
-

C6 (SRAM C, bits 4
-
6), I0
-

I7 (the message inputs), A0
-

A3 (SRAM
A, bits 0
-
3), as
sign, clear, resetU, polySet, omniSet, B0
-

B6 (SRAM B, bits 0
-

6), C0
-

C3 (SRAM C, bits 0
-

3), UReady, and global reset.

All of the pins have a specific purpose, and all of the necessary signals can be sent using
the available pins. For example, only t
he first seven bits of SRAMs B and C are
significant, since MIDI specs limit the use of the highest order bit for status bytes that
initiate messages. Note values, velocity values, and control requests all fall in the 0
-

127
range and can be represented w
ith seven bits. In addition, after decoding the actual class
of the message from the status byte, only the lower order four bits of it are necessary to
keep around as the identification for the particular channel number (or voice subsystem)
for which the m
essage is intended. Lastly, I have defined the logic so that the clearAll

34

and reset omni/poly mode signals can be respectively represented by pulling both clear
and assign high, or by pulling omniSet and polySet high. While it also seems that I have
left o
ut the actual omni/poly mode decisions, the following subsystems can deduce the
decision by looking at the lowest order bit of byte B. If that bit is zero, then the mode is
off; otherwise, it is on.


System Simulation Results

Again, I have placed all of t
he simulation graphs on another page:

Full System Simulation Results


The system
-
wide simulations took hours
to run. (I would hate to imagine the amount of
time that the more complicated projects for this class required.) Since I designed the
system to ignore trashy inputs and to return to state 0 whenever a particular message did
not match its selection criteria
, I was not really worried about an extensive fault test. In
addition, I had tested the system's cells from the ground up and had confidence that they
would "do the right thing" when carefully wired together.

The painstaking planning and preliminary testin
g rewarded me with immediately
successful simulations. For each simulation, I began with a reset period, since a real
implementation of this system should have the reset pin pulled high on startup. In the
initial test, I sent a string of messages, containi
ng a Note On request (0x95, 0x70, 0x31
--
> note on, channel 5; note 0x70, velocity 0x31), a Note Off request (0x85) for the same
note, a system reset message, a couple of omni/poly control change requests, and an All
Notes Off message. Each worked as expec
ted. I did notice, however, with one simulation
where I had a typo, that I had forgotten to include a specification on the UReady/ResetU
signal relationship.

If the Uready signal does not stay set for the entire clock cycle that the reset signal is
asserte
d, the system becomes stuck in that state. This could potentially be called a bug,
since the system could be altered to get rid of this possibility. The problem can also be
solved, however, by requiring the reset of the ready signal to wait a cycle (easy t
o do with
a flip
-
flop) before it takes effect. Since this is an external specification, all I need to do is
state the restriction. If the external subsystem supplying the message signals abides by the
restriction, the system operates perfectly.

The group o
f simulation graphs include verification for all three of the SRAMs and
shows how the system responds to different messages. In particular, I can verify that the
system asserts "assign" and "clr" when all notes should be turned off, and that all control
si
gnals are pushed high for a reset. While the graphs also show the glitches in the outputs
due to the ROM stabilization delays, I am not particularly worried, since the following
subsystem will be expected to synchronize these outputs to the system clock. P
roper

35

synchronization (again with flip
-
flops) will remove any problems, since the glitches
disappear before the next clock edge.

The final simulation tested how fast I could push the system. I designed a batch file that
would keep reducing the clock rate w
hile repeating the same message test sequence. The
designed the simulation to push the present 1/2 cycle value into the SRAM so I could
easily tell from inspection what the particular clock periods were. The system failed once
the 1/2 cycle reached 16 ns.
During this segment of the test, the system did not assert the
assign signal as it should have. (While this timing test was not an exhaustive test of all of
the possible functions, it still gave a fairly approximate timing specification.) Thus, the
minimum

time it should be given for each cycle is 2*17, or 34 ns, practically the same
time as that predicted through the preliminary tests.


Conclusion

The system works perfectly as designed, and clearly exceeds any timing demands that
could be placed on it, si
nce MIDI signals are transmitted at relatively slow rates. I feel
that I have learned a lot through the design and implementation of the project and will
carry the knowledge and experience with me into future hardware design projects. While
it has been ver
y frustrating at times, it felt absolutely wonderful to see the final system
wired together and simulating correctly. I would like to thank Chris Terman, Rich Lethin,
and Andy Allen for their help, guidance, and understanding throughout the project and
the

entire course.


References

MIDI:

http://home.earthlink.net/~mma/


Harmony Central:

http://www.harmony
-
central.com/MIDI/


VLSI and digital logic design:

W
este, Neil H. E. and Kamran Eshraghian,
Principles of CMOS VLSI
Design
, Addison
-
Wesley Publishing Company, Reading, MA. (1993)

Wakerly, John F.,
Digital Design: Principles & Practices
, Prentice Hall,
Englewood Cliffs, NJ. (1994)


36

ZF8081
--

a small microprocessor.



The goal of the project is to implement a full functioned microprocessor using VLSI
technology. This chip in
cludes a timing generator, a memory address register(MAR), a
memory data register(MDR), a program counter(PC), an Opcode latch and decoder, an
accumulator and an arithmetic
-
logic unit (ALU). The data width is 16
-
bit, the address
length is 13
-
bit and the Op
code length is 3 bit (load, store, add, AND, XOR, OR...).


ZF8081
--
a small microprocessor

Overview

The goal of the project is to implement a full functioned microprocessor using VLSI
technology. This chip includes a timing generator, a memory address regis
ter(MAR), a
memory data register(MDR), a program counter(PC), an Opcode latch and decoder, an
accumulator and an arithmetic
-
logic unit (ALU). The data width is 16
-
bit, the address
length is 13
-
bit and the Opcode length is 3 bit (load, store, add, AND, XOR,

OR...).

Team Members and Responsibilities

Team

Member

Responsibility

xxxxxxxx

Design, testing and layout of the ALU, accumulator, memory data
register, memory address register and random control circuit. Share
responsibility for top
-
level routing.

yyy
yyyyy

Design, testing and layout of the timing circuit, program counter, and
Op code latch and decoder. Share responsibility for top
-
level routing.

Schedule

Task

Duration

Completion

date

circuit design and schematic entry

10 days


initial simulation usi
ng schematics

5 days


cell layout (overlap with simulation)

7 days


block layout

4 days


block simulation with layout capacitance

4 days


top
-
level layout

4 days


final simulation

3 days


presentation preparation

3 days



37

report preparation

on going

12/9/96

Theory of Operation

The execution of an instruction generally requires two cycles: a FETCH cycle during
which the CPU reads memory to get the instruction and an EXECUTE cycle during
which it executes the instruction. Each FETCH and EXECUTE cycle i
s subdivided into 2
phases.

FETCH cycle:



Phase 0: the memory is read. This brings the instruction into the MDR.



Phase 1: the Op code portion of the instruction (bits 12
-
15) is sent to the Op code
latch which is transparent at this phase. And the Op deco
der will activates one
output line that corresponds to the instruction being executed. At the same time,
the Programmer Counter is incremented and the address portion of the instruction
(bits 0
-
12) is placed in the MAR.

EXECUTE cycle: what occurs depends
on the the instruction.

1.

LOAD

o

Phase 0: The memory is read, this brings the data to the the MDR. The
Accumulator is clear to be zero. ALU is commanded to ADD.

o

Phase 1: ALU add MDR contents to Accumulator contents which is zero
at present time. The results

goes to Accumulator and (PC) goes to MAR at
the end of the cycle.

2.

STORE

o

Phase 0: Accumulator output set to MDR.

o

Phase 1: Memory placed in WRITE mode. (PC) goes into the MAR at the
end of the cycle.

3.

ADD, AND, XOR, OR...

o

Phase 0: The memory is read, th
is brings the data to the MDR.

o

Phase 1: ALU add (AND, XOR, OR..) MDR contents to Accumulator
contents. Results goes to the Accumulator and (PC) goes into the MAR at
end of the cycle.

The block diagram of this microprocessor is shown below:







38








All the logic is implemented as "dual
-
rail" domino gates so we compute when each gate
has finished its computation. The precharge and evaluation cycle for each cell will be
designed individually. Initially we will try 16
-
bit datapaths and a Manchester ripp
le
-
carry
adder. If the space does not permit we may try 8
-
bit datapaths.

Pinout

We will be using the standard Tiny Chip pad which has total 40 pins and up to 34 signal
pins. This microprocessor chip will have 16 data pins (D0:D15), 13 address pins
(A0:A12
), one pin for memory read (R), one pin for memory write (W), one pin for clock
input (CLK), one pin for chip reset(Res). If we only have space to implement 8
-
bit
datapaths, then the chip will have 8 data pins (D0:D7), 5 address (A0:A4) pins.

Floorplan

Th
e standard Tiny Chip pad frame allows a core of up to 1830 lambda wide and 1800
lambda high. The floorplan for the possible 16
-
bit microprocessor is shown below. For 8
-
bit microprocessor, simply replace D8
-
D15 and A5
-
A12 with NC.






39

Vdd

D3

(i/o)

D2

(i/o)

D1

(i/o)

D0

(i/o)

GND

A12

(out)

A11

(out)

A10

(out)

A9

(out)

Vdd

D4

(i/o)

r

o

u

t

i

n

g

routing

r

o

u

t

i

n

g

A8

(out)

D5

(i/o)

M

D

R

A

c

c

u

m

u

l

a

t

o

r

A

L

U

P

C

M

A

R

A7

(out)

D6

(i/o)

A6

(out)

D7

(i/o)

A5

(out)

D8

(i/o)

A
4

(out)

D9

(i/o)

A3

(out)

D10

(i/o)

Timing circuit and Random logic

A2

(out)

D11

(i/o)

A1

(out)

D12

(i/o)

Opcode latch and decoder

A0

(out)

GND

D13

(i/o)

D14

(i/o)

D15

(i/o)

CLK

(in)

Vdd

RES

(in)

R

(out)

W

(out)

NC

(out)

GND

Cells

The followi
ng cells need to be designed:



Arithmetic logic unit (ALU) which has following functions: ADD, AND, XOR,
OR (Zhibo).



Accumulator (Zhibo).



Bi
-
direction memory data register (Zhibo).



Memory address register (Zhibo).



Random control logic (Zhibo).



Timing
circuit (Maya).



Opcode latch and decoder (Maya).



Program counter (Maya)

References

1.

Greenfield, J. D., "Practical Digital Design Using ICs,"
Prentice
-
Hall
, 1994.

2.

Johnson, M., "Superscalar Microprocessor Design,"
Prentice
-
Hall
, 1991.


40

3.

Hennessy, J.L., Pat
terson, D., "Computer Architecture
--
A Quantatitive
Approach,"
Prentice
-
Hall
, 1990.



5.
El Cheapo PIC.


The goal of t
his 6.371 project is to design and implement a PIC
-
like microcontroller
called the ecPIC. Like a PIC, this processor will implement a load store, Harvard
architecture capable of executing 1 instruction per cycle. The ecPIC will feature 8 output
lines for c
ontrolling devices. Unlike a PIC, this processor will not feature any pipelining
and will have fewer arithmetic and control operations. Furthermore, the instruction
memory will be located off the chip.

el cheapo PIC
-

A minimalist's microcontroller


Overvi
ew

The goal of this
6.371

project is to design and implement a
PIC
-
like microcontroller
called the ecPIC. Like a PIC, this processor wil
l implement a load store, Harvard
architecture capable of executing 1 instruction per cycle. The ecPIC will feature 8 output
lines for controlling devices. Unlike a PIC, this processor will not feature any pipelining
and will have fewer arithmetic and cont
rol operations. Furthermore, the instruction
memory will be located off the chip.


Team Members and Responsibilities

Team
Member


Responsibility

xxxxxxxx


Design, testing and layout of the Control Logic, Muxs,
Incrementer and Registers.




yyyyyyyy


D
esign, testing and layout of the SRAM and ALU.





41

Schedule

Task


Duration


Completion Date

Circuit Design and Schematic Entry

14 Days



Initial Simulation Using Schematics

3 Days



Cell Layout

8 Days



Block Layout

3 Days



Block Layout Simulation

3
Days



Top Level Layout

4 Days



Final Simultaion

2 Days



Presentation Preparation

2 Days



Presentation Preparation

4 Day



Project Webpage

Ongoing




Theory of OperationThe exPIC implements a simple, fully static
classic Harvard, load
-
store archit
ecture.

On each clock cycle, the PC register will contain the address of the current instructon
which is stored on the ROM located outside of the of the chip. The control register will
interpret this instruction by setting the appropriate inputs of the mu
xs, the write back load
enables of the SRAM and W register and the correct ALU operation. The SRAM will
essentailly be a bank of 16 8
-
bit registers which can be written and read to in one clock
cycle. One of these register will serve as an output register
for interfacing with the
outside world. The ALU will be an 8
-
bit ALU with two functions: Z=A and Z=Nand(A,
B). All muxs and registers will be 8
-
bits wide. The following is a simple block diagram
of the ecPIC:


42


To simplify the required control logic and ALU logic, the ecPIC implements a very small
instruction set. The instructions will include: Jump, Move and

Nand. Jumps make it
possible to perform loops by jumping the PC register to a specified literal. Move makes it
possible to move literals into the W register. Moves will also allow NOPs to be
performed. Nand is our basic arithmetic building block.


43


Pinout



A0
-
A7
-

Instruction address output



D0
-
D11
-

Instruction word input



O0
-
O7
-

Data output



CLK
-

Clock input



Spc
-

Control Logic Output



Sop
-

Control Logic Output



Salu
-

Control Logic Output



enS
-

Control Logic Output



enW
-

Control Logic Output

A0
-
A7

are the address outputs to the instruction word ROM. D0
-
D11 are the instruction
word inputs. O0
-
O7 are the data output pins for the PIC. CLK is simply the clock input.
Spc, Sop, Salu, enS and enW are control lines that come from the Control Logic module.
These lines are only put in place for the sake of testing and debugging.


Layout

We will be using the standard Tiny Chip pad frame which allows a core of 1830 lambda
wide by 1800 lambda high.




44

Cells

The following cells will be implemented:

1.

8
-
bit mux

2.

8
-
bit load enabled registers (== mux + register)

3.

SRAM

4.

ALU

5.

Control Logic

6.

Incrementer

The breakdown of
who implements what is outlined in Team Members and
responsibilities.


References

1.

Microchip Databook, 1994.

el cheapo PIC
-

A minimalist's microcontroller


Abstract

The goal of this
6.371

project is to design and implement a
PIC
-
like microcontroller
called the

ecPIC. Like a PIC, this processor will implement a load store, Harvard

45

architecture capable of executing 1 12
-
bit instruction per cycle. The ecPIC will feature 8
output lines for controlling devices. Unlike a PIC, this processor will not feature any
pipel
ining and will have fewer arithmetic and control operations. Furthermore, the
instruction memory will be located off the chip.


Table of Contents



Proposal




Final Report




Instruction Set




Schematics




Layouts




Software


Overview

The goal of this
6.371

project is to design and implement a
PIC
-
l
ike microcontroller
called the ecPIC. Like a PIC, this processor will implement a load store, Harvard
architecture capable of executing 1 12
-
bit instruction per cycle. The ecPIC will feature 8
output lines for controlling devices. Unlike a PIC, this proces
sor will not feature any
pipelining and will have fewer arithmetic and control operations. Furthermore, the
instruction memory will be located off the chip.



Team Members and Responsibilities

.............................................................
........


Theory of Operation

The exPIC implements a simple, fully static classic Harvard, load
-
store architecture.

On each clock cycle, the PC register will contain the address of the current instructon
which is stored on the ROM located outside of the o
f the chip. The control register will
interpret this instruction by setting the appropriate inputs of the muxs, the write back load
enables of the SRAM and W register and the correct ALU operation. The SRAM will
essentailly be a bank of 2 8
-
bit registers w
hich can be written and read to in one clock
cycle. One of these register will serve as an output register for interfacing with the
outside world. The ALU will be an 8
-
bit ALU with four functions: Z=A+B, Z=B+1,
Z=Nand(A, B) and Z=Nor(A B). All muxs and reg
isters will be 8
-
bits wide. The
following is a simple block diagram of the ecPIC:


46


To simplify th
e required control logic and ALU logic, the ecPIC implements a very small
instruction set. The instructions will include: Jump, Move and Nand. Jumps make it
possible to perform loops by jumping the PC register to a specified literal. Move makes it
possible

to move literals into the W register. Moves will also allow NOPs to be
performed. Nand is our basic arithmetic building block.


47



Pinout



PC0
-
PC7
-

Instruction address output



I0
-
I11
-

Instruction word input



O0
-
O7
-

Data output



CLK
-

Clock input



Rst
-

Reset



Spc
-

Control Logic Output



Sop
-

Control Logic Output



enS
-

Control Logic Output



enW
-

Control Logic Output

PC0
-
PC7 are the address outputs to the instruction word ROM. I0
-
I11 are the instruction
word inputs. O0
-
O7 are the data output pins for t
he PIC. CLK is simply the clock input.
Spc, Sop, Salu, enS and enW are control lines that come from the Control Logic module.
These lines are only put in place for the sake of testing and debugging.



Layout

We will be using the standard Tiny Chip pad fr
ame which allows a core of 1830 lambda
wide by 1800 lambda high.


48




Cells

The following cells will be
implemented:

1.

8
-
bit mux

2.

8
-
bit load enabled registers (== mux + register)

3.

SRAM

4.

ALU

5.

Control Logic

6.

Incrementer

The breakdown of who implements what is outlined in Team Members and
responsibilities.

References

1.

Microchip Databook, 1994.



49

Overview

The ecP
IC is a simple microcontroller based loosely upon the design of PICs, popular
microcontrollers manufactured by
Microchip
. Like PICs, the ecPIC is based upon a load
-
store, Harvard architecture capable of executing
1 12
-
bit instruction per cycle. Like PICs,
the ecPIC features output pins which can be used to control devices. Unlike a PIC, this
processor will not feature any pipelining and will have fewer arithmetic and control
operations. Furthermore, the instruction

memory is located off the chip. If fabbed using
the mosis2n process and the TinyChip frame, the ecPIC can be clocked at about 3.7MHz
which translates to 3.7 Million Instructions Per Second (MIPS). Internally, the ecPIC is
completely static so there is no
mimimum clock speed.

The ecPIC has been successfully implemented and the Led ".L" file is available
here.

This
document describes the features, design, implementation and testing of the ecPIC. It
is organized in the following sections:

1.

Features


2.

Block Diagram


3.

Schematics


4.

Layout


5.

Testing


6.

Timing Analysis


7.

Conclusions


8.

Appendix

1.

Instruction Set


2.

Software



1. Features

The ecPIC is a simple microcontroller with the following features:

o

8
-
bit instruction address and data bus sizes.

o

11 different 12
-
bit instructions.

o

Executes one instruction per
-
c
ycle.

o

Can be run at about 3.7MHz.

o

One 1
-
byte general purpose register (the "W" register).

o

Static design.

o

Two bytes of SRAM that can be accessed independently.

o

SRAM location zero is used as an output register.

o

External Instruction Memory

The instruct
ion set of the ecPIC consists of three distinct classes of operations
which operate on 8
-
bit values and addresses. The first set of operations perform
operations on the general purpose W register and an 8
-
bit literal and returns the

50

result to the W registe
r. They include nandl, norl, incl and addl. The second
instruction type performs operations on an SRAM location and the W register and
returns the result in the SRAM xor the W register. These include nand, nor, inc
and add. The final operation type consist
s of operations on the PC register. They
include two absolute jumps (jmp and jmpl) and a conditional jump (cjmp).

The instructions have a very simple encoding that allows the ecPIC to be run at
about 3.7MHz which translates to 3.7 MIPs. The ecPIC is compl
etely static so
that there is no minimum clock speed. For more information on the encoding see
the
Instr
uction Set

section. Note that unmapped encodings result in unspecified
ecPIC behavior.

A primitive assembler is available in the
Software

section The instructions that
perform operations on the W register and the SRAM can specify one of the two
byte
-
sized locations for reading and optionally writing back. The outputs of
SRAM location 0 are connected to th
e pins of the chip which make it possible for
the ecPIC to send signals to another device.

The instructions are located on a ROM located outside of the ecPIC. Therefore,
the ecPIC has another 8 outputs that specify the address of the instruction to be
exe
cuted and 12 inputs for the specified instruction. The following summarizes
the ecPIC output.

o

PC0
-
PC7
-

Instruction address output

o

I0
-
I11
-

Instruction word input

o

O0
-
O7
-

Data output

o

CLK
-

Clock input

o

Rst
-

Resets the PC to zero when asserted higg for

one clock cycle. Must
be used to initialize the ecPIC after power up.

o

Spc
-

Control Logic Output (for debugging purposes)

o

Sop
-

Control Logic Output (for debugging purposes)

o

enS
-

Control Logic Output (for debugging purposes)

o

enW
-

Control Logic Outpu
t (for debugging purposes)


2. Block Diagram

The image below is the block diagram of the ecPIC.


51


The output of the PC register points to the instruction located in the Instruction
Memory ROM located outside of the chip. The Instruction Memory ROM output
is an input to the ecPIC. The ControlLogic block decodes the instruction by
asserting the select
signals on various muxes and the enable signals to the
registers. The instruction directly specifies the desired SRAM location or literal
which, in addition to the W register, are operand inputs to the ALU. The output of
the ALU is an input to the SRAM and

W register which may capture the value on
the next falling clock edge. Similarly, the next PC value is latched on the next
falling clock edge. Note that the block diagram does not incldue the inverting
clock buffer
-

all of the registers are actually posi
tive edge triggered.



52

3. Schematic Diagrams

Schematics

Schematics

This section contains all of the schema
tics of the ecPIC organized in a top
-
down manner.
Each schematic is preceded by a brief explanation about the functionality of the circuit.
Unless specified otherwise, all PFETs have widths and lengths of 6 and 2 microns while
all NFETs have widths and len
gths of 3 and 2 microns respectively.


ecPIC

This is the high
-
level schematic of the ecPIC. This is essentially the same as the block
diagram except that all bus wires are drawn.


53




Control Logic

This module is responsible for decoding the current instruction. Inputs (I8
-
I11) are the
most significant bits of the instruction word which determine the operati
on that is being
performed. The SRAM_OUT7 is the most significant bit of the output of the SRAM

54

selected by I7
-

this bit is used to determine whether or not a conditional jump should be
taken.

The Spc2 and Spc1 outputs are connected to the muxes which de
termine the PC value for
the next instruction. The Sop output determines the operand for the ALU operation being
performed
-

either the literal embedded into the instruction word or the value of the
SRAM location embedded in the control word. The enS and e
nW outputs are asserted
low when the ALU result is written back to the SRAM or W register respectively. The
Salu1 and Salu0 outputs determine the ALU function being performed.

Most of the decoding is trivial due to the simple
instruction set
.



ALU

This module performs the four following functions: NOR, NAND, INCREMENT, and
ADDITION. There are only three modules because the add and increment functions are
combined. The 8 bit A and B inputs are fed in
to the 3 modules. The 8
-
bit mux in the
lower left chooses between the add and increment function. The increment function is
performed on the B input. The demultiplexer on the left chooses which function output is
to be let through the tri
-
state gate and pl
aced on the output bus. Notic that the NAND plus
inverter gates allow the select signal to the first mux to let both the add and increment
functions pass. The tri
-
state gates passed when the select was low.


55



Adder

This is a zoomed view of the adder shown above. The modules were implemented with
varying schemes starting with the Manchester carry chain. This w
as replaced by the prod
block used in problem set four. Finally, two xor blocks for the sum function and the logic
equation CARRY=AB+C(A+B);implemented with an AOI and a OAI gate; were used.




56

RAM

This is the original 12 8
-
bit SRAM implemented using register cells shown in the
following schematic. Although the final version was scaled down to 2 8
-
bit cells,

This
schematic is basically the same and represents the desired expansion if the chip were
larger.



Ram

Cell

This is two individual RAM cells, showing the tri
-
state gates after the register and the
mux in front. Also visible is the enable input at the top left corner. The gate is a single
NOR gate to complete the logic necessary for the mux select input.


57



Demultiplexer

The demultiplexer shown below has 16 output lines with the demultiplexed line going
low
. This is used in the original RAM shown above before the scaling down was
implemented. This schematic was kept because the first two inputs on the left along with
the first row of NAND gates represents the four output demultiplexer which was used for
the
reduced SRAM and is also implemented in the ALU.


58



Incrementer

The incrementer is used to increment t
he PC register when a non
-
control instruction is
performed or a conditional jump is not taken. This incrementer was implemented by
cascading four 2
-
bit partial adders in series.


59



Register with Enable

This 8
-
bit register is used in the SRAM module and W register. It uses an 8
-
bit mux and a
standard 8
-
bit register to implement a register with an enable

input. The register
maintains its value unless the enable input is asserted low on a rising clock edge.



Register with Reset

This 8
-
bit register is used as the PC register which one must be able to "boot
-
strap" to
some known location. It is similar to the Register With Enable except that one of the
inputs of the mux is connected to ground instead o
f the output of the register. The register
acts like a normal register unless the Rst is asserted high. If Rst is asserted high on a
rising clock edge, then the register will output 0x00.


60



Register

This positive edge triggered 8
-
bit register is used in the SRAM, PC register and W
registers. It is composed of two jam latches connected in series. The g
rey inverters in the
schematics are weak inverters with PFET and NFET widths of 3 microns and lengths of 4
microns. The inverters connected to clock are stronger inverters with PFET and NFET
lengths of 2 microns. The PFETs for these inverters have channel
widths of 12 microns
while the NFETs for these inverters have channel widths of 6 microns.


61



Partial

Adder

The partial adder is used in the incrementer. It adds a 1
-
bit value (Cin0) to a 2
-
bit value
(A1,A0) and outputs the sum (Cout2, Z1, Z0).




62

Mux

The mux is an 8
-
bit mux which is used within the registers with reset and enable inputs. It
is also used to select the PC
-
register values and operand values to the ALU. It is
composed of transition gate
s and an inverter. The mux selects input A if S is asserted
high and the B input if S is asserted low.



Xor

A standard xor implmentation used throughout the ecPIC.



Nand4

A standard implementation of a four i
nput nand gate which is used only in the Control
Logic.


63



Nand3

A standard implementation of a three in
put nand gate which is used only in the Control
Logic.



Nand

A standard implementation of a two input
nand gate which is used throughout the ecPIC.



64


Nor