👩‍👧‍👧 🦋 🤚🏿 Native FPGA soft-processor with high-level language compiler or Song of the Mouse 👩🏾‍🤝‍👨🏿 📔 🛏️

Own FPGA soft-processor with high-level language compiler or Song of the Mouse - experience in adapting a high-level language compiler to the stack processor core.

A common problem for software processors is the lack of development tools for them, especially if their instruction system is not a subset of the instructions of one of their popular processor cores. Developers in this case will have to solve this problem. Its direct solution is to create an assembler language compiler. However, in modern realities it is not always convenient to work in Assembler, since in the process of development of the project the command system may change due to, for example, changing requirements. Therefore, the task of easy implementation of a high-level language compiler (JAV) for a software processor is relevant.

Python compiler - Uzh seems to be an easy and convenient toolkit for developing software for software processors. The toolkit for defining primitives and macros as functions of the target language allows critical places to be implemented in assembler processor. This paper discusses the main points of compiler adaptation for stack architecture processors.

Instead of an epigraph:

If you take an adult mouse
and, carefully holding it,
cram the needles into it
you will get a hedgehog.

If this hedgehog,
Nose plugged, so as not to breathe,
Where deeper, throw into the river
You will get a ruff.

If this ruff, Holding your
head in a vice,
Pull harder by the tail
You will get a snake.

If this already,
Having prepared two knives ...
However, he will probably die,
But the idea is good!

Introduction

In many cases, when implementing measuring instruments, research equipment, it is preferable to use reconfigurable FPGA / FPGA solutions as the main core of the system. This approach has many advantages, due to the ability to easily and quickly make changes to the logic of work, as well as due to hardware acceleration of data processing and control signals.

For a wide range of tasks, such as digital signal processing, embedded control systems, data acquisition and analysis systems, the approach has proven itself, consisting in combining in one solution blocks implemented by FPGA logic for critical processes and program control elements based on one or several software processors for general management and coordination, as well as for implementing interaction with the user or external devices / nodes. The use of software processors in this case allows us to slightly reduce the time spent on debugging and verification of system control algorithms or interaction algorithms of individual nodes.

Typical Wishlist

Often, soft processors in this case do not require ultra-high performance (since it is easier to achieve, I use the FPGA logical and hardware resources). They can be quite simple (and from the point of view of modern microcontrollers - almost primitive), because they can do without a complex interrupt system, work only with certain nodes or interfaces, there is no need to support a particular command system. There can be many of them, while each of them can execute only a certain set of algorithms or subprograms. The capacity of soft processors can also be any, including not a multiple of a byte - depending on the requirements of the current task.

Typical targets for soft processors are:

sufficient functionality of the command system, possibly optimized for the task;
, .. ;
– , .

Of course, a problem for software processors is the lack of development tools for them, especially if their instruction system is not a subset of the instructions of one of their popular processor cores. Developers in this case will have to solve this problem. Its direct solution is to create an assembler language compiler for the software processor. However, in modern realities it is not always convenient to work in Assembler, especially if the team system changes during the development of the project due to, for example, changing requirements. Therefore, it is logical to add to the above requirements the requirement of easy implementation of a high-level language compiler (HLV) for the soft processor.

Source components

Stack processors satisfy these requirements with a high percentage of compliance, as there is no need to address registers, the bit depth of the command may be small.
The bit depth of the data for them can vary and is not tied to the bit depth of the command system. Being a de facto (albeit with a few caveats) hardware implementation of the intermediate representation of the program code during compilation (a virtual stacked machine, or in terms of context-free grammars - a store automaton), it is possible with low labor costs to translate the grammar of any language into executable code. In addition, for stack processors, the Fort language is practically the “native” language. The labor costs of implementing a Fort compiler for a stack processor are comparable to those of Assembler, with much greater flexibility and efficiency in the implementation of programs in the future.

Having the task of constructing a system for collecting data from smart sensors in a mode close to real-time, the Fort processor was selected as the reference solution (the so-called Reference Design) of the soft processor, described in [ 1 ] (hereinafter it will be sometimes referred to as a whiteTiger processor by its author’s nickname).

Its main features:

Separate data and return stacks
Harvard memory organization architecture (separate program and data memory, including address space);
expansion with peripherals using a simple parallel bus.
The processor does not use a pipeline, the execution of commands is push-pull:
1. fetch commands and operands;
2. execution of the command and saving the result.

The processor is supplemented by a UART-loader of program code, which allows you to change the executable program without recompiling the project for FPGAs.

With regard to the configuration of block memory in the FPGA, the capacity of the instructions is set to 9 bits. The bit depth of the data is set to 32 bits, but can be basically any.

The processor code is written in VHDL without the use of any specific libraries, which allows you to work with this project on FPGAs from any manufacturer.

For a relatively widespread use, lowering the "input threshold", as well as for reusing code and applying code developments, it is more expedient to switch to a Java engine other than Fort (this is partly due to the superstitions and misconceptions of mine-stream programmers regarding the complexities of this language and the readability of its code (by the way, one of the authors of this work has a similar opinion about C-like languages)).

Based on a number of factors, the Python language (Python) was chosen for the experiment to “bind” the software processor and the Java Language Engine. This is a high-level general-purpose programming language focused on improving developer productivity and code readability, supporting several programming paradigms, including structural, object-oriented, functional, imperative and aspect-oriented [ 2].

For novice developers, its extension MyHDL [ 3 , 4 ] is interesting , which allows describing hardware elements and structures in Python and translating them into VHDL or Verilog code.

Some time ago, the Uzh compiler [ 5 ] was announced - a small compiler for the Zmey FPGA software processor (32-bit stack architecture with multithreading support - if you follow the chain of versions / modifications / verification - Zmey is a distant descendant of the whiteTiger processor).
Uzh is also a statically compiled subset of Python, based on the promising raddsl toolkit (a set of tools for quickly creating prototypes of DSL compilers) [ 6 , 7 ].

Thus, the factors that influenced the choice of the direction of work can be formulated approximately like this:

interest in tools that lower the "entry threshold" for novice developers of devices and systems on FPGAs (syntactically Python is not as "scary" for a beginner as VHDL);
striving for harmony and a single style in the project (it is theoretically possible to describe the required hardware blocks and software of the software processor in Python);
random coincidence.

Small, “almost” meaningless nuances

The source code of the Zmey processor is not open, but a description of the principles of its operation and some architecture features is available. Although it is also stackable, there are a number of key differences from the whiteTiger processor:

stacks are software - i.e. represented by pointers and placed in the data memory at different addresses;
, - ;
;
, .

Accordingly, the Uzh compiler takes these features into account. The compiler accepts Python code and generates a boot stream at the output to initiate program memory and processor data memory, the key point is that all the language functionality is available at the compilation stage.

To install the Uzh compiler, just download its archive and unzip it to any convenient folder (it is better to adhere to the general recommendations for specialized software - to avoid paths containing Cyrillic and spaces). You also need to download and unzip the raddsl toolkit to the compiler’s main folder.

The compiler test folder contains examples of programs for the soft processor; the src folder contains the source texts of the compiler elements. For convenience, it is better to create a small batch file (extension .cmd) with the contents:, c.py C:\D\My_Docs\Documents\uzh-master\tests\abc.py where abc.py is the name of the file with the program for the soft processor.

A snake biting its tail or lapping iron and software

To adapt Uzh to the whiteTiger processor, some changes will be required, as well as the processor itself will have to be slightly corrected.

Fortunately, there are not many places to be adjusted in the compiler. The main "hardware-dependent" files:

asm.py - assembler and the formation of numbers (literals);
gen.py - low-level code generation rules (functions, variables, transitions and conditions);
stream.py - forming a boot stream;
macro.py - macro definitions, in fact - extensions of the base language with hardware-specific functions.

In the original whiteTiger processor design, the UART loader only initializes program memory. The bootloader algorithm is simple, but well-established and reliable:

upon receiving a certain control byte, the loader sets the active level on the internal line of the processor reset;
the second byte command resets the memory address counter;
the following is a sequence of notebooks of the transmitted word, starting with the youngest, combined with a notebook-number;
after each byte with a packed notebook, a pair of control bytes follows, the first of which sets the active level on the memory write permission line, the second resets it;
upon completion of the sequence of packed notebooks, the active level on the reset line is removed by the control byte.

Since the compiler also uses data memory, it is necessary to modify the loader so that it can also initialize the data memory.

Since the data memory is involved in the logic of the processor core, it is necessary to multiplex its data and control lines. For this, additional signals DataDinBtemp, LoaderAddrB, DataWeBtemp are introduced - data, address and recording resolution for the port In memory.

The bootloader code now looks like this:

uart_unit: entity work.uart
--uart_unit: entity uart
  Generic map(
    ClkFreq => 50_000_000,
    Baudrate => 115200)
  port map(
    clk => clk,
    rxd => rx,
    txd => tx,
    dout => receivedByte,
    received => received,
    din => transmitByte,
    transmit => transmit);
    
process(clk)
begin
  if rising_edge(clk) then
    if received = '1' then
      case conv_integer(receivedByte) is
      -- 0-F   - 0-3 bits
        when 0 to 15 => CodeDinA(3 downto 0) <= receivedByte(3 downto 0);
		                  DataDinBtemp(3 downto 0) <= receivedByte(3 downto 0);
      -- 10-1F -4-7bits
        when 16 to 31 => CodeDinA(7 downto 4) <= receivedByte(3 downto 0);
		                   DataDinBtemp(7 downto 4) <= receivedByte(3 downto 0); 
      -- 20-2F -8bit 
        when 32 to 47 => CodeDinA(8) <= receivedByte(0);
	                   DataDinBtemp(11 downto 8) <= receivedByte(3 downto 0);
	  when 48 to 63 => DataDinBtemp(15 downto 12) <= receivedByte(3 downto 0);
	  when 64 to 79 => DataDinBtemp(19 downto 16) <= receivedByte(3 downto 0);
	  when 80 to 95 => DataDinBtemp(23 downto 20) <= receivedByte(3 downto 0);
	  when 96 to 111 => DataDinBtemp(27 downto 24) <= receivedByte(3 downto 0);
        when 112 to 127 => DataDinBtemp(31 downto 28) <= receivedByte(3 downto 0);

      -- F0 addr=0
        when 240 => CodeAddrA <= (others => '0');
      -- F1 - WE=1
        when 241 => CodeWeA <= '1';
      -- F2 WE=0 addr++
        when 242 => CodeWeA <= '0'; CodeAddrA <= CodeAddrA + 1;
      -- F3 RESET=1
        when 243 => int_reset <= '1';
      -- F4 RESET=0
        when 244 => int_reset <= '0';

      -- F5 addr=0
        when 245 => LoaderAddrB <= (others => '0');
      -- F6 - WE=1
        when 246 => DataWeBtemp <= '1';
      -- F7 WE=0 addr++
        when 247 => DataWeBtemp <= '0'; LoaderAddrB <= LoaderAddrB + 1;
		  
		  
        when others => null;
      end case;
    end if;
  end if;
end process;

---- end of loader

With an active reset level, the DataDinBtemp, LoaderAddrB, DataWeBtemp signals are connected to the corresponding data memory ports.

…
    if reset = '1' or int_reset = '1' then
      DSAddrA <= (others => '0');      
      
      RSAddrA <= (others => '0');
      RSAddrB <= (others => '0');
      RSWeA <= '0';
      
      DataAddrB <= LoaderAddrB;
		DataDinB<=DataDinBtemp;
		DataWeB<=DataWeBtemp;
      DataWeA <= '0';
…

In accordance with the bootloader algorithm, it is necessary to modify the stream.py module. Now it has two functions. The first function - get_val () - splits the input word into the desired number of tetrads. So, for 9-bit instructions of the whiteTiger processor, they will be transformed into groups of three tetrads, and 32-bit data in a sequence of eight tetrads. The second function make () forms the bootstrap directly.
The final form of the stream module:

def get_val(x, by_4):
  r = []
  for i in range(by_4):
    r.append((x & 0xf) | (i << 4))
    x >>= 4
  return r

def make(code, data, core=0):
  #        0  
  stream = [243,245] 
  for x in data:
    #    32- 
    #         
    stream += get_val(x, 8) + [246, 247]
  #       0
  stream += [240]
  for x in code:
    #    9-  
    #         
    stream += get_val(x, 3) + [241, 242]
  #  
  stream.append(244)

  return bytearray(stream)

The following changes in the compiler will affect the asm.py module, which describes the processor command system (command mnemonics and command opcodes are written) and the way of representing / compiling numerical values - literals.

Commands are packed into a dictionary, and the lite () function is responsible for literals. If everything is simple with the command system - the list of mnemonics and the corresponding opcodes just changes, then the situation with literals is a little different. The Zmey processor has 8-bit instructions and there are a number of specialized instructions for working with literals. In whiteTiger, the 9th bit indicates whether the opcode is a command or part of a number.

If the highest (9th) bit of a word is 1, then the opcode is interpreted as a number - for example, four consecutive opcodes with a sign of a number form a 32-bit number as a result. A sign of the end of a number is the presence of the command opcode - for definiteness and ensuring uniformity, the end of the number determination is the opcode of the NOP command (“no operations”).

As a result, the modified lit () function looks like this:


def lit(x):
  x &= 0xffffffff
  r = [] 
  if (x>>24) & 255 :
    r.append(int((x>>24) & 255) | 256)
  if (x>>16) & 255:
    r.append(int((x>>16) & 255) | 256)
  if (x>>8) & 255:
    r.append(int((x>>8) & 255) | 256)
  r.append(int(x & 255) | 256)
  r += asm("NOP")
  return list(r)

The main and most important changes / definitions are in the gen.py module. This module defines the basic logic of the work / execution of high-level code at the assembler level:

conditional and unconditional jumps;
calling functions and passing arguments to them;
return from functions and returning results;
adjustments to the sizes of program memory, data memory and stacks;
sequence of actions at startup of the processor.

In order to support Java, the processor must be able to work arbitrarily with memory and pointers and have a memory area for storing local variable functions.

In the Zmey processor, a return stack is used to work with local variables and function arguments - function arguments are transferred to it and during further work, they are accessed through the pointer-register of the return stack (read, modify up / down, read at the pointer address). Since the stack is physically located in the data memory, such operations essentially simply come down to memory operations, and global variables are located within the same memory.

In whiteTiger, return and data stacks are dedicated hardware stacks with their address space and do not have stack pointer instructions. Consequently, operations with passing arguments to functions and work with local variables will need to be organized through data memory. It does not make much sense to increase the volume of data stacks and returns for possible storage of relatively large data arrays in them; it is more logical to have a slightly large data memory.

To work with local variables, a dedicated LocalReg register was added, the task of which is to store a pointer to the memory area allocated for local variables (a kind of heap). Also added operations for working with it (cpu.vhd file - command definition area):


          -- group 1; pop 0; push 1;
          when cmdLOCAL => DSDinA <= LocalReg;
			 when cmdLOCALadd => DSDinA <= LocalReg; LocalReg <= LocalReg+1;
			 when cmdLOCALsubb => DSDinA <= LocalReg; LocalReg <= LocalReg-1;
…
          -- group 2; pop 1; push 0;
          when cmdSETLOCAL => LocalReg <= DSDinA;
…

LOCAL - returns to the data stack the current value of the LocalReg pointer;
SETLOCAL - sets the new pointer value received from the data stack;
LOCALadd - leaves the current value of the pointer on the data stack and increments it by 1;
LOCALsubb - leaves the current value of the pointer on the data stack and decreases it by 1.
LOCALadd and LOCALsubb are added to reduce the number of ticks during operations of passing function parameters and vice versa.

Unlike the original whiteTiger, the data memory connections were slightly changed - now the In memory port is constantly addressed by the output of the first cell of the data stack, the output of the second cell of the data stack is fed to its input:

-- ++
DataAddrB <= DSDoutA(DataAddrB'range);
DataDinB <= DSDoutB;

The logic for executing the STORE and FETCH commands was also slightly corrected - FETCH takes the output value of the port In memory to the top of the data stack, and STORE simply controls the write enable signal for port B:

…
          -- group 3; pop 1; push 1;
          when cmdFETCH => DSDinA <= DataDoutB;
…
          when cmdSTORE =>            
            DataWeB <= '1';
…

As part of the training, as well as for some hardware support for loops at a low level (and at the compiler level of the Fort language), a stack of loop counters was added to the whiteTiger core (actions are similar to those when declaring data and return stacks):

…
--  
type TCycleStack is array(0 to LocalSize-1) of DataSignal;
signal CycleStack: TCycleStack;
signal CSAddrA, CSAddrB: StackAddrSignal;
signal CSDoutA, CSDoutB: DataSignal;
signal CSDinA, CSDinB: DataSignal;
signal CSWeA, CSWeB: std_logic;
…
--  
process(clk)
begin
  if rising_edge(clk) then
    if CSWeA = '1' then
      CycleStack(conv_integer(CSAddrA)) <= CSDinA;
      CSDoutA <= CSDinA;
    else
      CSDoutA <= CycleStack(conv_integer(CSAddrA));
    end if;
  end if;
end process;

Cycle counter commands have been added.

DO - moves the number of iterations of the cycle from the data stack to the counter stack and places the incremented value of the instruction counter on the return stack.

LOOP - checks the counter zeroing, if not reached, the top element of the counter stack is decremented, the transition to the address at the top of the return stack is performed. If the top of the counter stack is zero, the top element is reset, the return address to the beginning of the cycle from the top of the return stack is also reset.


	when cmdDO => -- DO - 
               RSAddrA <= RSAddrA + 1; -- 
               RSDinA <= ip + 1;
               RSWeA <= '1';
				
               CSAddrA <= CSAddrA + 1; --
         		CSDinA <= DSDoutA;
 		         CSWeA <= '1';
		         DSAddrA <= DSAddrA - 1; --
		         ip <= ip + 1;	-- 

      when cmdLOOP => --            
           if conv_integer(CSDoutA) = 0 then
	          ip <= ip + 1;	-- 
		         RSAddrA <= RSAddrA - 1; -- 
		         CSAddrA <= CSAddrA - 1; -- 
            else
		         CSDinA <= CSDoutA - 1;
		         CSWeA <= '1';
		         ip <= RSDoutA(ip'range);
            end if;

Now you can start modifying the code for the gen.py module.

* _SIZE variables do not need comments and require only the substitution of values specified in the processor core project.

The STUB list is a temporary stub for creating a place for transition addresses and then filling them with the compiler (current values correspond to the 24-bit address space of the code memory).

STARTUP list - sets the sequence of actions performed by the kernel after a reset - in this case, the starting address of the memory of local variables is set to 900, and the transition to the start point (if you do not change anything, the start / entry point in the application is written to the compiler in the data memory address 2):

STARTUP = asm("""
900  SETLOCAL
2 NOP FETCH JMP
""")

The definition of func () prescribes the actions that are performed when the function is called, namely, the transfer of function arguments to the region of local variables, memory allocation for its own local variables of the function.

@act
def func(t, X):
  t.c.entry = t.c.globs[X]
  t.c.entry["offs"] = len(t.c.code) # - 1
  args = t.c.entry["args"]
  temps_size = len(t.c.entry["locs"]) - args
#      
  t.out = asm("LOCALadd STORE " * args)
  if temps_size:
#      
    t.out += asm("LOCAL %d PLUS SETLOCAL" % temps_size)
  return True

Epilog () defines actions when returning from a function - freeing the memory of temporary variables, returning to the call point.

def epilog(t, X):
  locs_size = len(t.c.entry["locs"])
#    
  t.out = asm("RET")
  if locs_size:
#    ()  
    t.out = asm("LOCAL %d MINUS SETLOCAL" % locs_size) + t.out
  return True

Working with variables is done through their addresses, the key definition for this is push_local (), which leaves the address of the "high-level" variable on the data stack.

def push_local(t, X):
#          
#  
  t.out = asm("LOCAL %d MINUS" % get_loc_offset(t, X))
  return True

The following key points are conditional and unconditional transitions. The conditional jump in the whiteTiger processor checks the second element of the data stack for 0 and jumps to the address at the top of the stack if the condition is met. An unconditional jump simply sets the value of the instruction counter to the value at the top of the stack.

@act
def goto_if_0(t, X):
  push_label(t, X)
  t.out += asm("IF")
  return True

@act
def goto(t, X):
  push_label(t, X)
  t.out += asm("JMP")
  return True

The following two definitions specify bit shifting operations - just at a low level, loops are applied (it will give some gain in the size of the code - in the original, the compiler simply puts the required number of elementary shift operations in a row.

@act
def shl_const(t, X):
  t.out = asm("%d DO SHL LOOP" %(X-1))
  return True

@act
def shr_const(t, X):
  t.out = asm("%d DO SHR LOOP" %(X-1))
  return True

And the main definition of the compiler at a low level is a set of rules for language operations and working with memory:

stmt = rule(alt(
  seq(Push(Int(X)), to(lambda v: asm("%d" % v.X))),
  seq(Push(Local(X)), push_local),
  seq(Push(Global(X)), push_global),
  seq(Load(), to(lambda v: asm("NOP FETCH"))),
  seq(Store(), to(lambda v: asm("STORE"))),
  seq(Call(), to(lambda v: asm("CALL"))),
  seq(BinOp("+"), to(lambda v: asm("PLUS"))),
  seq(BinOp("-"), to(lambda v: asm("MINUS"))),
  seq(BinOp("&"), to(lambda v: asm("AND"))),
  seq(BinOp("|"), to(lambda v: asm("OR"))),
  seq(BinOp("^"), to(lambda v: asm("XOR"))),
  seq(BinOp("*"), to(lambda v: asm("MUL"))),
  seq(BinOp("<"), to(lambda v: asm("LESS"))),
  seq(BinOp(">"), to(lambda v: asm("GREATER"))),
  seq(BinOp("=="), to(lambda v: asm("EQUAL"))),
  seq(BinOp("~"), to(lambda v: asm("NOT"))),
  seq(ShlConst(X), shl_const),
  seq(ShrConst(X), shr_const),
  seq(Func(X), func),
  seq(Label(X), label),
  seq(Return(X), epilog),
  seq(GotoIf0(X), goto_if_0),
  seq(Goto(X), goto),
  seq(Nop(), to(lambda v: asm("NOP"))),
  seq(Asm(X), to(lambda v: asm(v.X)))
))

The macro.py module allows you to “expand” the dictionary of the target language somewhat using macro definitions in the assembler of the target processor. For the Java Compiler, the definitions in macro.py will not differ from the "native" operators and functions of the language. So, for example, in the original compiler, I / O functions of the value in the external port were defined. Test sequences of operations with memory and local variables and a time delay operation were added.

@macro(1,0)
def testasm(c,x):
  return Asm("1 1 OUTPORT 0 1 OUTPORT 11 10 STORE 10 FETCH 1 OUTPORT  15 100 STORE 100  FETCH 1 OUTPORT")

@macro(1,0)
def testlocal(c,x):
   return Asm("1 100 STORE 2 101 STORE 100 SETLOCAL LOCAL NOP FETCH 1 OUTPORT LOCAL 1 PLUS NOP FETCH 1 OUTPORT")

@prim(1, 0)
def delay(c, val):
  return [val, Asm("DO LOOP")]

Testing

A small test high-level program for our processor contains the definition of a function for calculating factorial, and the main function that implements serial output of factorial values from 1 to 7 to the port in an infinite loop.

def fact(n):
  r = 1
  while n > 1:
    r *= n
    n -= 1
  return r


def main():
  n=1
  while True:
     digital_write(1, fact(n))
     delay(10)
     n=(n+1)&0x7

It can be launched for compilation, for example, by a simple script or from the command line by the sequence: As a result, a boot file stream.bin will be generated, which can be transferred to the processor core in the FPGA via the serial port (in modern realities, through any virtual serial port that converters provide USB-UART interfaces). The program as a result occupies 146 words (9-bit) of program memory and 3 in data memory.

c.py C:\D\My_Docs\Documents\uzh-master\tests\fact2.py

Conclusion

In general, the Uzh compiler seems to be an easy and convenient toolkit for developing software for software processors. It is a great alternative to assembler, at least in terms of programmer usability. The toolkit for defining primitives and macros as functions of the target language allows critical places to be implemented in assembler processor. For stack architecture processors, the compiler adaptation procedure is not too complicated and lengthy. We can say that this is just the case when the availability of the source code of the compiler helps - the key sections of the compiler are changing.

The results of the processor synthesis (32-bit capacity, 4K words of program memory and 1K RAM) for FPGA Altera Cyclone V series gives the following:

Family	Cyclone V
Device	5CEBA4F23C7
Logic utilization (in ALMs)	694 / 18,480 ( 4 % )
Total registers	447
Total pins	83 / 224 ( 37 % )
Total virtual pins	0
Total block memory bits	72,192 / 3,153,920 ( 2 % )
Total DSP Blocks	2 / 66 ( 3 % )

Literature

Forth processor on VHDL // m.habr.com/en/post/149686
Python - Wikipedia // en.wikipedia.org/wiki/Python
We begin FPGA on Python _ Habr // m.habr.com/en/post/439638
MyHDL // www.myhdl.org
GitHub - true-grue_uzh_ Uzh compiler // github.com/true-grue/uzh
GitHub - true-grue_raddsl_ Tools for rapid prototyping of DSL compilers // github.com/true-grue/raddsl
sovietov.com/txt/dsl_python_conf.pdf

The author is grateful to the developers of the Zmey software processor and Uzh compiler for consultations and patience.

Native FPGA soft-processor with high-level language compiler or Song of the Mouse