Cortex-M7 Launches: Embedded, IoT and Wearables

Name: Cortex-M7 Launches: Embedded, IoT and Wearables
Item: Cortex-M7 Launches: Embedded, IoT and Wearables
Author: Stephen Barrett

by Stephen Barrett on September 23, 2014 7:01 PM EST

Posted in
CPUs
Arm
Wearables
IoT
Cortex M

43 Comments | Add A Comment

43 Comments

The Cortex-M7 CPU

The primary focus of the Cortex-M7 is improved performance. ARM’s goal was to elevate the M series performance to a level previously unseen, while maintaining the M series' signature small die size and tiny power consumption. There are at least two reasons ARM focused on performance for the M7 processor. First, they want to further drive a wedge between traditional 8- and 16-bit microcontrollers and provide ARM a further differentiated market position; second, the M7 will help support the IoT (Internet of Things) and wearable device markets. Focusing on enhanced DSP capabilities, the M7 is more suited to audio and visual sensor hub processing than any previous M series design.

Digging into the details, the Cortex-M7 features a six-stage, in-order, dual-issue superscalar pipeline with single- and double-precision floating point units, instruction and data caches, branch prediction, SIMD support, and tightly coupled memory. Here's the high level view of the pipeline:

The presence of instruction and data caches, branch prediction, as well as tightly coupled memory are differentiating features of the M7 versus previous M series processors. Microcontrollers often forego caches and sometimes even operate with flash as the only memory interface. By providing high performance instruction and data caches, the M7 approaches more typical high performance processor design.

Tightly coupled memory (TCM) is a technology ARM’s partners can use to extend the effective caching of a single M7 processor and has only been seen in previous A and R series designs. In use, it can have the performance of a cache but, unlike cache, its contents are directly controlled by the developer. That is, TCM is part of the physical memory map of the microcontroller. Developers can place critical code and data inside TCM that can be deterministically accessed with high performance in routines such as interrupt service requests. The M7 supports up to 16 MB of tightly coupled memory.

Adding branch prediction allows arm to target dedicated DSP devices with its Cortex-M7 microcontroller. DSP code is often analog data stream filters for applications such as audio input keyword detection, audio output equalization, and frequency domain amplitude peak searching. When running on an always-on microcontroller these tasks are almost always looped. Without a branch predictor, the code must continually evaluate a loop condition that 99.9% of the time results in the same outcome. Branch predictors cost extra die space but when DSP is your target, they are an obvious design benefit.

Summarizing the M series cores can be done both from an instruction features standpoint and also a die size and performance standpoint. Unfortunately ARM, who provides HDL (Hardware Description Language) that can be synthesized to physical chips, was not yet willing to provide die size numbers until their partner Cortex-M7 announcements, since the processor does not become physical until a partner gets involved. Until a partner releases data, we can simply assume the M7 somewhat larger than its predecessors.

ARM Cortex-M Instruction Sets
	M0	M0+	M3	M4	M7
Thumb	Most	Most	Entire	Entire	Entire
Thumb-2	Subset	Subset	Entire	Entire	Entire
Hardware multiply	1 or 32 cycles	1 or 32 cycles	1 cycle	1 cycle	1 cycle
Hardware divide	No	No	Yes	Yes	Yes
Saturated math	No	No	Yes	Yes	Yes
DSP Extensions	No	No	No	Yes	Yes, enhanced
Floating-point	No	No	No	Optional single precision	Yes
Tightly coupled memory	No	No	No	No	yes
Architecture	ARMv6-M	ARMv6-M	ARMv7-M	ARMv7-M	ARMv7-M
Cache Architecture	Von Neuman	Von Neuman	Harvard	Harvard	Harvard

ARM Cortex-M Area, Power, Performance
	M0	M0+	M3	M4	M7
90nm LP dynamic power (µW/MHz)	16	9.8	32	33	n/a
90nm LP area mm²	0.04	0.035	0.12	0.17	n/a
40nm G dynamic power (µW/MHz)	4	3	7	8	n/a
40nm G area mm²	0.01	0.009	0.03	0.04	n/a
Dhrystone (official) DMIPS/MHz	0.84	0.94	1.25	1.25	2.14
Dhrystone (max options) DMIPS/MHz	1.21	1.31	1.89	1.95	3.23
CoreMark/MHz	2.33	2.42	3.32	3.40	5.04

ARM did state that power consumption of M7 is roughly in line with previous performance/mW, so we could estimate a corresponding increase of 50% to 75% more power consumption. Area is anyone's guess at the moment.

Introduction Hybrid Systems

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

43 Comments

View All Comments

Wilco1 - Wednesday, September 24, 2014 - link
Embedded (M) was traditionally a micro controller using on-chip flash and SRAM, no MMU, no DSP, no FP support. The R series are higher performance realtime CPUs with TCM, caches, branch prediction and often external DRAM and FP. Now that M also supports DSP, FP, caches, and is becoming high performance, things have become blurred. The ISA differences are now the main distinction, M only supports Thumb-1, Thumb-2 and uses a different interrupt model, while the R architecture is basically A series plus TCM minus MMU. So many TLA's...
hammer256 - Wednesday, September 24, 2014 - link
Oh that's right, different interrupt model. M series is generally lower latency because it's directly coupled, if I recall. It's just that this new M7 line blurs that line even further than the M4 did...
For TCM, is it generally DRAM integrated on the MCU, or a tight interface between the MCU and the DRAM chips?
Didn't one of samsung's SSDs use a few Cortex-R3 cores for their controller?
Wilco1 - Wednesday, September 24, 2014 - link
3 R4 cores are used in Samsung SSDs.

Simply put, TCM is fast on-core instruction/data SRAM, similar to an I- or D-cache. It is fully under user control and thus without the non-deterministic effects of a traditional cache. TCM can be used in addition to a cache. TCM allows high frequencies like a cache, and thus is faster than an external SRAM.

The usage model is that you put all your critical realtime code/data in the instruction/data TCMs and run the rest from flash/DRAM. When an interrupt occurs, you start executing realtime code from the TCM immediately rather than having to wait for cache misses that inevitably occur if you didn't have TCM. So the TCMs are actually necessary for realtime on a fast CPU, having a low interrupt latency alone is not the whole story.
hammer256 - Wednesday, September 24, 2014 - link
Oooh I see. It sounds like TCM is a big distinguishing feature between the M and R series then. So even if performance is equal, R series actually allows for applications with even tighter latency requirements than the M series.
Well, learned something new today, thanks!
toyotabedzrock - Wednesday, September 24, 2014 - link
It is not a good idea to put this in a wearable or a car. The lack of an MMU seems tone deaf given the security environment we live in.
Wilco1 - Wednesday, September 24, 2014 - link
Most of the M series support an optional MPU for OS task protection. That said, security and MMU are 2 orthogonal things - an MMU doesn't stop exploits as otherwise we wouldn't have any viruses/trojans/rootkits/etc on PCs. For microcontrollers security is easier as there are far fewer possible security breaches, so it's more down to not setting default passwords or using old, already broken encryption algorithms.
ah06 - Thursday, September 25, 2014 - link
Which one makes most sense in a wearable? M4, M7, Rx, A7, A53?
Wilco1 - Thursday, September 25, 2014 - link
IMHO only M3 or M4 - anything else is way overkill for eg. a watch. You definitely don't want to run anything as big/complex as Linux/Android if you want to provide at least a week of battery life.
DIYEyal - Sunday, September 28, 2014 - link
Actually the WeLoop tommy smartwatch has the M0, they claim 3 weeks of battery life with a 110mAh battery.
RomanR - Thursday, September 25, 2014 - link
Hi,

who can tell me: how many clock cycles will be needed for ten taps 32-bit FIR filter output sample computation ?
1 cycle MAC instruction is O.K. but what about data transfer ?

Cortex-M7 Launches: Embedded, IoT and Wearables

The Cortex-M7 CPU

Post Your Comment

43 Comments

View All Comments

Wilco1 - Wednesday, September 24, 2014 - link

hammer256 - Wednesday, September 24, 2014 - link

Wilco1 - Wednesday, September 24, 2014 - link

hammer256 - Wednesday, September 24, 2014 - link

toyotabedzrock - Wednesday, September 24, 2014 - link

Wilco1 - Wednesday, September 24, 2014 - link

ah06 - Thursday, September 25, 2014 - link

Wilco1 - Thursday, September 25, 2014 - link

DIYEyal - Sunday, September 28, 2014 - link

RomanR - Thursday, September 25, 2014 - link

Log in

Don't have an account? Sign up now